Skip to content

Commit 25871cd

Browse files
committed
Introduce a SetBuilder that exploits common prefixes and compresses very well with gzip so shipping dictionaries becomes even easier
1 parent 1a86fc3 commit 25871cd

File tree

4 files changed

+268
-28
lines changed

4 files changed

+268
-28
lines changed

README.md

Lines changed: 39 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -85,24 +85,57 @@ composer require toflar/fast-set
8585

8686
### 1. Build the set (one-time)
8787

88+
Hashes are an excellent tool for evenly distributing entries, which is exactly what makes lookups in `FastSet`
89+
extremely fast. However, hashes are not well suited for distribution:
90+
91+
* They are often larger than the original terms (especially for short words).
92+
* They are effectively random, which means gzip compression performs very poorly.
93+
94+
Shipping prebuilt hash files would therefore often mean shipping more data than the original dictionary.
95+
96+
The solution: The `SetBuilder`
97+
98+
This library ships with a `SetBuilder` that is designed specifically for distribution size efficiency.
99+
Instead of shipping hashes, you ship a compressed dictionary that:
100+
101+
* exploits shared prefixes between terms
102+
* avoids repeating identical prefixes
103+
* compresses extremely well with gzip
104+
105+
The hash-based data structures are then generated locally at build time.
106+
88107
```php
108+
$myOriginalSet = __DIR__ . '/dictionary.txt'; // one entry per line
109+
110+
// Encode/Compress with the prefix algorithm:
111+
SetBuilder::buildSet($myOriginalSet, './compressed.txt');
112+
113+
// Encode/Compress with the prefix algorithm and gzip on top (the .gz suffix determines that):
114+
SetBuilder::buildSet($myOriginalSet, './compressed.gz');
115+
116+
// You then ship either "compressed.txt" or "compressed.gz" with your application. Instantiating
117+
// is then done as follows:
89118
$set = new FastSet(__DIR__ . '/dict');
90-
$set->build(__DIR__ . '/dictionary.txt'); // one entry per line
119+
$set->build(__DIR__ . '/compressed.(txt|gz)'); // Must be a file built using the SetBuilder
91120
```
92121

93-
This creates:
122+
Calling `build` creates the following files on-the-fly:
94123
```
95124
dict/
96125
├── hashes.bin
97126
└── index.bin
98127
```
99128

100-
You can ship these files with your application.
101-
129+
> Important:
130+
> Do not ship `hashes.bin` or `index.bin`.
131+
> Only ship the compressed dictionary created by the `SetBuilder`.
102132
---
103133

104134
### 2. Lookup
105135

136+
Once you have initialized/built your `FastSet` calling `build()` so that the required files have been built, you can
137+
then use it as follows:
138+
106139
```php
107140
$set = new FastSet(__DIR__ . '/dict');
108141

@@ -111,7 +144,8 @@ if ($set->has('look-me-up')) {
111144
}
112145
```
113146

114-
The files are loaded lazily on first lookup, but you can also call `initialize()` explicitly if you want to.
147+
The `hashes.bin` and `index.bin` files are loaded lazily on first lookup, but you can also call `initialize()`
148+
explicitly if you want to load them into memory at a specific point in time.
115149

116150
---
117151

src/FastSet.php

Lines changed: 8 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
namespace Toflar\FastSet;
66

7-
class FastSet
7+
final class FastSet
88
{
99
private readonly string $hashesPath;
1010

@@ -92,32 +92,18 @@ public function has(string $entry): bool
9292
}
9393

9494
/**
95-
* The source path must be a file with all entries separated by new lines.
95+
* The source path must be a file built using the SetBuilder.
9696
*/
9797
public function build(string $sourcePath): void
9898
{
99-
if (!is_file($sourcePath)) {
100-
throw new \InvalidArgumentException('Source file does not exist.');
101-
}
102-
103-
$handle = fopen($sourcePath, 'r');
104-
if (false === $handle) {
105-
throw new \InvalidArgumentException('Source file is not readable.');
106-
}
107-
10899
$fingerPrints = [];
109100

110-
while (($line = fgets($handle)) !== false) {
111-
$entry = trim($line);
112-
113-
if ('' === $entry) {
114-
continue;
115-
}
116-
117-
$fingerPrints[] = $this->getFingerPrintForEntry($entry);
118-
}
119-
120-
fclose($handle);
101+
SetBuilder::readSet(
102+
$sourcePath,
103+
function (string $entry) use (&$fingerPrints): void {
104+
$fingerPrints[] = $this->getFingerPrintForEntry($entry);
105+
},
106+
);
121107

122108
// Sort all fingerprints so we can binary-search them later
123109
sort($fingerPrints, SORT_STRING);

src/SetBuilder.php

Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
<?php
2+
3+
declare(strict_types=1);
4+
5+
namespace Toflar\FastSet;
6+
7+
final class SetBuilder
8+
{
9+
/**
10+
* @param string $sourcePath the source path must be a file with all entries separated by new lines
11+
* @param string $outputPath If your output path ends on ".gz", the set will also get compressed on top.
12+
*/
13+
public static function buildSet(string $sourcePath, string $outputPath): void
14+
{
15+
$inputHandle = fopen($sourcePath, 'r');
16+
if (false === $inputHandle) {
17+
throw new \RuntimeException('Unable to open file: '.$sourcePath);
18+
}
19+
20+
$terms = [];
21+
22+
while (($line = fgets($inputHandle)) !== false) {
23+
$line = trim($line);
24+
if ('' !== $line) {
25+
$terms[] = $line;
26+
}
27+
}
28+
fclose($inputHandle);
29+
30+
sort($terms, SORT_STRING);
31+
32+
$outputHandle = self::openForWritingPossiblyGzip($outputPath);
33+
34+
$previousTerm = '';
35+
36+
foreach ($terms as $term) {
37+
$commonPrefixLength = self::commonPrefixByteLength($previousTerm, $term);
38+
$suffix = substr($term, $commonPrefixLength);
39+
40+
// <prefixLen>\t<suffix>\n
41+
fwrite($outputHandle, (string) $commonPrefixLength);
42+
fwrite($outputHandle, "\t");
43+
fwrite($outputHandle, $suffix);
44+
fwrite($outputHandle, "\n");
45+
46+
$previousTerm = $term;
47+
}
48+
49+
self::closePossiblyGzip($outputHandle, $outputPath);
50+
}
51+
52+
public static function readSet(string $setPath, callable $callable): void
53+
{
54+
$handle = self::openForReadingPossiblyGzip($setPath);
55+
56+
$previousTerm = '';
57+
58+
while (($line = self::readLinePossiblyGzip($handle, $setPath)) !== false) {
59+
$line = rtrim($line, "\r\n");
60+
if ('' === $line) {
61+
continue;
62+
}
63+
64+
$tabPosition = strpos($line, "\t");
65+
if (false === $tabPosition) {
66+
throw new \UnexpectedValueException('Invalid file format. Ensure you have built it using the SetBuilder class!');
67+
}
68+
69+
$prefixLengthText = substr($line, 0, $tabPosition);
70+
$suffix = substr($line, $tabPosition + 1);
71+
72+
$prefixLength = (int) $prefixLengthText;
73+
74+
$term = substr($previousTerm, 0, $prefixLength).$suffix;
75+
76+
$callable($term);
77+
78+
$previousTerm = $term;
79+
}
80+
81+
self::closePossiblyGzip($handle, $setPath);
82+
}
83+
84+
private static function commonPrefixByteLength(string $left, string $right): int
85+
{
86+
$limit = min(\strlen($left), \strlen($right));
87+
$index = 0;
88+
89+
// Byte-wise common prefix
90+
while ($index < $limit && $left[$index] === $right[$index]) {
91+
++$index;
92+
}
93+
94+
return $index;
95+
}
96+
97+
/**
98+
* @return resource
99+
*/
100+
private static function openForWritingPossiblyGzip(string $path)
101+
{
102+
if (str_ends_with($path, '.gz')) {
103+
if (!\function_exists('gzopen')) {
104+
throw new \RuntimeException('Cannot open for gzip write (gzopen not available): '.$path);
105+
}
106+
107+
$handle = gzopen($path, 'wb9');
108+
if (false === $handle) {
109+
throw new \RuntimeException('Cannot open for gzip write: '.$path);
110+
}
111+
112+
return $handle;
113+
}
114+
115+
$handle = fopen($path, 'w');
116+
if (false === $handle) {
117+
throw new \RuntimeException('Cannot open for write: '.$path);
118+
}
119+
120+
return $handle;
121+
}
122+
123+
/**
124+
* @return resource
125+
*/
126+
private static function openForReadingPossiblyGzip(string $path)
127+
{
128+
if (str_ends_with($path, '.gz')) {
129+
if (!\function_exists('gzopen')) {
130+
throw new \RuntimeException('Cannot open for reading gzip reading (gzopen not available): '.$path);
131+
}
132+
133+
$handle = gzopen($path, 'rb');
134+
if (false === $handle) {
135+
throw new \RuntimeException('Cannot open for reading gzip reading: '.$path);
136+
}
137+
138+
return $handle;
139+
}
140+
141+
$handle = fopen($path, 'r');
142+
if (false === $handle) {
143+
throw new \RuntimeException('Cannot open for reading: '.$path);
144+
}
145+
146+
return $handle;
147+
}
148+
149+
/**
150+
* @param resource $handle
151+
*/
152+
private static function readLinePossiblyGzip($handle, string $path): string|false
153+
{
154+
if (str_ends_with($path, '.gz')) {
155+
if (!\function_exists('gzgets')) {
156+
throw new \RuntimeException('Cannot read gzip: '.$path);
157+
}
158+
159+
return gzgets($handle);
160+
}
161+
162+
return fgets($handle);
163+
}
164+
165+
/**
166+
* @param resource $handle
167+
*/
168+
private static function closePossiblyGzip($handle, string $path): void
169+
{
170+
if (str_ends_with($path, '.gz')) {
171+
if (!\function_exists('gzclose')) {
172+
throw new \RuntimeException('Cannot write gzip: '.$path);
173+
}
174+
175+
gzclose($handle);
176+
177+
return;
178+
}
179+
fclose($handle);
180+
}
181+
}

tests/FastSetTest.php

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
use PHPUnit\Framework\TestCase;
88
use Symfony\Component\Filesystem\Filesystem;
99
use Toflar\FastSet\FastSet;
10+
use Toflar\FastSet\SetBuilder;
1011

1112
class FastSetTest extends TestCase
1213
{
@@ -21,11 +22,49 @@ protected function setUp(): void
2122
$this->testDirectory = $testDir;
2223
}
2324

24-
public function testFastSet(): void
25+
public function testFastSetFailsWithWrongFileFormat(): void
2526
{
27+
$this->expectException(\UnexpectedValueException::class);
28+
$this->expectExceptionMessage('Invalid file format. Ensure you have built it using the SetBuilder class!');
29+
2630
$fastSet = new FastSet($this->testDirectory);
2731
$fastSet->build(__DIR__.'/Fixtures/terms_de.txt');
32+
}
33+
34+
public function testWorkingWithSetBuilderWithoutGzipCompression(): void
35+
{
36+
// Build a set without gzip but with our prefix algorithm
37+
SetBuilder::buildSet(__DIR__.'/Fixtures/terms_de.txt', $this->testDirectory.'/terms_encoded.txt');
38+
39+
// File size of the encoded file must be definitely smaller
40+
$this->assertTrue(filesize($this->testDirectory.'/terms_encoded.txt') < filesize(__DIR__.'/Fixtures/terms_de.txt'));
41+
42+
$fastSet = new FastSet($this->testDirectory);
43+
$fastSet->build($this->testDirectory.'/terms_encoded.txt');
44+
45+
$this->assertFastSetContents($fastSet);
46+
}
2847

48+
public function testWorkingWithSetBuilderWithGzipCompression(): void
49+
{
50+
// Build a set without gzip but with our prefix algorithm
51+
SetBuilder::buildSet(__DIR__.'/Fixtures/terms_de.txt', $this->testDirectory.'/terms_encoded.txt');
52+
// Also build the gzipped one
53+
SetBuilder::buildSet(__DIR__.'/Fixtures/terms_de.txt', $this->testDirectory.'/terms_gzipped.gz');
54+
55+
// File size of the encoded file must be definitely smaller
56+
$this->assertTrue(filesize($this->testDirectory.'/terms_encoded.txt') < filesize(__DIR__.'/Fixtures/terms_de.txt'));
57+
// File size of the gzipped file must be even smaller
58+
$this->assertTrue(filesize($this->testDirectory.'/terms_gzipped.gz') < filesize($this->testDirectory.'/terms_encoded.txt'));
59+
60+
$fastSet = new FastSet($this->testDirectory);
61+
$fastSet->build($this->testDirectory.'/terms_gzipped.gz');
62+
63+
$this->assertFastSetContents($fastSet);
64+
}
65+
66+
private function assertFastSetContents(FastSet $fastSet): void
67+
{
2968
$this->assertTrue($fastSet->has('mailadresse'));
3069
$this->assertTrue($fastSet->has('stolperfalle'));
3170
$this->assertTrue($fastSet->has('zytozym'));

0 commit comments

Comments
 (0)