perf(lzss,huffman): hash-chain match finder + table Huffman decode#111
Merged
Conversation
Reviewed the codec suite for optimization headroom (bench across every algorithm). Two clear algorithmic wins, both keeping output correct: lzss encode: the finder compared each position against all 4096 ring-buffer slots — O(N·n) regardless of content, so incompressible input collapsed to ~0.3 MB/s. Replace it with a hash chain over the raw input (translating a match source at input position `cand` to the decoder's ring index `(cand + N - F) & (N - 1)`). Output size is unchanged because it depends only on match lengths, which the fully-walked chain reproduces; only the tie-broken source position can differ. ~9x faster on text, ~700x on random at 1 MiB; compressed sizes within 0.01% across text/binary/zeros/code. huffman decode: the canonical decoder walked each code one bit at a time (one BitReader call per bit). Build a single peek-and-lookup table indexed by the next max_length bits (<= 15, so <= 64 KiB) and decode a symbol per lookup. ~1.9-2.1x fewer decode instructions on both text and high-entropy input; output identical, corrupt/truncated streams still rejected without panic. Verified: full suite (61 binaries), clippy, fmt clean; lzss ratio preserved and round-trips; 60-case huffman fuzz + 30 corrupt inputs round-trip through our decoder without panic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reviewed the whole codec suite for optimization headroom (benchmarked encode+decode throughput across every algorithm) and kept the changes that clear the ≥10% bar. Two clear algorithmic wins:
lzss encode — O(N·n) brute force → hash chain
The encoder compared every position against all 4096 ring-buffer slots, i.e. O(N·n) regardless of content, so incompressible input collapsed to ~0.3 MB/s (~110k instructions/byte). Replaced it with a hash-chain finder over the raw input, translating a match source at input position
candto the ring index the decoder expects,(cand + N - F) & (N - 1).u32chain array keeps the fixed allocation small.huffman decode — bit-by-bit → table lookup
The standalone canonical-Huffman decoder walked each code one bit at a time (a
BitReadercall per bit). It now builds a single peek-and-lookup table indexed by the nextmax_lengthbits (≤15 ⇒ ≤64 KiB) and decodes one symbol per lookup.Corrupt/UnexpectedEnd) without panicking.Notes
Other slow paths were checked and left alone as already-optimal or inherently costly: the range coder (8-bit bit-tree, ~37 instr/bit),
h2-huffman(already a byte-wide FSA),mtf(already single-pass; random is a non-goal),bwt(SA-IS). The multi-MB, highly-repetitive lzss case is ~30% slower than the old early-breaking brute force but stays >270 MB/s — an acceptable trade for fixing the 700× worst case.Verification
Full suite (61 binaries), clippy, fmt clean. lzss ratio preserved + round-trips; 60-case huffman fuzz + 30 corrupt inputs round-trip through our decoder without panic.
🤖 Generated with Claude Code