Skip to content

perf(lzss,huffman): hash-chain match finder + table Huffman decode#111

Merged
MagicalTux merged 1 commit into
masterfrom
perf/lzss-huffman-decode
Jul 1, 2026
Merged

perf(lzss,huffman): hash-chain match finder + table Huffman decode#111
MagicalTux merged 1 commit into
masterfrom
perf/lzss-huffman-decode

Conversation

@MagicalTux

Copy link
Copy Markdown
Member

Reviewed the whole codec suite for optimization headroom (benchmarked encode+decode throughput across every algorithm) and kept the changes that clear the ≥10% bar. Two clear algorithmic wins:

lzss encode — O(N·n) brute force → hash chain

The encoder compared every position against all 4096 ring-buffer slots, i.e. O(N·n) regardless of content, so incompressible input collapsed to ~0.3 MB/s (~110k instructions/byte). Replaced it with a hash-chain finder over the raw input, translating a match source at input position cand to the ring index the decoder expects, (cand + N - F) & (N - 1).

  • Output size is unchanged: it depends only on match lengths, which the fully-walked chain reproduces; only the tie-broken source position can differ.
  • ~9× faster on text, ~700× on random (at 1 MiB); zeros neutral. Compressed sizes within 0.01% across text/binary/zeros/source.
  • u32 chain array keeps the fixed allocation small.

huffman decode — bit-by-bit → table lookup

The standalone canonical-Huffman decoder walked each code one bit at a time (a BitReader call per bit). It now builds a single peek-and-lookup table indexed by the next max_length bits (≤15 ⇒ ≤64 KiB) and decodes one symbol per lookup.

  • ~1.9–2.1× fewer decode instructions (deterministic callgrind) on both text and high-entropy input.
  • Output identical; corrupt/truncated streams still rejected (Corrupt/UnexpectedEnd) without panicking.

Notes

Other slow paths were checked and left alone as already-optimal or inherently costly: the range coder (8-bit bit-tree, ~37 instr/bit), h2-huffman (already a byte-wide FSA), mtf (already single-pass; random is a non-goal), bwt (SA-IS). The multi-MB, highly-repetitive lzss case is ~30% slower than the old early-breaking brute force but stays >270 MB/s — an acceptable trade for fixing the 700× worst case.

Verification

Full suite (61 binaries), clippy, fmt clean. lzss ratio preserved + round-trips; 60-case huffman fuzz + 30 corrupt inputs round-trip through our decoder without panic.

🤖 Generated with Claude Code

Reviewed the codec suite for optimization headroom (bench across every
algorithm). Two clear algorithmic wins, both keeping output correct:

lzss encode: the finder compared each position against all 4096 ring-buffer
slots — O(N·n) regardless of content, so incompressible input collapsed to
~0.3 MB/s. Replace it with a hash chain over the raw input (translating a
match source at input position `cand` to the decoder's ring index
`(cand + N - F) & (N - 1)`). Output size is unchanged because it depends only
on match lengths, which the fully-walked chain reproduces; only the tie-broken
source position can differ. ~9x faster on text, ~700x on random at 1 MiB;
compressed sizes within 0.01% across text/binary/zeros/code.

huffman decode: the canonical decoder walked each code one bit at a time
(one BitReader call per bit). Build a single peek-and-lookup table indexed by
the next max_length bits (<= 15, so <= 64 KiB) and decode a symbol per lookup.
~1.9-2.1x fewer decode instructions on both text and high-entropy input;
output identical, corrupt/truncated streams still rejected without panic.

Verified: full suite (61 binaries), clippy, fmt clean; lzss ratio preserved
and round-trips; 60-case huffman fuzz + 30 corrupt inputs round-trip through
our decoder without panic.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MagicalTux MagicalTux merged commit ca385e2 into master Jul 1, 2026
42 checks passed
@MagicalTux MagicalTux deleted the perf/lzss-huffman-decode branch July 1, 2026 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant