Encoder compression-ratio improvements (lzma/lz4/zstd/bzip2/brotli)#96
Merged
Conversation
Replace the textbook frequency-only Huffman length builder with a faithful port of bzip2's BZ2_hbMakeCodeLengths. The reference packs subtree depth into the low 8 bits of each heap key so equal-frequency merges prefer the shallower subtree, reproducing bzip2's exact per-table bit costs. Design length is capped at 17 bits (bzip2's limit since 1.0.3); over-long codes trigger frequency halving and retry. The decode side still accepts up to 20 bits for pre-1.0.3 compatibility. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace the single-shared-table encoder (which replicated one table
num_tables times and pinned every selector to 0, giving up all
multi-table compression) with a faithful port of reference bzip2's
sendMTFValues:
- Initialise 2..=6 tables by partitioning the alphabet into bands of
roughly equal cumulative frequency (low->high, matching the
reference's nPart/gs/ge walk including the odd-iteration back-off).
- Run 4 refinement passes: assign each 50-symbol group to the table
that codes it cheapest, accumulate that table's per-symbol
frequencies, then rebuild every table from its assigned groups.
- Emit real per-group selectors (MTF + unary, already wired) instead
of all-zeros.
This is the bulk of the ratio win: at matching block size our output is
now byte-for-byte identical to bzip2 on the test corpus.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Reference bzip2 sizes each block by its post-RLE-1 length (nblock), not by the count of raw input bytes, so on compressible data it packs more raw bytes into a block. Our encoder capped raw input, producing more (smaller) blocks than bzip2 -9 (4 vs 3 on the corpus) and paying extra block + BWT-context overhead. Add a streaming Rle1Encoder that the block-builder feeds raw bytes through, cutting a block once the tracked post-RLE-1 length reaches the cap. The raw buffer is still retained for the block CRC and the in-block RLE-1 re-run. This closes the remaining gap to bzip2 -9 and makes our level-6/9 output bit-identical to the reference. Tests: streaming-RLE1 vs one-shot equivalence (incl. 255-cap and run-straddling), encoded_len tracking, large compressible multi-block round-trip, and a low-level forced-multi-block round-trip. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace the single-entry hash table on the high-compression path with an LZ4-HC-style hash-chain match finder (head + prev chains) that searches up to nb_attempts candidates per position for the longest match in the 64 KiB window, plus one-step lazy matching. Search depth scales with the level. The fast greedy single-hash parse is preserved for low levels (LZ4's speed crown). `encode_block_level(input, out, level)` dispatches: level < 3 uses the fast path, level >= 3 uses HC. Wire the level knob through both block (`Lz4`) and frame (`LZ4Frame`) EncoderConfig and `factory::encoder_by_name_with_level`, so `-l 9` now actually engages HC. The bitstream is unchanged — only the parse improves, so existing decoders (ours and the reference `lz4` tool) read it unchanged. On the 2.9 MB corpus the block stream drops from 1214053 (fast) to ~933986 bytes at level 12 (ratio vs `lz4 -9` 796072: 1.53 -> 1.17). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add a forward dynamic-programming optimal parse used at levels >= 10. It minimises the encoded byte cost: price[i] is the cheapest way to reach position i via a marginal-priced literal step (tracking the run length so the run-length token overflow is charged exactly) or any hash-chain match (token + offset + match-length-extension cost). Backtracking recovers the cheapest path. Long matches (>= OPT_SUFFICIENT_LEN) are taken greedily and the interior is skipped, keeping highly-repetitive inputs near-linear (an all-'a' 2 MB block drops from 72 s to 0.01 s). Also fix a latent conformance bug in the FAST greedy parse: its inner probe loop was bounded by hash_limit (len - 4 - 5) rather than match_limit (len - MFLIMIT), so it could emit a match *starting* inside the final 12 bytes of a block. Such blocks round-trip through our (lenient) decoder but are rejected by the strict reference `lz4` decoder. Bounding the probe at match_limit makes every emitted block spec-conformant. A new test walks the emitted bitstream and asserts the end-of-block rules at all levels. With this fix the whole 2.9 MB corpus now cross-decodes through the system `lz4 -d` at every level (0..12), not just the HC levels. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace the greedy/lazy parser in the shared LZMA chunk encoder and the
.lzma alone encoder with a forward dynamic-programming optimal parse
modelled on the LZMA SDK's GetOptimum:
- Range-coder bit-price model (fixed-point -log2 ProbPrices table) read
from a snapshot of the live probability model, covering literals,
matches, rep0..rep3, and short-reps.
- Hash-chain match finder extended to return the full candidate set
(shortest distance per achievable length) plus the four rep distances.
- Forward DP over a level-scaled look-ahead window with per-segment price
refresh (COMMIT_CAP) so the snapshot tracks the adapting model; early
commit on long matches (nice_len) mirrors the SDK's cut-off.
- level knob wired through EncoderParams/LevelParams: higher level =
deeper chain, longer nice_match, larger window.
- min(greedy, optimal) guard guarantees no level regresses below the
greedy baseline on pathological tiny inputs.
Ratio on the 2.9 MB source corpus (ours vs xz -9 = 485868):
.lzma level 9: 585481 -> 521918 (1.21 -> 1.07)
xz level 9: 763828 -> 734100 (xz path capped by 64 KiB per-chunk
model+dict reset in the framing)
Decoders unchanged; output stays valid and cross-decodes with system xz.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The greedy-vs-optimal min() guard only matters on small, highly-repetitive inputs where the optimal parser's cold-start price model can briefly lose. On inputs larger than 64 KiB the optimal parse always wins overall, so the second greedy pass is pure waste. Gate it on input size, cutting the .lzma level-9 encode of the 2.9 MB corpus from ~31s to ~27s with identical output. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…alphabets The literals section silently fell back to a Raw_Literals_Block (no entropy coding at all) whenever the literal alphabet's highest present byte exceeded index 127 — which is the common case for UTF-8 text, since multi-byte lead/continuation bytes (0xC2, 0xA7, …) push `last_present` past the 128-weight cap of the direct nibble tree encoding. On the 2.9 MB source-code corpus every block hit this path, so ~180 KB of literals were stored uncompressed. Add `encode_huff_tree_fse`, the encoder counterpart to the decoder's existing `decode_fse_weights`: it normalises the weight histogram (alphabet 0..=11, accuracy_log ≤ 6), serialises an FSE table header, and writes the two interleaved FSE weight streams backwards so the forward decoder replays them in order. The literals builder now serialises the tree as the smaller of the direct and FSE encodings, and only bails when neither fits. Spec-compliant (system `zstd -d` decodes our frames byte-for-byte) and round-trips through our own decoder. Corpus at max level: 731606 -> 686651. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A fresh offset at distance D spends ~log2(D) FSE-code bits plus ~log2(D) offset extra bits; a repeat offset (codes 1..=3) spends only the FSE code and no extra bits. On the source corpus the offset extra bits are the single largest part of the output (~288 KB of a ~550 KB sequence section), so a repeat match that is a few bytes shorter than the best fresh match is often the cheaper encoding. `best_at` previously took any fresh match strictly longer than the repeat candidate. It now requires the fresh match to beat the repeat by a distance-dependent byte margin (~2*log2(D)/6) before displacing it, keeping the repeat when the fresh match's extra offset bits wouldn't pay for the few bytes of extra length. Small but clean win; round-trips through our decoder and decodes byte-for-byte under system zstd. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Greedy/lazy parsing leaves most of the offset-extra-bit cost on the table: each fresh offset at distance D spends ~log2(D) extra bits, and on the source corpus those extra bits are ~288 KB of a ~550 KB sequence section. A forward dynamic program that prices each literal/match by estimated encoded bits — with repeat offsets priced near-free and every match length from MIN_MATCH up to the candidate maximum considered — recovers a meaningful part of that. - New `optimal_parse`: DP over the block. `price[i]` = cheapest bits to encode buffer[0..i]; reached by a literal step or a match (from the hash chain or the three active repeat offsets). Repeat-offset ring state is carried along the minimum-price path so the DP can prefer matches that reuse offsets. - New `MatchFinder::collect_matches`: returns distinct-length candidates at the smallest distance achieving each length, which is what the DP needs. - Shared `finish_compressed_block` / `build_sequences_section` extracted as free functions so both parsers reuse them without aliasing `self.pending`. - Gated to level >= 17; per-position chain depth capped (OPTIMAL_MAX_CHAIN) so encode time stays bounded (~3.4 s for the 2.9 MB corpus at level 22). Corpus at level 22: 686651 -> 665337. Round-trips through our decoder and decodes byte-for-byte under system zstd across text, incompressible, mixed, empty, and block-boundary inputs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The price-based optimal parser beats lazy on the corpus at every level it can run, so lower its activation threshold from 17 to 13. Levels 13..=16 now use the DP with progressively deeper (still capped) chains, smoothing the ratio curve: L13 686k->681k, L16 687k->671k. Encode time stays well bounded (~1.7 s for the 2.9 MB corpus at level 16). Round-trips and cross-decodes clean across text/incompressible/mixed inputs at every newly-promoted level. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Three encoder-side ratio improvements; bitstream stays spec-compliant (verified: system `brotli -d` decodes our output byte-for-byte across empty/single/text/random/repetitive inputs at q0..11). - Literal context modeling (new encoder_ctx): per-meta-block the encoder now tallies literal histograms across the 64 brotli literal contexts, evaluates all four context modes (UTF8/MSB6/LSB6/Signed), clusters the contexts into up to 16 trees by an agglomerative bit-cost merge, and emits NTREESL>1 with a literal context map. Previously a single literal tree (NTREESL=1) was always used. - Cost-aware match selection (find_match_cost): at quality >= 4 the finder maximises an approximate bit gain (len*V - log2(dist)) instead of raw length, preferring closer/cheaper-distance matches. Distance coding is ~58% of our output, so this directly attacks distance extra bits. - Repeat-distance preference: the LZ77 walk carries a local distance ring (kept in lockstep with plan_commands) and prefers matches reachable at a recent ring distance, which encode as a cheap short code. On the 2.9 MB corpus, q11: 716306 -> 707558 bytes (ratio vs `brotli -q11` 480480 improves 1.491 -> 1.473). Honest limit: the remaining gap to q11 is the optimal (zopfli-style) parse, which greedy match selection cannot close — most matches here are at genuinely distinct far distances. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A ratio-focused pass on the encoders. Audit found our deflate/zlib/gzip/lzw are already at parity with reference max level, but the high-effort formats were ~50% behind because their encoders used a simple greedy/lazy parse. This closes much of that gap. Encoder-only — every decoder is untouched, and each format's output still decodes byte-for-byte with its reference tool.
Results (2.9 MB real-source corpus, our
-l 9vs reference max level;ours/ref, lower = better)-9bzip2 -9bzip2 -d✅xz -9xz --format=lzma -d✅-9lz4 -d✅-19zstd -d✅-9xz -d✅-q11brotli -d✅Highlights — two were genuine encoder bugs, not just weak parsing
bzip2 -9..lzmais now near parity withxz -9.lz4 -drejected it).Honest limits (documented)
-8..-10territory; brotli ~-q2..-q4parse class — reaching-19/-q11needs their multi-iteration optimal parsers, beyond a bounded change. Agents reverted speculative changes that didn't pay off.Checks
cargo test --all-features— 61 suites green, 0 failures (incl. the bzip2bunzip2cross-validation test + new regression tests).cargo fmt --check,cargo clippy --all-features --all-targets -D warnings, rustdoc-D warnings— clean.🤖 Generated with Claude Code