Ratio + speed round 2: xz/zstd/lz4 near parity; bwt/mtf/range faster by MagicalTux · Pull Request #97 · KarpelesLab/compcol

MagicalTux · 2026-06-15T07:46:51Z

Second ratio+speed pass. The previous round left structural ratio gaps on xz/zstd/lz4 (their encoders only searched a small window) and a few slow standalone codecs. This closes most of those. Encoder-only for ratio — decoders unchanged; every format still decodes byte-for-byte with its reference tool.

Ratio (2.9 MB real-source corpus, our `-l 9` vs reference max; `ours/ref`)

Format	Prev	Now	Cross-decode
xz / lzma2 vs `xz -9`	1.51	1.10	`xz -d` ✅
zstd vs `-19`	1.40	1.14 (`-l 9`); 1.04 at `-l 19` (beats `-12`)	`zstd -d` ✅
lz4 vs `-9`	1.18	1.02 (`-l 12` beats `lz4 -9`)	`lz4 -d` ✅
lzma vs `xz -9`	1.07	1.07	✅
bzip2 vs `-9`	1.00	1.00	✅
gzip vs `-9`	1.01	1.01	✅
brotli vs `-q11`	1.48	1.48 (unchanged this round)	✅

How

xz/lzma2 — continuous cross-chunk dictionary. The LZMA2 chunk encoder now keeps the LZ history continuous across chunks (first chunk 0xE0 reset-dict, the rest 0xC0 continue-dict; one match-finder spanning the whole input) instead of full-resetting every 64 KiB. That was the entire structural gap — it now nearly matches the .lzma path. Also fixed the raw-LZMA2 decoder to feed stored chunks into the dictionary.
zstd — cross-block matching + two-pass repricing. A retained ≤8 MiB sliding window lets back-refs cross block boundaries (where most of zstd's edge lives), plus a btultra2-style statistics-driven re-parse and repeat-offset-aware DP pricing. Now beats zstd -12.
lz4 — frame linked-block mode. Matches reference up to 64 KiB of prior blocks' output (FLG independence bit cleared; the bitstream is ordinary LZ4, offsets just reach further). lz4 -d resolves them natively.

Speed (standalone codecs, output byte-identical)

bwt encode ~3× (7 → 21 MB/s) — prefix-doubling rotation sort → linear-time SA-IS.
mtf encode ~2.3× (51 → 118 MB/s) — single-pass scan-and-shift.
rangecoder ~+15% enc/dec — tightened hot loops.

Honest notes

The xz/lzma2 and zstd encoders now buffer the whole input to drive one continuous match-finder (same memory profile the .lzma path already had). Higher levels trade encode time for ratio; decode speed is unchanged.
brotli stays ~1.48 — reaching -q11 needs its full iterative context-modeling optimal parse, beyond a bounded change (unchanged this round).

Checks

cargo test --all-features — 61 suites green, 0 failures (incl. new linked-block + continue-dict + SA-IS regression tests, and the bzip2 bunzip2 cross-validation).
Independent reference cross-decode for all formats (table above), re-run on the integrated branch.
cargo fmt --check, cargo clippy --all-features --all-targets -D warnings, rustdoc -D warnings — clean.

🤖 Generated with Claude Code

Add `encode_lzma2_stream`, which range-codes an entire input through a single `LzmaEncCore` + hash-chain match-finder and slices the result into framed LZMA2 chunks. Only the LZMA *state* (probabilities, reps, range coder) resets per chunk via the new `reset_state_keep_pos`; `output_pos` and the LZ history continue, so a match in a later chunk can reference data from any earlier chunk up to `dict_size` (default 4 MiB) — the same effective window the continuous `.lzma` path enjoys. The greedy/optimal parses now take an explicit `[pos_start, pos_end)` and clamp emitted match/rep lengths to `pos_end`, so a chunk ends exactly at its boundary while match finding still reaches back over the whole history. Each chunk records `reset_dict` (true only for the first chunk) so callers emit the right control byte. The single-chunk `encode_lzma_chunk` is kept for the unit tests behind `#[cfg_attr(not(test), allow(dead_code))]`. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…dict Encoder side: both the xz block payload and the raw LZMA2 stream now buffer the whole input and produce their chunks from `encode_lzma2_stream`. The first compressed chunk is emitted as `0xE0` (reset dict) and every later one as `0xC0` (reset state + props, dictionary continues); the uncompressed fallback likewise switches from `0x01` (reset) to `0x02` (continue) after the first chunk. This drops xz/lzma2 output on a 2.9 MB corpus from ~734 KB (ratio 1.51 vs `xz -9`) to ~532 KB (1.10), matching the `.lzma` path. Decoder side: an uncompressed (stored) chunk now feeds its bytes into the LZMA2 dictionary via `append_literals`, lazily creating the LZ core with the canonical default props when the stream/block opens with an uncompressed chunk. Without this a `0xC0`/`0x02` continue chunk following an uncompressed chunk would reference an empty or absent dictionary. The resulting streams cross-decode byte-for-byte with system `xz -d` at every level, including inputs with uncompressed chunks in the middle of an otherwise compressible stream. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The frame encoder previously compressed each block in isolation, so its match window never exceeded the block's own <=64 KiB. Switch the default to LZ4 linked blocks: carry a sliding 64 KiB window of recently emitted raw output and offer it to the block match finder as a dictionary, so a match in one block can reference bytes emitted by earlier blocks. - block.rs: add `encode_block_level_dict(dict, input, out, level)` plus dict-aware HC and optimal parses. Internally dict and block share one combined buffer; only block positions may start a sequence, but their back-references may reach into the dict region. End-of-block rules (last 5 literal, last match >=12 before block end) are unchanged, and match distances stay <= 65535 so offsets fit the 16-bit field. Empty dict reproduces the previous output exactly. - frame.rs: Encoder carries a 64 KiB sliding window of prior raw output, passes it as the dict, and advances it after each block. The FLG block- independence bit was already cleared by default. Independent mode keeps an empty window and compresses each block in isolation. The emitted bitstream is an ordinary LZ4 block; the reference decoder resolves the cross-block offsets natively. On the 2.9 MB corpus this takes lz4-frame -l9 from 935774 to 811139 bytes (vs lz4 -9 = 796072), and -l12 to 788013 (beating lz4 -9). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

- linked_beats_independent_across_boundary: a payload whose second block copies the first block's tail compresses strictly smaller in linked mode than independent mode (only a cross-block reference can exploit it). - cross_tool_linked_multiblock_our_encode_system_decode: our default (linked) multi-block output decodes byte-for-byte through the system `lz4` tool. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The forward BWT built the cyclic-rotation order with an O(n log n) prefix-doubling (Manber-Myers) sort. Replace it with linear-time SA-IS suffix-array construction over the doubled block T+T+$, the same algorithm class bzip2's BWT uses (implemented independently here). A KMP-based tie-break pass reorders runs of equal cyclic rotations (only reachable on fully periodic blocks) by ascending offset, so the emitted (last column, primary index) pair is byte-identical to the previous stable cyclic sort. Verified identical compressed output on text/corpus/repetitive/random/small inputs. Encode throughput on 2 MiB text: ~8.2 -> ~21 MB/s (2.6x). Decode (inverse transform) unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Replace the per-bit encode_bit/decode_bit calls with whole-buffer encode_bytes/decode_bytes that: * hoist range/low (encoder) and range/code/pos (decoder) into locals so they stay in registers across the inner 8-bit walk; * index the probability array with node & 0xFF on the fixed [u16; 256] model, eliding the per-bit bounds check (the bit-tree walk only ever reads nodes 1..=255); * inline the probability-counter update and the renormalization byte I/O; * reserve output capacity up front so renorm pushes don't re-check. The coding arithmetic and stream format are unchanged, so the bitstream is byte-identical (verified on text + corpus) and existing streams still decode. Encode ~42.6 -> ~49 MB/s, decode ~56 -> ~58 MB/s on 2 MiB text. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

encode_byte found the rank with a linear scan and then shifted the prefix with copy_within — two passes over the [0, rank) prefix per byte. Fuse them: walk the table once, sliding each scanned entry back one slot (carrying the previous value forward) until b is found. Each element is touched exactly once, with better locality. Output is byte-identical (verified on text/corpus/random); decode_byte (and its move_to_front) are unchanged. Encode ~55 -> ~118 MB/s on 2 MiB text (2.1x). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…parse Push the zstd encoder's compression ratio from ~1.35 to ~1.04 vs `zstd -19` on the 2.9 MB source corpus (max level: 665337 -> 512756 bytes), now beating `zstd -12` at every optimal level. Stays fully spec-compliant — `zstd -d` decodes our output byte-for-byte, and our own decoder is unchanged. Three levers, in order of impact: - Cross-block matching (the big one): retain a sliding window of previously emitted bytes (`history`, capped at 8 MiB, well inside the advertised 16 MiB window) and let back-references in each block point into earlier blocks. Previously every 128 KiB block was compressed in isolation, which left most of zstd's advantage on the table — `zstd -19` with a matching 128 KiB window only reaches 579965, vs 492721 with its default 8 MiB window. The matcher indexes `history ++ pending` per block; each parser splices the current block's positions into the hash chains lazily as it advances, preserving the LZ invariant that chains never contain positions ahead of the probe. - Two-pass statistics-driven repricing (btultra2-style): after a first heuristic-priced optimal parse, rebuild a fractional-bit price model from the block's actual LL/ML/OF FSE code statistics and literal-byte Huffman entropy, then reparse; keep the cheaper result (one or two iterations at level >= 15). Prices are carried in fixed point (8 fractional bits) so the DP can weigh sub-bit differences between offset codes. - Repeat-offset- and literal-length-aware pricing in the DP: carry the literal run length per node to price the LL FSE code correctly and resolve the LL==0 repeat-offset aliasing, and price a match landing on an active repeat offset at just its (cheap) FSE code with no extra bits. Encode time stays in single-digit seconds (L19 ~5s, L22 ~7s on the corpus) via a level-scaled per-position chain cap for the optimal parse. Validation: all existing tests green (`cargo test --features all`, 464+ pass); round-trip through our decoder across sizes incl. exact block boundaries, incompressible, and multi-block; reference cross-decode (`zstd -d`) byte-exact at levels 3/12/19/22 on the corpus, random data, duplicated/periodic inputs, and >8 MiB inputs that exercise the window-trim path. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

MagicalTux and others added 9 commits June 15, 2026 16:38

docs: changelog for round-2 ratio + speed work

70e3550

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

MagicalTux merged commit 1d15f7c into master Jun 15, 2026
41 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ratio + speed round 2: xz/zstd/lz4 near parity; bwt/mtf/range faster#97

Ratio + speed round 2: xz/zstd/lz4 near parity; bwt/mtf/range faster#97
MagicalTux merged 9 commits into
masterfrom
ratio-speed-round2

MagicalTux commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MagicalTux commented Jun 15, 2026

Ratio (2.9 MB real-source corpus, our -l 9 vs reference max; ours/ref)

How

Speed (standalone codecs, output byte-identical)

Honest notes

Checks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Ratio (2.9 MB real-source corpus, our `-l 9` vs reference max; `ours/ref`)