Skip to content

Ratio + speed round 2: xz/zstd/lz4 near parity; bwt/mtf/range faster#97

Merged
MagicalTux merged 9 commits into
masterfrom
ratio-speed-round2
Jun 15, 2026
Merged

Ratio + speed round 2: xz/zstd/lz4 near parity; bwt/mtf/range faster#97
MagicalTux merged 9 commits into
masterfrom
ratio-speed-round2

Conversation

@MagicalTux

Copy link
Copy Markdown
Member

Second ratio+speed pass. The previous round left structural ratio gaps on xz/zstd/lz4 (their encoders only searched a small window) and a few slow standalone codecs. This closes most of those. Encoder-only for ratio — decoders unchanged; every format still decodes byte-for-byte with its reference tool.

Ratio (2.9 MB real-source corpus, our -l 9 vs reference max; ours/ref)

Format Prev Now Cross-decode
xz / lzma2 vs xz -9 1.51 1.10 xz -d
zstd vs -19 1.40 1.14 (-l 9); 1.04 at -l 19 (beats -12) zstd -d
lz4 vs -9 1.18 1.02 (-l 12 beats lz4 -9) lz4 -d
lzma vs xz -9 1.07 1.07
bzip2 vs -9 1.00 1.00
gzip vs -9 1.01 1.01
brotli vs -q11 1.48 1.48 (unchanged this round)

How

  • xz/lzma2 — continuous cross-chunk dictionary. The LZMA2 chunk encoder now keeps the LZ history continuous across chunks (first chunk 0xE0 reset-dict, the rest 0xC0 continue-dict; one match-finder spanning the whole input) instead of full-resetting every 64 KiB. That was the entire structural gap — it now nearly matches the .lzma path. Also fixed the raw-LZMA2 decoder to feed stored chunks into the dictionary.
  • zstd — cross-block matching + two-pass repricing. A retained ≤8 MiB sliding window lets back-refs cross block boundaries (where most of zstd's edge lives), plus a btultra2-style statistics-driven re-parse and repeat-offset-aware DP pricing. Now beats zstd -12.
  • lz4 — frame linked-block mode. Matches reference up to 64 KiB of prior blocks' output (FLG independence bit cleared; the bitstream is ordinary LZ4, offsets just reach further). lz4 -d resolves them natively.

Speed (standalone codecs, output byte-identical)

  • bwt encode ~3× (7 → 21 MB/s) — prefix-doubling rotation sort → linear-time SA-IS.
  • mtf encode ~2.3× (51 → 118 MB/s) — single-pass scan-and-shift.
  • rangecoder ~+15% enc/dec — tightened hot loops.

Honest notes

  • The xz/lzma2 and zstd encoders now buffer the whole input to drive one continuous match-finder (same memory profile the .lzma path already had). Higher levels trade encode time for ratio; decode speed is unchanged.
  • brotli stays ~1.48 — reaching -q11 needs its full iterative context-modeling optimal parse, beyond a bounded change (unchanged this round).

Checks

  • cargo test --all-features61 suites green, 0 failures (incl. new linked-block + continue-dict + SA-IS regression tests, and the bzip2 bunzip2 cross-validation).
  • Independent reference cross-decode for all formats (table above), re-run on the integrated branch.
  • cargo fmt --check, cargo clippy --all-features --all-targets -D warnings, rustdoc -D warnings — clean.

🤖 Generated with Claude Code

MagicalTux and others added 9 commits June 15, 2026 16:38
Add `encode_lzma2_stream`, which range-codes an entire input through a
single `LzmaEncCore` + hash-chain match-finder and slices the result into
framed LZMA2 chunks. Only the LZMA *state* (probabilities, reps, range
coder) resets per chunk via the new `reset_state_keep_pos`; `output_pos`
and the LZ history continue, so a match in a later chunk can reference data
from any earlier chunk up to `dict_size` (default 4 MiB) — the same
effective window the continuous `.lzma` path enjoys.

The greedy/optimal parses now take an explicit `[pos_start, pos_end)` and
clamp emitted match/rep lengths to `pos_end`, so a chunk ends exactly at
its boundary while match finding still reaches back over the whole history.

Each chunk records `reset_dict` (true only for the first chunk) so callers
emit the right control byte. The single-chunk `encode_lzma_chunk` is kept
for the unit tests behind `#[cfg_attr(not(test), allow(dead_code))]`.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…dict

Encoder side: both the xz block payload and the raw LZMA2 stream now buffer
the whole input and produce their chunks from `encode_lzma2_stream`. The
first compressed chunk is emitted as `0xE0` (reset dict) and every later
one as `0xC0` (reset state + props, dictionary continues); the uncompressed
fallback likewise switches from `0x01` (reset) to `0x02` (continue) after
the first chunk. This drops xz/lzma2 output on a 2.9 MB corpus from ~734 KB
(ratio 1.51 vs `xz -9`) to ~532 KB (1.10), matching the `.lzma` path.

Decoder side: an uncompressed (stored) chunk now feeds its bytes into the
LZMA2 dictionary via `append_literals`, lazily creating the LZ core with
the canonical default props when the stream/block opens with an
uncompressed chunk. Without this a `0xC0`/`0x02` continue chunk following an
uncompressed chunk would reference an empty or absent dictionary. The
resulting streams cross-decode byte-for-byte with system `xz -d` at every
level, including inputs with uncompressed chunks in the middle of an
otherwise compressible stream.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The frame encoder previously compressed each block in isolation, so its
match window never exceeded the block's own <=64 KiB. Switch the default
to LZ4 linked blocks: carry a sliding 64 KiB window of recently emitted
raw output and offer it to the block match finder as a dictionary, so a
match in one block can reference bytes emitted by earlier blocks.

- block.rs: add `encode_block_level_dict(dict, input, out, level)` plus
  dict-aware HC and optimal parses. Internally dict and block share one
  combined buffer; only block positions may start a sequence, but their
  back-references may reach into the dict region. End-of-block rules
  (last 5 literal, last match >=12 before block end) are unchanged, and
  match distances stay <= 65535 so offsets fit the 16-bit field. Empty
  dict reproduces the previous output exactly.
- frame.rs: Encoder carries a 64 KiB sliding window of prior raw output,
  passes it as the dict, and advances it after each block. The FLG block-
  independence bit was already cleared by default. Independent mode keeps
  an empty window and compresses each block in isolation.

The emitted bitstream is an ordinary LZ4 block; the reference decoder
resolves the cross-block offsets natively. On the 2.9 MB corpus this
takes lz4-frame -l9 from 935774 to 811139 bytes (vs lz4 -9 = 796072),
and -l12 to 788013 (beating lz4 -9).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- linked_beats_independent_across_boundary: a payload whose second block
  copies the first block's tail compresses strictly smaller in linked mode
  than independent mode (only a cross-block reference can exploit it).
- cross_tool_linked_multiblock_our_encode_system_decode: our default
  (linked) multi-block output decodes byte-for-byte through the system
  `lz4` tool.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The forward BWT built the cyclic-rotation order with an O(n log n)
prefix-doubling (Manber-Myers) sort. Replace it with linear-time SA-IS
suffix-array construction over the doubled block T+T+$, the same
algorithm class bzip2's BWT uses (implemented independently here).

A KMP-based tie-break pass reorders runs of equal cyclic rotations
(only reachable on fully periodic blocks) by ascending offset, so the
emitted (last column, primary index) pair is byte-identical to the
previous stable cyclic sort. Verified identical compressed output on
text/corpus/repetitive/random/small inputs.

Encode throughput on 2 MiB text: ~8.2 -> ~21 MB/s (2.6x). Decode
(inverse transform) unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace the per-bit encode_bit/decode_bit calls with whole-buffer
encode_bytes/decode_bytes that:

* hoist range/low (encoder) and range/code/pos (decoder) into locals so
  they stay in registers across the inner 8-bit walk;
* index the probability array with node & 0xFF on the fixed [u16; 256]
  model, eliding the per-bit bounds check (the bit-tree walk only ever
  reads nodes 1..=255);
* inline the probability-counter update and the renormalization byte I/O;
* reserve output capacity up front so renorm pushes don't re-check.

The coding arithmetic and stream format are unchanged, so the bitstream
is byte-identical (verified on text + corpus) and existing streams still
decode. Encode ~42.6 -> ~49 MB/s, decode ~56 -> ~58 MB/s on 2 MiB text.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
encode_byte found the rank with a linear scan and then shifted the
prefix with copy_within — two passes over the [0, rank) prefix per byte.
Fuse them: walk the table once, sliding each scanned entry back one slot
(carrying the previous value forward) until b is found. Each element is
touched exactly once, with better locality.

Output is byte-identical (verified on text/corpus/random); decode_byte
(and its move_to_front) are unchanged. Encode ~55 -> ~118 MB/s on
2 MiB text (2.1x).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…parse

Push the zstd encoder's compression ratio from ~1.35 to ~1.04 vs `zstd -19`
on the 2.9 MB source corpus (max level: 665337 -> 512756 bytes), now beating
`zstd -12` at every optimal level. Stays fully spec-compliant — `zstd -d`
decodes our output byte-for-byte, and our own decoder is unchanged.

Three levers, in order of impact:

- Cross-block matching (the big one): retain a sliding window of previously
  emitted bytes (`history`, capped at 8 MiB, well inside the advertised 16 MiB
  window) and let back-references in each block point into earlier blocks.
  Previously every 128 KiB block was compressed in isolation, which left most
  of zstd's advantage on the table — `zstd -19` with a matching 128 KiB window
  only reaches 579965, vs 492721 with its default 8 MiB window. The matcher
  indexes `history ++ pending` per block; each parser splices the current
  block's positions into the hash chains lazily as it advances, preserving the
  LZ invariant that chains never contain positions ahead of the probe.

- Two-pass statistics-driven repricing (btultra2-style): after a first
  heuristic-priced optimal parse, rebuild a fractional-bit price model from the
  block's actual LL/ML/OF FSE code statistics and literal-byte Huffman entropy,
  then reparse; keep the cheaper result (one or two iterations at level >= 15).
  Prices are carried in fixed point (8 fractional bits) so the DP can weigh
  sub-bit differences between offset codes.

- Repeat-offset- and literal-length-aware pricing in the DP: carry the literal
  run length per node to price the LL FSE code correctly and resolve the LL==0
  repeat-offset aliasing, and price a match landing on an active repeat offset
  at just its (cheap) FSE code with no extra bits.

Encode time stays in single-digit seconds (L19 ~5s, L22 ~7s on the corpus) via
a level-scaled per-position chain cap for the optimal parse.

Validation: all existing tests green (`cargo test --features all`, 464+ pass);
round-trip through our decoder across sizes incl. exact block boundaries,
incompressible, and multi-block; reference cross-decode (`zstd -d`) byte-exact
at levels 3/12/19/22 on the corpus, random data, duplicated/periodic inputs,
and >8 MiB inputs that exercise the window-trim path.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@MagicalTux MagicalTux merged commit 1d15f7c into master Jun 15, 2026
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant