Ratio + speed round 2: xz/zstd/lz4 near parity; bwt/mtf/range faster#97
Merged
Conversation
Add `encode_lzma2_stream`, which range-codes an entire input through a single `LzmaEncCore` + hash-chain match-finder and slices the result into framed LZMA2 chunks. Only the LZMA *state* (probabilities, reps, range coder) resets per chunk via the new `reset_state_keep_pos`; `output_pos` and the LZ history continue, so a match in a later chunk can reference data from any earlier chunk up to `dict_size` (default 4 MiB) — the same effective window the continuous `.lzma` path enjoys. The greedy/optimal parses now take an explicit `[pos_start, pos_end)` and clamp emitted match/rep lengths to `pos_end`, so a chunk ends exactly at its boundary while match finding still reaches back over the whole history. Each chunk records `reset_dict` (true only for the first chunk) so callers emit the right control byte. The single-chunk `encode_lzma_chunk` is kept for the unit tests behind `#[cfg_attr(not(test), allow(dead_code))]`. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…dict Encoder side: both the xz block payload and the raw LZMA2 stream now buffer the whole input and produce their chunks from `encode_lzma2_stream`. The first compressed chunk is emitted as `0xE0` (reset dict) and every later one as `0xC0` (reset state + props, dictionary continues); the uncompressed fallback likewise switches from `0x01` (reset) to `0x02` (continue) after the first chunk. This drops xz/lzma2 output on a 2.9 MB corpus from ~734 KB (ratio 1.51 vs `xz -9`) to ~532 KB (1.10), matching the `.lzma` path. Decoder side: an uncompressed (stored) chunk now feeds its bytes into the LZMA2 dictionary via `append_literals`, lazily creating the LZ core with the canonical default props when the stream/block opens with an uncompressed chunk. Without this a `0xC0`/`0x02` continue chunk following an uncompressed chunk would reference an empty or absent dictionary. The resulting streams cross-decode byte-for-byte with system `xz -d` at every level, including inputs with uncompressed chunks in the middle of an otherwise compressible stream. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The frame encoder previously compressed each block in isolation, so its match window never exceeded the block's own <=64 KiB. Switch the default to LZ4 linked blocks: carry a sliding 64 KiB window of recently emitted raw output and offer it to the block match finder as a dictionary, so a match in one block can reference bytes emitted by earlier blocks. - block.rs: add `encode_block_level_dict(dict, input, out, level)` plus dict-aware HC and optimal parses. Internally dict and block share one combined buffer; only block positions may start a sequence, but their back-references may reach into the dict region. End-of-block rules (last 5 literal, last match >=12 before block end) are unchanged, and match distances stay <= 65535 so offsets fit the 16-bit field. Empty dict reproduces the previous output exactly. - frame.rs: Encoder carries a 64 KiB sliding window of prior raw output, passes it as the dict, and advances it after each block. The FLG block- independence bit was already cleared by default. Independent mode keeps an empty window and compresses each block in isolation. The emitted bitstream is an ordinary LZ4 block; the reference decoder resolves the cross-block offsets natively. On the 2.9 MB corpus this takes lz4-frame -l9 from 935774 to 811139 bytes (vs lz4 -9 = 796072), and -l12 to 788013 (beating lz4 -9). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- linked_beats_independent_across_boundary: a payload whose second block copies the first block's tail compresses strictly smaller in linked mode than independent mode (only a cross-block reference can exploit it). - cross_tool_linked_multiblock_our_encode_system_decode: our default (linked) multi-block output decodes byte-for-byte through the system `lz4` tool. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The forward BWT built the cyclic-rotation order with an O(n log n) prefix-doubling (Manber-Myers) sort. Replace it with linear-time SA-IS suffix-array construction over the doubled block T+T+$, the same algorithm class bzip2's BWT uses (implemented independently here). A KMP-based tie-break pass reorders runs of equal cyclic rotations (only reachable on fully periodic blocks) by ascending offset, so the emitted (last column, primary index) pair is byte-identical to the previous stable cyclic sort. Verified identical compressed output on text/corpus/repetitive/random/small inputs. Encode throughput on 2 MiB text: ~8.2 -> ~21 MB/s (2.6x). Decode (inverse transform) unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace the per-bit encode_bit/decode_bit calls with whole-buffer encode_bytes/decode_bytes that: * hoist range/low (encoder) and range/code/pos (decoder) into locals so they stay in registers across the inner 8-bit walk; * index the probability array with node & 0xFF on the fixed [u16; 256] model, eliding the per-bit bounds check (the bit-tree walk only ever reads nodes 1..=255); * inline the probability-counter update and the renormalization byte I/O; * reserve output capacity up front so renorm pushes don't re-check. The coding arithmetic and stream format are unchanged, so the bitstream is byte-identical (verified on text + corpus) and existing streams still decode. Encode ~42.6 -> ~49 MB/s, decode ~56 -> ~58 MB/s on 2 MiB text. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
encode_byte found the rank with a linear scan and then shifted the prefix with copy_within — two passes over the [0, rank) prefix per byte. Fuse them: walk the table once, sliding each scanned entry back one slot (carrying the previous value forward) until b is found. Each element is touched exactly once, with better locality. Output is byte-identical (verified on text/corpus/random); decode_byte (and its move_to_front) are unchanged. Encode ~55 -> ~118 MB/s on 2 MiB text (2.1x). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…parse Push the zstd encoder's compression ratio from ~1.35 to ~1.04 vs `zstd -19` on the 2.9 MB source corpus (max level: 665337 -> 512756 bytes), now beating `zstd -12` at every optimal level. Stays fully spec-compliant — `zstd -d` decodes our output byte-for-byte, and our own decoder is unchanged. Three levers, in order of impact: - Cross-block matching (the big one): retain a sliding window of previously emitted bytes (`history`, capped at 8 MiB, well inside the advertised 16 MiB window) and let back-references in each block point into earlier blocks. Previously every 128 KiB block was compressed in isolation, which left most of zstd's advantage on the table — `zstd -19` with a matching 128 KiB window only reaches 579965, vs 492721 with its default 8 MiB window. The matcher indexes `history ++ pending` per block; each parser splices the current block's positions into the hash chains lazily as it advances, preserving the LZ invariant that chains never contain positions ahead of the probe. - Two-pass statistics-driven repricing (btultra2-style): after a first heuristic-priced optimal parse, rebuild a fractional-bit price model from the block's actual LL/ML/OF FSE code statistics and literal-byte Huffman entropy, then reparse; keep the cheaper result (one or two iterations at level >= 15). Prices are carried in fixed point (8 fractional bits) so the DP can weigh sub-bit differences between offset codes. - Repeat-offset- and literal-length-aware pricing in the DP: carry the literal run length per node to price the LL FSE code correctly and resolve the LL==0 repeat-offset aliasing, and price a match landing on an active repeat offset at just its (cheap) FSE code with no extra bits. Encode time stays in single-digit seconds (L19 ~5s, L22 ~7s on the corpus) via a level-scaled per-position chain cap for the optimal parse. Validation: all existing tests green (`cargo test --features all`, 464+ pass); round-trip through our decoder across sizes incl. exact block boundaries, incompressible, and multi-block; reference cross-decode (`zstd -d`) byte-exact at levels 3/12/19/22 on the corpus, random data, duplicated/periodic inputs, and >8 MiB inputs that exercise the window-trim path. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Second ratio+speed pass. The previous round left structural ratio gaps on xz/zstd/lz4 (their encoders only searched a small window) and a few slow standalone codecs. This closes most of those. Encoder-only for ratio — decoders unchanged; every format still decodes byte-for-byte with its reference tool.
Ratio (2.9 MB real-source corpus, our
-l 9vs reference max;ours/ref)xz -9xz -d✅-19-l 9); 1.04 at-l 19(beats-12)zstd -d✅-9-l 12beatslz4 -9)lz4 -d✅xz -9-9-9-q11How
0xE0reset-dict, the rest0xC0continue-dict; one match-finder spanning the whole input) instead of full-resetting every 64 KiB. That was the entire structural gap — it now nearly matches the.lzmapath. Also fixed the raw-LZMA2 decoder to feed stored chunks into the dictionary.zstd -12.lz4 -dresolves them natively.Speed (standalone codecs, output byte-identical)
Honest notes
.lzmapath already had). Higher levels trade encode time for ratio; decode speed is unchanged.-q11needs its full iterative context-modeling optimal parse, beyond a bounded change (unchanged this round).Checks
cargo test --all-features— 61 suites green, 0 failures (incl. new linked-block + continue-dict + SA-IS regression tests, and the bzip2bunzip2cross-validation).cargo fmt --check,cargo clippy --all-features --all-targets -D warnings, rustdoc-D warnings— clean.🤖 Generated with Claude Code