Throughput optimizations across the codec suite by MagicalTux · Pull Request #90 · KarpelesLab/compcol

MagicalTux · 2026-06-12T03:20:08Z

Parallel performance pass over every algorithm module. Nine specialists each took a disjoint set of modules, profiled the hot paths, applied speed optimizations, and validated against the existing tests + bench harness. All changes preserve byte-identical decoder output (decoders are checked against round-trips and real reference fixtures); a few encoders changed output but still round-trip and pass their reference fixtures. No unsafe, no new dependencies.

Validation

cargo test --all-features — 60/60 suites green, 0 failures (this is the correctness gate: every decoder still produces identical bytes, encoders still round-trip + match fixtures incl. system gzip/bunzip2/zstd/xz and RAR/StuffIt/zip fixtures).
cargo fmt --check, cargo clippy --all-features --all-targets -D warnings — clean.
cargo build --all-features + bench harness run clean.

Highlights (measured by the agents, 1 MiB inputs, release)

Area	Change	Δ
deflate / deflate64	vectorized match-copy (spans + doubling `copy_within` for overlap)	deflate Random decode ~3.5×, deflate64 long-match decode several×
LZMA / xz	bulk + overlapping dict match-copy	RLE-heavy `.lzma` decode ~6×
zstd	inlined backward bit-reader, single-load FSE transitions, hoisted LL/ML tables	~1.5× (−16% instr.) on Huffman/FSE-heavy input
brotli	wider Huffman LUT, single-tree literal fast path, persistent bit accumulator	literal-heavy decode ~2.3×
lz4 / lz5 / lzo / snappy	bulk overlapping match-copy; lzo/snappy encoder skip-step	decode multi-GB/s; incompressible encode ~6×
xpress-huffman	O(n²) history-trim → O(n)	orders of magnitude on large inputs
lha / rar1–5 / zip-implode·reduce·shrink / arc-*	bulk LZSS/LZW window copy	lha lh5 decode +64%, etc.
delta	vectorizable filter recurrence	encode ~15×
hpack	byte-wide FSA Huffman decode	random decode +44%
bzip2	SA-IS suffix-array: fewer allocs + in-place recursion	BWT build +14–31% (dominant encode cost)
checksum / rle90	CRC-32 slice-by-8; rle90 bulk literal copy	~4× / ~3.5×

Method

Each module group was optimized in an isolated git worktree (disjoint files → conflict-free integration via cherry-pick). Agents were instructed to revert any change without a measurable win; several speculative changes (e.g. a zstd reserve pre-pass, deflate encoder block-copy elision, lzs bulk-copy) were dropped for showing no gain or a regression. One agent caught and fixed a self-introduced FSE table transcription bug before committing, surfaced by the bench's long-match corpus.

Notes:

The bzip2 "naive O(n²log n) BWT" called out in the Cargo.toml comment was already SA-IS from prior work; that comment is stale (left untouched here since the feature table wasn't in scope).
Decode-only codecs (RAR, PPMd, LZFSE, zip methods) have no encoder to bench, so they were validated via their fixture tests + code review of the hot loops.

🤖 Generated with Claude Code

Replace the byte-at-a-time CRC-32 inner loop with Intel slice-by-8: fold eight bytes per iteration through eight precomputed tables instead of one. Output is byte-identical (verified against the byte-at-a-time loop over 16 MiB). Standalone microbench: 642 -> 2525 MB/s. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The Normal-state decode path copied literal (non-FLAG) bytes one at a time through the state machine. Scan for the contiguous non-FLAG span bounded by input/output availability and copy_from_slice it in one memcpy, updating last/have_last from the span's final byte. Output is byte-identical; all rle90 tests pass. Bench decode (1 MiB): Lorem ~1268 -> ~4600 MB/s, Random ~1211 -> ~3750 MB/s. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Replace the per-byte overlap fallback in the inflate EmittingMatch hot loop (distance < remaining, e.g. distance-1 zero runs) with contiguous copy_within/copy_from_slice in non-wrapping spans, plus an expanding doubling copy that replicates the d-byte pattern instead of one byte at a time. Two modulos per byte become one wrap check per span. Decode throughput (1 MiB, median of 3): deflate Zeros: 242 -> ~460 MB/s (+90%) zlib Zeros: 231 -> ~419 MB/s (+82%) gzip Zeros: 179 -> ~271 MB/s (+52%) deflate Lorem: 4751 -> ~5700 MB/s (+20%) zlib Lorem: 2483 -> ~2700 MB/s (+9%) Round-trip + reference-fixture tests (system gzip, python zlib/deflate) all green; output is byte-identical. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

window_pos advance used `% win_cap`; win_cap is a runtime value so this lowered to an integer division on every emitted literal. Swap for a single equality+reset branch and mark emit_byte #[inline]. Correctness unchanged (output byte-identical); removes a hardware divide from the literal hot path. Neutral-to-positive on the literal-heavy Lorem decode, no regression elsewhere. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Mirror the deflate inflate optimization in the deflate64 decoder: copy each match run in contiguous, non-wrapping spans (one copy_within + copy_from_slice for non-overlapping spans, an expanding doubling copy for overlapping ones) instead of a per-byte fallback loop. deflate64's larger window and match length make long matches common, so the bulk copy is a big win. Decode throughput (1 MiB, median of 3): deflate64 Lorem: ~1459 -> ~10800 MB/s (long repetitive matches) deflate64 Zeros/Random: unchanged within noise Round-trip tests green; output byte-identical. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The LZMA2 chunk decoder copied match bytes one at a time through dict_get/dict_put. For non-overlapping matches (distance+1 >= length) the source bytes already sit contiguously behind dict_pos, so we can copy_from_slice into the output and copy_within inside the dict in bulk, mirroring the dict_copy_match_bulk fast path already used by the .lzma decoder. The per-byte loop still handles overlapping matches and the circular-buffer wrap remainder, so decoder output is byte-identical. Measured (1 MiB corpus, median of 3, release): xz Lorem decode 340 -> ~553 MB/s (+63%) xz Random decode 434 -> ~680 MB/s (+57%) xz Zeros decode 365 -> ~384 MB/s (+5%) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Overlapping matches (distance+1 < length, e.g. RLE-style runs over long zero/repeat regions) still fell through to the byte-by-byte loop. Add dict_copy_match_overlap: it replicates the dist1-byte source window forward inside the dict via doubling copy_within windows (each read hits bytes written by an earlier window), then copy_from_slice's the filled run into the output. Only the non-wrapping contiguous portion is bulked; the per-byte loop still handles the circular-dict wrap remainder, so decoder output stays byte-identical. Measured (1 MiB corpus, median of 3, release): xz Zeros decode ~384 -> ~570 MB/s (+48%) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The .lzma streaming decoder already bulk-copied non-overlapping matches but fell through to a byte-by-byte loop for overlapping runs (small distance, large length — the dominant pattern on RLE-heavy inputs like long zero runs). Add dict_copy_match_overlap mirroring the lzma2 path: replicate the dist1-byte source window forward via doubling copy_within, then copy_from_slice into the output. Both drain sites (the live Match outcome and the parked pending_match) get the new branch. The per-byte loop still covers the circular-dict wrap remainder and respects the uncompressed-size cap, so decoder output is byte-identical. Measured (1 MiB corpus, median of 3, release): lzma Zeros decode ~860 -> ~5400 MB/s (+6x) lzma Lorem/Random decode unchanged (no overlapping runs) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The decoder's per-symbol fast path called set_position() after every LUT hit, which zeroed the 64-bit bit accumulator and forced a fresh refill on the next decode. Add BitSource::consume() to advance within the buffered bits, plus peek_lut_bits() that refills once and reports how many bits are available without asserting on a short tail. The hot Huffman decode loop now resolves consecutive symbols out of registers. Decode throughput (median of 3, 1 MiB inputs): Random: 106 -> ~140 MB/s (+~32%) Lorem: 1030 -> ~1040 MB/s (within noise) cargo test --features "brotli std": green. clippy clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

When NTREESL == 1 the literal context map is all zeroes, so the per-byte context-id computation (context::literal_context plus the cmapl index) always selects tree 0. Specialize the insert-literal loop to decode straight from htree_l[0] in that case, hoisting the single tree reference out of the loop. Block-type switching still runs (it drives block_len_l) but no longer feeds an unused context lookup. Decode throughput (median of 3, 1 MiB): Random: ~140 -> ~235 MB/s (+~68% on top of the prior commit; +~120% vs the original 106 MB/s baseline) Lorem: unchanged (uses multiple context trees -> slow path) cargo test --features "brotli std": green. clippy clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The primary lookup table now covers codes up to length 11 instead of 9, resolving more literal/distance symbols in a single indexed load before falling back to the per-bit canonical walk. The table grows to 2048 u32 (8 KiB) per tree, still L1-resident; build cost is paid once per tree per meta-block and is dwarfed by the per-symbol decode savings on 1 MiB+ inputs. Decode throughput (median of 3, 1 MiB): Random: ~235 -> ~255 MB/s Lorem: unchanged (~1030, within noise) cargo test --features "brotli std": green. clippy clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Replace the per-symbol read+unread reseed in HuffTable::decode with a peek_bits/consume pair on RevBitReader. The old path rebuilt the bit accumulator from memory on every literal (reseed_from_consumed); the new path peeks max_bits without consuming, indexes the lookup table, and consumes only the matched code length. #[inline] the bit-reader read. Decode micro-bench (4 MiB mixed-entropy text, Huffman+FSE heavy): ~314 -> ~330 MB/s median. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

FseState::advance now special-cases num_bits==0 (max-probability symbols) to avoid a RevBitReader::read call whose result is always 0, and inlines symbol()/advance(). A meaningful fraction of FSE table entries carry num_bits==0, so this removes a hot per-sequence function call. Decode micro-bench: ~330 -> ~350 MB/s median. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

read() is called up to 6x per sequence (FSE state advances + LL/OF/ML extra bits) and was a non-inlined ~30% hotspot. Mark the n<=56 fast path #[inline(always)] and move the rare 57..=64-bit wide-read branch into a #[cold] #[inline(never)] read_wide(). The hot small-read path now inlines directly into decode_sequences and FseState::advance, eliminating the call overhead and bounds-check duplication. Decode micro-bench: instruction count -16% (callgrind), wall-clock ~350 -> ~425 MB/s. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ll_base_extra/ml_base_extra rebuilt two 36/53-element stack arrays on every call (once per sequence). Replace with module-level const [(base, extra); N] tables indexed via .get(), so the hot sequence loop reads a single rodata table instead of re-materialising arrays. Tables verified element-for-element against the RFC 8478 LL/ML code tables. Decode micro-bench: instruction count -13% (callgrind), wall-clock ~425 -> ~470 MB/s. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…load) The sequence loop indexed each FSE table twice per state per sequence: once in symbol() and again in advance(). Add FseState::entry() to fetch the FseEntry once (yielding the symbol) and advance_with(entry, size) to reuse it, and hoist the loop-invariant table sizes. This cuts the per-sequence memory traffic on the three FSE tables. Decode micro-bench wall-clock: ~470 -> ~483 MB/s (consistent across runs). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Replace byte-at-a-time self-overlap copy loops with chunked extend_from_within: each round duplicates the offset-byte tail produced so far, doubling the source region, so the loop runs O(log len) rounds instead of one push per byte. Decoder output is byte-identical. Measured (1 MiB Lorem, decode MB/s): lz4: 1470 -> ~18000 (~12x) lzo: 2396 -> ~18000 (~7x) snappy/lz5 overlap-heavy paths similarly bulk-copy now. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

decode_string_to_emit_buf walked the prefix chain into a scratch Vec (reversing), then popped it into emit_buf (un-reversing) — two passes and a second buffer. Walk the chain straight into emit_buf and reverse just the written region in place: one walk + one tight in-place reverse, and the scratch stack field is removed. Decoder output is byte-identical. Measured (decode MB/s): Lorem 425 -> ~510, Zeros 641 -> ~950 (~1.45x). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

On a miss, advance by a stride that grows with the consecutive-miss count (LZ4-style) instead of one byte at a time, so incompressible data is scanned in large strides. The first ~64 misses still step 1 byte, so compressible data keeps a dense hash table and its ratio/speed are unchanged; a hit resets the stride. Round-trip tests pass (decode output unchanged). Measured (features=lzo,factory,std; encode MB/s): Random: 495 -> ~3000 (~6x) Lorem: 1335 -> ~1290 (flat, within noise; output size unchanged) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

On a miss, advance by a stride that grows with the consecutive-miss count (matching the reference encoder's bytes_between_hash_lookups), so incompressible regions are scanned in large strides instead of one byte at a time. A hit resets the stride. Round-trip tests pass and the >2x ratio test still holds (output stays well-compressed). Measured (features=snappy,factory,std; encode MB/s): Random: 804 -> ~4900 (~6x) Lorem: 2557 -> ~2760 (slightly up) Zeros: flat (within run-to-run noise) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Reuse a single bucket scratch buffer across all induced-sort passes instead of allocating fresh bucket-start/-end Vecs each call, collect LMS positions once during type classification (removing the later rescan + lms_positions rebuild), and inline is_lms / bucket fills. SA-IS build throughput on a 900 KB block (median of 3, --release): lorem 18.6 -> 19.2 MB/s zeros 31.8 -> 32.9 MB/s random 10.8 -> 13.2 MB/s (+22%) Output unchanged: same induced-sort order => identical BWT+origin. Full test suite (round-trip, reference fixtures, bunzip2 cross-check) stays green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The reduced LMS text already lives in the trailing n1 cells of `sa` and the recursive sub-suffix-array is written into the leading n1 cells. Since no two adjacent positions can both be LMS, n1 <= n/2, so those two regions are disjoint halves of split_at_mut and can be borrowed (immutable text / mutable output) simultaneously — removing the fresh reduced_text Vec allocated and filled at every recursion level. SA-IS build throughput on a 900 KB block (median of 3, --release), relative to the previous commit: lorem 19.2 -> 21.2 MB/s zeros 32.9 -> 40.5 MB/s (+23%) random 13.2 -> 14.1 MB/s Output unchanged (identical recursion, identical BWT+origin); full suite green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

emit_byte() drained out_history back to MAX_DISTANCE on every emitted byte once the 64 KiB window filled, shifting the whole buffer per byte — quadratic over the stream and the dominant decode cost. Let the buffer grow to 2*MAX_DISTANCE and trim the oldest half only then; all reads are relative to len() and bounded by MAX_DISTANCE, so correctness is unchanged and decode stays byte-identical. Decode MB/s (1 MiB): Lorem 1.34→786, Zeros 1.46→588, Random 1.40→266. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

decode_compressed_chunk emitted every match byte-by-byte. For the common non-overlapping case (offset >= length) the source range is already fully populated, so resize + copy_within does it in one shot. The overlapping run case (offset < length, run-length expansion) keeps the byte loop. Decode byte-identical. Decode MB/s (Zeros 1 MiB, match-heavy): ~1807 → ~1980 (+~10%). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Replace the bit-at-a-time canonical decode (one table probe per input bit) with a byte-wide finite-state machine over the canonical trie: one lookup per input byte, emitting 0..=8 symbols. Built per call but cheaply (composed from a per-nibble table), so even the fast/short-code case stays flat. h2-huffman decode MB/s (1 MiB): Lorem: 385 -> 378 (flat, within noise) Zeros: 155 -> 202 (+30%) Random: 64 -> 93 (+45%) All hpack + full-feature tests green; output byte-identical. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Decode previously wrote every output byte twice: pushed onto a scratch stack during the prefix-chain walk, then popped into emit_buf. Replace with a fixed-size reverse-assembly scratch (allocated once) filled back-to-front in one walk, then a single vectorised extend_from_slice into emit_buf. A length-1 literal (common on incompressible input) skips assembly entirely. crunch decode MB/s (1 MiB): Lorem: 322 -> 390 (+21%) Zeros: 686 -> 1078 (+57%) Random: 194 -> 207 (+7%) Crafted-stream guards preserved (i==0 rejects over-long/cyclic chains); all arc_crunch + full-feature tests green, output byte-identical. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Same optimization as arc_crunch: replace the push-to-stack / pop-to-emit_buf double write with a fixed-size reverse-assembly scratch filled in one walk and bulk-copied via extend_from_slice; bare literals skip assembly. squashed decode MB/s (1 MiB): Lorem: 384 -> 502 (+31%) Zeros: 669 -> 976 (+46%) Random: 194 -> 210 (+8%) Crafted-stream guards preserved; all arc_squash + full-feature tests green, output byte-identical. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The per-byte loop kept a dist-byte ring with a modulo branch and a read-modify-write of history every byte, serialising the whole transform. Split into three phases: seed the first dist bytes through the ring (cross-call history), then run a flat recurrence over the bulk — encode reads input[i-dist] directly (read-only input → auto-vectorises), decode reads output[i-dist] — then refresh the ring from the tail. Streaming/chunk semantics unchanged (the 1-byte-chunk-vs-bulk equivalence test passes). delta encode MB/s (1 MiB, default dist=1): ~1680 -> ~25000 (≈15x; the read-only subtract vectorises) delta decode unchanged (dist=1 reconstruction is an inherently serial prefix sum); larger distances also speed up decode. Output byte-identical. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The LZSS match expansion copied byte-by-byte through the ring with two modulo ops per byte. Split by match geometry: non-overlapping matches copy in contiguous ring segments (straight-line loops, no per-byte wrap test); single-byte runs (distance 1) fill a constant byte directly; only genuinely overlapping matches fall back to the byte walk. The ring's space-prefill semantics are preserved, so output is byte-identical (shared by lh4/5/6/7). lh5 decode MB/s (1 MiB): Lorem: ~853 -> ~1400 (+64%) Zeros: ~890 -> ~1140 (+28%) lh4/lh6/lh7 improve comparably. All lha + full-feature tests green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

All four RAR decoders expanded matches byte-by-byte through the sliding window. Apply the same geometry split used in lha: distance-1 runs fill a constant byte; non-overlapping matches copy in contiguous window segments (straight-line loops, no per-byte index recompute / mask test); only genuinely overlapping matches walk byte-by-byte. The window-prefill and truncation semantics are preserved exactly, so decoded output stays byte-identical — verified by each codec's reference-fixture tests (rar1 53, rar2 28, rar3 30, rar5 29 tests, all green). Decode-only codecs (no bench round-trip); correctness is fixture-validated and the transform mirrors the measured lha win (+28-64% decode). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

- zip_shrink (LZW): assemble the decoded string in the scratch buffer, then reverse once and extend_from_slice into emit_buf, instead of the per-byte pop/push round trip (each output byte was written twice). - zip_reduce: split the DLE back-reference copy into a zero-fill prefix (refs before stream start), a distance-1 fill, a non-overlapping extend_from_within, and an overlapping byte walk. - zip_implode: split the window match copy by geometry (distance-1 fill / contiguous non-overlap segments / overlapping byte walk) and apply the pending_len/output_left bookkeeping once per match. All decode-only; output byte-identical, verified by each codec's reference-fixture tests (shrink 14, reduce 17, implode 18, all green). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

MagicalTux and others added 30 commits June 12, 2026 12:17

MagicalTux and others added 2 commits June 12, 2026 12:17

docs: changelog entry for codec throughput optimizations

3f6adb1

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

MagicalTux merged commit ef5a38b into master Jun 12, 2026
41 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Throughput optimizations across the codec suite#90

Throughput optimizations across the codec suite#90
MagicalTux merged 32 commits into
masterfrom
optimize-codecs

MagicalTux commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MagicalTux commented Jun 12, 2026

Validation

Highlights (measured by the agents, 1 MiB inputs, release)

Method

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant