Skip to content

perf(lzma2/xz): linear-time match finder + faster optimal parse#105

Merged
MagicalTux merged 1 commit into
masterfrom
perf/lzma2-xz-encoder
Jun 30, 2026
Merged

perf(lzma2/xz): linear-time match finder + faster optimal parse#105
MagicalTux merged 1 commit into
masterfrom
perf/lzma2-xz-encoder

Conversation

@MagicalTux

Copy link
Copy Markdown
Member

Problem

The xz/lzma2 optimal-parse encoder used a fixed 64 Ki-bucket hash head table. For incompressible or low-match input, distinct 3-byte prefixes collided into the same buckets as the input grew, so per-bucket chains lengthened with length and every probe walked O(n/table) work — encode was effectively O(n²) until the max_chain cap engaged. xz encode of 4 MiB of random data took ~6.7 s and kept worsening; native xz handles it in constant time.

Changes (compressed output is byte-for-byte unchanged at every level)

  1. Linear match finder — size the hash head table to the match-finder window (a Vec + head_mask), like liblzma sizes its hash to the dictionary, so average chain length stays O(1). hash3 now returns a full 32-bit mix masked per-probe.
  2. Length-price cache — cache length-symbol prices per pos_state, refreshed every 128 committed decisions, instead of an 8-bit bittree walk per length per position.
  3. Distance-price hoist — the new-match distance price depends on length only through the dist-state bucket (saturates at DIST_STATES), so it's recomputed only when that bucket changes (one call for the common length≥5 band).
  4. Word-at-a-time match comparisonmatch_len_at compares 8 bytes per step via LE u64 + trailing_zeros.

Results

Deterministic instruction counts (2 MiB, before → after): text 3.1×, all-zeros 4.0×, mixed source code 1.6× fewer instructions. Random encode is now linear and ~1.1× faster than native xz -6; realistic source code is ~0.85–0.9× native (was far slower).

Verification

  • Full suite: 61/61 test binaries pass.
  • 60-case randomized roundtrip fuzz (sizes incl. 64 KiB chunk boundaries, levels 0–9) decoded by both native xz and our own decoder.
  • Cross-checked output decodes under native xz for text/zeros/random/source.

🤖 Generated with Claude Code

The xz/lzma2 optimal-parse encoder had a fixed 64 Ki-bucket hash head
table, so for incompressible or low-match input the per-bucket chains
lengthened with the input and every probe walked work that scaled with
length — encoding 4 MiB of random data took ~6.7 s and kept worsening.

Size the hash head table to the match-finder window (as liblzma sizes
its hash to the dictionary) so chains stay O(1) and encode is linear.
Then cut constant factors in the optimal parser: cache length-symbol
prices per pos_state (refreshed periodically instead of an 8-bit bittree
walk per length per position), compute the new-match distance price once
per dist-state band, and compare match bytes eight at a time.

Deterministic instruction counts (2 MiB, before -> after): ~3x fewer on
text, ~4x on long-run data, ~1.6x on mixed source code; random encode is
now linear and ~1.1x faster than native xz. Compressed output is
byte-for-byte unchanged at every level.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MagicalTux MagicalTux force-pushed the perf/lzma2-xz-encoder branch from 8bca525 to 6093af7 Compare June 30, 2026 08:55
@MagicalTux MagicalTux merged commit 47e11cf into master Jun 30, 2026
42 checks passed
@MagicalTux MagicalTux deleted the perf/lzma2-xz-encoder branch June 30, 2026 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant