perf(lzma2/xz): linear-time match finder + faster optimal parse#105
Merged
Conversation
The xz/lzma2 optimal-parse encoder had a fixed 64 Ki-bucket hash head table, so for incompressible or low-match input the per-bucket chains lengthened with the input and every probe walked work that scaled with length — encoding 4 MiB of random data took ~6.7 s and kept worsening. Size the hash head table to the match-finder window (as liblzma sizes its hash to the dictionary) so chains stay O(1) and encode is linear. Then cut constant factors in the optimal parser: cache length-symbol prices per pos_state (refreshed periodically instead of an 8-bit bittree walk per length per position), compute the new-match distance price once per dist-state band, and compare match bytes eight at a time. Deterministic instruction counts (2 MiB, before -> after): ~3x fewer on text, ~4x on long-run data, ~1.6x on mixed source code; random encode is now linear and ~1.1x faster than native xz. Compressed output is byte-for-byte unchanged at every level. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8bca525 to
6093af7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The xz/lzma2 optimal-parse encoder used a fixed 64 Ki-bucket hash head table. For incompressible or low-match input, distinct 3-byte prefixes collided into the same buckets as the input grew, so per-bucket chains lengthened with length and every probe walked O(n/table) work — encode was effectively O(n²) until the
max_chaincap engaged.xzencode of 4 MiB of random data took ~6.7 s and kept worsening; nativexzhandles it in constant time.Changes (compressed output is byte-for-byte unchanged at every level)
Vec+head_mask), like liblzma sizes its hash to the dictionary, so average chain length stays O(1).hash3now returns a full 32-bit mix masked per-probe.pos_state, refreshed every 128 committed decisions, instead of an 8-bit bittree walk per length per position.DIST_STATES), so it's recomputed only when that bucket changes (one call for the common length≥5 band).match_len_atcompares 8 bytes per step via LEu64+trailing_zeros.Results
Deterministic instruction counts (2 MiB, before → after): text 3.1×, all-zeros 4.0×, mixed source code 1.6× fewer instructions. Random encode is now linear and ~1.1× faster than native
xz -6; realistic source code is ~0.85–0.9× native (was far slower).Verification
xzand our own decoder.xzfor text/zeros/random/source.🤖 Generated with Claude Code