perf(brotli,zstd): faster encode on low-redundancy input#109
Merged
Conversation
Audit of the codecs against their official CLIs found two encoders that were dramatically slower than the reference on incompressible/low-match data — both linear but with pathological constants. (Interop and the streaming contract were otherwise clean across every codec with an official tool; lh1/lh2's Unsupported-without-length is documented, not a bug.) brotli: the literal-context histogram clustering was O(contexts^3 * 256) — it rescanned all cluster pairs and recomputed each cluster's cost from scratch on every merge — which exploded on dense histograms (~37k instructions/byte on random input). Cache per-cluster costs and the pairwise-delta matrix, updating only the merged cluster each round. The merge sequence and compressed output are byte-for-byte identical; incompressible encode is ~8x faster. zstd: the match finder used a fixed 64 Ki-bucket hash table over an up-to-8 MiB window (load factor in the hundreds), so each probe walked a full chain of useless far links. Size the table to the window. Also build the per-block match index incrementally — the chains persist across blocks (the history prefix is byte-stable until a window trim) instead of re-indexing all of history every block, which was O(history) per block and quadratic over a stream. Output is unchanged on single-block inputs and equal-or-smaller on multi-block ones (0 ratio regressions observed); random encode is ~3x faster. Verified: brotli output byte-identical across inputs x quality 0..11; zstd 0 ratio regressions and interop both ways with the zstd CLI; 50-case zstd fuzz (incl. >8 MiB trim path and block boundaries) round-trips through both our decoder and the CLI; full suite, clippy, and fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f016c26 to
552f93e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to the audit of our codecs vs their official CLIs. Two encoders were dramatically slower than the reference on incompressible / low-match data — both linear but with pathological constants, and one with a hidden quadratic over a stream.
(The rest of the audit came back clean: every codec with an official tool round-trips both ways — including zstd's default content-checksum and lzw — and no streaming-contract bugs remain.
lh1/lh2returningUnsupportedwithout an out-of-band length is documented behaviour, not a bug.)brotli — O(n³) histogram clustering → incremental
The literal-context clustering (
encoder_ctx::cluster) was O(contexts³ · 256): it rescanned every cluster pair and recomputed each cluster'shistogram_bits/merged_bitsfrom scratch on every merge. On dense histograms (random/incompressible input, where every context spans all 256 symbols) this hit ~37,000 instructions/byte. Now it caches per-cluster self-costs and the pairwise-delta matrix and refreshes only the merged cluster's row each round.zstd — hash table sizing + incremental indexing
max_chainof useless far-distance links. Now sized to the window (clamped), same idea as liblzma sizing its hash to the dictionary.Verification
zstdCLI, and a 50-case fuzz (including the >8 MiB window-trim path and 128 KiB block boundaries) round-trips through both our decoder and the CLI.🤖 Generated with Claude Code