Skip to content

perf(brotli,zstd): faster encode on low-redundancy input#109

Merged
MagicalTux merged 1 commit into
masterfrom
perf/brotli-zstd-encode
Jun 30, 2026
Merged

perf(brotli,zstd): faster encode on low-redundancy input#109
MagicalTux merged 1 commit into
masterfrom
perf/brotli-zstd-encode

Conversation

@MagicalTux

Copy link
Copy Markdown
Member

Follow-up to the audit of our codecs vs their official CLIs. Two encoders were dramatically slower than the reference on incompressible / low-match data — both linear but with pathological constants, and one with a hidden quadratic over a stream.

(The rest of the audit came back clean: every codec with an official tool round-trips both ways — including zstd's default content-checksum and lzw — and no streaming-contract bugs remain. lh1/lh2 returning Unsupported without an out-of-band length is documented behaviour, not a bug.)

brotli — O(n³) histogram clustering → incremental

The literal-context clustering (encoder_ctx::cluster) was O(contexts³ · 256): it rescanned every cluster pair and recomputed each cluster's histogram_bits/merged_bits from scratch on every merge. On dense histograms (random/incompressible input, where every context spans all 256 symbols) this hit ~37,000 instructions/byte. Now it caches per-cluster self-costs and the pairwise-delta matrix and refreshes only the merged cluster's row each round.

  • Merge sequence is identical ⇒ compressed output is byte-for-byte identical (verified across 4 inputs × quality 0–11).
  • Incompressible encode ~8× faster (3.5 s → 0.42 s per MiB).

zstd — hash table sizing + incremental indexing

  1. The match finder used a fixed 64 Ki-bucket hash table over an up-to-8 MiB window (load factor in the hundreds), so each probe walked a full max_chain of useless far-distance links. Now sized to the window (clamped), same idea as liblzma sizing its hash to the dictionary.
  2. The per-block index was rebuilt over all of history every block — O(history)/block, i.e. quadratic over a stream. The chains now persist across blocks (the history prefix is byte-stable until a window trim) and each block only indexes the new positions; trims fall back to a full rebuild.
  • Output unchanged on single-block inputs, equal-or-smaller on multi-block (the larger table finds slightly better matches); 0 ratio regressions across a broad input/level sweep.
  • Random encode ~3× faster (0.118 s → 0.035 s per MiB); large-input scaling improved (the residual super-linearity at ≫8 MiB is cache-bound, noted for follow-up).

Verification

  • brotli output byte-identical (inputs × quality 0–11).
  • zstd: 0 ratio regressions, interop both ways with the zstd CLI, and a 50-case fuzz (including the >8 MiB window-trim path and 128 KiB block boundaries) round-trips through both our decoder and the CLI.
  • Full suite (61 binaries), clippy, and fmt all clean.

🤖 Generated with Claude Code

Audit of the codecs against their official CLIs found two encoders that
were dramatically slower than the reference on incompressible/low-match
data — both linear but with pathological constants. (Interop and the
streaming contract were otherwise clean across every codec with an
official tool; lh1/lh2's Unsupported-without-length is documented, not a
bug.)

brotli: the literal-context histogram clustering was O(contexts^3 * 256)
— it rescanned all cluster pairs and recomputed each cluster's cost from
scratch on every merge — which exploded on dense histograms (~37k
instructions/byte on random input). Cache per-cluster costs and the
pairwise-delta matrix, updating only the merged cluster each round. The
merge sequence and compressed output are byte-for-byte identical;
incompressible encode is ~8x faster.

zstd: the match finder used a fixed 64 Ki-bucket hash table over an
up-to-8 MiB window (load factor in the hundreds), so each probe walked a
full chain of useless far links. Size the table to the window. Also build
the per-block match index incrementally — the chains persist across blocks
(the history prefix is byte-stable until a window trim) instead of
re-indexing all of history every block, which was O(history) per block and
quadratic over a stream. Output is unchanged on single-block inputs and
equal-or-smaller on multi-block ones (0 ratio regressions observed);
random encode is ~3x faster.

Verified: brotli output byte-identical across inputs x quality 0..11;
zstd 0 ratio regressions and interop both ways with the zstd CLI; 50-case
zstd fuzz (incl. >8 MiB trim path and block boundaries) round-trips
through both our decoder and the CLI; full suite, clippy, and fmt clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MagicalTux MagicalTux force-pushed the perf/brotli-zstd-encode branch from f016c26 to 552f93e Compare June 30, 2026 11:12
@MagicalTux MagicalTux merged commit 0e48c56 into master Jun 30, 2026
42 checks passed
@MagicalTux MagicalTux deleted the perf/brotli-zstd-encode branch June 30, 2026 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant