perf(brotli,zstd): faster encode on low-redundancy input by MagicalTux · Pull Request #109 · KarpelesLab/compcol

MagicalTux · 2026-06-30T11:11:48Z

Follow-up to the audit of our codecs vs their official CLIs. Two encoders were dramatically slower than the reference on incompressible / low-match data — both linear but with pathological constants, and one with a hidden quadratic over a stream.

(The rest of the audit came back clean: every codec with an official tool round-trips both ways — including zstd's default content-checksum and lzw — and no streaming-contract bugs remain. lh1/lh2 returning Unsupported without an out-of-band length is documented behaviour, not a bug.)

brotli — O(n³) histogram clustering → incremental

The literal-context clustering (encoder_ctx::cluster) was O(contexts³ · 256): it rescanned every cluster pair and recomputed each cluster's histogram_bits/merged_bits from scratch on every merge. On dense histograms (random/incompressible input, where every context spans all 256 symbols) this hit ~37,000 instructions/byte. Now it caches per-cluster self-costs and the pairwise-delta matrix and refreshes only the merged cluster's row each round.

Merge sequence is identical ⇒ compressed output is byte-for-byte identical (verified across 4 inputs × quality 0–11).
Incompressible encode ~8× faster (3.5 s → 0.42 s per MiB).

zstd — hash table sizing + incremental indexing

The match finder used a fixed 64 Ki-bucket hash table over an up-to-8 MiB window (load factor in the hundreds), so each probe walked a full max_chain of useless far-distance links. Now sized to the window (clamped), same idea as liblzma sizing its hash to the dictionary.
The per-block index was rebuilt over all of history every block — O(history)/block, i.e. quadratic over a stream. The chains now persist across blocks (the history prefix is byte-stable until a window trim) and each block only indexes the new positions; trims fall back to a full rebuild.

Output unchanged on single-block inputs, equal-or-smaller on multi-block (the larger table finds slightly better matches); 0 ratio regressions across a broad input/level sweep.
Random encode ~3× faster (0.118 s → 0.035 s per MiB); large-input scaling improved (the residual super-linearity at ≫8 MiB is cache-bound, noted for follow-up).

Verification

brotli output byte-identical (inputs × quality 0–11).
zstd: 0 ratio regressions, interop both ways with the zstd CLI, and a 50-case fuzz (including the >8 MiB window-trim path and 128 KiB block boundaries) round-trips through both our decoder and the CLI.
Full suite (61 binaries), clippy, and fmt all clean.

🤖 Generated with Claude Code

Audit of the codecs against their official CLIs found two encoders that were dramatically slower than the reference on incompressible/low-match data — both linear but with pathological constants. (Interop and the streaming contract were otherwise clean across every codec with an official tool; lh1/lh2's Unsupported-without-length is documented, not a bug.) brotli: the literal-context histogram clustering was O(contexts^3 * 256) — it rescanned all cluster pairs and recomputed each cluster's cost from scratch on every merge — which exploded on dense histograms (~37k instructions/byte on random input). Cache per-cluster costs and the pairwise-delta matrix, updating only the merged cluster each round. The merge sequence and compressed output are byte-for-byte identical; incompressible encode is ~8x faster. zstd: the match finder used a fixed 64 Ki-bucket hash table over an up-to-8 MiB window (load factor in the hundreds), so each probe walked a full chain of useless far links. Size the table to the window. Also build the per-block match index incrementally — the chains persist across blocks (the history prefix is byte-stable until a window trim) instead of re-indexing all of history every block, which was O(history) per block and quadratic over a stream. Output is unchanged on single-block inputs and equal-or-smaller on multi-block ones (0 ratio regressions observed); random encode is ~3x faster. Verified: brotli output byte-identical across inputs x quality 0..11; zstd 0 ratio regressions and interop both ways with the zstd CLI; 50-case zstd fuzz (incl. >8 MiB trim path and block boundaries) round-trips through both our decoder and the CLI; full suite, clippy, and fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

MagicalTux force-pushed the perf/brotli-zstd-encode branch from f016c26 to 552f93e Compare June 30, 2026 11:12

MagicalTux merged commit 0e48c56 into master Jun 30, 2026
42 checks passed

MagicalTux deleted the perf/brotli-zstd-encode branch June 30, 2026 11:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(brotli,zstd): faster encode on low-redundancy input#109

perf(brotli,zstd): faster encode on low-redundancy input#109
MagicalTux merged 1 commit into
masterfrom
perf/brotli-zstd-encode

MagicalTux commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MagicalTux commented Jun 30, 2026

brotli — O(n³) histogram clustering → incremental

zstd — hash table sizing + incremental indexing

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant