Skip to content

Complete stubs: raw LZMA2 encoder + LZFSE bvx2 decoder (lz5 Huffman documented)#94

Merged
MagicalTux merged 5 commits into
masterfrom
complete-stubs
Jun 14, 2026
Merged

Complete stubs: raw LZMA2 encoder + LZFSE bvx2 decoder (lz5 Huffman documented)#94
MagicalTux merged 5 commits into
masterfrom
complete-stubs

Conversation

@MagicalTux

Copy link
Copy Markdown
Member

Completes the achievable Error::Unsupported stubs flagged in the codec-gap review. Two are now real, validated implementations; the third is left honest with a precise rationale (it can't be validated in this environment).

✅ Raw LZMA2 encoder (lzma2)

compcol::lzma2::Lzma2 now encodes as well as decodes. It emits the raw 7-Zip LZMA2 chunk stream — full dict/props/state reset per chunk (so each chunk is independently decodable), an uncompressed-chunk fallback when compression would expand, and the 0x00 end marker — reusing the existing xz LZMA2 chunk codec (encode_lzma_chunk). The dictionary size is out of band (the 7z coder property); the encoder uses the 4 MiB default so a default-config decoder round-trips.

Validation: round-trip (empty / small / multi-chunk / incompressible-forcing-uncompressed-fallback / zeros, bulk + byte-at-a-time) and cross-decode of the raw output through the shared xz LZMA2 codec.

✅ LZFSE bvx2 decoder (lzfse)

The core LZFSE v2 block type (LZ77 + Finite State Entropy) now decodes — full v2 header parse, 4-way interleaved literal FSE, three interleaved L/M/D FSE streams (reverse bitstreams), and LZ reconstruction with the repeat-distance rule. Critically, the FSE table builder was upgraded from a power-of-two-only simplification to Apple's general fse_init_decoder_table (k/k-1 split), so it handles the arbitrary frequency tables real .lzfse files use — not just its own encoder's output.

Validation: round-trip against an in-crate v2 encoder using general (non-dyadic) frequency normalization, a 1200-case fuzz loop, malformed/truncated rejection, and a frozen hand-written non-dyadic bvx2 block that decodes to a known string without invoking the encoder (guards against shared encoder/decoder bugs). No Apple lzfse tool is available in this environment, so real-stream interop is best-effort but follows the documented format precisely. bvx1 stays Unsupported.

📝 lz5 (Lizard) Huffman — left Unsupported, now documented

Investigated reusing zstd's Huff0 decoder. Blockers: zstd's decoder is X1-only and module-private, while Lizard's HUF_decompress selects X1/X2 from (regenSize, comprLen) at runtime (never stored in the stream) — a conformant decoder needs both variants plus the HUF_selectDecoder heuristic. With no lizard tool, no fixtures, and a store-only lz5 encoder, it is unvalidatable here, so shipping a decoder would be the self-validating fiction the crate avoids (cf. lzham/sit13). Left honest, with the concrete reuse path recorded in the module docs for a future round. Doc-only change.

Checks

  • cargo test --all-features61 suites green, 0 failures (incl. the expanded tests/lzma2.rs and tests/lzfse.rs).
  • cargo fmt --check, cargo clippy --all-features --all-targets -D warnings, rustdoc -D warnings — clean.
  • lzma2 / lzfse / lz5 each build standalone under --no-default-features.

🤖 Generated with Claude Code

MagicalTux and others added 5 commits June 15, 2026 02:22
Replace the permanent Error::Unsupported encoder stub in src/lzma2 with a
real streaming encoder that emits a raw LZMA2 chunk stream:

- Mirrors the xz encoder's chunk framing but emits ONLY the raw chunk
  stream (no .xz container): full-reset chunks (control 0xE0 compressed /
  0x01 uncompressed) terminated by a single 0x00 end marker.
- Reuses the shared encode_lzma_chunk / EncoderParams / LZMA2_PROPS_BYTE
  from lzma2_internal::lzma2_encoder — no LZMA re-implementation.
- 64 KiB per-chunk cap; uncompressed-chunk fallback when compression would
  expand the data; 4 MiB dictionary (the xz default) bounding match
  distances, documented as the out-of-band dict-size contract so a
  default-config decoder round-trips.
- Gate lzma2_internal::lzma2_encoder on `lzma2` too so an lzma2-only
  (decode+encode) no_std build compiles.

Validation: module unit tests + tests/lzma2.rs cover empty/1-byte/small/
multi-chunk/incompressible-fallback/highly-compressible round-trips, encoder
reset reuse, factory wiring, and an xz cross-validation that wraps our raw
payload in a minimal .xz container and decodes it via the public xz decoder
(shared chunk codec) to prove framing parity.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Decode the core LZFSE v2 block type: 4-way interleaved FSE literals plus
L/M/D commands over reverse FSE bitstreams, executing the LZ77 copy (with
D==0 rep-distance reuse) and enforcing n_raw_bytes. Wire it into the
streaming decoder, buffering the variable-length header + both payload
streams before decoding in one shot, with DoS-bounded allocation and
Corrupt/UnexpectedEnd on malformed input (no panics).

Validated by round-trip against a spec-conformant test-only v2 encoder
(power-of-two FSE normalization, documented header/freq-table packing):
empty/small/text/repetitive/random/multi-block plus a 1200-case fuzz loop
and byte-at-a-time streaming. No Apple reference fixtures are available in
this environment, so Apple interop is best-effort but follows the
documented wire format precisely.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The bvx2 FSE decode tables previously assigned a single bit-width per
symbol, which is only exact for power-of-two frequencies — so the decoder
could not handle real Apple LZFSE v2 streams (general frequencies) and
round-tripped only against an encoder restricted to dyadic tables.

Implement Apple's general `fse_init_decoder_table` algorithm in both
`build_literal_decoder` and `build_lmd_decoder`: each symbol's `f` spread
slots split at `j0 = (2*n_states >> k) - f` into a k-bit prefix and a
(k-1)-bit suffix, with `k = L - floor(log2(f))`. When `f` is a power of two
`j0 == f` and it degenerates to the old single-k form. The per-symbol
power-of-two restriction is gone; only the table SIZE (2^L) is still
required to be a power of two.

The test-only v2 encoder now uses standard quantized (nearest)
normalization producing general non-power-of-two frequencies, with encode
slots that exactly invert the k/k-1 decode table (a slot in the i<j0 region
consumes k bits, else k-1).

New coverage that a regression to single-k would fail:
- fse.rs: per-symbol bijection/tiling checks for non-dyadic literal and LMD
  frequency tables (and a dyadic-still-works check).
- lzfse_v2.rs: round-trips over deliberately skewed non-dyadic literal
  distributions and small/singleton match-count histograms.
- A hand-frozen non-dyadic bvx2 block (literal freqs 1000/24) decoded
  through the public streaming decoder to a known string, independent of
  the encoder — guards against encoder+decoder sharing the same bug.

Module docs updated to drop the power-of-two limitation and state that FSE
table construction now matches Apple's general algorithm (Apple-stream
interop remains best-effort: still no reference fixture in this env, but the
real table-construction algorithm is now followed).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…hange)

Investigated implementing Lizard's Huffman entropy stage. Confirmed from
the reference source that Lizard frames each Huffman sub-stream as a
6-byte header (3-byte LE regen size + 3-byte LE compressed size) and
decodes it with the generic Huff0 HUF_decompress.

Decision: keep Unsupported. The generic HUF_decompress selects between
X1 (single-symbol) and X2 (double-symbol) decode tables via
HUF_selectDecoder, and that choice is recomputed from the
regen/compressed sizes -- it is never stored in the stream. The crate's
Huff0 decoder (src/zstd/huffman.rs) is X1-only and private to the zstd
module; it has no X2 decoder and no size-driven selector. With no lizard
CLI and no Huff0 fixtures available, the only possible "test" is a
round-trip against a hand-written X1-only encoder, which would always
pick X1 and validate nothing about real (possibly X2) blocks. Per the
crate's lzham/sit13 policy, an unvalidatable decoder is worse than an
honest Unsupported.

Doc-only: expand the rationale in block.rs and mod.rs (Huff0 framing,
X1/X2 selector blocker, and a concrete reuse path for a future round);
also clarify the LIZv1 out-of-scope note. No behavior change; the
existing rejects_huffman_flag / rejects_lizv1_mode tests still hold.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@MagicalTux MagicalTux merged commit a9384f2 into master Jun 14, 2026
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant