Complete stubs: raw LZMA2 encoder + LZFSE bvx2 decoder (lz5 Huffman documented)#94
Merged
Conversation
Replace the permanent Error::Unsupported encoder stub in src/lzma2 with a real streaming encoder that emits a raw LZMA2 chunk stream: - Mirrors the xz encoder's chunk framing but emits ONLY the raw chunk stream (no .xz container): full-reset chunks (control 0xE0 compressed / 0x01 uncompressed) terminated by a single 0x00 end marker. - Reuses the shared encode_lzma_chunk / EncoderParams / LZMA2_PROPS_BYTE from lzma2_internal::lzma2_encoder — no LZMA re-implementation. - 64 KiB per-chunk cap; uncompressed-chunk fallback when compression would expand the data; 4 MiB dictionary (the xz default) bounding match distances, documented as the out-of-band dict-size contract so a default-config decoder round-trips. - Gate lzma2_internal::lzma2_encoder on `lzma2` too so an lzma2-only (decode+encode) no_std build compiles. Validation: module unit tests + tests/lzma2.rs cover empty/1-byte/small/ multi-chunk/incompressible-fallback/highly-compressible round-trips, encoder reset reuse, factory wiring, and an xz cross-validation that wraps our raw payload in a minimal .xz container and decodes it via the public xz decoder (shared chunk codec) to prove framing parity. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Decode the core LZFSE v2 block type: 4-way interleaved FSE literals plus L/M/D commands over reverse FSE bitstreams, executing the LZ77 copy (with D==0 rep-distance reuse) and enforcing n_raw_bytes. Wire it into the streaming decoder, buffering the variable-length header + both payload streams before decoding in one shot, with DoS-bounded allocation and Corrupt/UnexpectedEnd on malformed input (no panics). Validated by round-trip against a spec-conformant test-only v2 encoder (power-of-two FSE normalization, documented header/freq-table packing): empty/small/text/repetitive/random/multi-block plus a 1200-case fuzz loop and byte-at-a-time streaming. No Apple reference fixtures are available in this environment, so Apple interop is best-effort but follows the documented wire format precisely. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The bvx2 FSE decode tables previously assigned a single bit-width per symbol, which is only exact for power-of-two frequencies — so the decoder could not handle real Apple LZFSE v2 streams (general frequencies) and round-tripped only against an encoder restricted to dyadic tables. Implement Apple's general `fse_init_decoder_table` algorithm in both `build_literal_decoder` and `build_lmd_decoder`: each symbol's `f` spread slots split at `j0 = (2*n_states >> k) - f` into a k-bit prefix and a (k-1)-bit suffix, with `k = L - floor(log2(f))`. When `f` is a power of two `j0 == f` and it degenerates to the old single-k form. The per-symbol power-of-two restriction is gone; only the table SIZE (2^L) is still required to be a power of two. The test-only v2 encoder now uses standard quantized (nearest) normalization producing general non-power-of-two frequencies, with encode slots that exactly invert the k/k-1 decode table (a slot in the i<j0 region consumes k bits, else k-1). New coverage that a regression to single-k would fail: - fse.rs: per-symbol bijection/tiling checks for non-dyadic literal and LMD frequency tables (and a dyadic-still-works check). - lzfse_v2.rs: round-trips over deliberately skewed non-dyadic literal distributions and small/singleton match-count histograms. - A hand-frozen non-dyadic bvx2 block (literal freqs 1000/24) decoded through the public streaming decoder to a known string, independent of the encoder — guards against encoder+decoder sharing the same bug. Module docs updated to drop the power-of-two limitation and state that FSE table construction now matches Apple's general algorithm (Apple-stream interop remains best-effort: still no reference fixture in this env, but the real table-construction algorithm is now followed). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…hange) Investigated implementing Lizard's Huffman entropy stage. Confirmed from the reference source that Lizard frames each Huffman sub-stream as a 6-byte header (3-byte LE regen size + 3-byte LE compressed size) and decodes it with the generic Huff0 HUF_decompress. Decision: keep Unsupported. The generic HUF_decompress selects between X1 (single-symbol) and X2 (double-symbol) decode tables via HUF_selectDecoder, and that choice is recomputed from the regen/compressed sizes -- it is never stored in the stream. The crate's Huff0 decoder (src/zstd/huffman.rs) is X1-only and private to the zstd module; it has no X2 decoder and no size-driven selector. With no lizard CLI and no Huff0 fixtures available, the only possible "test" is a round-trip against a hand-written X1-only encoder, which would always pick X1 and validate nothing about real (possibly X2) blocks. Per the crate's lzham/sit13 policy, an unvalidatable decoder is worse than an honest Unsupported. Doc-only: expand the rationale in block.rs and mod.rs (Huff0 framing, X1/X2 selector blocker, and a concrete reuse path for a future round); also clarify the LIZv1 out-of-scope note. No behavior change; the existing rejects_huffman_flag / rejects_lizv1_mode tests still hold. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Completes the achievable
Error::Unsupportedstubs flagged in the codec-gap review. Two are now real, validated implementations; the third is left honest with a precise rationale (it can't be validated in this environment).✅ Raw LZMA2 encoder (
lzma2)compcol::lzma2::Lzma2now encodes as well as decodes. It emits the raw 7-Zip LZMA2 chunk stream — full dict/props/state reset per chunk (so each chunk is independently decodable), an uncompressed-chunk fallback when compression would expand, and the0x00end marker — reusing the existing xz LZMA2 chunk codec (encode_lzma_chunk). The dictionary size is out of band (the 7z coder property); the encoder uses the 4 MiB default so a default-config decoder round-trips.Validation: round-trip (empty / small / multi-chunk / incompressible-forcing-uncompressed-fallback / zeros, bulk + byte-at-a-time) and cross-decode of the raw output through the shared xz LZMA2 codec.
✅ LZFSE
bvx2decoder (lzfse)The core LZFSE v2 block type (LZ77 + Finite State Entropy) now decodes — full v2 header parse, 4-way interleaved literal FSE, three interleaved L/M/D FSE streams (reverse bitstreams), and LZ reconstruction with the repeat-distance rule. Critically, the FSE table builder was upgraded from a power-of-two-only simplification to Apple's general
fse_init_decoder_table(k/k-1split), so it handles the arbitrary frequency tables real.lzfsefiles use — not just its own encoder's output.Validation: round-trip against an in-crate v2 encoder using general (non-dyadic) frequency normalization, a 1200-case fuzz loop, malformed/truncated rejection, and a frozen hand-written non-dyadic
bvx2block that decodes to a known string without invoking the encoder (guards against shared encoder/decoder bugs). No Applelzfsetool is available in this environment, so real-stream interop is best-effort but follows the documented format precisely.bvx1staysUnsupported.📝 lz5 (Lizard) Huffman — left
Unsupported, now documentedInvestigated reusing zstd's Huff0 decoder. Blockers: zstd's decoder is X1-only and module-private, while Lizard's
HUF_decompressselects X1/X2 from(regenSize, comprLen)at runtime (never stored in the stream) — a conformant decoder needs both variants plus theHUF_selectDecoderheuristic. With nolizardtool, no fixtures, and a store-only lz5 encoder, it is unvalidatable here, so shipping a decoder would be the self-validating fiction the crate avoids (cf.lzham/sit13). Left honest, with the concrete reuse path recorded in the module docs for a future round. Doc-only change.Checks
cargo test --all-features— 61 suites green, 0 failures (incl. the expandedtests/lzma2.rsandtests/lzfse.rs).cargo fmt --check,cargo clippy --all-features --all-targets -D warnings, rustdoc-D warnings— clean.lzma2/lzfse/lz5each build standalone under--no-default-features.🤖 Generated with Claude Code