Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,37 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

- **Raw LZMA2 encoder** (`lzma2`): `compcol::lzma2::Lzma2` now encodes as well
as decodes — it emits the raw 7-Zip LZMA2 chunk stream (full dict/props/state
reset per chunk, uncompressed-chunk fallback when compression would expand,
`0x00` end marker), reusing the xz LZMA2 chunk codec. The dictionary size is
out of band (the 7z coder property); the encoder uses the 4 MiB default so a
default-config decoder round-trips. Validated by round-trip and by decoding
the output through the shared xz LZMA2 codec.
- **LZFSE `bvx2` decoding** (`lzfse`): the core LZFSE v2 block type (LZ77 +
Finite State Entropy) now decodes — full v2 header parse, 4-way interleaved
literal FSE, three interleaved L/M/D FSE streams (reverse bitstreams), and LZ
reconstruction. The FSE table construction matches Apple's general
`fse_init_decoder_table` (the `k`/`k-1` split), so arbitrary frequency tables
are handled, not just power-of-two ones. Validated by round-trip against an
in-crate v2 encoder plus a frozen hand-written non-dyadic vector; there is no
Apple `lzfse` tool in the build environment, so real-stream interop is
best-effort but follows the documented format precisely. `bvx1` (v1) remains
`Unsupported`.

### Changed

- **lz5 (Lizard) Huffman sub-streams** stay `Unsupported`, now with a precise
rationale in the module docs: the Huff0 entropy stage selects X1/X2 from
`(regenSize, comprLen)` at runtime and there is no reference encoder or
fixture available to validate a decoder bit-exactly, so — consistent with the
crate's `lzham`/`sit13` policy — it is left honest rather than shipped blind.
The docs record the concrete reuse path (zstd's X1 Huff0 decoder + an X2
decoder + the `HUF_selectDecoder` heuristic) for a future round with fixtures.


### Added

- **HTTP/3 QPACK header compression** (RFC 9204) behind the new `qpack`
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,14 +47,14 @@ flag, and a `compcol` binary turns the library into a Unix-style filter.
| LZW (`compress(1)` `.Z`) | `lzw` | `.lzw` | full | full | `compress(1)` / `uncompress(1)` |
| LZMA (legacy `.lzma`) | `lzma` | `.lzma` | full | full | `python3 -m lzma` (FORMAT_ALONE) |
| xz | `xz` | `.xz` | compressed-LZMA2 chunks + uncompressed fallback | full envelope + all reset variants | `xz(1)` both directions |
| Raw LZMA2 (7z coder 21) | `lzma2` | `.lzma2` | `Unsupported` (decode-only) | full (raw LZMA2 chunk stream; reuses the xz LZMA2 engine) | round-trip vs the xz LZMA2 encoder |
| Raw LZMA2 (7z coder 21) | `lzma2` | `.lzma2` | full (raw LZMA2 chunk stream; reuses the xz LZMA2 engine) | full (raw LZMA2 chunk stream; reuses the xz LZMA2 engine) | round-trip + cross-decode via the shared xz LZMA2 codec |
| Zstandard (RFC 8478) | `zstd` | `.zst` | LZ77 + Huffman literals + FSE_Compressed_Mode sequences + repeat offsets + RLE blocks | full Compressed_Block | `zstd(1)` both directions |
| Brotli (RFC 7932) | `brotli` | `.br` | LZ77 + length-limited Huffman + 704-symbol IC alphabet + static-dictionary refs | full (with 122 KiB static dictionary) | `brotli(1)` both directions |
| LZO (LZO1X-1) | `lzo` | `.lzo` | LZ77 hash matcher | full | `python3 -c "import lzo"` |
| LZX (Microsoft CAB / WIM) | `lzx` | `.lzx` | uncompressed blocks only | full (verbatim + aligned-offset + uncompressed; E8 filter) | — |
| Amiga LZX (original 1995 Forbes) | `amiga_lzx` | — (`.lzx` claimed by MS LZX) | uncompressed blocks only | full (verbatim + aligned + uncompressed; fixed 64 KiB window, no chunk reset, no E8 filter) | — |
| Quantum (Stac, old CAB) | `quantum` | `.q` | `Unsupported` (no public encoder exists) | full (libmspack-equivalent) | libmspack regression fixtures |
| LZFSE (Apple) | `lzfse` | `.lzfse` | `Unsupported` (decoder-only) | `bvx-` raw + `bvxn` (LZVN); `bvx2` returns `Unsupported` | hand-built fixtures (no Apple toolchain bundled) |
| LZFSE (Apple) | `lzfse` | `.lzfse` | `Unsupported` (decoder-only) | `bvx-` raw + `bvxn` (LZVN) + `bvx2` (LZ77 + FSE); `bvx1` returns `Unsupported` | round-trip (bvx2 vs own FSE encoder; no Apple toolchain bundled) |
| ADC (Apple DMG) | `adc` | `.adc` | LZSS-style greedy match-finder | full | hand-built fixtures |
| bzip2 | `bzip2` | `.bz2` | full (RLE-1 + SA-IS BWT + MTF + RLE-2 + dynamic Huffman) | full | `bzip2(1)` both directions |
| PPMd (Shkarin's PPMII variant H) | `ppmd` | `.ppmd` | `Unsupported` (decoder-only; PPM model is intricate) | full (used in 7z / RAR3+ / ZIP method 98) | `python3 ppmd-cffi` |
Expand Down Expand Up @@ -427,7 +427,7 @@ lzw = ["alloc"]
lzo = ["alloc"]
lzx = ["alloc"]
quantum = ["alloc"]
lzfse = ["alloc"] # decoder-only, bvx2 returns Unsupported
lzfse = ["alloc"] # decoder-only; bvx-/bvxn/bvx2, bvx1 Unsupported
adc = ["alloc"]
rar1 = ["alloc"]
rar2 = ["alloc"]
Expand Down
53 changes: 51 additions & 2 deletions src/lz5/block.rs
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,19 @@
//! Only the LZ4-codeword sequence loop (levels 10..=19, 30..=39) with
//! all sub-streams stored raw (no Huffman entropy stage) is
//! implemented; everything else returns [`Error::Unsupported`].
//!
//! Two paths stay `Unsupported` for documented, validation-driven
//! reasons (see the inline comments at the `huffman_bits` and LIZv1
//! rejections below):
//!
//! * **Huff0 entropy stage** (any sub-stream flag bit set): Lizard's
//! generic `HUF_decompress` recomputes an X1-vs-X2 decoder choice
//! that is never carried in the stream; the crate has only an X1
//! Huff0 decoder (private to `zstd`), and there is no `lizard` CLI
//! or fixture here to validate an X2 decoder against. A round-trip
//! against our own X1-only encoder would prove nothing.
//! * **LIZv1 codewords** (levels 20..=29, 40..=49): a separate, larger
//! sequence format, out of scope for this round.

use alloc::vec::Vec;

Expand Down Expand Up @@ -61,6 +74,12 @@ pub fn decode_compressed_block(input: &[u8], out: &mut Vec<u8>, cap: usize) -> R
// Lizard groups levels by decompression strategy:
// 10..=19, 30..=39 → LZ4 codewords (this build supports)
// 20..=29, 40..=49 → LIZv1 codewords (not supported)
//
// LIZv1 is a distinct, larger sequence format (`Lizard_decompress_LIZv1`
// vs `Lizard_decompress_LZ4` in the reference): different token layout,
// explicit `lengths`/`offset16`/`offset24` streams, and a 24-bit offset
// path. Implementing it is a separate effort from the Huffman stage and
// is out of scope for this round, so it stays `Unsupported`.
let is_lz4_mode = matches!(clevel, 10..=19 | 30..=39);
if !is_lz4_mode {
return Err(Error::Unsupported);
Expand Down Expand Up @@ -96,8 +115,38 @@ pub fn decode_compressed_block(input: &[u8], out: &mut Vec<u8>, cap: usize) -> R
if res & FLAG_LEN != 0 {
return Err(Error::Corrupt);
}
// Any Huffman bit set on a sub-stream means we'd need to FSE-Huffman
// decode that stream. Out of scope.
// Any Huffman bit set on a sub-stream means the stream is entropy-coded
// with Huff0 (Yann Collet's FiniteStateEntropy library) and must be
// `HUF_decompress`'d before the sequence loop runs. Each such sub-stream
// is framed as a 6-byte header (3-byte LE regenerated size + 3-byte LE
// compressed size) followed by `compressed_size` bytes of Huff0 payload
// (`Lizard_readStream` → `HUF_decompress(op, regenSize, ip + 6, comprLen)`).
//
// This stays `Unsupported`. The decision is deliberate, not a TODO —
// there is no faithful way to *validate* such a decoder in this
// environment, and the crate's policy (see `lzham`, `sit13`) is to mark
// formats we cannot validate bit-exactly as `Unsupported` rather than
// ship a blind decoder. Concretely:
//
// * The crate already has a Huff0 decoder in `src/zstd/huffman.rs`, but
// it is (a) private to the `zstd` module (`mod huffman;`, not
// reachable from here without re-exporting it) and (b) implements
// only the **X1** (single-symbol) decode table that zstd's *literals*
// spec restricts itself to.
// * Lizard calls the *generic* `HUF_decompress`, which selects **X1 or
// X2** (double-symbol) at runtime via `HUF_selectDecoder`. That
// choice is **recomputed from (regenSize, comprLen)** and is **never
// stored in the stream**, so a conformant decoder must implement both
// X1 and X2 *and* reproduce `HUF_selectDecoder`'s timing heuristic
// exactly. The crate has no X2 decoder anywhere. (The 4-stream jump
// table — three LE u16 sizes — does match zstd's literals framing, so
// that part would be reusable; the X1/X2 split is the blocker.)
// * The lz5 encoder here is store-only, and there is no `lizard` CLI or
// Huff0 fixture in this environment. A round-trip against a
// hand-written X1-only encoder would always select X1 and "pass"
// while proving nothing about a real (possibly X2) Lizard block — a
// self-validating fiction. Absent a real fixture or reference
// encoder there is no honest round-trip, so we do not ship.
let huffman_bits = res & (FLAG_LITERALS | FLAG_FLAGS | FLAG_OFFSET16 | FLAG_OFFSET24);
if huffman_bits != 0 {
return Err(Error::Unsupported);
Expand Down
31 changes: 28 additions & 3 deletions src/lz5/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,34 @@
//! **Decoder**: implemented for the **LZ4 codeword path with all
//! sub-streams stored raw** (the most common shape produced by the
//! reference CLI at levels 10..=19 on non-tiny inputs). Frames whose
//! blocks use the LIZv1 sequence format (levels 20..=29) or any
//! Huffman-coded sub-stream (levels 30+) are rejected with
//! [`Error::Unsupported`]. The frame-level uncompressed block path
//! blocks use the LIZv1 sequence format (levels 20..=29, 40..=49) or any
//! Huffman-coded sub-stream are rejected with [`Error::Unsupported`].
//!
//! The Huffman path stays `Unsupported` for a concrete, validation-first
//! reason rather than mere absence of effort. Lizard's entropy stage is
//! Huff0 (`HUF_decompress` from Yann Collet's FiniteStateEntropy), the
//! same family as zstd's literals Huffman, and each Huffman sub-stream is
//! framed as a 6-byte header (3-byte LE regenerated size + 3-byte LE
//! compressed size) then the Huff0 payload. But the *generic*
//! `HUF_decompress` Lizard calls selects between **X1** (single-symbol)
//! and **X2** (double-symbol) decode tables via `HUF_selectDecoder`, and
//! that choice is **recomputed from the regenerated/compressed sizes,
//! never stored in the stream**. This crate's Huff0 decoder
//! (`src/zstd/huffman.rs`) is X1-only and is private to the `zstd`
//! module; it covers neither X2 nor the size-driven selector. With no
//! `lizard` CLI and no Huff0 fixtures in this environment, the only
//! "test" available would be a round-trip against a hand-written
//! X1-only encoder, which would always pick X1 and therefore validate
//! nothing about real (possibly X2) blocks. Per the crate's
//! `lzham`/`sit13` policy, an unvalidatable decoder is worse than an
//! honest `Unsupported`, so we do not ship one.
//!
//! A future round could lift this once validation is possible: expose
//! zstd's X1 Huff0 decoder as `pub(crate)`, add an X2 decoder plus the
//! `HUF_selectDecoder` heuristic, and validate against fixtures from the
//! `lizard` CLI (e.g. `lizard -30`). The 6-byte sub-stream header and the
//! 4-stream jump table (three LE u16 sizes) already match formats this
//! crate parses elsewhere. The frame-level uncompressed block path
//! (high bit on block-size word) is handled fully, so frames where
//! every block stored raw decode without ever exercising the sequence
//! loop. Block checksums (FLG bit 4) and external dictionaries are
Expand Down
91 changes: 76 additions & 15 deletions src/lzfse/decoder.rs
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,12 @@ enum State {
enum BlockKind {
Uncompressed,
Lzvn,
/// `bvx2` returns Unsupported once we've parsed its header far enough
/// to know we hit it; this variant exists so the state machine can
/// surface that decision uniformly with the other block kinds.
/// `bvx2` (LZFSE v2): FSE + LZ77. Decoded by [`lzfse_v2::decode_block`]
/// once the whole block (variable-length header + both payload streams)
/// is buffered.
V2,
/// `bvx1` (LZFSE v1, uncompressed-freq variant): not emitted by modern
/// encoders; returns [`Error::Unsupported`].
V1,
}

Expand Down Expand Up @@ -216,23 +218,56 @@ impl Decoder {
};
}
BlockKind::V2 => {
// We don't decode v2 in this build, but we need to
// skip past the block cleanly so callers don't
// confuse "block we can't decode" with "garbage".
// Parse the n_payload_bytes field from the header.
if self.input_buf.len() < lzfse_v2::V2_HEADER_FIXED_BYTES {
// The v2 header is variable-length (FSE frequency
// tables follow the fixed packed fields). Buffer the
// fixed 28 bytes (post-magic: n_raw + three u64 words)
// first so we can read `header_size` and the payload
// sizes, then arrange to buffer the whole block (header
// + payload) before decoding it in one shot.
let fixed = lzfse_v2::V2_HEADER_FIXED_BYTES;
if self.input_buf.len() < fixed {
return Ok(RawProgress {
consumed,
written,
done: false,
});
}
// We *could* skip past the v2 block, but the spec is
// explicit that the encoder may mix block types
// freely. Returning Unsupported here is the
// documented behaviour for v2 in this build.
self.poisoned = true;
return Err(Error::Unsupported);
let header_size = match lzfse_v2::parse_header_size(&self.input_buf) {
Ok(h) => h as usize,
Err(e) => {
self.poisoned = true;
return Err(e);
}
};
let n_payload = match lzfse_v2::parse_payload_size(&self.input_buf) {
Ok(n) => n as usize,
Err(e) => {
self.poisoned = true;
return Err(e);
}
};
// `header_size` includes the 4-byte magic we already
// dropped; remaining block bytes after the magic are
// `header_size - 4 + n_payload`.
let header_len = match header_size.checked_sub(4) {
Some(h) if h >= fixed => h,
_ => {
self.poisoned = true;
return Err(Error::Corrupt);
}
};
let block_len = match header_len.checked_add(n_payload) {
Some(b) => b,
None => {
self.poisoned = true;
return Err(Error::Corrupt);
}
};
self.state = State::AwaitPayload {
kind: BlockKind::V2,
payload_len: block_len,
decoded_size: 0,
};
}
BlockKind::V1 => {
self.poisoned = true;
Expand Down Expand Up @@ -287,7 +322,33 @@ impl Decoder {
self.input_buf.drain(..payload_len);
self.state = State::AwaitMagic;
}
BlockKind::V2 | BlockKind::V1 => {
BlockKind::V2 => {
// The whole block (header + both payload streams)
// is now buffered in `payload_len` bytes. Decode in
// one shot. Bound the up-front output reservation by
// a payload-derived hint (an FSE block can expand
// more than LZVN, but is still bounded; the decoder
// enforces the exact `n_raw_bytes` internally).
let cap_hint = payload_len.saturating_mul(32).saturating_add(1 << 16);
let mut block_out = Vec::new();
match lzfse_v2::decode_block(
&self.input_buf[..payload_len],
&mut block_out,
cap_hint,
) {
Ok(consumed_block) => {
debug_assert_eq!(consumed_block, payload_len);
}
Err(e) => {
self.poisoned = true;
return Err(e);
}
}
self.output_buf.append(&mut block_out);
self.input_buf.drain(..payload_len);
self.state = State::AwaitMagic;
}
BlockKind::V1 => {
// Unreachable — header step would have errored.
self.poisoned = true;
return Err(Error::Unsupported);
Expand Down
Loading
Loading