Summary
The C reference can split blocks to use different entropy tables for different sections of data. This improves compression ratio when data characteristics change within a block (e.g., text followed by binary data).
C reference implementation
Superblock (zstd_compress_superblock.c)
- Analyzes symbol distribution within a block
- Splits into sub-blocks with independently optimized entropy tables
- Each sub-block gets its own Huffman/FSE tables tuned to its content
Pre-split (zstd_preSplit.c)
- Pre-splitting for parallelization and better entropy adaptation
- Detects content boundaries before compression
Literal compression decisions (zstd_compress_literals.c)
- RLE detection:
allBytesIdentical() for repeated bytes
- Min threshold: btultra2 ≥6 bytes, dfast ≥256 bytes
- Stream count: <256 bytes → single stream, ≥256 → 4 streams
Current Rust state
- Single block = single set of entropy tables
- No content-aware splitting
- No superblock optimization
What needs to be implemented
- Content analysis — detect entropy changes within block
- Split point detection — find optimal boundaries for table changes
- Per-sub-block encoding — independent Huffman/FSE tables per section
- Threshold tuning — when to split vs. keep unified
Acceptance criteria
Dependencies
Time estimate
3d
Blocked by
Summary
The C reference can split blocks to use different entropy tables for different sections of data. This improves compression ratio when data characteristics change within a block (e.g., text followed by binary data).
C reference implementation
Superblock (zstd_compress_superblock.c)
Pre-split (zstd_preSplit.c)
Literal compression decisions (zstd_compress_literals.c)
allBytesIdentical()for repeated bytesCurrent Rust state
What needs to be implemented
Acceptance criteria
Dependencies
Time estimate
3d
Blocked by