Execution order (dependency-aware)
Last actualized: 2026-04-12
Phase 1: Correctness & Core (P0-critical) — DONE ✅
These are blocking bugs and the most-used missing feature.
Order
Issue
What
Est
1.1
✅ #15
fix: large literals block panic
1d
1.2
✅ #17
feat: FSE table reuse + offset history (broken encoding)
2d
1.3
✅ #5
feat: Default compression level (dfast, level 3)
3d
Dependency chain : ✅ #15 and ✅ #17 are independent, both should be done before ✅ #5 (correct encoding foundation).
Phase 2: CoordiNode Critical Path (P1-high) — DONE ✅
Core features needed for the CoordiNode use case.
Order
Issue
What
Blocked by
Est
2.1
✅ #24
test: benchmark suite
—
2d
2.2
✅ #14
perf: encoder match interleaving
—
2d
2.3
✅ #12
perf: sequence execution wildcopy
—
3d
2.4
✅ #8
feat: dictionary compression
✅ #17
3d
2.5
✅ #9
feat: streaming encoder
—
2d
2.6
✅ #6
feat: Better level (lazy2, level 7)
✅ #5
4d
Parallelizable : ✅ #24 , ✅ #14 , ✅ #12 , ✅ #9 are independent. ✅ #8 needs ✅ #17 . ✅ #6 needs ✅ #5 .
Phase 3: Performance Parity (P2-medium) — DONE ✅
Close the 1.4-3.5x decompression gap with C zstd.
Order
Issue
What
Blocked by
Est
3.1
✅ #10
perf: Huffman 4-stream parallel
—
3d
3.2
✅ #11
perf: FSE batched refill for state updates
—
2d
3.3
✅ #13
perf: bitstream reader
—
2d
3.4
✅ #20
perf: decoder pre-allocation
—
4h
3.5
✅ #16
feat: frame content size
—
1d
3.6
✅ #7
feat: Best level (btlazy2, level 11)
✅ #6
4d
3.7
✅ #21
feat: numeric levels 1-22
✅ #5 , ✅ #6 , ✅ #7
1d
3.8
✅ #25
feat: FastCOVER dict builder
—
3d
3.9
✅ #56
perf: packed FSE Entry layout (4-byte + bulk spread + cache alignment)
✅ #11
2d 4h
3.10
✅ #47
perf: reuse encoded scratch buffer across streaming blocks
—
1h
3.11
✅ #51
perf: rebase HC table positions (remove 4 GiB cutoff)
—
2d
3.12
✅ #67
perf(encoding): row-based match finder for fast/dfast levels
—
3d
Parallelizable : ✅ #25 , ✅ #56 , ✅ #47 , ✅ #51 , ✅ #67 were independent.
Phase 4: SIMD & Hardware Acceleration (P1-high / P2-medium) — DONE ✅
Architecture-specific optimizations to reach C zstd parity. Ordered by impact.
Order
Issue
What
Blocked by
Est
4.1
✅ #68
perf(decoding): SIMD wildcopy for literal+match memcpy
—
3d
4.2
✅ #69
perf(decoding): branchless offset history + stride prefetch + BMI2 pext
—
1d 4h
4.3
✅ #66
perf(decoding): SIMD HUF decode kernels (BMI2/AVX2/VBMI2/NEON)
✅ #56 (recommended)
4d
4.4
✅ #88
perf(encoding): eliminate default-level small-input dfast reset/table clear cliff
—
2d
4.5
✅ #70
perf(encoding): SIMD match-length comparison (SSE2/AVX2/NEON)
—
1d
4.6
✅ #97
perf(encoding): early incompressible fast-path for fastest/default encode + default decodecorpus dfast parity vs C
—
2d 4h
4.7
✅ #71
perf: ARM platform optimizations (CRC32 hash, NEON, SVE2 histcnt)
#68 (NEON wildcopy)
2d 4h
Critical path update: ✅ #68 (PR #85 , 2026-04-09), ✅ #69 (PR #90 , 2026-04-09), ✅ #66 (PR #92 , 2026-04-09), ✅ #70 (PR #96 , 2026-04-09), ✅ #97 (PR #99 , 2026-04-11), ✅ #88 (verified+closed, 2026-04-11), ✅ #71 (PR #104 , 2026-04-11), and ✅ #86 (PR #105 , 2026-04-12) are completed. Next highest-impact open item is #22 .
Parallelizable : Phase 4B item #86 is completed; Phase 5 items remain independently schedulable by dependency order.
Expected combined impact:
Decode throughput: +60-100% (from 1.4-3.5x slower → ~1.0-1.5x slower vs C zstd)
Encode throughput: +15-30% at levels 5+ (SIMD match comparison + ARM CRC32 hash)
Encode throughput (incompressible/random): major latency reduction from early no-compress fast-path (perf(encoding): early incompressible fast-path for fastest/default encode #97 )
Encode latency (tiny payloads, default level): remove 100x-class cliff vs C for small-4k-log-lines path
Phase 4B: Dictionary Decode Hot Path (P2-medium) — DONE ✅
Latency follow-up immediately after SIMD/HW phase.
Order
Issue
What
Blocked by
Est
4B.1
✅ #86
perf(decoding): pre-parsed dictionary handle for repeated dict decode
Phase 4 complete
1d 4h
Phase 5: Advanced Features (P3-low)
Full feature parity with C zstd.
Order
Issue
What
Blocked by
Est
5.1
#22
feat: optimal parsing (levels 16-22)
✅ #7 , ✅ #21
5d
5.2
#23
feat: block splitting
✅ #5
3d
5.3
#18
feat: long distance matching
—
3d
5.4
#26
feat: magicless format
—
4h
5.5
#27
feat: configurable parameters API
✅ #21
1d
5.6
#19
feat: multi-threaded compression
✅ #5
5d
5.7
#72
perf(decoding): parallel block decompression
—
3d
Dependency graph
Phase 1-2 (DONE):
✅#15 ──→ ✅#5 ──→ ✅#6 ──→ ✅#7 ──→ ✅#21
✅#17 ──→ ✅#8 │ │ │
✅#9 ✅#12 ✅#14 │ │ ├──→ #27 (params API)
✅#24 ✅#10 ✅#11 │ │ └──→ #22 (optimal parsing)
✅#13 ✅#20 ✅#16 ├──→ #19 │
└──→ #23 └──→ #22
Phase 3 (Performance Parity — DONE):
✅#56 (packed FSE) ──→ ✅#66 (SIMD HUF decode)
✅#25 (FastCOVER)
✅#47 (scratch reuse)
✅#51 (HC rebase)
✅#67 (row matcher)
Phase 4 (SIMD & HW Acceleration — DONE):
✅#68 (SIMD wildcopy) ──→ ✅#71 (ARM optimizations, NEON backend)
✅#69 (branchless offset + prefetch + pext)
✅#66 (SIMD HUF kernels, BMI2/AVX2/VBMI2/NEON)
✅#88 (default-level small-input dfast cliff)
✅#70 (SIMD match-length comparison)
✅#97 (early incompressible fast-path + default decodecorpus dfast parity vs C)
✅#86 (pre-parsed dictionary handle for repeated dict decode)
Phase 5 (Advanced Features):
#22 (optimal parsing)
#23 (block splitting)
#18 (LDM)
#26 (magicless)
#27 (params API)
#19 (multi-thread compress)
#72 (parallel block decompress)
Recommended execution order (next actions)
Highest impact per effort — do these next:
feat: optimal parsing compression (btopt/btultra/btultra2 strategies) #22 (optimal parsing compression) — 5d — highest-impact remaining feature-parity item
feat: block splitting for improved compression ratio #23 (block splitting) — 3d — follow-up ratio/feature parity improvement
perf(decoding): parallel block decompression for multi-block frames #72 (parallel block decompression) — 3d — next decode throughput multiplier after parity features
After these 3 items, decode should stay in the ~1.0-1.3x range vs C zstd on literals-heavy paths with additional headroom from parallel block decompression, while remaining encode parity work concentrates in optimal parsing + block splitting. The early incompressible fast-path item remains completed in ✅ #97 .
Total estimate
Phase 3 remaining: ~0d
Phase 4 (new): ~0d
Phase 4B: ~0d
Phase 5: ~20d 4h
Total remaining: ~20d 4h (~20-21 working days) for full feature parity
Roadmap tail backlog
research(decoding): evaluate advanced SIMD wildcopy paths beyond current baseline #87 (research: advanced SIMD wildcopy paths) — 2d 4h — evaluate post-Phase-4 SIMD candidates and benchmark go/no-go decisions
Execution order (dependency-aware)
Phase 1: Correctness & Core (P0-critical) — DONE ✅
These are blocking bugs and the most-used missing feature.
Dependency chain: ✅ #15 and ✅ #17 are independent, both should be done before ✅ #5 (correct encoding foundation).
Phase 2: CoordiNode Critical Path (P1-high) — DONE ✅
Core features needed for the CoordiNode use case.
Parallelizable: ✅ #24, ✅ #14, ✅ #12, ✅ #9 are independent. ✅ #8 needs ✅ #17. ✅ #6 needs ✅ #5.
Phase 3: Performance Parity (P2-medium) — DONE ✅
Close the 1.4-3.5x decompression gap with C zstd.
Parallelizable: ✅ #25, ✅ #56, ✅ #47, ✅ #51, ✅ #67 were independent.
Phase 4: SIMD & Hardware Acceleration (P1-high / P2-medium) — DONE ✅
Architecture-specific optimizations to reach C zstd parity. Ordered by impact.
Critical path update: ✅ #68 (PR #85, 2026-04-09), ✅ #69 (PR #90, 2026-04-09), ✅ #66 (PR #92, 2026-04-09), ✅ #70 (PR #96, 2026-04-09), ✅ #97 (PR #99, 2026-04-11), ✅ #88 (verified+closed, 2026-04-11), ✅ #71 (PR #104, 2026-04-11), and ✅ #86 (PR #105, 2026-04-12) are completed. Next highest-impact open item is #22.
Parallelizable: Phase 4B item #86 is completed; Phase 5 items remain independently schedulable by dependency order.
Expected combined impact:
small-4k-log-linespathPhase 4B: Dictionary Decode Hot Path (P2-medium) — DONE ✅
Latency follow-up immediately after SIMD/HW phase.
Phase 5: Advanced Features (P3-low)
Full feature parity with C zstd.
Dependency graph
Recommended execution order (next actions)
Highest impact per effort — do these next:
After these 3 items, decode should stay in the ~1.0-1.3x range vs C zstd on literals-heavy paths with additional headroom from parallel block decompression, while remaining encode parity work concentrates in optimal parsing + block splitting. The early incompressible fast-path item remains completed in ✅ #97.
Total estimate
Roadmap tail backlog