Skip to content

Roadmap: structured-zstd feature parity with C zstd #28

@polaz

Description

@polaz

Execution order (dependency-aware)

Last actualized: 2026-04-12

Phase 1: Correctness & Core (P0-critical) — DONE ✅

These are blocking bugs and the most-used missing feature.

Order Issue What Est
1.1 #15 fix: large literals block panic 1d
1.2 #17 feat: FSE table reuse + offset history (broken encoding) 2d
1.3 #5 feat: Default compression level (dfast, level 3) 3d

Dependency chain: ✅ #15 and ✅ #17 are independent, both should be done before ✅ #5 (correct encoding foundation).

Phase 2: CoordiNode Critical Path (P1-high) — DONE ✅

Core features needed for the CoordiNode use case.

Order Issue What Blocked by Est
2.1 #24 test: benchmark suite 2d
2.2 #14 perf: encoder match interleaving 2d
2.3 #12 perf: sequence execution wildcopy 3d
2.4 #8 feat: dictionary compression #17 3d
2.5 #9 feat: streaming encoder 2d
2.6 #6 feat: Better level (lazy2, level 7) #5 4d

Parallelizable: ✅ #24, ✅ #14, ✅ #12, ✅ #9 are independent. ✅ #8 needs ✅ #17. ✅ #6 needs ✅ #5.

Phase 3: Performance Parity (P2-medium) — DONE ✅

Close the 1.4-3.5x decompression gap with C zstd.

Order Issue What Blocked by Est
3.1 #10 perf: Huffman 4-stream parallel 3d
3.2 #11 perf: FSE batched refill for state updates 2d
3.3 #13 perf: bitstream reader 2d
3.4 #20 perf: decoder pre-allocation 4h
3.5 #16 feat: frame content size 1d
3.6 #7 feat: Best level (btlazy2, level 11) #6 4d
3.7 #21 feat: numeric levels 1-22 #5, ✅ #6, ✅ #7 1d
3.8 #25 feat: FastCOVER dict builder 3d
3.9 #56 perf: packed FSE Entry layout (4-byte + bulk spread + cache alignment) #11 2d 4h
3.10 #47 perf: reuse encoded scratch buffer across streaming blocks 1h
3.11 #51 perf: rebase HC table positions (remove 4 GiB cutoff) 2d
3.12 #67 perf(encoding): row-based match finder for fast/dfast levels 3d

Parallelizable: ✅ #25, ✅ #56, ✅ #47, ✅ #51, ✅ #67 were independent.

Phase 4: SIMD & Hardware Acceleration (P1-high / P2-medium) — DONE ✅

Architecture-specific optimizations to reach C zstd parity. Ordered by impact.

Order Issue What Blocked by Est
4.1 #68 perf(decoding): SIMD wildcopy for literal+match memcpy 3d
4.2 #69 perf(decoding): branchless offset history + stride prefetch + BMI2 pext 1d 4h
4.3 #66 perf(decoding): SIMD HUF decode kernels (BMI2/AVX2/VBMI2/NEON) #56 (recommended) 4d
4.4 #88 perf(encoding): eliminate default-level small-input dfast reset/table clear cliff 2d
4.5 #70 perf(encoding): SIMD match-length comparison (SSE2/AVX2/NEON) 1d
4.6 #97 perf(encoding): early incompressible fast-path for fastest/default encode + default decodecorpus dfast parity vs C 2d 4h
4.7 #71 perf: ARM platform optimizations (CRC32 hash, NEON, SVE2 histcnt) #68 (NEON wildcopy) 2d 4h

Critical path update:#68 (PR #85, 2026-04-09), ✅ #69 (PR #90, 2026-04-09), ✅ #66 (PR #92, 2026-04-09), ✅ #70 (PR #96, 2026-04-09), ✅ #97 (PR #99, 2026-04-11), ✅ #88 (verified+closed, 2026-04-11), ✅ #71 (PR #104, 2026-04-11), and ✅ #86 (PR #105, 2026-04-12) are completed. Next highest-impact open item is #22.

Parallelizable: Phase 4B item #86 is completed; Phase 5 items remain independently schedulable by dependency order.

Expected combined impact:

  • Decode throughput: +60-100% (from 1.4-3.5x slower → ~1.0-1.5x slower vs C zstd)
  • Encode throughput: +15-30% at levels 5+ (SIMD match comparison + ARM CRC32 hash)
  • Encode throughput (incompressible/random): major latency reduction from early no-compress fast-path (perf(encoding): early incompressible fast-path for fastest/default encode #97)
  • Encode latency (tiny payloads, default level): remove 100x-class cliff vs C for small-4k-log-lines path

Phase 4B: Dictionary Decode Hot Path (P2-medium) — DONE ✅

Latency follow-up immediately after SIMD/HW phase.

Order Issue What Blocked by Est
4B.1 #86 perf(decoding): pre-parsed dictionary handle for repeated dict decode Phase 4 complete 1d 4h

Phase 5: Advanced Features (P3-low)

Full feature parity with C zstd.

Order Issue What Blocked by Est
5.1 #22 feat: optimal parsing (levels 16-22) #7, ✅ #21 5d
5.2 #23 feat: block splitting #5 3d
5.3 #18 feat: long distance matching 3d
5.4 #26 feat: magicless format 4h
5.5 #27 feat: configurable parameters API #21 1d
5.6 #19 feat: multi-threaded compression #5 5d
5.7 #72 perf(decoding): parallel block decompression 3d

Dependency graph

Phase 1-2 (DONE):
✅#15 ──→ ✅#5 ──→ ✅#6 ──→ ✅#7 ──→ ✅#21
✅#17 ──→ ✅#8      │         │        │
✅#9  ✅#12 ✅#14   │         │        ├──→ #27 (params API)
✅#24 ✅#10 ✅#11   │         │        └──→ #22 (optimal parsing)
✅#13 ✅#20 ✅#16   ├──→ #19  │
                   └──→ #23  └──→ #22

Phase 3 (Performance Parity — DONE):
  ✅#56 (packed FSE) ──→ ✅#66 (SIMD HUF decode)
  ✅#25 (FastCOVER)
  ✅#47 (scratch reuse)
  ✅#51 (HC rebase)
  ✅#67 (row matcher)

Phase 4 (SIMD & HW Acceleration — DONE):
  ✅#68 (SIMD wildcopy) ──→ ✅#71 (ARM optimizations, NEON backend)
  ✅#69 (branchless offset + prefetch + pext)
  ✅#66 (SIMD HUF kernels, BMI2/AVX2/VBMI2/NEON)
  ✅#88 (default-level small-input dfast cliff)
  ✅#70 (SIMD match-length comparison)
  ✅#97 (early incompressible fast-path + default decodecorpus dfast parity vs C)
  ✅#86 (pre-parsed dictionary handle for repeated dict decode)

Phase 5 (Advanced Features):
  #22 (optimal parsing)
  #23 (block splitting)
  #18 (LDM)
  #26 (magicless)
  #27 (params API)
  #19 (multi-thread compress)
  #72 (parallel block decompress)

Recommended execution order (next actions)

Highest impact per effort — do these next:

  1. feat: optimal parsing compression (btopt/btultra/btultra2 strategies) #22 (optimal parsing compression) — 5d — highest-impact remaining feature-parity item
  2. feat: block splitting for improved compression ratio #23 (block splitting) — 3d — follow-up ratio/feature parity improvement
  3. perf(decoding): parallel block decompression for multi-block frames #72 (parallel block decompression) — 3d — next decode throughput multiplier after parity features

After these 3 items, decode should stay in the ~1.0-1.3x range vs C zstd on literals-heavy paths with additional headroom from parallel block decompression, while remaining encode parity work concentrates in optimal parsing + block splitting. The early incompressible fast-path item remains completed in ✅ #97.

Total estimate

  • Phase 3 remaining: ~0d
  • Phase 4 (new): ~0d
  • Phase 4B: ~0d
  • Phase 5: ~20d 4h
  • Total remaining: ~20d 4h (~20-21 working days) for full feature parity

Roadmap tail backlog

  1. research(decoding): evaluate advanced SIMD wildcopy paths beyond current baseline #87 (research: advanced SIMD wildcopy paths) — 2d 4h — evaluate post-Phase-4 SIMD candidates and benchmark go/no-go decisions

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1-highHigh priority — core functionalitydocumentationImprovements or additions to documentationenhancementNew feature or requestperformancePerformance optimization

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions