Roadmap: structured-zstd feature parity with C zstd

## Execution order (dependency-aware)

> Last actualized: 2026-04-12

### Phase 1: Correctness & Core (P0-critical) — DONE ✅
These are blocking bugs and the most-used missing feature.

| Order | Issue | What | Est |
|-------|-------|------|-----|
| 1.1 | ✅ #15 | fix: large literals block panic | 1d |
| 1.2 | ✅ #17 | feat: FSE table reuse + offset history (broken encoding) | 2d |
| 1.3 | ✅ #5 | feat: Default compression level (dfast, level 3) | 3d |

**Dependency chain**: ✅ #15 and ✅ #17 are independent, both should be done before ✅ #5 (correct encoding foundation).

### Phase 2: CoordiNode Critical Path (P1-high) — DONE ✅
Core features needed for the CoordiNode use case.

| Order | Issue | What | Blocked by | Est |
|-------|-------|------|------------|-----|
| 2.1 | ✅ #24 | test: benchmark suite | — | 2d |
| 2.2 | ✅ #14 | perf: encoder match interleaving | — | 2d |
| 2.3 | ✅ #12 | perf: sequence execution wildcopy | — | 3d |
| 2.4 | ✅ #8 | feat: dictionary compression | ✅ #17 | 3d |
| 2.5 | ✅ #9 | feat: streaming encoder | — | 2d |
| 2.6 | ✅ #6 | feat: Better level (lazy2, level 7) | ✅ #5 | 4d |

**Parallelizable**: ✅ #24, ✅ #14, ✅ #12, ✅ #9 are independent. ✅ #8 needs ✅ #17. ✅ #6 needs ✅ #5.

### Phase 3: Performance Parity (P2-medium) — DONE ✅
Close the 1.4-3.5x decompression gap with C zstd.

| Order | Issue | What | Blocked by | Est |
|-------|-------|------|------------|-----|
| 3.1 | ✅ #10 | perf: Huffman 4-stream parallel | — | 3d |
| 3.2 | ✅ #11 | perf: FSE batched refill for state updates | — | 2d |
| 3.3 | ✅ #13 | perf: bitstream reader | — | 2d |
| 3.4 | ✅ #20 | perf: decoder pre-allocation | — | 4h |
| 3.5 | ✅ #16 | feat: frame content size | — | 1d |
| 3.6 | ✅ #7 | feat: Best level (btlazy2, level 11) | ✅ #6 | 4d |
| 3.7 | ✅ #21 | feat: numeric levels 1-22 | ✅ #5, ✅ #6, ✅ #7 | 1d |
| 3.8 | ✅ #25 | feat: FastCOVER dict builder | — | 3d |
| 3.9 | ✅ #56 | perf: packed FSE Entry layout (4-byte + bulk spread + cache alignment) | ✅ #11 | 2d 4h |
| 3.10 | ✅ #47 | perf: reuse encoded scratch buffer across streaming blocks | — | 1h |
| 3.11 | ✅ #51 | perf: rebase HC table positions (remove 4 GiB cutoff) | — | 2d |
| 3.12 | ✅ #67 | perf(encoding): row-based match finder for fast/dfast levels | — | 3d |

**Parallelizable**: ✅ #25, ✅ #56, ✅ #47, ✅ #51, ✅ #67 were independent.

### Phase 4: SIMD & Hardware Acceleration (P1-high / P2-medium) — DONE ✅
Architecture-specific optimizations to reach C zstd parity. **Ordered by impact.**

| Order | Issue | What | Blocked by | Est |
|-------|-------|------|------------|-----|
| 4.1 | ✅ **#68** | **perf(decoding): SIMD wildcopy for literal+match memcpy** | — | 3d |
| 4.2 | ✅ **#69** | **perf(decoding): branchless offset history + stride prefetch + BMI2 pext** | — | 1d 4h |
| 4.3 | ✅ **#66** | **perf(decoding): SIMD HUF decode kernels (BMI2/AVX2/VBMI2/NEON)** | ✅ #56 (recommended) | 4d |
| 4.4 | ✅ **#88** | **perf(encoding): eliminate default-level small-input dfast reset/table clear cliff** | — | 2d |
| 4.5 | ✅ **#70** | **perf(encoding): SIMD match-length comparison (SSE2/AVX2/NEON)** | — | 1d |
| 4.6 | ✅ **#97** | **perf(encoding): early incompressible fast-path for fastest/default encode + default decodecorpus dfast parity vs C** | — | 2d 4h |
| 4.7 | ✅ **#71** | **perf: ARM platform optimizations (CRC32 hash, NEON, SVE2 histcnt)** | #68 (NEON wildcopy) | 2d 4h |

**Critical path update:** ✅ #68 (PR #85, 2026-04-09), ✅ #69 (PR #90, 2026-04-09), ✅ #66 (PR #92, 2026-04-09), ✅ #70 (PR #96, 2026-04-09), ✅ #97 (PR #99, 2026-04-11), ✅ #88 (verified+closed, 2026-04-11), ✅ #71 (PR #104, 2026-04-11), and ✅ #86 (PR #105, 2026-04-12) are completed. Next highest-impact open item is #22.

**Parallelizable**: Phase 4B item #86 is completed; Phase 5 items remain independently schedulable by dependency order.

**Expected combined impact:**
- Decode throughput: **+60-100%** (from 1.4-3.5x slower → ~1.0-1.5x slower vs C zstd)
- Encode throughput: **+15-30%** at levels 5+ (SIMD match comparison + ARM CRC32 hash)
- Encode throughput (incompressible/random): major latency reduction from early no-compress fast-path (#97)
- Encode latency (tiny payloads, default level): remove 100x-class cliff vs C for `small-4k-log-lines` path

### Phase 4B: Dictionary Decode Hot Path (P2-medium) — DONE ✅
Latency follow-up immediately after SIMD/HW phase.

| Order | Issue | What | Blocked by | Est |
|-------|-------|------|------------|-----|
| 4B.1 | ✅ **#86** | **perf(decoding): pre-parsed dictionary handle for repeated dict decode** | Phase 4 complete | 1d 4h |

### Phase 5: Advanced Features (P3-low)
Full feature parity with C zstd.

| Order | Issue | What | Blocked by | Est |
|-------|-------|------|------------|-----|
| 5.1 | #22 | feat: optimal parsing (levels 16-22) | ✅ #7, ✅ #21 | 5d |
| 5.2 | #23 | feat: block splitting | ✅ #5 | 3d |
| 5.3 | #18 | feat: long distance matching | — | 3d |
| 5.4 | #26 | feat: magicless format | — | 4h |
| 5.5 | #27 | feat: configurable parameters API | ✅ #21 | 1d |
| 5.6 | #19 | feat: multi-threaded compression | ✅ #5 | 5d |
| 5.7 | **#72** | **perf(decoding): parallel block decompression** | — | 3d |

## Dependency graph

```
Phase 1-2 (DONE):
✅#15 ──→ ✅#5 ──→ ✅#6 ──→ ✅#7 ──→ ✅#21
✅#17 ──→ ✅#8      │         │        │
✅#9  ✅#12 ✅#14   │         │        ├──→ #27 (params API)
✅#24 ✅#10 ✅#11   │         │        └──→ #22 (optimal parsing)
✅#13 ✅#20 ✅#16   ├──→ #19  │
                   └──→ #23  └──→ #22

Phase 3 (Performance Parity — DONE):
  ✅#56 (packed FSE) ──→ ✅#66 (SIMD HUF decode)
  ✅#25 (FastCOVER)
  ✅#47 (scratch reuse)
  ✅#51 (HC rebase)
  ✅#67 (row matcher)

Phase 4 (SIMD & HW Acceleration — DONE):
  ✅#68 (SIMD wildcopy) ──→ ✅#71 (ARM optimizations, NEON backend)
  ✅#69 (branchless offset + prefetch + pext)
  ✅#66 (SIMD HUF kernels, BMI2/AVX2/VBMI2/NEON)
  ✅#88 (default-level small-input dfast cliff)
  ✅#70 (SIMD match-length comparison)
  ✅#97 (early incompressible fast-path + default decodecorpus dfast parity vs C)
  ✅#86 (pre-parsed dictionary handle for repeated dict decode)

Phase 5 (Advanced Features):
  #22 (optimal parsing)
  #23 (block splitting)
  #18 (LDM)
  #26 (magicless)
  #27 (params API)
  #19 (multi-thread compress)
  #72 (parallel block decompress)
```

## Recommended execution order (next actions)

**Highest impact per effort — do these next:**

1. **#22** (optimal parsing compression) — 5d — highest-impact remaining feature-parity item
2. **#23** (block splitting) — 3d — follow-up ratio/feature parity improvement
3. **#72** (parallel block decompression) — 3d — next decode throughput multiplier after parity features

After these 3 items, decode should stay in the **~1.0-1.3x** range vs C zstd on literals-heavy paths with additional headroom from parallel block decompression, while remaining encode parity work concentrates in optimal parsing + block splitting. The early incompressible fast-path item remains completed in ✅ #97.

## Total estimate
- Phase 3 remaining: ~0d
- Phase 4 (new): ~0d
- Phase 4B: ~0d
- Phase 5: ~20d 4h
- **Total remaining: ~20d 4h (~20-21 working days) for full feature parity**


## Roadmap tail backlog

7. **#87** (research: advanced SIMD wildcopy paths) — 2d 4h — evaluate post-Phase-4 SIMD candidates and benchmark go/no-go decisions




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap: structured-zstd feature parity with C zstd #28

Execution order (dependency-aware)

Phase 1: Correctness & Core (P0-critical) — DONE ✅

Phase 2: CoordiNode Critical Path (P1-high) — DONE ✅

Phase 3: Performance Parity (P2-medium) — DONE ✅

Phase 4: SIMD & Hardware Acceleration (P1-high / P2-medium) — DONE ✅

Phase 4B: Dictionary Decode Hot Path (P2-medium) — DONE ✅

Phase 5: Advanced Features (P3-low)

Dependency graph

Recommended execution order (next actions)

Total estimate

Roadmap tail backlog

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Order	Issue	What	Est
1.1	✅ #15	fix: large literals block panic	1d
1.2	✅ #17	feat: FSE table reuse + offset history (broken encoding)	2d
1.3	✅ #5	feat: Default compression level (dfast, level 3)	3d

Order	Issue	What	Blocked by	Est
2.1	✅ #24	test: benchmark suite	—	2d
2.2	✅ #14	perf: encoder match interleaving	—	2d
2.3	✅ #12	perf: sequence execution wildcopy	—	3d
2.4	✅ #8	feat: dictionary compression	✅ #17	3d
2.5	✅ #9	feat: streaming encoder	—	2d
2.6	✅ #6	feat: Better level (lazy2, level 7)	✅ #5	4d

Order	Issue	What	Blocked by	Est
3.1	✅ #10	perf: Huffman 4-stream parallel	—	3d
3.2	✅ #11	perf: FSE batched refill for state updates	—	2d
3.3	✅ #13	perf: bitstream reader	—	2d
3.4	✅ #20	perf: decoder pre-allocation	—	4h
3.5	✅ #16	feat: frame content size	—	1d
3.6	✅ #7	feat: Best level (btlazy2, level 11)	✅ #6	4d
3.7	✅ #21	feat: numeric levels 1-22	✅ #5, ✅ #6, ✅ #7	1d
3.8	✅ #25	feat: FastCOVER dict builder	—	3d
3.9	✅ #56	perf: packed FSE Entry layout (4-byte + bulk spread + cache alignment)	✅ #11	2d 4h
3.10	✅ #47	perf: reuse encoded scratch buffer across streaming blocks	—	1h
3.11	✅ #51	perf: rebase HC table positions (remove 4 GiB cutoff)	—	2d
3.12	✅ #67	perf(encoding): row-based match finder for fast/dfast levels	—	3d

Order	Issue	What	Blocked by	Est
4.1	✅ #68	perf(decoding): SIMD wildcopy for literal+match memcpy	—	3d
4.2	✅ #69	perf(decoding): branchless offset history + stride prefetch + BMI2 pext	—	1d 4h
4.3	✅ #66	perf(decoding): SIMD HUF decode kernels (BMI2/AVX2/VBMI2/NEON)	✅ #56 (recommended)	4d
4.4	✅ #88	perf(encoding): eliminate default-level small-input dfast reset/table clear cliff	—	2d
4.5	✅ #70	perf(encoding): SIMD match-length comparison (SSE2/AVX2/NEON)	—	1d
4.6	✅ #97	perf(encoding): early incompressible fast-path for fastest/default encode + default decodecorpus dfast parity vs C	—	2d 4h
4.7	✅ #71	perf: ARM platform optimizations (CRC32 hash, NEON, SVE2 histcnt)	#68 (NEON wildcopy)	2d 4h

Order	Issue	What	Blocked by	Est
5.1	#22	feat: optimal parsing (levels 16-22)	✅ #7, ✅ #21	5d
5.2	#23	feat: block splitting	✅ #5	3d
5.3	#18	feat: long distance matching	—	3d
5.4	#26	feat: magicless format	—	4h
5.5	#27	feat: configurable parameters API	✅ #21	1d
5.6	#19	feat: multi-threaded compression	✅ #5	5d
5.7	#72	perf(decoding): parallel block decompression	—	3d

Roadmap: structured-zstd feature parity with C zstd #28

Description

Execution order (dependency-aware)

Phase 1: Correctness & Core (P0-critical) — DONE ✅

Phase 2: CoordiNode Critical Path (P1-high) — DONE ✅

Phase 3: Performance Parity (P2-medium) — DONE ✅

Phase 4: SIMD & Hardware Acceleration (P1-high / P2-medium) — DONE ✅

Phase 4B: Dictionary Decode Hot Path (P2-medium) — DONE ✅

Phase 5: Advanced Features (P3-low)

Dependency graph

Recommended execution order (next actions)

Total estimate

Roadmap tail backlog

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions