TODO — Z80 Superoptimizer Roadmap

Last updated: 2026-04-01 (Day 7 — IX expansion, OFB, mul16c)

Legend: [x] done, [-] in progress, [ ] planned. Effort: S = hours, M = day, L = days, XL = week+.

1. Arithmetic Library (for MinZ compiler)

1.1 Multiply — COMPLETE

mul8: 254/254 constants, A×K→A — data/mulopt8_clobber.json (S)
mul16: 254/254 constants, A×K→HL — data/mulopt16_complete.json (M)
[-] mul16c: HL×K→HL (16-bit × constant) — cuda/z80_mulopt16c.cu built, running (M)
- 12-op pool: ADD HL,HL/BC/DE + 8×LD saves + EX DE,HL
- ~86/254 found (max-len=10). Structural limit: floor(log2(K))+hamming(K)≤9
- Not found (168/254): K values with high Hamming weight — need SBC HL,rr to add
- Next: gen_mul16c_table.py → pkg/mulopt/mul16c_table.go; Go wrapper Emit16c(k)

1.2 Division / Modulo — COMPLETE (u8)

div8: 254/254 constants, A÷K→A — data/div8_optimal.json v3 (M)
- 6 methods: shift(5), mul_shift(30), preshift_mul(36), mul_add256_shift(41), double_mul_shift(15), carry_compare(127)
- T-states: 8–188 (avg 79T). v1→v2→v3: 154→135→79T (−49%). All exhaustively verified.
- carry_compare (GPU-discovered): OR A; LD B,(256-K); ADC A,B; SBC A,A; AND 1 = 26T for K≥128
- PRESHIFT: (A>>P)×M>>S — exploit K's power-of-2 factor
- Cross-verified: z80-optimizer + MinZ + MinZ-VIR + MinZ-ABAP (4 systems)
mod8: 254/254 — data/mod8_optimal.json (S)
divmod8: 254/254 — data/divmod8_optimal.json (S)
div16n: HL÷K→A (narrowing division, quotient ≤ 255) (M)
- Approach: analytical multiply-and-shift on 16-bit, or lookup table
mod16n: HL%K→A (narrowing modulo) (M)

1.3 Branchless Primitives — COMPLETE (u8)

1.4 16-bit Arithmetic — PARTIAL

abs16: 11 ops, 44T — data/arith16_new.json
neg16: 6 ops, 27T
min16/max16: 5 ops, 41–46T (branch)
sign16: 7 ops, 20–34T
cmp16_zero: 2 ops, 8T
sat_add16 / sat_sub16: unsigned, 16-bit (S)
mul16×16→32: HL×DE→DEHL, needed for mul16c (L)
- Schoolbook: 4 partial products, 3 additions. ~200T estimate.

1.5 32-bit Arithmetic — PARTIAL (convention rework needed)

u32 library: 13 operations — data/u32_ops.json (DEHL convention)
- SHL32(34T), SHR32(32T), SAR32(32T), ADD32(54T), SUB32(58T), NEG32(57T), CMP32_ZERO(16T), ZEXT/SEXT, XOR32(100T), AND32(100T), ROTR32(32–40T)
SHA-256 round decomposition — data/sha256_round.json
- ~2570T/round, 64 rounds + message schedule = ~202K T = 58ms/block @3.5MHz
[-] Multi-convention u32: DEHL is suboptimal, HLIX and HLH'L' are faster (M)
- HLIX (H:L:IXH:IXL): SHL=30T, ADD=30T — best for fused ops (×10+A atoi)
- HLH'L' (H:L:H':L'): SHL=30T, save=4T(EXX!) — best for pure arithmetic (×K)
- DEHL (D:E:H:L): SHL=34T, ADD=54T — current, worst, but simplest
- HLH'L' wins 14/19 ×K benchmarks, saves ~74T avg vs DEHL
- Need: u32_ops.json v2 with all 3 conventions per operation
[-] u32 ×K constant multiply: search optimal per-convention (M)
- Pool: SHL32+SAVE+ADD32 composites per convention
- Key targets: ×3, ×5, ×7, ×9, ×10, ×100, ×1000
- ×10 hand-written HLIX = 178T (with +A fused!)
u32 ÷K constant divide: multiply-and-shift with 32-bit magic (L)
- ÷10 = ×0x1999999A>>32 (~426T via schoolbook u32×u32)
- Alternative: repeated subtraction of powers of 10 (~306T for itoa)
- Power-of-2: just SHR chain (28T × log2(K))
OR32, NOT32: trivial but not yet in table (S)
CMP32_unsigned: compare DEHL vs IX:IY (S)
SHA-256 Z80 implementation: full working .asm (XL)
- Have decomposition; need actual register allocation + memory layout
- HLH'L' convention could hold 2 working vars without RAM spills
- 4×32 vars possible: DEHL + BCIX + D'E'H'L' + B'C'IY

2. Peephole Superoptimizer

2.1 Rules — PARTIAL

len-2→len-1: 739K rules, complete — data/rules_l2.json (done)
[-] len-3→len-1: 37M rules found (~0.05% of 74.9B search space) (XL)
- GPU search running intermittently. Full sweep estimated weeks on 2× RTX 4060 Ti.
- Strategy: prioritize high-value target op ranges.
len-3→len-2: not yet started. Produces size-reducing rewrites. (XL)
len-4→len-2: STOKE only. Stochastic, not exhaustive. (L)
Dead-flags rules: rules valid when specific flags are dead post-sequence (M)
- Infrastructure exists (pkg/search/verifier.go). Need dedicated search run.

2.2 Reordering Optimizer

Dependency DAG: build RAW/WAW/WAR graph from opReads/opWrites (M)
- Primitives exist in pkg/search/pruner.go.
Pattern matcher with reorder: scan basic block, match rules across independent instructions (M)
Multi-pass fixpoint: repeat until no more rules apply (S)
Integration with MinZ: basic-block input/output format, JSON API (M)

2.3 Multi-Target Search

Approximate search: 13-op pool, 15 targets simultaneously (M)
- Day 2 prototype worked. Needs cleanup for production use.
Focused search cores: AF-only, HL+M-only for deeper search (L)

3. Register Allocation

3.1 Tables — COMPLETE + IX EXPANDED

83.6M shapes (≤6v): enumerated, enriched, compressed (78MB)
37.6M feasible: each with optimal assignment + 15 metrics
O(1) lookup: signature = (interference_shape, operation_bag) → hash
Enriched tables: 43% lack A, 21% lack HL, smart CALL save 17T avg
IX-expanded 5v: data/ix_expanded_5v.bin (60.9M entries, 79.2% feasible, 117s GPU)
Merged IX+5v: data/merged_ix_5v.bin (60.9M entries, 79.8% feasible, 382MB)
OFB sidecars: data/enriched_{4v,5v,6v_dense}.ofb — 32-bit feasibility bitmask per entry
- 15 bits: ALU, ptr ops, mul8-safe, DJNZ, EXX bridge, HLH'L' u32, ADC/SBC src validity
- 5v: 11.76M feasible; 6v dense: 25.77M feasible

3.2 Five-Level Pipeline — MOSTLY COMPLETE

Level 1: Cut vertex decomposition (free split, 87%)
Level 2: Enriched table O(1) lookup (37.6M entries, 79%)
Level 3: EXX 2-coloring (7–12v, 70% bipartite)
Level 4: GPU partition optimizer (≤18v exhaustive, <2 min)
Level 5: Z3 fallback (>18v, <0.5%)
[-] Backtracking solver (cmd/regalloc-enum/): alternative to Z3 for 7–15v (M)
- Basic implementation done. Needs benchmarking vs Z3.

3.3 Next Steps

7-variable table: ~500M shapes estimated, 4–12h compute (L)
- Feasibility: ~5% at 7v (phase cliff). Worth it for O(1) coverage.
MinZ integration testing: run pipeline on full 820-fn corpus (M)
- 246 unique signatures tested. Need end-to-end with codegen.
Spill cost model: when allocation fails, estimate spill cost (M)
- Currently: binary feasible/infeasible. Need: quantify spill penalty.

4. Image Search / Demoscene (ZX Spectrum)

4.1 CUDA Generators — DONE (3 methods)

Dual-layer evolutionary — cuda/prng_hybrid_gpu.cu
- 5 layers (3 additive OR + 2 subtractive AND NOT), island model
- Best: cat 4.9%, skull 14.7%, Einstein 15.1%
- 557K img/s dual-layer on RTX 4060 Ti
Segmented hierarchical LFSR — cuda/prng_segmented_search.cu
- 6 levels, brute-force 65536 seeds per segment
- Best: Che 15.0% (1194 bytes), 31.2% (170 bytes fits 256b intro)
BB Introspec port — cuda/bb_search.cu
- 24-bit Galois LFSR, 66 layers, 3 weighted masks
- 0.9s per s0 value, full sweep in ~4 minutes

4.2 Z80 Intros

[-] che_intro.asm — 351 bytes, builds with mza, LFSR mismatch produces noise (M)
- Root cause: CUDA search uses per-layer init, Z80 code uses fixed init
- Fix: re-run CUDA search matching exact Z80 LFSR parameters
- Or: rewrite Z80 code to match CUDA LFSR conventions
256-byte intro: Che or skull, minimal LFSR + segmented seeds (M)
- 85 seeds × 2 bytes = 170 bytes data + ~80 bytes code = fits!
512-byte intro: Che with 6 levels, LFSR playback (M)
1K intro: dual-layer with evolved genome playback (L)

4.3 Improvements

Multi-pass BB refinement: re-optimize early layers after first pass (M)
- Introspec's secret sauce. Each re-pass reduces error monotonically.
Better segmented search: levels 6–7, 32-bit seeds, multiple polynomials (M)
- Goal: <10% error from 256 bytes
Face-region masks: weighted fitness for eyes/mouth like Introspec (S)
New targets: Marilyn Monroe, Darth Vader, Einstein tongue (S)
BB + segmented hybrid: BB for base, segmented for detail (L)

5. GPU Infrastructure

5.1 CUDA Kernels — WORKING

z80_search_v2.cu: 3-stage pipeline (QC→Mid→Exhaustive), dual-GPU
z80_regalloc.cu: GPU allocator + CPU backtracking fallback (constrained enumeration fix: 97× speedup)
z80_mulopt_fast.cu: 14-op constant multiply (38× faster)
z80_divmod_fast.cu: 14-op division/modulo search
z80_mulopt16.cu: 16-bit multiply search (A×K→HL)
z80_mulopt16c.cu: HL×K→HL search, 12-op pool (new day 7)
z80_common.h: shared executor, flag tables, test vectors

5.2 Multi-Platform DSL — WORKING

pkg/gpugen/: ISA DSL → CUDA, Metal, OpenCL, Vulkan
3 ISA definitions: Z80 (394 ops), 6502, SM83 (Game Boy)
Cross-verified: 5 platforms, 4 APIs, 3 vendors, zero discrepancies

5.3 Next Steps

Unified search orchestrator: coordinate multi-GPU multi-kernel searches (L)
- Currently: manual per-kernel scripts. Need: single coordinator.
Remote GPU dispatch: ssh+scp to i5/i3, launch, collect results (M)
- Scripts exist (cuda/run_mulopt_remote.sh). Need: proper pipeline.
Vulkan compute on AMD: RX 580 via gpugen Vulkan backend (M)
- ROCm broken for gfx803. Vulkan works. Need: Vulkan search driver.

6. Compiler Integration (MinZ)

6.1 Go Packages — SHIPPED

pkg/mulopt/: Emit8(k), Emit16(k), Lookup8/16
pkg/regalloc/: LoadBinary(path), IndexOf(shape), Lookup(idx)
pkg/peephole/: Lookup(source) top500, LoadRules(path) full 739K
pkg/gpugen/: ISA DSL for multi-platform code generation
cmd/enrich-ofb/: Emit OFB sidecar — built and validated (day 7)
- Add to Go package exports for use by regalloc pipeline
pkg/mulopt/mul16c_table.go: Emit16c(k) HL×K→HL — pending mul16c JSON complete

6.2 Pending Integration

div8 inline expansion: MinZ codegen wiring for JP __div8 → inline (S)
- Pipeline issue: peephole transforms JP before div8 check. Fix: expansion before optimization.
Clobber-aware selection: pick mul/div variant by register pressure (M)
- Current: always use cheapest. Better: pick by what's live.
Pareto front export: (tstates, bytes, clobber_mask) triples per operation (S)
- MinZ can then pick optimal variant per call site.
Batch evaluation API: JSON-in, JSON-out for testing entire corpus (M)

7. Research & Documentation

7.1 Papers

[-] Register allocation paper — docs/paper_seed_superopt.md (L)
- 8 sections drafted. Needs: image search chapter, Introspec analysis.
[-] Book outline — docs/book_outline.md, 19 chapters (XL)
- Appendices K(minz), L(FP), M(BCD), N(LUT), O(meta) in progress.
Phase diagram publication: feasibility cliff 95.9%→0.9% (S)
- Data complete. Need: LaTeX figure + 2-page writeup.
21-instruction universal pool paper: 2.7% of ISA generates ALL arithmetic (S)
Dual-layer image search writeup: subtractive carving insight (M)

7.2 Documentation

CLAUDE.md: up to date with all packages and kernels
README.md: v1.0.0 release notes, architecture, results
Gallery: media/prng_images/README.md (29 experiments, 3 methods)
Update docs/NEXT.md: replace WebGPU section with CUDA v2 reality (S)
API documentation: godoc for pkg/mulopt, pkg/regalloc, pkg/peephole (M)

8. Cleanup & Tech Debt

Update README.md results section: div8 254/254, sign/sat ops, u32, SHA-256 (S)
Archive dead code: WebGPU shader (abandoned), CUDA v1 search (superseded) (S)
Consolidate data files: merge branchless_lib + sign_sat_ops + arith16_new into single arith8.json / arith16.json (S)
Test coverage: pkg/mulopt, pkg/regalloc need integration tests (M)
CI pipeline: GitHub Actions for Go build + test (S)
Release v1.1.0: div8, mod8, divmod8, sign/sat, u32 ops, SHA-256, arith16 (S)

Priority Matrix

Priority	Item	Effort	Blocker for
P0	~~div8/mod8/divmod8~~	M	~~MinZ codegen~~ DONE
P0	~~sign8/sat_add8/sat_sub8~~	S	~~MinZ stdlib~~ DONE
P0	~~abs16/neg16/min16/max16~~	S	~~MinZ u16 ops~~ DONE
P0	~~SHA-256 decomposition~~	S	~~MinZ feasibility~~ DONE
P0	~~div8 v3 + carry_compare~~	M	~~MinZ-VIR integrated~~ DONE
P1	div8 inline expansion fix	S	MinZ codegen pipeline
P1	Z80 intro LFSR alignment	M	Working demo
P1	mul16c (HL×K→HL)	M	MinZ u16 multiply
P2	Multi-pass BB refinement	M	Better image quality
P2	div16n/mod16n	M	MinZ u16 division
P2	len-3 peephole search	XL	Compiler quality
P3	Reordering optimizer	L	Real-world rule application
P3	7-variable regalloc table	L	O(1) coverage expansion
P3	256-byte ZX Spectrum intro	M	Demoscene deliverable
P4	SHA-256 full Z80 .asm	XL	Proof-of-concept
P4	approx sin/cos, popcount	M	MinZ math library
P4	Segmented search <10%	M	Research paper
P5	Paper / book completion	XL	Publication
P5	Vulkan compute on AMD	M	Multi-vendor story

Completed Milestones

Date	Milestone
2026-03-26	v1.0.0 release, 500+ arithmetic sequences, article published
2026-03-26	83.6M regalloc shapes, 97.7% infeasibility proof
2026-03-26	ISA DSL gpugen: 4 backends, 3 ISAs, cross-verified
2026-03-27	FP16 library, BCD arithmetic, multi-target search
2026-03-27	21-instruction universal pool thesis
2026-03-28	37.6M enriched shapes, O(1) regalloc lookup
2026-03-28	Branchless library: ABS/MIN/MAX/CMOV, gray_decode EXACT
2026-03-28	Z flag write-only proof, SBC A,A trick library
2026-03-29	Image search: 3 methods, cat 4.9%, Che 15%, Introspec port
2026-03-29	u32 library: 13 ops, SHL32/SHR32 proven optimal
2026-03-29	Segmented hierarchical LFSR (user's breakthrough idea)
2026-03-29	div8 254/254, mod8 254/254, divmod8 254/254
2026-03-29	sign8, sat_add8 (16T!), sat_sub8 — all exhaustively verified
2026-03-29	abs16, neg16, min16, max16, sign16, cmp16_zero
2026-03-29	SHA-256 round decomposition: 58ms/block realistic estimate
2026-03-29	div8 v3: carry_compare trick, avg 79T (−49%), GPU-discovered
2026-03-29	4× cross-verified: z80-optimizer + MinZ + MinZ-VIR + MinZ-ABAP
2026-03-29	MinZ-VIR integrated div8 IntrinsicTable (commit 8cfba219)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TODO — Z80 Superoptimizer Roadmap

1. Arithmetic Library (for MinZ compiler)

1.1 Multiply — COMPLETE

1.2 Division / Modulo — COMPLETE (u8)

1.3 Branchless Primitives — COMPLETE (u8)

1.4 16-bit Arithmetic — PARTIAL

1.5 32-bit Arithmetic — PARTIAL (convention rework needed)

2. Peephole Superoptimizer

2.1 Rules — PARTIAL

2.2 Reordering Optimizer

2.3 Multi-Target Search

3. Register Allocation

3.1 Tables — COMPLETE + IX EXPANDED

3.2 Five-Level Pipeline — MOSTLY COMPLETE

3.3 Next Steps

4. Image Search / Demoscene (ZX Spectrum)

4.1 CUDA Generators — DONE (3 methods)

4.2 Z80 Intros

4.3 Improvements

5. GPU Infrastructure

5.1 CUDA Kernels — WORKING

5.2 Multi-Platform DSL — WORKING

5.3 Next Steps

6. Compiler Integration (MinZ)

6.1 Go Packages — SHIPPED

6.2 Pending Integration

7. Research & Documentation

7.1 Papers

7.2 Documentation

8. Cleanup & Tech Debt

Priority Matrix

Completed Milestones

FilesExpand file tree

TODO.md

Latest commit

History

TODO.md

File metadata and controls

TODO — Z80 Superoptimizer Roadmap

1. Arithmetic Library (for MinZ compiler)

1.1 Multiply — COMPLETE

1.2 Division / Modulo — COMPLETE (u8)

1.3 Branchless Primitives — COMPLETE (u8)

1.4 16-bit Arithmetic — PARTIAL

1.5 32-bit Arithmetic — PARTIAL (convention rework needed)

2. Peephole Superoptimizer

2.1 Rules — PARTIAL

2.2 Reordering Optimizer

2.3 Multi-Target Search

3. Register Allocation

3.1 Tables — COMPLETE + IX EXPANDED

3.2 Five-Level Pipeline — MOSTLY COMPLETE

3.3 Next Steps

4. Image Search / Demoscene (ZX Spectrum)

4.1 CUDA Generators — DONE (3 methods)

4.2 Z80 Intros

4.3 Improvements

5. GPU Infrastructure

5.1 CUDA Kernels — WORKING

5.2 Multi-Platform DSL — WORKING

5.3 Next Steps

6. Compiler Integration (MinZ)

6.1 Go Packages — SHIPPED

6.2 Pending Integration

7. Research & Documentation

7.1 Papers

7.2 Documentation

8. Cleanup & Tech Debt

Priority Matrix

Completed Milestones