Last updated: 2026-04-01 (Day 7 — IX expansion, OFB, mul16c)
Legend: [x] done, [-] in progress, [ ] planned.
Effort: S = hours, M = day, L = days, XL = week+.
- mul8: 254/254 constants, A×K→A —
data/mulopt8_clobber.json(S) - mul16: 254/254 constants, A×K→HL —
data/mulopt16_complete.json(M) - [-] mul16c: HL×K→HL (16-bit × constant) —
cuda/z80_mulopt16c.cubuilt, running (M)- 12-op pool: ADD HL,HL/BC/DE + 8×LD saves + EX DE,HL
- ~86/254 found (max-len=10). Structural limit: floor(log2(K))+hamming(K)≤9
- Not found (168/254): K values with high Hamming weight — need SBC HL,rr to add
- Next: gen_mul16c_table.py → pkg/mulopt/mul16c_table.go; Go wrapper Emit16c(k)
- div8: 254/254 constants, A÷K→A —
data/div8_optimal.jsonv3 (M)- 6 methods: shift(5), mul_shift(30), preshift_mul(36), mul_add256_shift(41), double_mul_shift(15), carry_compare(127)
- T-states: 8–188 (avg 79T). v1→v2→v3: 154→135→79T (−49%). All exhaustively verified.
- carry_compare (GPU-discovered):
OR A; LD B,(256-K); ADC A,B; SBC A,A; AND 1= 26T for K≥128 - PRESHIFT:
(A>>P)×M>>S— exploit K's power-of-2 factor - Cross-verified: z80-optimizer + MinZ + MinZ-VIR + MinZ-ABAP (4 systems)
- mod8: 254/254 —
data/mod8_optimal.json(S) - divmod8: 254/254 —
data/divmod8_optimal.json(S) - div16n: HL÷K→A (narrowing division, quotient ≤ 255) (M)
- Approach: analytical multiply-and-shift on 16-bit, or lookup table
- mod16n: HL%K→A (narrowing modulo) (M)
- ABS(A): 6 ops, 24T —
data/branchless_lib.json(done day 3) - MIN/MAX(A,B): 8 ops, 32T — SBC A,A + bitwise select
- CMOV CY?B:C: 6 ops, 24T
- sign8: 9 ops, 43T —
data/sign_sat_ops.json - sat_add8: 4 ops, 16T — the
OR Cmasterpiece - sat_sub8: 5 ops, 20T
- sat_add8_signed: 7 ops (branch), needs branchless variant? (S)
- clamp8(A, lo, hi): A clamped to [lo,hi] range (S)
- Compose: max(lo, min(hi, A))? Or dedicated sequence.
- bit_reverse8: reverse bits of A (S)
- GPU search or manual via nibble-swap + lookup
- popcount8: count set bits of A → A (S)
- Classic: A = A - ((A>>1) & 0x55); etc. Adapt to Z80.
- approx sin8 / cos8: polynomial, allow ±2 error (M)
- abs16: 11 ops, 44T —
data/arith16_new.json - neg16: 6 ops, 27T
- min16/max16: 5 ops, 41–46T (branch)
- sign16: 7 ops, 20–34T
- cmp16_zero: 2 ops, 8T
- sat_add16 / sat_sub16: unsigned, 16-bit (S)
- mul16×16→32: HL×DE→DEHL, needed for mul16c (L)
- Schoolbook: 4 partial products, 3 additions. ~200T estimate.
- u32 library: 13 operations —
data/u32_ops.json(DEHL convention)- SHL32(34T), SHR32(32T), SAR32(32T), ADD32(54T), SUB32(58T), NEG32(57T), CMP32_ZERO(16T), ZEXT/SEXT, XOR32(100T), AND32(100T), ROTR32(32–40T)
- SHA-256 round decomposition —
data/sha256_round.json- ~2570T/round, 64 rounds + message schedule = ~202K T = 58ms/block @3.5MHz
- [-] Multi-convention u32: DEHL is suboptimal, HLIX and HLH'L' are faster (M)
- HLIX (H:L:IXH:IXL): SHL=30T, ADD=30T — best for fused ops (×10+A atoi)
- HLH'L' (H:L:H':L'): SHL=30T, save=4T(EXX!) — best for pure arithmetic (×K)
- DEHL (D:E:H:L): SHL=34T, ADD=54T — current, worst, but simplest
- HLH'L' wins 14/19 ×K benchmarks, saves ~74T avg vs DEHL
- Need: u32_ops.json v2 with all 3 conventions per operation
- [-] u32 ×K constant multiply: search optimal per-convention (M)
- Pool: SHL32+SAVE+ADD32 composites per convention
- Key targets: ×3, ×5, ×7, ×9, ×10, ×100, ×1000
- ×10 hand-written HLIX = 178T (with +A fused!)
- u32 ÷K constant divide: multiply-and-shift with 32-bit magic (L)
- ÷10 = ×0x1999999A>>32 (~426T via schoolbook u32×u32)
- Alternative: repeated subtraction of powers of 10 (~306T for itoa)
- Power-of-2: just SHR chain (28T × log2(K))
- OR32, NOT32: trivial but not yet in table (S)
- CMP32_unsigned: compare DEHL vs IX:IY (S)
- SHA-256 Z80 implementation: full working .asm (XL)
- Have decomposition; need actual register allocation + memory layout
- HLH'L' convention could hold 2 working vars without RAM spills
- 4×32 vars possible: DEHL + BCIX + D'E'H'L' + B'C'IY
- len-2→len-1: 739K rules, complete —
data/rules_l2.json(done) - [-] len-3→len-1: 37M rules found (~0.05% of 74.9B search space) (XL)
- GPU search running intermittently. Full sweep estimated weeks on 2× RTX 4060 Ti.
- Strategy: prioritize high-value target op ranges.
- len-3→len-2: not yet started. Produces size-reducing rewrites. (XL)
- len-4→len-2: STOKE only. Stochastic, not exhaustive. (L)
- Dead-flags rules: rules valid when specific flags are dead post-sequence (M)
- Infrastructure exists (
pkg/search/verifier.go). Need dedicated search run.
- Infrastructure exists (
- Dependency DAG: build RAW/WAW/WAR graph from opReads/opWrites (M)
- Primitives exist in
pkg/search/pruner.go.
- Primitives exist in
- Pattern matcher with reorder: scan basic block, match rules across independent instructions (M)
- Multi-pass fixpoint: repeat until no more rules apply (S)
- Integration with MinZ: basic-block input/output format, JSON API (M)
- Approximate search: 13-op pool, 15 targets simultaneously (M)
- Day 2 prototype worked. Needs cleanup for production use.
- Focused search cores: AF-only, HL+M-only for deeper search (L)
- 83.6M shapes (≤6v): enumerated, enriched, compressed (78MB)
- 37.6M feasible: each with optimal assignment + 15 metrics
- O(1) lookup: signature = (interference_shape, operation_bag) → hash
- Enriched tables: 43% lack A, 21% lack HL, smart CALL save 17T avg
- IX-expanded 5v:
data/ix_expanded_5v.bin(60.9M entries, 79.2% feasible, 117s GPU) - Merged IX+5v:
data/merged_ix_5v.bin(60.9M entries, 79.8% feasible, 382MB) - OFB sidecars:
data/enriched_{4v,5v,6v_dense}.ofb— 32-bit feasibility bitmask per entry- 15 bits: ALU, ptr ops, mul8-safe, DJNZ, EXX bridge, HLH'L' u32, ADC/SBC src validity
- 5v: 11.76M feasible; 6v dense: 25.77M feasible
- Level 1: Cut vertex decomposition (free split, 87%)
- Level 2: Enriched table O(1) lookup (37.6M entries, 79%)
- Level 3: EXX 2-coloring (7–12v, 70% bipartite)
- Level 4: GPU partition optimizer (≤18v exhaustive, <2 min)
- Level 5: Z3 fallback (>18v, <0.5%)
- [-] Backtracking solver (
cmd/regalloc-enum/): alternative to Z3 for 7–15v (M)- Basic implementation done. Needs benchmarking vs Z3.
- 7-variable table: ~500M shapes estimated, 4–12h compute (L)
- Feasibility: ~5% at 7v (phase cliff). Worth it for O(1) coverage.
- MinZ integration testing: run pipeline on full 820-fn corpus (M)
- 246 unique signatures tested. Need end-to-end with codegen.
- Spill cost model: when allocation fails, estimate spill cost (M)
- Currently: binary feasible/infeasible. Need: quantify spill penalty.
- Dual-layer evolutionary —
cuda/prng_hybrid_gpu.cu- 5 layers (3 additive OR + 2 subtractive AND NOT), island model
- Best: cat 4.9%, skull 14.7%, Einstein 15.1%
- 557K img/s dual-layer on RTX 4060 Ti
- Segmented hierarchical LFSR —
cuda/prng_segmented_search.cu- 6 levels, brute-force 65536 seeds per segment
- Best: Che 15.0% (1194 bytes), 31.2% (170 bytes fits 256b intro)
- BB Introspec port —
cuda/bb_search.cu- 24-bit Galois LFSR, 66 layers, 3 weighted masks
- 0.9s per s0 value, full sweep in ~4 minutes
- [-] che_intro.asm — 351 bytes, builds with mza, LFSR mismatch produces noise (M)
- Root cause: CUDA search uses per-layer init, Z80 code uses fixed init
- Fix: re-run CUDA search matching exact Z80 LFSR parameters
- Or: rewrite Z80 code to match CUDA LFSR conventions
- 256-byte intro: Che or skull, minimal LFSR + segmented seeds (M)
- 85 seeds × 2 bytes = 170 bytes data + ~80 bytes code = fits!
- 512-byte intro: Che with 6 levels, LFSR playback (M)
- 1K intro: dual-layer with evolved genome playback (L)
- Multi-pass BB refinement: re-optimize early layers after first pass (M)
- Introspec's secret sauce. Each re-pass reduces error monotonically.
- Better segmented search: levels 6–7, 32-bit seeds, multiple polynomials (M)
- Goal: <10% error from 256 bytes
- Face-region masks: weighted fitness for eyes/mouth like Introspec (S)
- New targets: Marilyn Monroe, Darth Vader, Einstein tongue (S)
- BB + segmented hybrid: BB for base, segmented for detail (L)
- z80_search_v2.cu: 3-stage pipeline (QC→Mid→Exhaustive), dual-GPU
- z80_regalloc.cu: GPU allocator + CPU backtracking fallback (constrained enumeration fix: 97× speedup)
- z80_mulopt_fast.cu: 14-op constant multiply (38× faster)
- z80_divmod_fast.cu: 14-op division/modulo search
- z80_mulopt16.cu: 16-bit multiply search (A×K→HL)
- z80_mulopt16c.cu: HL×K→HL search, 12-op pool (new day 7)
- z80_common.h: shared executor, flag tables, test vectors
- pkg/gpugen/: ISA DSL → CUDA, Metal, OpenCL, Vulkan
- 3 ISA definitions: Z80 (394 ops), 6502, SM83 (Game Boy)
- Cross-verified: 5 platforms, 4 APIs, 3 vendors, zero discrepancies
- Unified search orchestrator: coordinate multi-GPU multi-kernel searches (L)
- Currently: manual per-kernel scripts. Need: single coordinator.
- Remote GPU dispatch: ssh+scp to i5/i3, launch, collect results (M)
- Scripts exist (
cuda/run_mulopt_remote.sh). Need: proper pipeline.
- Scripts exist (
- Vulkan compute on AMD: RX 580 via gpugen Vulkan backend (M)
- ROCm broken for gfx803. Vulkan works. Need: Vulkan search driver.
-
pkg/mulopt/: Emit8(k), Emit16(k), Lookup8/16 -
pkg/regalloc/: LoadBinary(path), IndexOf(shape), Lookup(idx) -
pkg/peephole/: Lookup(source) top500, LoadRules(path) full 739K -
pkg/gpugen/: ISA DSL for multi-platform code generation -
cmd/enrich-ofb/: Emit OFB sidecar — built and validated (day 7)- Add to Go package exports for use by regalloc pipeline
-
pkg/mulopt/mul16c_table.go: Emit16c(k) HL×K→HL — pending mul16c JSON complete
- div8 inline expansion: MinZ codegen wiring for JP __div8 → inline (S)
- Pipeline issue: peephole transforms JP before div8 check. Fix: expansion before optimization.
- Clobber-aware selection: pick mul/div variant by register pressure (M)
- Current: always use cheapest. Better: pick by what's live.
- Pareto front export: (tstates, bytes, clobber_mask) triples per operation (S)
- MinZ can then pick optimal variant per call site.
- Batch evaluation API: JSON-in, JSON-out for testing entire corpus (M)
- [-] Register allocation paper —
docs/paper_seed_superopt.md(L)- 8 sections drafted. Needs: image search chapter, Introspec analysis.
- [-] Book outline —
docs/book_outline.md, 19 chapters (XL)- Appendices K(minz), L(FP), M(BCD), N(LUT), O(meta) in progress.
- Phase diagram publication: feasibility cliff 95.9%→0.9% (S)
- Data complete. Need: LaTeX figure + 2-page writeup.
- 21-instruction universal pool paper: 2.7% of ISA generates ALL arithmetic (S)
- Dual-layer image search writeup: subtractive carving insight (M)
- CLAUDE.md: up to date with all packages and kernels
- README.md: v1.0.0 release notes, architecture, results
- Gallery:
media/prng_images/README.md(29 experiments, 3 methods) - Update docs/NEXT.md: replace WebGPU section with CUDA v2 reality (S)
- API documentation: godoc for pkg/mulopt, pkg/regalloc, pkg/peephole (M)
- Update README.md results section: div8 254/254, sign/sat ops, u32, SHA-256 (S)
- Archive dead code: WebGPU shader (abandoned), CUDA v1 search (superseded) (S)
- Consolidate data files: merge branchless_lib + sign_sat_ops + arith16_new into single arith8.json / arith16.json (S)
- Test coverage: pkg/mulopt, pkg/regalloc need integration tests (M)
- CI pipeline: GitHub Actions for Go build + test (S)
- Release v1.1.0: div8, mod8, divmod8, sign/sat, u32 ops, SHA-256, arith16 (S)
| Priority | Item | Effort | Blocker for |
|---|---|---|---|
| P0 | |||
| P0 | |||
| P0 | |||
| P0 | |||
| P0 | |||
| P1 | div8 inline expansion fix | S | MinZ codegen pipeline |
| P1 | Z80 intro LFSR alignment | M | Working demo |
| P1 | mul16c (HL×K→HL) | M | MinZ u16 multiply |
| P2 | Multi-pass BB refinement | M | Better image quality |
| P2 | div16n/mod16n | M | MinZ u16 division |
| P2 | len-3 peephole search | XL | Compiler quality |
| P3 | Reordering optimizer | L | Real-world rule application |
| P3 | 7-variable regalloc table | L | O(1) coverage expansion |
| P3 | 256-byte ZX Spectrum intro | M | Demoscene deliverable |
| P4 | SHA-256 full Z80 .asm | XL | Proof-of-concept |
| P4 | approx sin/cos, popcount | M | MinZ math library |
| P4 | Segmented search <10% | M | Research paper |
| P5 | Paper / book completion | XL | Publication |
| P5 | Vulkan compute on AMD | M | Multi-vendor story |
| Date | Milestone |
|---|---|
| 2026-03-26 | v1.0.0 release, 500+ arithmetic sequences, article published |
| 2026-03-26 | 83.6M regalloc shapes, 97.7% infeasibility proof |
| 2026-03-26 | ISA DSL gpugen: 4 backends, 3 ISAs, cross-verified |
| 2026-03-27 | FP16 library, BCD arithmetic, multi-target search |
| 2026-03-27 | 21-instruction universal pool thesis |
| 2026-03-28 | 37.6M enriched shapes, O(1) regalloc lookup |
| 2026-03-28 | Branchless library: ABS/MIN/MAX/CMOV, gray_decode EXACT |
| 2026-03-28 | Z flag write-only proof, SBC A,A trick library |
| 2026-03-29 | Image search: 3 methods, cat 4.9%, Che 15%, Introspec port |
| 2026-03-29 | u32 library: 13 ops, SHL32/SHR32 proven optimal |
| 2026-03-29 | Segmented hierarchical LFSR (user's breakthrough idea) |
| 2026-03-29 | div8 254/254, mod8 254/254, divmod8 254/254 |
| 2026-03-29 | sign8, sat_add8 (16T!), sat_sub8 — all exhaustively verified |
| 2026-03-29 | abs16, neg16, min16, max16, sign16, cmp16_zero |
| 2026-03-29 | SHA-256 round decomposition: 58ms/block realistic estimate |
| 2026-03-29 | div8 v3: carry_compare trick, avg 79T (−49%), GPU-discovered |
| 2026-03-29 | 4× cross-verified: z80-optimizer + MinZ + MinZ-VIR + MinZ-ABAP |
| 2026-03-29 | MinZ-VIR integrated div8 IntrinsicTable (commit 8cfba219) |