From Z80 to Universal Computation Chains
- Compilers make suboptimal choices. Can we prove what's optimal?
- The key insight: Z80's 11-byte state is GPU-friendly
- 739K proven peephole rules in minutes, not months
- QuickCheck (8 vectors, 99.99% rejection)
- MidCheck (32 vectors, BIT/SET/RES targeting)
- ExhaustiveCheck (full sweep, shared-memory early termination)
- Performance: 30× faster than CPU, 743K rules verified
- Start with 21 ops, analyze which appear in solutions
- 7 ops never used → remove → 38× speedup
- Empirical pool reduction as a general technique
- The lesson: most instructions are useless for most computations
- 254/254 u8 constants solved (8× faster than shift-and-add)
- 254/254 u16 constants solved in 30 seconds (3-op basis!)
- Pool reduction: 23→3 ops for u16 = 13,600× speedup
- NEG trick: ×255 = 1 instruction (8T)
- Prefix sharing: 51% code compression, multiple entry points
- The reciprocal trick: n/K = (n × M) >> S
- Abstract chains guide GPU search: 6-op focused pool
- 118/120 divisors found in 11 seconds each
- div10 = 124T matches Hacker's Delight (found automatically!)
- Guided brute-force: abstract oracle → ISA-specific materialization
- 15 branchless idioms found: bool, abs, sign, not, lsb, ...
- ABS in 6 insts branchless: carry-to-mask trick
- Sign-extend in 3 insts (12T): ADC overflow → SBC mask
- CPL: the instruction we forgot (complement in 1 inst vs 2)
- 83.6 million provably optimal allocations
- Feasibility phase transition: 96%(2v) → 1%(6v)
- The Z80 register file "fills up" — a mathematical cliff
- 99.5% of random graphs decompose classically
- But compiler-generated graphs are denser (53.7% tw≥4)
- The honest result: theory doesn't match practice
- Five-level pipeline: table → composition → GPU → backtrack → Z3
- ZSQL: 31 functions, 5 profitable merges (210T saved)
- Island decomposition for 28v-37v functions
- Partition optimizer: bottom-up DP on call graph
- The merge decision: when CALL/RET overhead exceeds merge cost
- CPU fallback for GPU-intractable problems (>5T search space)
- Pattern-aware location masks: 1000-4000× pruning
- Constraint propagation + forward checking + most-constrained-first
- Why it works: sparse interference = most assignments infeasible early
- ISA-independent: {dbl, add, sub, save, neg, shr}
- One search → materialize to Z80, 6502, RISC-V, ARM
- Modular arithmetic: NEG in chains vs NEG on hardware
- 254/254 multiply chains in 8 seconds on CPU
- Abstract chain predicts depth and structure
- GPU searches only the ISA-specific materialization space
- 6 ops instead of 37 = millions× faster
- Division: abstract says mul(M)+shr(S), GPU finds exact Z80 sequence
- Shortest ≠ best for a compiler
- Pareto-optimal solutions: multiple sequences per constant
- B-preserving vs B-clobbering: compiler picks by liveness
- 14 B-safe multiplies, 150 B-clobbering (same lookup API)
- The gpugen DSL: ISA → CUDA / Metal / OpenCL / Vulkan
- 250 lines per kernel, 95% shared logic
- Cross-vendor verification: NVIDIA × AMD × Apple = identical results
- RTX 4060 Ti (CUDA): the workhorse
- RTX 2070 (CUDA): the validator
- Radeon RX 580 (OpenCL + Vulkan): the AMD proof
- M2 MacBook Air (Metal): the Apple proof
- ROCm broken for gfx803 — but Mesa saves the day
- When ROCm fails: Mesa rusticl provides OpenCL 3.0
- No CUDA, no ROCm, no HIP — just Mesa and a GPU
- Verified: identical results to CUDA on same search
- The lesson: open drivers matter
- 83.6M regalloc entries (32MB compressed, format spec + readers)
- 254 mul8 + 254 mul16 + 118 div + 15 idioms
- 739K peephole rules
- Everything in data/ with Python/Go reader examples
- Go packages: pkg/mulopt/, pkg/regalloc/, pkg/peephole/
- INCBIN for runtime tables, inline for compile-time
- The 5-level pipeline: table → composition → GPU → backtrack → Z3
- v0.23.0: 372 arithmetic sequences shipping in production
- 6502 constant multiplication (similar ISA, different pool)
- Meet-in-the-middle for deeper search
- 32-bit arithmetic via shadow registers (EXX)
- The self-hosting dream: Z80 allocating its own registers
- "Universal Computation Chains" as a standalone paper