From MIR2 to Z80 assembly in six stages, using ISLE term rewriting, Wave Function Collapse, and PBQP hints — with Z80-specific innovations no existing compiler has.
Z80 is one of the most constrained architectures still actively targeted:
- 7 general-purpose 8-bit registers (A, B, C, D, E, H, L)
- Accumulator-only ALU:
ADD A, r— destination is always A - Register pair asymmetry: 16-bit ADD only works on HL
- DD/FD prefix conflicts: IX-indexed ops can't use H or L
- No hardware multiply, no barrel shifter, no second accumulator
Traditional approaches (linear scan, graph coloring) struggle because Z80's constraints aren't just "these two vregs interfere" — they're structural (specific ops require specific registers) and hierarchical (IX halves survive calls but cost more than GPRs).
┌─────────────────────────────────────┐
│ MIR2 Function │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ ISLE (What to compute) │
│ S-expression rewrite rules │
│ "fuse two loads into load16_le" │
│ "mul(x, 4) → shl(x, 2)" │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ ISel (How to compute) │
│ Pattern table → LIR instructions │
│ "OpAdd width=8 → add_a_r" │
│ Insert setup moves for constraints │
└──────────────┬──────────────────────┘
│
┌────────────────────────▼────────────────────────────┐
│ │
┌──────▼──────┐ ┌────────▼────────┐
│ PBQP │──── hints (preferred locs) ──────>│ WFC │
│ Global │ │ Local │
│ Strategist │ │ Tactician │
│ │ │ │
│ Sees: full │ │ Sees: Z80- │
│ interference │ │ specific rules │
│ graph │ │ DD/FD prefix │
│ │ │ IXH/IXL call-safe│
│ Outputs: │ │ clobber pass │
│ vreg→phys │ │ │
└──────────────┘ │ Outputs: │
│ physical regs │
└────────┬────────┘
│
┌────────────────▼────────────────┐
│ Peephole + Emit │
│ LD r,r elimination │
│ Tail call: CALL+RET → JP │
│ Template expansion → Z80 asm │
└─────────────────────────────────┘
ISLE (Instruction Selection Lowering Expressions) is a term-rewriting engine inspired by Cranelift. It operates on MIROps as S-expression trees and applies prioritized rewrite rules:
;; FatFS ld_word: fuse two 8-bit loads + shift + or into one 16-bit LE load
(rule 20 (or (shl (load8 (add ?base (const 1))) (const 8))
(load8 ?base))
(load16_le ?base))
;; Strength reduction: multiply by power of 2 → shift
(rule 15 (mul ?x (const 2)) (add ?x ?x))
(rule 15 (mul ?x (const 4)) (shl ?x (const 2)))
;; Identity elimination
(rule (add ?x (const 0)) ?x)ISLE reduces instruction count before register allocation runs. For FatFS ld_word, it fuses 8 MIROps into 2.
PBQP (Partitioned Boolean Quadratic Programming) builds an interference graph of all virtual registers in a function. For each vreg, it knows which physical registers are possible and which other vregs conflict. It solves for globally optimal assignment.
Limitation: PBQP doesn't know about Z80-specific encoding constraints (DD/FD prefix conflicts, accumulator-only ALU, IXH/IXL availability).
Wave Function Collapse treats register allocation as constraint satisfaction. Each instruction operand starts as a superposition of possible physical locations (LocSet bitfield). Constraints propagate until each operand collapses to exactly one location.
Before propagation:
ADD dst={A,B,C,D,E,H,L} src0={A,B,C,D,E,H,L} src1={A,B,C,D,E,H,L}
After pattern constraint (add_a_r: dst=A, src0=A):
ADD dst={A} src0={A} src1={A,B,C,D,E,H,L}
After vreg consistency (src1 vreg was defined as param in C):
ADD dst={A} src0={A} src1={C}
Collapsed: ADD A, C
Five propagation passes:
| Pass | Direction | What it does |
|---|---|---|
| Forward | → | Def LocSet narrows use LocSet |
| Backward | ← | Use requirement narrows def LocSet |
| VReg consistency | ↔ | Same vreg = same physical everywhere |
| Clobber | ↓ | CALL clobbers → narrow to IXH/IXL/spill |
| Collapse | → | Sequential assignment with pickPreferred(hints) |
PBQP's allocation result is converted to hints — a map from vreg to preferred physical location. During WFC collapse, pickPreferred() checks if PBQP's choice is available in the current LocSet:
func (st *WFCState) pickPreferred(available LocSet, vreg int) LocSet {
if hinted, ok := st.Hints[vreg]; ok && available.Has(hinted) {
return Singleton(hinted) // use PBQP's choice
}
return pickFirst(available) // fallback
}Result: leaf functions produce output identical to production codegen.
Z80's undocumented index register halves (LD IXH, r via DD prefix) are not clobbered by function calls. This gives WFC 4 bytes of fast storage that survives CALL — cheaper than PUSH/POP (8T vs 11T) and doesn't touch the stack.
Resource hierarchy:
L0: Same register 0T (vreg consistency)
L1: GPR move (LD r, r') 4T (setup moves)
L2: IX half (LD IXH, r) 8T (call-safe!) ← NEW
L3: Shadow (EXX) 4T/batch (future)
L4: Memory (LD (nn), A) 13T (always safe)
WFC's clobberPass detects vregs live across calls and narrows their LocSet to callSafeLocs() = {IXH, IXL, IYH, IYL, spill0-3}.
No existing Z80 compiler uses IX/IY halves for register allocation.
Z80's accumulator ALU (ADD A, r overwrites A) creates a unique challenge: when multiple values need to be in A at different times, earlier values must be saved.
MIR2: x = a + b; y = a & b; return x + y
Naive codegen (WRONG):
ADD A, C ; A = a+b = x
AND C ; A = x&b ← wrong, should be a&b
With save-before-overwrite:
LD B, A ; save a to B
LD D, C ; save b to D
ADD A, D ; A = a+b = x
LD C, A ; save x to C
LD A, B ; restore a
AND D ; A = a&b = y
LD B, A ; save y to B
LD A, C ; restore x
ADD A, B ; A = x+y (CORRECT)
When the last instruction before RET is a CALL, replace with JP:
Before: LD A, 65; CALL putchar; RET (17T + 10T = 27T)
After: LD A, 65; JP putchar (10T, saves 17T)
Z80 has no hardware multiply. Non-constant multiplies are lowered to shared runtime routines:
__mul8: A = A * B (~80T, 8 iterations, shift-and-add)
__mul16: HL = HL * DE (~200T, 16 iterations)Emitted once per module, shared across all call sites. Tail-call optimized when possible.
Declarative block-level transformations applied after isel, before WFC:
┌──────────────────┐ loop_rotate ┌──────────────────┐
│ head: │ ─────────────> │ guard: │
│ branch cond │ │ branch cond │
│ @body, @exit │ │ @body, @exit │
├──────────────────┤ ├──────────────────┤
│ body: │ │ body: │
│ ...work... │ │ ...work... │
│ jump @head │ │ branch cond │
└──────────────────┘ │ @body, @exit │
└──────────────────┘
While loop Do-while (DJNZ-ready)
Six rules in priority order: DJNZ absorption (35) → loop rotation (30) → empty block elimination (20) → block merge (15) → branch-to-next (10) → branch inversion (5).
Four machine descriptors verify correctness across the constraint spectrum:
RISC32 RISC8 CISC Z80
32 regs 8 regs acc+pairs DD/FD prefix
symmetric symmetric asymmetric constrained
| | | |
+----+----+---------+---------+
|
Same observable result?
YES → lowering is correct
NO → bug in constraint handling
If a function passes on RISC32 but fails on Z80, the bug is in Z80-specific constraints. If it fails on all, the bug is in MIR2→LIR translation.
948/948 = 100% across all four frontends:
| Frontend | Functions | Rate | Notes |
|---|---|---|---|
| C89 | 720 (240 × 3 machines) | 100% | 36 source files |
| Nanz | 162 (54 × 3) | 100% | includes arena allocator |
| Lizp | 57 (19 × 3) | 100% | S-expression frontend |
| Lanz | 9 (3 × 3) | 100% | Lanz→HIR frontend |
Z80's EXX instruction swaps BC↔BC', DE↔DE', HL↔HL' in 4 T-states. For 3+ values live across a call:
EXX; CALL foo; EXX = 8T total
vs
PUSH BC; PUSH DE; PUSH HL; CALL foo; POP HL; POP DE; POP BC = 63T
WFC would model this as a group constraint: EXX regions are SESE (single-entry, single-exit), and the cost is amortized across all values saved.
This would be a novel contribution — no existing Z80 compiler uses shadow registers for register allocation.