You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fusing multiple transformer layers into a single ANE eval eliminates XPC inter-process communication overhead. Data stays on-chip between layers instead of round-tripping through CPU.
Partial fusion (4-layer mega) achieves even higher ratios: 7.70× at D=768 (fewer remaining XPC round-trips dominate). No SRAM limit hit at any size (~162MB total weights for 12-layer D=768).
Key insight: ~160µs of each ANE eval is XPC overhead, not neural engine compute. Fusing N layers into one MIL program cuts N XPC round-trips to 1. Residual add ops and all intermediate computations happen inside the ANE — data never leaves the chip.
Backward pass is hardware-limited — CPU overhead is negligible (~4ms), ANE eval dominates
1. ACCUM_STEPS Optimization
The compiled ANE kernels are reused across ALL training steps within a batch. Increasing ACCUM_STEPS amortizes the ~10s compile cost with zero additional compile overhead.
Benchmarks (M1 Pro, stderr→file, same checkpoint)
ACCUM
Steps
ms/step
steps/s
Speedup
Compile overhead
10 (current)
50
208.6
0.66
1.0×
86.2%
50
200
175.4
2.56
3.86×
55.0%
100
200
169.4
3.15
4.74×
46.7%
500
279
168.7
~4.68*
~7.1×
~20%
*Compile time not captured for ACCUM=500 (partial batch); throughput range 4.45-4.80 based on compile times from other runs (10.3-15.3s range).
Note: The existing JSON telemetry only captures forward pass timing (~63ms). The backward pass (106ms = 63% of total) is completely un-instrumented. A CPU overhead probe confirmed the backward pass is hardware-limited (ANE eval + I/O dominate; malloc/memcpy/scalar ops add < 4ms total).
Cache warming effect
Steps into batch
ms/step
1 (cold)
~258ms
30
~175ms
136+
~169ms (converged)
ACCUM_STEPS=10 never reaches warm state (batch ends too early).
Recommended change
-#define ACCUM_STEPS 10+#define ACCUM_STEPS 100
Training quality note: ACCUM_STEPS=100 means gradients are averaged over 100 samples before weight update. Standard practice is to scale LR linearly with batch size (3e-4 → ~3e-3). For benchmarking with synthetic data this doesn't matter.
2. Compile Budget Myth — exec() Restart is Unnecessary
The code assumes ~72 compiles per process before ANE failure, triggering exec() restart. This is wrong on M1 Pro macOS 26.2.
Test: 312 compiles, no restart, no failure
ACCUM_STEPS=10, MAX_COMPILES=500, 50 steps (5 batches):
Batch 1: 72 compiles ← normal
Batch 2: 132 compiles ← still fine
Batch 3: 192 compiles ← normal
Batch 4: 252 compiles ← normal
Batch 5: 312 compiles ← still fine, no degradation
50 steps, 312 compiles, NO restart, NO failure
Also verified with ACCUM=50: 252 compiles across 4 batches, stable.
And standalone: 150 × 768-dim conv compiled+loaded → all pass.
Tested via standalone probe (probe_async_compile.m): compiling 768-dim kernels on background thread while evaluating on main thread:
Scenario
Eval avg
Eval max
Slowdown
Baseline (no compile)
0.337ms
0.514ms
—
During bg compile (20 kernels)
0.381ms
1.419ms
1.13×
ANE compile and eval can overlap with only 13% overhead. A double-buffered kernel pipeline could push throughput to ~5.24 steps/s (7.9× vs baseline).
Optimization stack
Optimization
steps/s
vs baseline
Status
Baseline (ACCUM=10)
0.66
1.0×
Current code
ACCUM=100
3.15
4.74×
Validated, single #define
+ async compile pipeline
5.24
7.9×
Validated feasible
Asymptote (no overhead)
5.93
9.0×
Theoretical max
5. CPU Overhead Probe — Backward Pass is Hardware-Limited
Measured all CPU-side operations in the backward pass:
Operation
Per step
Overhead
malloc+free (133 capture buffers)
144MB alloc'd
< 0.1ms
memcpy captures
144MB copied
3.4ms (inherent)
Scalar residual adds (24 loops)
4.7M elements
0.53ms (→0.22ms with vDSP)
IOSurface lock/unlock
228 pairs
0.14ms
Conclusion: The backward pass CPU code is well-optimized. The bottleneck is ANE eval latency (~67ms for 48 evals) and I/O conversion (~8-13ms for NEON fp16↔fp32). No practical CPU optimization would move the needle.
The current architecture executes 72 separate ANE evals per training step (24 forward + 48 backward). Each eval incurs ~160µs XPC overhead to the aned daemon, dwarfing the actual neural engine compute time (~3-270µs depending on model size). Between each layer, data round-trips: ANE→IOSurface→CPU (residual add, f16↔f32 conversion)→IOSurface→ANE.
The Solution
Fuse N transformer layers into a single MIL program. The add op for residual connections runs inside the ANE — intermediate activations never leave the chip:
Before: Layer0: CPU→XPC→ANE→XPC→CPU→ Layer1: CPU→XPC→ANE→XPC→CPU→ ... (N round-trips)
After: All N layers: CPU→XPC→ANE [N layers internally] →XPC→CPU (1 round-trip)
Results: FFN-Only Proxy
Config
1-layer
12-layer mega
12× separate
Speedup
Compile
D=64, H=128
183µs
247µs
2197µs
8.9×
209ms
D=128, H=256
160µs
344µs
1921µs
5.6×
145ms
D=288, H=768
218µs
705µs
2611µs
3.7×
244ms
D=288, H=768, SP=128
228µs
654µs
2740µs
4.2×
179ms
D=768, H=2048
429µs
1839µs
5153µs
2.8×
512ms
Results: Full Transformer Architecture (Definitive)
Fuses the COMPLETE transformer layer (RMSNorm + multi-head attention with SDPA + residual + RMSNorm + gated SiLU FFN + residual) into a single ANE eval.
Config
Mega-Kernel
Baseline (separate)
Speedup
Compile
stories15M (D=288, 6 layers)
728µs
3039µs (12 evals)
4.17×
1.0s
stories110M (D=768, 4L partial)
1978µs
~5076µs (8 evals)
2.57×
~1.0s
stories110M (D=768, 8L partial)
3515µs
~10150µs (16 evals)
2.89×
~2.5s
stories110M (D=768, 12 layers)
5081µs
15227µs (24 evals)
3.00×
4.2s
The full transformer speedup exceeds the FFN-only proxy (4.17× vs 3.7× at D=288) because attention ops pipeline efficiently on-chip while XPC overhead stays constant. Absolute savings: 10.15ms per forward pass at D=768.
Negative Results (Weight Mutability)
Weights MUST be const() in MIL — there is no escape from recompilation when weights change. Tested and failed:
Weights as function inputs: MIL parser rejects multi-input functions (desc=NULL)
Weight channel packing: conv requires const() weight; slice_by_size+reshape output rejected at parse time
File-based weight reload: ANE bakes weights at compile time; overwriting blob files has no effect
mil_gen_matmul: Dead code in ane_mil_gen.h — never called, would fail identically
Recompilation Strategy
Since weights must be const(), mega-kernels require recompilation on weight updates. With gradient accumulation of K steps:
Model
Kernel Type
Compile
Step
K to hide
stories15M (D=288)
FFN-only
244ms
~3ms
K≥82
stories15M (D=288)
Full transformer
1.0s
~3ms
K≥338
stories110M (D=768)
FFN-only
512ms
~8ms
K≥64
stories110M (D=768)
Full transformer
4.2s
~8ms
K≥520
A 4-layer partial fusion sweet spot exists: 7.70× speedup at D=768 with ~4× faster compile (~1s), needing only K≥125. A double-buffered approach (compile new mega-kernel on background thread while evaluating current one) makes this practical.
Training Impact
For stories15M (D=288, 6 layers, full transformer mega-kernel):
Full 1000-line findings document (security review, private framework exploration, ChainingRequest deep-dive, cross-validation with M5 results) available on request.
Summary
Systematic benchmarking on M1 Pro (macOS 26.2, PR #6 branch) with two categories of findings:
Architectural Breakthrough: Mega-Kernel Layer Fusion
Fusing multiple transformer layers into a single ANE eval eliminates XPC inter-process communication overhead. Data stays on-chip between layers instead of round-tripping through CPU.
FFN-Only Proxy (Simple Layers)
Full Transformer Architecture (Definitive Test)
Tested the complete forward pass fused into 1 eval: RMSNorm + QKV projections + SDPA attention (matmul, scale, causal mask, softmax) + output projection + residual + RMSNorm + gated SiLU FFN (W1, W3, W2) + residual.
Partial fusion (4-layer mega) achieves even higher ratios: 7.70× at D=768 (fewer remaining XPC round-trips dominate). No SRAM limit hit at any size (~162MB total weights for 12-layer D=768).
Key insight: ~160µs of each ANE eval is XPC overhead, not neural engine compute. Fusing N layers into one MIL program cuts N XPC round-trips to 1. Residual
addops and all intermediate computations happen inside the ANE — data never leaves the chip.Quick Wins: Configuration Optimizations
#definechange)exec()restart entirely (compile budget is a myth on macOS 26.2)1. ACCUM_STEPS Optimization
The compiled ANE kernels are reused across ALL training steps within a batch. Increasing
ACCUM_STEPSamortizes the ~10s compile cost with zero additional compile overhead.Benchmarks (M1 Pro, stderr→file, same checkpoint)
*Compile time not captured for ACCUM=500 (partial batch); throughput range 4.45-4.80 based on compile times from other runs (10.3-15.3s range).
Per-step breakdown (ACCUM=500, steady state)
Note: The existing JSON telemetry only captures forward pass timing (~63ms). The backward pass (106ms = 63% of total) is completely un-instrumented. A CPU overhead probe confirmed the backward pass is hardware-limited (ANE eval + I/O dominate; malloc/memcpy/scalar ops add < 4ms total).
Cache warming effect
ACCUM_STEPS=10 never reaches warm state (batch ends too early).
Recommended change
Training quality note: ACCUM_STEPS=100 means gradients are averaged over 100 samples before weight update. Standard practice is to scale LR linearly with batch size (3e-4 → ~3e-3). For benchmarking with synthetic data this doesn't matter.
2. Compile Budget Myth — exec() Restart is Unnecessary
The code assumes ~72 compiles per process before ANE failure, triggering
exec()restart. This is wrong on M1 Pro macOS 26.2.Test: 312 compiles, no restart, no failure
Also verified with ACCUM=50: 252 compiles across 4 batches, stable.
And standalone: 150 × 768-dim conv compiled+loaded → all pass.
Recommended change
This eliminates the
exec()checkpoint/restart cycle entirely, simplifying the training loop and avoiding ~1.7s checkpoint overhead per restart.aned Cache Discovery
The
aneddaemon caches compiled.hwxbinaries internally, persisting across processes:hexStringIdentifier=SHA-256(MIL) _ SHA-256(options) _ SHA-256(weights)compiledModelExistsreturns YES for cached kernels3. Terminal I/O Throughput Warning
Biggest benchmarking pitfall. Per-step JSON telemetry printed to terminal causes massive slowdown:
Likely caused by XPC communication with
anedbeing blocked by terminal I/O on the main thread. Always benchmark with2>/dev/nullor2>logfile.4. Async Compile+Eval Concurrency (Validated Feasible)
Tested via standalone probe (
probe_async_compile.m): compiling 768-dim kernels on background thread while evaluating on main thread:ANE compile and eval can overlap with only 13% overhead. A double-buffered kernel pipeline could push throughput to ~5.24 steps/s (7.9× vs baseline).
Optimization stack
5. CPU Overhead Probe — Backward Pass is Hardware-Limited
Measured all CPU-side operations in the backward pass:
Conclusion: The backward pass CPU code is well-optimized. The bottleneck is ANE eval latency (~67ms for 48 evals) and I/O conversion (~8-13ms for NEON fp16↔fp32). No practical CPU optimization would move the needle.
6. Mega-Kernel Layer Fusion — Architectural Breakthrough
The Problem
The current architecture executes 72 separate ANE evals per training step (24 forward + 48 backward). Each eval incurs ~160µs XPC overhead to the
aneddaemon, dwarfing the actual neural engine compute time (~3-270µs depending on model size). Between each layer, data round-trips: ANE→IOSurface→CPU (residual add, f16↔f32 conversion)→IOSurface→ANE.The Solution
Fuse N transformer layers into a single MIL program. The
addop for residual connections runs inside the ANE — intermediate activations never leave the chip:Results: FFN-Only Proxy
Results: Full Transformer Architecture (Definitive)
Fuses the COMPLETE transformer layer (RMSNorm + multi-head attention with SDPA + residual + RMSNorm + gated SiLU FFN + residual) into a single ANE eval.
The full transformer speedup exceeds the FFN-only proxy (4.17× vs 3.7× at D=288) because attention ops pipeline efficiently on-chip while XPC overhead stays constant. Absolute savings: 10.15ms per forward pass at D=768.
Negative Results (Weight Mutability)
Weights MUST be
const()in MIL — there is no escape from recompilation when weights change. Tested and failed:desc=NULL)convrequiresconst()weight;slice_by_size+reshapeoutput rejected at parse timemil_gen_matmul: Dead code inane_mil_gen.h— never called, would fail identicallyRecompilation Strategy
Since weights must be const(), mega-kernels require recompilation on weight updates. With gradient accumulation of K steps:
A 4-layer partial fusion sweet spot exists: 7.70× speedup at D=768 with ~4× faster compile (~1s), needing only K≥125. A double-buffered approach (compile new mega-kernel on background thread while evaluating current one) makes this practical.
Training Impact
For stories15M (D=288, 6 layers, full transformer mega-kernel):
For stories110M (D=768, 12 layers, full transformer mega-kernel):
Probe Files
probe_mega_scale.m— Scale test at toy dimensions (1→12 layers, 10.7× result)probe_mega_real_size.m— Scale test at real model dimensions, FFN-only (D=288, D=768)probe_full_mega.m— Definitive test: full transformer mega-kernel (RMSNorm + attention + FFN + residual)probe_ops_test.m— Systematic individual op testing (discovered blob format requirement)probe_mega_and_pack.m— Mega-kernel + weight packing attemptsprobe_paradigm_shift.m— Weight-as-input tests (all failed)Device / Environment
Reproduction
Full 1000-line findings document (security review, private framework exploration, ChainingRequest deep-dive, cross-validation with M5 results) available on request.