Skip to content

Commit 050bc4f

Browse files
author
maderix
committed
Add cross-generation ANE benchmark report from issue #3
Community-submitted results for M1 Pro/Max, M3 Pro, M4 Pro/Max, M5. Includes training performance, peak throughput, MIL compatibility matrix, and structured JSON data.
1 parent e986572 commit 050bc4f

2 files changed

Lines changed: 241 additions & 0 deletions

File tree

benchmarks/ANE_BENCHMARK_REPORT.md

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
# Apple Neural Engine — Cross-Generation Benchmark Report
2+
3+
Community-submitted benchmark data from [Issue #3](https://github.com/maderix/ANE/issues/3).
4+
All results use Stories110M (12-layer transformer, 109M params, dim=768, seq=256).
5+
6+
## Training Performance (Static Pipeline)
7+
8+
```
9+
Chip ms/step ANE ms Compile/10 ANE TFLOPS Util% Contributor
10+
─────────────────────────────────────────────────────────────────────────────────
11+
M1 Pro 148-163 32-35 7.9-8.5s 0.57-0.63 3.6-4.0 @moriwang
12+
M1 Max 143-167 35-45 ~7.1s 0.54-0.65 3.4-4.1 @andyg5000
13+
M3 Ultra* 91 ~10 ~3.7s 0.88 5.6 (repo ref)
14+
M4 Pro 69-73 8.9 ~3.5s 1.28 8.1 @srt54558
15+
M4 Max 64 10.2 ~3.5s 1.45 9.2 @SethBurkart123
16+
M5 101-120 9.1-9.8 3.2-3.4s 0.77-0.91 4.9-5.8 @GitBubble
17+
```
18+
19+
*M3 Ultra = reference platform this project was developed on.
20+
21+
## Peak ANE Throughput (inmem_peak, 128x conv 512ch sp64)
22+
23+
```
24+
Chip TFLOPS Rated TOPS Utilization
25+
───────────────────────────────────────────────────
26+
M1 Pro FAIL 11 - (MIL compat issue)
27+
M1 Max FAIL 11 - (MIL compat issue)
28+
M3 Pro 9.98 15.8 63%
29+
M4 Pro 12.57 38 33%
30+
M4 Max 10.93 38 29%
31+
M5 12.17 ~19* 64%
32+
M5 (other) 12.44 ~19* 65%
33+
```
34+
35+
*M5 ANE TOPS not officially disclosed; ~19 TOPS estimated from measured peak.
36+
37+
## Comparative Chart
38+
39+
```
40+
ANE Training Speed (ms/step, lower is better)
41+
══════════════════════════════════════════════════════════════
42+
43+
M1 Pro ████████████████████████████████████████░░░░ 148-163 ms
44+
M1 Max ██████████████████████████████████████░░░░░░ 143-167 ms
45+
M3 Ultra ██████████████████░░░░░░░░░░░░░░░░░░░░░░░░░ 91 ms
46+
M4 Pro ██████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 69-73 ms
47+
M4 Max ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 64 ms
48+
M5 ████████████████████████░░░░░░░░░░░░░░░░░░░░ 101-120 ms
49+
50+
0 50 100 150 200
51+
52+
53+
Peak ANE Throughput (TFLOPS, higher is better)
54+
══════════════════════════════════════════════════════════════
55+
56+
M1 Pro FAIL (MIL compat)
57+
M1 Max FAIL (MIL compat)
58+
M3 Pro ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░ 9.98
59+
M4 Pro ████████████████████████████████░░░░░░░░░░░░░ 12.57
60+
M4 Max ██████████████████████░░░░░░░░░░░░░░░░░░░░░░ 10.93
61+
M5 █████████████████████████░░░░░░░░░░░░░░░░░░░ 12.17
62+
63+
0 3 6 9 12 15 18
64+
65+
66+
ANE Sustained Throughput (TFLOPS, 5s window)
67+
══════════════════════════════════════════════════════════════
68+
69+
M3 Pro ██████████████████████████████████████████████ 15.04 (95.2%)
70+
71+
0 3 6 9 12 15 18
72+
(Only M3 Pro submitted sustained benchmark)
73+
```
74+
75+
## Key Findings
76+
77+
### M1/M1 Pro/M1 Max
78+
- **Standalone benchmarks fail**`ane_mil_gen.h` single-blob weight format rejected
79+
- **Training works** via `stories_mil.h` (separate per-matrix weight blobs)
80+
- ANE compiler handles weight blobs differently from M4+
81+
- Training at 148-167 ms/step, ~0.6 TFLOPS
82+
83+
### M3 Pro
84+
- **Only ch=512 compiles** — 52 channel values tested (1-4096), only 512 accepted
85+
- Fixed 512-wide lane structure in SRAM tiling
86+
- **Peak: 16.77 TFLOPS** (106% of rated 15.8 TOPS) at 128x conv 512ch sp2048
87+
- **Sustained: 15.04 TFLOPS** over 5 seconds (95.2% utilization)
88+
- Spatial dimension is the key to peak throughput (sp64→sp2048 = 2x improvement)
89+
90+
### M4 Pro / M4 Max
91+
- Flexible channel support (256/384/512/768+)
92+
- M4 Pro: peak 12.57 TFLOPS, training at 72.5 ms/step
93+
- M4 Max: peak 10.93 TFLOPS, training at 64 ms/step (fastest overall)
94+
- `sram_probe` and `inmem_bench` fail on M4 Pro (same MIL compat issue)
95+
96+
### M5
97+
- Training works out of the box with existing `program(1.3)` MIL
98+
- Training speed 101-120 ms/step (slower than M4 Max, comparable to M3 Ultra)
99+
- Peak ANE throughput ~12.2-12.4 TFLOPS (similar to M4 Pro)
100+
- ANE appears to be same H16 family as M4
101+
- **M5 Pro/Max not yet benchmarked** — Fusion Architecture may change ANE behavior
102+
103+
### Cross-Generation MIL Compatibility
104+
105+
```
106+
Feature M1 M3 M4 M5
107+
─────────────────────────────────────────────────────────
108+
program(1.3) / ios18 PARTIAL YES YES YES
109+
Single-blob weights FAIL YES YES YES
110+
Per-matrix weight blobs YES YES YES YES
111+
Channel flexibility ? ch=512 FLEX FLEX
112+
BLOBFILE offset refs FAIL YES YES YES
113+
```
114+
115+
## macOS Compatibility Issues
116+
117+
- **macOS 26.x**`[MLModel compileModelAtURL:]` broken for standalone benchmarks
118+
(fixed in PR #27: switched to in-memory MIL compilation)
119+
- **macOS 15.x** — Works for all M-series with correct MIL format
120+
- M1 generation requires `stories_mil.h` path, not `ane_mil_gen.h`
121+
122+
## How to Contribute
123+
124+
Run on your hardware and post results to [Issue #3](https://github.com/maderix/ANE/issues/3):
125+
126+
```bash
127+
cd training && make train_large
128+
./train_large ane_stories110M_ckpt.bin 256 20 1e-4
129+
```
130+
131+
Include: chip model, macOS version, full output with JSON lines.
132+
133+
---
134+
*Report compiled 2026-03-04 from community submissions.*
135+
*Contributors: @SethBurkart123, @srt54558, @andyg5000, @moriwang, @D-Ogi, @GitBubble, @elijah-pelton*

benchmarks/community_results.json

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
{
2+
"report_date": "2026-03-04",
3+
"source": "https://github.com/maderix/ANE/issues/3",
4+
"model": "Stories110M (12-layer transformer, 109M params)",
5+
"config": {"dim": 768, "hidden": 2048, "heads": 12, "seq": 256, "vocab": 32000, "layers": 12},
6+
"training_results": [
7+
{
8+
"chip": "M1 Pro",
9+
"cores": "10-core CPU",
10+
"ram_gb": 32,
11+
"macos": "15.0",
12+
"ms_per_step": [148, 163],
13+
"ane_ms": [32, 35],
14+
"compile_ms": [7900, 8500],
15+
"ane_tflops": [0.57, 0.63],
16+
"ane_util_pct": [3.6, 4.0],
17+
"benchmarks_pass": false,
18+
"notes": "Standalone benchmarks fail (MIL compat). Training works via stories_mil.h.",
19+
"contributor": "moriwang"
20+
},
21+
{
22+
"chip": "M1 Max",
23+
"cores": "10-core CPU",
24+
"ram_gb": 64,
25+
"macos": "15.6.1",
26+
"ms_per_step": [143, 167],
27+
"ane_ms": [35, 45],
28+
"compile_ms": [7100, 7100],
29+
"ane_tflops": [0.54, 0.65],
30+
"ane_util_pct": [3.4, 4.1],
31+
"benchmarks_pass": false,
32+
"notes": "Same MIL compat issue as M1 Pro.",
33+
"contributor": "andyg5000"
34+
},
35+
{
36+
"chip": "M3 Pro",
37+
"cores": "12-core CPU",
38+
"ram_gb": 36,
39+
"macos": "15.7.4",
40+
"peak_tflops": 16.77,
41+
"sustained_tflops": 15.04,
42+
"sustained_util_pct": 95.2,
43+
"channel_constraint": "ch=512 only",
44+
"notes": "Only ch=512 compiles. 52 values tested. Peak at 128x conv 512ch sp2048.",
45+
"contributor": "D-Ogi"
46+
},
47+
{
48+
"chip": "M4 Pro",
49+
"cores": "unknown",
50+
"ram_gb": null,
51+
"macos": null,
52+
"ms_per_step": [69, 73],
53+
"ane_ms": [8.9, 8.9],
54+
"compile_ms": [3465, 3465],
55+
"ane_tflops": [1.28, 1.28],
56+
"ane_util_pct": [8.1, 8.1],
57+
"peak_tflops_inmem": 12.57,
58+
"notes": "sram_probe and inmem_bench fail. inmem_peak and training work.",
59+
"contributor": "srt54558"
60+
},
61+
{
62+
"chip": "M4 Max",
63+
"cores": "unknown",
64+
"ram_gb": null,
65+
"macos": null,
66+
"ms_per_step": [64, 64],
67+
"ane_ms": [10.2, 10.2],
68+
"compile_ms": [3531, 3531],
69+
"ane_tflops": [1.45, 1.45],
70+
"ane_util_pct": [9.2, 9.2],
71+
"peak_tflops_inmem": 10.93,
72+
"notes": "Fastest training ms/step overall.",
73+
"contributor": "SethBurkart123"
74+
},
75+
{
76+
"chip": "M5",
77+
"cores": "10-core (4P+6E)",
78+
"ram_gb": 16,
79+
"macos": "26.3",
80+
"ms_per_step": [101, 120],
81+
"ane_ms": [9.1, 9.8],
82+
"compile_ms": [3200, 3400],
83+
"ane_tflops": [0.77, 0.91],
84+
"ane_util_pct": [4.9, 5.8],
85+
"peak_tflops_inmem": 12.44,
86+
"notes": "H16 ANE family (same as M4). Training works with existing program(1.3) MIL.",
87+
"contributor": "GitBubble"
88+
},
89+
{
90+
"chip": "M5",
91+
"cores": "unknown",
92+
"ram_gb": 32,
93+
"macos": "26.4",
94+
"peak_tflops_inmem": 12.17,
95+
"notes": "inmem_peak only, no training data submitted.",
96+
"contributor": "elijah-pelton"
97+
}
98+
],
99+
"neural_engine_specs": {
100+
"M1": {"cores": 16, "rated_tops": 11},
101+
"M2": {"cores": 16, "rated_tops": 15.8},
102+
"M3": {"cores": 16, "rated_tops": 15.8},
103+
"M4": {"cores": 16, "rated_tops": 38},
104+
"M5": {"cores": 16, "rated_tops": null, "estimated_tops": 19}
105+
}
106+
}

0 commit comments

Comments
 (0)