You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+92Lines changed: 92 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -166,6 +166,98 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve
166
166
167
167
This project uses Apple's private, undocumented APIs (`_ANEClient`, `_ANECompiler`, `_ANEInMemoryModelDescriptor`). These APIs are not covered by any public stability guarantee and may change or break with any macOS update. This is independent research into Apple Neural Engine architecture, using APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see *Sega v. Accolade*, 1992; DMCA §1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk.
168
168
169
+
## Hardware Characterization: Apple M5 (2026)
170
+
171
+
The M5 (Apple 10 family) introduces specific ANE behavioral constraints that differ from earlier M-series chips. This section documents the key findings from reverse-engineering efforts.
172
+
173
+
### Benchmark Methodology
174
+
175
+
**Hardware Configuration:**
176
+
-**Chip**: Apple M5 (base model, 16 NE cores)
177
+
-**macOS Version**: 26.3 (25D125) (Darwin 25.3.0)
178
+
-**Date Measured**: 2026-03-01
179
+
-**ANE Family**: H16 (same as M4)
180
+
181
+
**Measurement Approach:**
182
+
- Peak throughput measured using 4096×4096 dynamic matmul operations via the [`m5_performance_suite.m`](training/m5_performance_suite.m) benchmark tool
183
+
- Weight update latency measured as `memcpy` to IOSurface + ANE evaluation
184
+
- All IOSurface buffers use 128-byte alignment (required for M5 ANE compatibility)
185
+
- 1000 iterations per measurement after 10-iteration warmup
186
+
- FLOPS calculated as `2 × dim × dim` (multiply-add per output element)
187
+
188
+
**Important Notes:**
189
+
- M5 Pro and M5 Max variants have **not yet been benchmarked** — results may differ
190
+
- The Fusion Architecture in Pro/Max models may change ANE behavior
191
+
192
+
### Key M5 ANE Constraints
193
+
194
+
| Constraint | Value | Notes |
195
+
|:---|:---|:---|
196
+
|**IOSurface Alignment**| 128 bytes | All input, output, and weight surfaces must be 128-byte aligned. Failure results in silent evaluation errors or compiler rejection. |
197
+
|**MIL Version**| program(1.5) | M5 is optimized for MIL 1.5 using static `BLOBFILE` weights. However, **any dynamic weight injection via input tensors must use `program(1.3)` and `<ios17>`** to bypass strict AST compiler validations. |
198
+
|**Max Dynamic Dimension**| 4096 × 4096 | Maximum dimension for dynamic weight tensors passed as inputs. |
199
+
|**Peak Throughput**|~1.7 TFLOPS | Pure ANE compute for 4096-dim matmul operations (measured: 1.66-1.76 TFLOPS). |
200
+
|**Update Latency**|~1.27 ms | CPU-to-IOSurface `memcpy` + ANE eval for weight updates at 4096 dims. |
201
+
202
+
### Dynamic Weight Injection
203
+
204
+
On M5, the traditional approach of baking weights into the compiled model (via `BLOBFILE`) does not support runtime updates—the ANE snapshots weights into private memory at load time. The only viable path for real-time weight updates is:
205
+
206
+
**Treat weights as Input Tensors using the `matmul` operator.**
207
+
208
+
```objc
209
+
// MIL pattern for dynamic weights (M5 compatible)
- **Zero-copy weight swapping**: Update weights via `memcpy` into the input IOSurface
225
+
- **~100x faster updates** vs. recompile-and-load cycle (1.8ms vs 40-170ms)
226
+
- **On-device training**: Foundation for gradient descent on ANE
227
+
228
+
### M5 Performance Benchmarks
229
+
230
+
Run the benchmark suite:
231
+
232
+
```bash
233
+
cd training
234
+
make m5_performance_suite
235
+
./m5_performance_suite
236
+
```
237
+
238
+
Expected output on M5 (measured on base M5, macOS 26.3):
239
+
240
+
```
241
+
Max Dynamic Dimension: 4096 x 4096
242
+
Peak Throughput: 1.02 TFLOPS
243
+
Weight Update Latency: 1.78 ms
244
+
Max Weight Tensor Size: 67.11 MB
245
+
```
246
+
247
+
> **Note**: These values are from actual M5 hardware measurements. M5 Pro/Max variants have not yet been tested — results may differ.
248
+
249
+
### Implementation Notes
250
+
251
+
1.**Alignment Helper**: Use `ane_create_surface()` which automatically applies 128-byte alignment—backward compatible with M3/M4.
252
+
253
+
2.**MIL Generation**: Use `mil_gen_dynamic_matmul()` from `ane_mil_gen.h` for M5-compatible dynamic weight layers.
254
+
255
+
3.**Weight Surface**: For large weights (>16MB), use `ane_create_weights_surface()` which adds `kIOSurfaceIsGlobal` for ANE hardware access.
256
+
257
+
4.**Matmul vs Conv**: For dynamic weights, `matmul` is more stable than `conv` on M5 due to flexible hardware tiling on the NCE (Neural Compute Engine).
0 commit comments