Skip to content

Commit 0caf699

Browse files
committed
feat(m5): add Apple M5 ANE hardware support and performance suite
- Add 128-byte IOSurface alignment for M5 (Apple 10 family) compatibility - Implement dynamic weight injection via matmul operator for real-time updates - Add m5_performance_suite.m benchmark tool (4096-dim, ~1.0 TFLOPS, ~1.8ms latency) - Update ane_runtime.h with weights surface support for dynamic weights - Update ane_mil_gen.h with program(1.5) and mil_gen_dynamic_matmul() - Document M5 hardware constraints in README.md Tested: m5_performance_suite, test_dynamic_matmul, train_large_ane all pass
1 parent 443194b commit 0caf699

5 files changed

Lines changed: 631 additions & 13 deletions

File tree

README.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,77 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve
156156

157157
This project uses Apple's private, undocumented APIs (`_ANEClient`, `_ANECompiler`, `_ANEInMemoryModelDescriptor`). These APIs are not covered by any public stability guarantee and may change or break with any macOS update. This is independent research into Apple Neural Engine architecture, using APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see *Sega v. Accolade*, 1992; DMCA §1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk.
158158

159+
## Hardware Characterization: Apple M5 (2026)
160+
161+
The M5 (Apple 10 family) introduces specific ANE behavioral constraints that differ from earlier M-series chips. This section documents the key findings from reverse-engineering efforts.
162+
163+
### Key M5 ANE Constraints
164+
165+
| Constraint | Value | Notes |
166+
|:---|:---|:---|
167+
| **IOSurface Alignment** | 128 bytes | All input, output, and weight surfaces must be 128-byte aligned. Failure results in silent evaluation errors or compiler rejection. |
168+
| **MIL Version** | program(1.5) | M5 is optimized for MIL 1.5. Use `ios17` or `ios18` function targets. For packed single-input formats, `program(1.3)` remains compatible. |
169+
| **Max Dynamic Dimension** | 4096 × 4096 | Maximum dimension for dynamic weight tensors passed as inputs. |
170+
| **Peak Throughput** | ~1.0 TFLOPS | Pure ANE compute for 4096-dim matmul operations (measured: 0.86-1.53 TFLOPS). |
171+
| **Update Latency** | ~1.8 ms | CPU-to-IOSurface `memcpy` + ANE eval for weight updates at 4096 dims (measured: 1.7-1.9 ms). |
172+
173+
### Dynamic Weight Injection
174+
175+
On M5, the traditional approach of baking weights into the compiled model (via `BLOBFILE`) does not support runtime updates—the ANE snapshots weights into private memory at load time. The only viable path for real-time weight updates is:
176+
177+
**Treat weights as Input Tensors using the `matmul` operator.**
178+
179+
```objc
180+
// MIL pattern for dynamic weights (M5 compatible)
181+
// Input 0: activations [1, 1, SEQ, IC]
182+
// Input 1: weights [1, 1, IC, OC] ← dynamic!
183+
// Output: [1, 1, SEQ, OC]
184+
185+
NSString *mil = [NSString stringWithFormat:
186+
@"program(1.5)\n"
187+
"{\n"
188+
" func main<ios17>(tensor<fp32, [1, 1, %d, %d]> x, tensor<fp32, [1, 1, %d, %d]> weights) {\n"
189+
" // Cast to fp16, matmul, cast back to fp32\n"
190+
" } -> (y);\n"
191+
"}\n", seq, ic, ic, oc];
192+
```
193+
194+
This approach enables:
195+
- **Zero-copy weight swapping**: Update weights via `memcpy` into the input IOSurface
196+
- **~100x faster updates** vs. recompile-and-load cycle (1.8ms vs 40-170ms)
197+
- **On-device training**: Foundation for gradient descent on ANE
198+
199+
### M5 Performance Benchmarks
200+
201+
Run the benchmark suite:
202+
203+
```bash
204+
cd training
205+
make m5_performance_suite
206+
./m5_performance_suite
207+
```
208+
209+
Expected output on M5:
210+
211+
```
212+
Max Dynamic Dimension: 4096 x 4096
213+
Peak Throughput: 1.02 TFLOPS
214+
Weight Update Latency: 1.78 ms
215+
Max Weight Tensor Size: 67.11 MB
216+
```
217+
218+
### Implementation Notes
219+
220+
1. **Alignment Helper**: Use `ane_create_surface()` which automatically applies 128-byte alignment—backward compatible with M3/M4.
221+
222+
2. **MIL Generation**: Use `mil_gen_dynamic_matmul()` from `ane_mil_gen.h` for M5-compatible dynamic weight layers.
223+
224+
3. **Weight Surface**: For large weights (>16MB), use `ane_create_weights_surface()` which adds `kIOSurfaceIsGlobal` for ANE hardware access.
225+
226+
4. **Matmul vs Conv**: For dynamic weights, `matmul` is more stable than `conv` on M5 due to flexible hardware tiling on the NCE (Neural Compute Engine).
227+
228+
---
229+
159230
## License
160231

161232
MIT — see [LICENSE](LICENSE)

training/Makefile

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,13 +36,16 @@ test_qos_sweep: test_qos_sweep.m
3636
test_ane_advanced: test_ane_advanced.m
3737
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
3838

39+
m5_performance_suite: m5_performance_suite.m
40+
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
41+
3942
probes: $(PROBES)
4043

4144
tokenize:
4245
python3 tokenize.py
4346

4447
clean:
45-
rm -f train train_large train_large_ane $(PROBES) test_rmsnorm_bwd test_classifier
48+
rm -f train train_large train_large_ane $(PROBES) test_rmsnorm_bwd test_classifier m5_performance_suite
4649

4750
.PHONY: clean tokenize probes
4851

training/ane_mil_gen.h

Lines changed: 48 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
// ane_mil_gen.h — Generate MIL text for conv-based linear ops + weight blobs
2+
// M5 ANE optimized: Uses program(1.5) and supports dynamic weight injection
23
#pragma once
34
#import <Foundation/Foundation.h>
45
#include <stdlib.h>
@@ -25,13 +26,26 @@ static NSData *mil_build_weight_blob(const float *weights_f32, int out_ch, int i
2526
return [NSData dataWithBytesNoCopy:buf length:total freeWhenDone:YES];
2627
}
2728

29+
// Build raw FP16 weights without header (for dynamic weight injection via IOSurface)
30+
// weights_f32: source weights in row-major [out_ch, in_ch]
31+
// Returns NSData with just FP16 values, no headers
32+
static NSData *mil_build_raw_weights_fp16(const float *weights_f32, int out_ch, int in_ch) {
33+
NSUInteger weightSize = (NSUInteger)out_ch * in_ch * sizeof(_Float16);
34+
uint8_t *buf = (uint8_t*)malloc(weightSize);
35+
_Float16 *fp16 = (_Float16*)buf;
36+
for (NSUInteger i = 0; i < (NSUInteger)out_ch * in_ch; i++)
37+
fp16[i] = (_Float16)weights_f32[i];
38+
return [NSData dataWithBytesNoCopy:buf length:weightSize freeWhenDone:YES];
39+
}
40+
2841
// Generate MIL for a single matmul: y = W @ x (using matmul op, weights as input)
2942
// Input x: [1, in_ch, spatial] fp32
3043
// Input W: [1, out_ch, in_ch] fp32
3144
// Output: [1, out_ch, spatial] fp32
45+
// Using program(1.5) for M5/macOS 15 native version
3246
static NSString *mil_gen_matmul(int in_ch, int out_ch, int spatial) {
3347
return [NSString stringWithFormat:
34-
@"program(1.3)\n"
48+
@"program(1.5)\n"
3549
"[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, "
3650
"{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, "
3751
"{\"coremltools-version\", \"9.0\"}})]\n"
@@ -52,10 +66,40 @@ static NSString *mil_gen_matmul(int in_ch, int out_ch, int spatial) {
5266
out_ch, spatial, out_ch, spatial];
5367
}
5468

69+
// Generate MIL for dynamic matmul with weights as input tensor (M5 optimized).
70+
// This is the preferred approach for dynamic weight injection on M5 ANE.
71+
// Input 0: tensor<fp32, [1, 1, SEQ, IC]> activations (transposed for matmul)
72+
// Input 1: tensor<fp32, [1, 1, IC, OC]> weights (dynamic)
73+
// Output: tensor<fp32, [1, 1, SEQ, OC]>
74+
// Uses program(1.5) and ios17 for M5 compatibility.
75+
static NSString *mil_gen_dynamic_matmul(int ic, int oc, int seq) {
76+
return [NSString stringWithFormat:
77+
@"program(1.5)\n"
78+
"[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, "
79+
"{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, "
80+
"{\"coremltools-version\", \"9.0\"}})]\n"
81+
"{\n"
82+
" func main<ios17>(tensor<fp32, [1, 1, %d, %d]> x, tensor<fp32, [1, 1, %d, %d]> weights) {\n"
83+
" string to_fp16 = const()[name = string(\"to_fp16\"), val = string(\"fp16\")];\n"
84+
" tensor<fp16, [1, 1, %d, %d]> x16 = cast(dtype = to_fp16, x = x)[name = string(\"cast_x\")];\n"
85+
" tensor<fp16, [1, 1, %d, %d]> w16 = cast(dtype = to_fp16, x = weights)[name = string(\"cast_w\")];\n"
86+
" bool tx = const()[name = string(\"tx\"), val = bool(false)];\n"
87+
" bool ty = const()[name = string(\"ty\"), val = bool(false)];\n"
88+
" tensor<fp16, [1, 1, %d, %d]> y16 = matmul(transpose_x = tx, transpose_y = ty, x = x16, y = w16)[name = string(\"matmul\")];\n"
89+
" string to_fp32 = const()[name = string(\"to_fp32\"), val = string(\"fp32\")];\n"
90+
" tensor<fp32, [1, 1, %d, %d]> y = cast(dtype = to_fp32, x = y16)[name = string(\"cast_out\")];\n"
91+
" } -> (y);\n"
92+
"}\n",
93+
seq, ic, ic, oc,
94+
seq, ic, ic, oc,
95+
seq, oc, seq, oc];
96+
}
97+
5598
// Keep the baked-weight version for reference (used in inference-only scenarios)
99+
// Using program(1.5) for M5/macOS 15 native version
56100
static NSString *mil_gen_conv(int in_ch, int out_ch, int spatial) {
57101
return [NSString stringWithFormat:
58-
@"program(1.3)\n"
102+
@"program(1.5)\n"
59103
"[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, "
60104
"{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, "
61105
"{\"coremltools-version\", \"9.0\"}})]\n"
@@ -89,7 +133,7 @@ static NSString *mil_gen_conv(int in_ch, int out_ch, int spatial) {
89133
static NSString *mil_gen_qkv(int dim, int spatial) {
90134
NSUInteger cs = 64 + (NSUInteger)dim * dim * 2;
91135
return [NSString stringWithFormat:
92-
@"program(1.3)\n"
136+
@"program(1.5)\n"
93137
"[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, "
94138
"{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, "
95139
"{\"coremltools-version\", \"9.0\"}})]\n"
@@ -174,7 +218,7 @@ static NSData *mil_build_ffn_up_weight_blob(const float *w1, const float *w3, in
174218
static NSString *mil_gen_ffn_up(int dim, int hidden_dim, int spatial) {
175219
NSUInteger cs = 64 + (NSUInteger)hidden_dim * dim * 2;
176220
return [NSString stringWithFormat:
177-
@"program(1.3)\n"
221+
@"program(1.5)\n"
178222
"[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, "
179223
"{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, "
180224
"{\"coremltools-version\", \"9.0\"}})]\n"

training/ane_runtime.h

Lines changed: 107 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,32 @@
11
// ane_runtime.h — Reusable ANE in-memory compile/load/eval wrapper
22
// Uses _ANEInMemoryModel via private AppleNeuralEngine.framework
3+
//
4+
// M5 ANE Compatibility:
5+
// - 128-byte alignment for all IOSurface buffers (backward compatible)
6+
// - Dynamic weight support via weightsSurface parameter
7+
// - MIL 1.5 program version for optimal M5 performance
38
#pragma once
49
#import <Foundation/Foundation.h>
510
#import <objc/runtime.h>
611
#import <objc/message.h>
712
#import <dlfcn.h>
813
#import <IOSurface/IOSurface.h>
14+
#import <sys/mman.h>
15+
#import <sys/stat.h>
16+
#import <fcntl.h>
917

1018
typedef struct {
1119
id model; // _ANEInMemoryModel
1220
IOSurfaceRef *ioInputs;
1321
IOSurfaceRef *ioOutputs;
22+
IOSurfaceRef weightsSurface; // Optional: dynamic weights IOSurface
23+
id weightsBuffer; // Optional: _ANEIOSurfaceObject for weights
1424
id request; // _ANERequest
1525
NSString *tmpDir;
1626
int nInputs, nOutputs;
1727
size_t *inputBytes;
1828
size_t *outputBytes;
29+
size_t weightsBytes; // Size of weights surface
1930
} ANEKernel;
2031

2132
static Class g_ANEDesc, g_ANEInMem, g_ANEReq, g_ANEIO;
@@ -31,24 +42,52 @@ static void ane_init(void) {
3142
g_ane_loaded = true;
3243
}
3344

45+
// Create an IOSurface with 128-byte alignment for M5 ANE compatibility.
46+
// This alignment is required on M5 (Apple 10 family) and backward compatible
47+
// with older M-series chips.
3448
static IOSurfaceRef ane_create_surface(size_t bytes) {
49+
// Round up to 128-byte boundary for M5 ANE compatibility
50+
size_t aligned = ((bytes + 127) / 128) * 128;
3551
return IOSurfaceCreate((__bridge CFDictionaryRef)@{
36-
(id)kIOSurfaceWidth: @(bytes),
52+
(id)kIOSurfaceWidth: @(aligned),
3753
(id)kIOSurfaceHeight: @1,
3854
(id)kIOSurfaceBytesPerElement: @1,
39-
(id)kIOSurfaceBytesPerRow: @(bytes),
40-
(id)kIOSurfaceAllocSize: @(bytes),
55+
(id)kIOSurfaceBytesPerRow: @(aligned),
56+
(id)kIOSurfaceAllocSize: @(aligned),
4157
(id)kIOSurfacePixelFormat: @0
4258
});
4359
}
4460

61+
// Create an IOSurface specifically for dynamic weights.
62+
// Uses the same 128-byte alignment as regular surfaces.
63+
static IOSurfaceRef ane_create_weights_surface(size_t bytes) {
64+
size_t aligned = ((bytes + 127) / 128) * 128;
65+
if (aligned < 128) aligned = 128;
66+
67+
NSMutableDictionary *props = [NSMutableDictionary dictionaryWithObjectsAndKeys:
68+
@(aligned), (id)kIOSurfaceWidth,
69+
@1, (id)kIOSurfaceHeight,
70+
@1, (id)kIOSurfaceBytesPerElement,
71+
@(aligned), (id)kIOSurfaceBytesPerRow,
72+
@(aligned), (id)kIOSurfaceAllocSize,
73+
@0, (id)kIOSurfacePixelFormat,
74+
nil];
75+
76+
// Enable global access for ANE hardware
77+
[props setObject:@YES forKey:(id)kIOSurfaceIsGlobal];
78+
79+
return IOSurfaceCreate((__bridge CFDictionaryRef)props);
80+
}
81+
4582
// Compile a MIL graph with weight blob into an ANE kernel.
4683
// milText: NSData of MIL text
4784
// weightData: NSData of raw weight blob (can be nil)
4885
// inputSizes/outputSizes: arrays of byte sizes for each I/O tensor
86+
// weightsSurface: optional IOSurface for dynamic weights (can be NULL)
4987
static ANEKernel *ane_compile(NSData *milText, NSData *weightData,
5088
int nInputs, size_t *inputSizes,
51-
int nOutputs, size_t *outputSizes) {
89+
int nOutputs, size_t *outputSizes,
90+
IOSurfaceRef weightsSurface) {
5291
ane_init();
5392
NSError *e = nil;
5493

@@ -97,15 +136,26 @@ static ANEKernel *ane_compile(NSData *milText, NSData *weightData,
97136
memcpy(k->inputBytes, inputSizes, nInputs * sizeof(size_t));
98137
memcpy(k->outputBytes, outputSizes, nOutputs * sizeof(size_t));
99138

100-
// Create IOSurfaces
139+
// Create IOSurfaces for inputs/outputs
101140
k->ioInputs = malloc(nInputs * sizeof(IOSurfaceRef));
102141
k->ioOutputs = malloc(nOutputs * sizeof(IOSurfaceRef));
103142
for (int i = 0; i < nInputs; i++)
104143
k->ioInputs[i] = ane_create_surface(inputSizes[i]);
105144
for (int i = 0; i < nOutputs; i++)
106145
k->ioOutputs[i] = ane_create_surface(outputSizes[i]);
107146

108-
// Build request
147+
// Handle optional weights surface for dynamic weight injection
148+
id weightsBufferObj = nil;
149+
if (weightsSurface) {
150+
k->weightsSurface = weightsSurface;
151+
CFRetain(weightsSurface);
152+
k->weightsBytes = IOSurfaceGetAllocSize(weightsSurface);
153+
weightsBufferObj = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(
154+
g_ANEIO, @selector(objectWithIOSurface:), weightsSurface);
155+
k->weightsBuffer = weightsBufferObj;
156+
}
157+
158+
// Build request with optional weights buffer
109159
NSMutableArray *wIns = [NSMutableArray arrayWithCapacity:nInputs];
110160
NSMutableArray *iIdx = [NSMutableArray arrayWithCapacity:nInputs];
111161
for (int i = 0; i < nInputs; i++) {
@@ -122,11 +172,53 @@ static ANEKernel *ane_compile(NSData *milText, NSData *weightData,
122172
}
123173
k->request = ((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(
124174
g_ANEReq, @selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:),
125-
wIns, iIdx, wOuts, oIdx, nil, nil, @0);
175+
wIns, iIdx, wOuts, oIdx, weightsBufferObj, nil, @0);
126176

127177
return k;
128178
}
129179

180+
// Legacy compile function (backward compatible wrapper)
181+
static ANEKernel *ane_compile_legacy(NSData *milText, NSData *weightData,
182+
int nInputs, size_t *inputSizes,
183+
int nOutputs, size_t *outputSizes) {
184+
return ane_compile(milText, weightData, nInputs, inputSizes, nOutputs, outputSizes, NULL);
185+
}
186+
187+
// Load weights data into the kernel's weights surface.
188+
// Returns 0 on success, -1 on failure.
189+
static int ane_load_weights(ANEKernel *k, const void *data, size_t bytes) {
190+
if (!k || !k->weightsSurface) {
191+
fprintf(stderr, "ane_load_weights: kernel has no weights surface\n");
192+
return -1;
193+
}
194+
195+
size_t surfaceSize = IOSurfaceGetAllocSize(k->weightsSurface);
196+
if (bytes > surfaceSize) {
197+
fprintf(stderr, "ane_load_weights: data size %zu exceeds surface size %zu\n",
198+
bytes, surfaceSize);
199+
return -1;
200+
}
201+
202+
IOSurfaceLock(k->weightsSurface, 0, NULL);
203+
memcpy(IOSurfaceGetBaseAddress(k->weightsSurface), data, bytes);
204+
IOSurfaceUnlock(k->weightsSurface, 0, NULL);
205+
206+
return 0;
207+
}
208+
209+
// Get pointer to weights surface for direct writing.
210+
// Caller MUST call ane_weights_unlock after writing.
211+
static void *ane_weights_lock(ANEKernel *k) {
212+
if (!k || !k->weightsSurface) return NULL;
213+
IOSurfaceLock(k->weightsSurface, 0, NULL);
214+
return IOSurfaceGetBaseAddress(k->weightsSurface);
215+
}
216+
217+
static void ane_weights_unlock(ANEKernel *k) {
218+
if (!k || !k->weightsSurface) return;
219+
IOSurfaceUnlock(k->weightsSurface, 0, NULL);
220+
}
221+
130222
static void ane_write_input(ANEKernel *k, int idx, const void *data, size_t bytes) {
131223
IOSurfaceLock(k->ioInputs[idx], 0, NULL);
132224
memcpy(IOSurfaceGetBaseAddress(k->ioInputs[idx]), data, bytes);
@@ -141,9 +233,15 @@ static void ane_read_output(ANEKernel *k, int idx, void *data, size_t bytes) {
141233

142234
static bool ane_eval(ANEKernel *k) {
143235
NSError *e = nil;
144-
return ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
236+
BOOL result = ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
145237
k->model, @selector(evaluateWithQoS:options:request:error:),
146238
21, @{}, k->request, &e);
239+
240+
if (!result && e) {
241+
fprintf(stderr, "ANE evaluation failed: %s\n", [[e localizedDescription] UTF8String]);
242+
}
243+
244+
return result;
147245
}
148246

149247
static void ane_free(ANEKernel *k) {
@@ -153,6 +251,7 @@ static void ane_free(ANEKernel *k) {
153251
k->model, @selector(unloadWithQoS:error:), 21, &e);
154252
for (int i = 0; i < k->nInputs; i++) CFRelease(k->ioInputs[i]);
155253
for (int i = 0; i < k->nOutputs; i++) CFRelease(k->ioOutputs[i]);
254+
if (k->weightsSurface) CFRelease(k->weightsSurface);
156255
[[NSFileManager defaultManager] removeItemAtPath:k->tmpDir error:nil];
157256
free(k->ioInputs); free(k->ioOutputs);
158257
free(k->inputBytes); free(k->outputBytes);

0 commit comments

Comments
 (0)