Commit 6940b54
committed
[feat][gpu] Q4 quantization, Metal GPU shaders, ANE kernel fusion, memory safety
Inference engine:
- Q4_0 quantization format (block_size=32, 20 bytes/block) with convert_weights.py
- Q4 AMX dequant-at-load path: same ~91 t/s as F32 (bandwidth-bound)
- Metal GPU compute shaders (matmul.metal): 15+ kernels including SIMD Q4
matmul, fused gate+up+SiLU FFN, batched attention, batched prefill
- GPU path reverted from generate() due to dispatch overhead making it
slower than CPU AMX for dim=896 (kept as dead code for larger models)
- ANE kernel fusion (GQA-aware QKV + Gate/Up) reduces 184->112 kernels,
fits under 119 compile limit. Compiles but ane_eval fails on M4.
Memory safety:
- safe_malloc/qwen_calloc wrappers with OOM crash protection
- Moved per-layer malloc/free out of 24-iteration loop in prefill
(was allocating/freeing up to 17MB buffers 168 times per request)
- Removed GPU weight upload and --gpu/--bench-gpu flags
Benchmark:
- Multi-format comparison: F16, Q8, Q4_AMX, Q4_Metal, 3 LM Studio models
- Wall-clock round-trip timing for fair cross-engine comparison
Made-with: Cursor1 parent abc9fa3 commit 6940b54
8 files changed
Lines changed: 3339 additions & 257 deletions
File tree
- inference
- training
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
28 | 31 | | |
29 | 32 | | |
30 | 33 | | |
| |||
59 | 62 | | |
60 | 63 | | |
61 | 64 | | |
| 65 | + | |
62 | 66 | | |
63 | 67 | | |
64 | 68 | | |
| |||
0 commit comments