Skip to content

Commit 6940b54

Browse files
committed
[feat][gpu] Q4 quantization, Metal GPU shaders, ANE kernel fusion, memory safety
Inference engine: - Q4_0 quantization format (block_size=32, 20 bytes/block) with convert_weights.py - Q4 AMX dequant-at-load path: same ~91 t/s as F32 (bandwidth-bound) - Metal GPU compute shaders (matmul.metal): 15+ kernels including SIMD Q4 matmul, fused gate+up+SiLU FFN, batched attention, batched prefill - GPU path reverted from generate() due to dispatch overhead making it slower than CPU AMX for dim=896 (kept as dead code for larger models) - ANE kernel fusion (GQA-aware QKV + Gate/Up) reduces 184->112 kernels, fits under 119 compile limit. Compiles but ane_eval fails on M4. Memory safety: - safe_malloc/qwen_calloc wrappers with OOM crash protection - Moved per-layer malloc/free out of 24-iteration loop in prefill (was allocating/freeing up to 17MB buffers 168 times per request) - Removed GPU weight upload and --gpu/--bench-gpu flags Benchmark: - Multi-format comparison: F16, Q8, Q4_AMX, Q4_Metal, 3 LM Studio models - Wall-clock round-trip timing for fair cross-engine comparison Made-with: Cursor
1 parent abc9fa3 commit 6940b54

8 files changed

Lines changed: 3339 additions & 257 deletions

File tree

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,9 @@ training/test_*
2525
# Inference binaries and runtime data
2626
inference/qwen_ane
2727
inference/qwen05b.bin
28+
inference/qwen05b_f32.bin
29+
inference/qwen05b_f16.bin
30+
inference/qwen05b_q8.bin
2831
inference/.venv/
2932
inference/benchmark_results.json
3033

@@ -59,6 +62,7 @@ web/
5962
training/tinystories_data00.bin
6063
training/ane_stories110M_ckpt.bin
6164
*.bin
65+
*.metallib
6266
!training/download_data.sh
6367

6468
# Secrets / env

0 commit comments

Comments
 (0)