llama.cpp fork with TurboQuant WHT-rotated KV cache & weight compression + Gemma 4 MTP and Qwen 3.6 NextN speculative decoding (+30-50% throughput).
metal cpp vulkan cuda quantization mtp multimodal apple-silicon low-bit-quantization llama-cpp local-llm llm-inference gguf speculative-decoding kv-cache-compression qwen3 multi-token-prediction turboquant gemma4 nextn
-
Updated
May 13, 2026 - C++