⚡ TurboQuant

A data-oblivious KV-cache compressor. Random rotation + scalar Lloyd-Max quantizer that shrinks the KV cache to 3–4× smaller than FP16, no calibration data required. Useful when GPU memory is the binding constraint on your serving rig.

Open-source implementation of Google's TurboQuant (ICLR 2026). Provably within 2.7× of the information-theoretic distortion bound.

Reality check (May 2026)

A lot has happened since the ICLR poster. TurboQuant is not always the right answer and the community has converged on a more nuanced picture than the initial hype suggested. Read this table before you start:

Finding	Source	What it means for you
FP8 KV cache is the best default on Hopper/Blackwell — 2× capacity, no measurable accuracy loss, no throughput hit	Red Hat AI / vLLM blog, May 11 2026	Reach for `--kv-cache-dtype fp8` first. Only use TurboQuant when you need >2× compression and accept some throughput cost
The QJL residual step often hurts at low bit widths — MSE-only beats MSE+QJL on every model the community has tested	tonbistudio V3 README, scos-lab 8-model benchmark	Prefer the `*_nc` ("norm correction") variants; treat QJL as optional
3-bit modes are not production-ready for reasoning or >128K retrieval — ~20-point accuracy drops on AIME25 / LiveCodeBench at long context	Red Hat blog Figures 5–6	Use `turboquant_4bit_nc` for production, save `3bit_nc` for memory-bound edge
Skip the first and last 2 layers — they hold disproportionate signal; quantizing them costs more accuracy than the bits it saves	Red Hat blog	Verify your engine config does this (vLLM does it for `*_nc` variants by default)
K/V norm ratio predicts quality — Qwen-class models need more bits on K than V (8/4 or 6/3)	scos-lab #8	Profile your model before picking a uniform bit budget
Layer 0 has ~20% outlier channels, middle layers ~4-6%	scos-lab	Per-layer dynamic outlier thresholds beat a fixed global allocation

Where TurboQuant still wins

Scenario	Why TurboQuant beats FP8
Single-card serving of 70B+ at 128K+ context	3–4× KV reduction lets model + cache fit; FP8 only buys 2×
Consumer GPUs without FP8 attention (Ada, AMD older, Apple Silicon)	FP8 KV needs hardware FP8 attention to win; TurboQuant works everywhere
Long-context retrieval where memory bandwidth dominates	Reading ~52 bytes/token instead of 256 frees decode bandwidth
Edge / on-device with strict memory budgets	`4bit_nc` at <5% perplexity drop is hard to match

🎛️ Live demo on HuggingFace Space — see your model's KV cache shrink in real time
What's new — May 2026 • Why TurboQuant? • How it works • Quick start
Tips and tricks (new — from Red Hat eval + independent ports)
The 2026 KV compression landscape
Integrations (vLLM, SGLang, llama.cpp, KVPress, LMCache, MLX)
Hardware support • Decision guide • FAQ
Project structure • Citation • Credits

What's new — May 2026

TurboQuant landed in upstream vLLM, picked up several serious independent ports, and got its first comprehensive third-party evaluation. The headline story has shifted from "new SOTA, use it everywhere" to "useful in a well-defined regime, with caveats."

The May 2026 must-reads

May 11 — Red Hat AI / vLLM blog: "A First Comprehensive Study of TurboQuant: Accuracy and Performance" by Eldar Kurtić, Michael Goin, Alexandre Marques. The single most important data point in the ecosystem right now: rigorous evaluation across Llama-3.3-70B, Qwen3-30B-A3B, MiniMax-M2.7 on long-context retrieval (MRCR) and reasoning (AIME25, GPQA, MATH500, LiveCodeBench-v6). Conclusion: FP8 KV is the best default, TurboQuant 4bit_nc is the practical TQ variant, aggressive 3-bit modes drop accuracy meaningfully.
May 11–18 — Multiple independent evaluations (TeqVolt, AI Intensify) converge on the same finding: PolarQuant works, QJL hurts at 3 bits. Disabling QJL gives better practical accuracy at the same compression ratio. Tested on PyTorch, MLX, and Triton implementations.
May 13 — varjoranta/turboquant-vllm v0.13.5 adds end-to-end Qwen3.6-35B-A3B-TQ3-native support through vLLM 0.20.2 + FlashInfer CUTLASS MoE. First native-packed TQ checkpoint for a frontier MoE model — ~16 GB on disk (4.4× over BF16).
Apr 23 — varjoranta v0.13.0 ships block-diagonal WHT CUDA kernel: 10.1× decode speedup on Qwen3.6-35B-A3B-TQ3.
Late April — vLLM merged --kv-cache-dtype turboquant_{k8v4, 4bit_nc, k3v4_nc, 3bit_nc}. These dtype names that shipped supersede the earlier tq3 / tq4 / pq4 proposals.

Active community ports

Repo	Stars	Engine	Notable
0xSero/turboquant	1.1K	vLLM + Triton	First open-source vLLM integration; RTX 5090 + 8×RTX 3090 benchmarks; asymmetric 3-bit K / 2-bit V
tonbistudio/turboquant-pytorch	940+	Pure PyTorch	V3 algorithm that drops QJL based on community findings; documented `residual_window=0` benchmark bug + correction
varjoranta/turboquant-vllm (`turboquant-plus-vllm` on PyPI)	60+	vLLM + CUDA	Fused CUDA dequant kernels, MLX/Apple Silicon port, block-diagonal WHT for partial-rotary models, bs=1 GEMV kernel, kurtosis-aware mixed precision
AmesianX/TurboQuant	50+	llama.cpp	Only llama.cpp fork that supports head_dim=256+ (Qwen3.5, Qwen3.6, Gemma 4); v1.4.2 adds MMA tensor-core acceleration (+63% TG)
spiritbuun/llama-cpp-turboquant-cuda	300+	llama.cpp	head_dim=128 models only; fast on Llama-3 / Qwen3-14B-class
scos-lab/turboquant	14	Research	8-model benchmark suite; per-layer outlier analysis; published the K/V-norm-ratio finding

Native-packed checkpoints

varjosoft/Qwen3.6-35B-A3B-TQ3-native — pre-fused expert layout; loads via turboquant-plus-vllm on A100 80 GB or RTX PRO 6000.

Earlier (still relevant) — April 2026

LMCache blog (Apr 15) — best layperson explainer of the algorithm.
Towards AI (Apr 15) — consumer-GPU walkthrough; TurboQuant stacks on top of AWQ / GGUF / NVFP4.
TriAttention (arXiv 2604.04921) — 10.7× KV reduction on AIME25 via token selection; stacks with TurboQuant.
LRKV (Apr 9) — architectural 45-53% KV reduction at pretraining time.
Adaptive KV-Quant (arXiv 2604.04722) — learned per-token bit-width for edge.
SGLang PR #21419 — first --kv-cache-dtype turboquant attempt; still WIP, see INTEGRATIONS.md for the current SGLang status.
SGLang PR #21954 — NVFP4 KV strategy abstraction on Blackwell SM100/SM120; container that TurboQuant composes with.

ICLR 2026 poster (archive): Zandieh et al., Apr 25 2026.

📚 For the full 2026 landscape — including side-by-side comparisons with TriAttention, LRKV, MLA, KIVI, KVQuant, ParoQuant, NVFP4-KV, KVPress, KV Packet, SnapKV, H2O, and StreamingLLM — see LANDSCAPE_2026.md.

Why TurboQuant?

📄 Paper Results (Llama-3.1-8B-Instruct, LongBench — from the paper)

Method	KV Bits	LongBench Avg	Needle-in-Haystack
Full Precision	16	50.06	0.997
TurboQuant	3.5	50.06	0.997
TurboQuant	2.5	49.44	0.997
PolarQuant	3.9	49.78	0.995
KIVI	3	48.50	0.981
SnapKV	—	44.57	0.858

⚠️ Independent reproductions paint a more cautious picture than the paper. The Red Hat AI / vLLM evaluation in May 2026 found that 3-bit modes give meaningful accuracy drops on reasoning + very long context, especially without disabling QJL. Use 4-bit *_nc variants for production. See BENCHMARKS.md §Independent reproductions.

🔧 Our Implementation Results (Mistral-7B-Instruct-v0.3)

Mode	Logit Cosine	Top-1 Match	KV Key Cosine	KV Value Cosine	Compression
3.5-bit (default)	0.963	80% (4/5)	0.992	0.988	4.9×
2.5-bit	0.956	80% (4/5)	0.973	0.961	7.1×

Both modes use two independent rotations for outlier/regular channel subsets (Section 2.3) and online codebooks from actual data (Section 4.1).

Rotation modes: rotation_mode="hadamard" (default, O(d log d)) or rotation_mode="dense" (full random orthogonal via QR decomposition, O(d²)). Both satisfy P^T P = I exactly.

QJL is optional. Independent evaluations on PyTorch, MLX, and Triton consistently find that the 1-bit QJL correction adds variance that softmax amplifies, hurting attention quality at b ≤ 3. Pass qjl_score_weight=0.0 (or set enable_qjl=False) to fall back to pure MSE-only PolarQuant. See the discussion in IMPLEMENTATION_NOTES.md §QJL Score Weight.

How It Works

TurboQuant is a two-stage vector quantizer that achieves near-optimal compression:

Input KV vector (FP16, d=128)
         │
         ▼
┌─────────────────────┐
│  Random Rotation Π   │  Hadamard + random signs
│  y = Π · x          │  O(d log d), preserves norms
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Scalar Lloyd-Max    │  Each coordinate independently
│  idx = quantize(y)   │  b bits per coordinate
│                      │  Beta dist ≈ N(0, 1/d)
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  QJL Residual        │  1-bit sign quantization
│  sign(S · residual)  │  Unbiased inner products
└─────────┬───────────┘
          │
          ▼
   Compressed: (b+1) bits/coord + FP16 norm
   = ~3.25 bits/value at b=2
   = ~4.9× compression vs FP16

Key insight: After random rotation, each coordinate follows a Beta distribution that's near-independent of other coordinates. This means scalar quantization per coordinate is near-optimal — no coupling, no error compounding through deep models.

Quick Start

# Install from PyPI
pip install turboquant

# Use it
from turboquant import TurboQuantCache

cache = TurboQuantCache(n_layers=32, n_heads=8, d=128, b_mse=3, mixed_precision=True)
# ... drop into your attention loop; see INTEGRATIONS.md for vLLM / SGLang / llama.cpp

Or install from source for hacking on it:

git clone https://github.com/OnlyTerp/turboquant.git
cd turboquant
pip install -e ".[dev]"

# Run demo (synthetic vectors, no GPU needed)
python src/demo.py

# Run real model validation (downloads TinyLlama or Nemotron-Nano-4B)
python src/test_real_model.py

Serving engines — see INTEGRATIONS.md for full setup of each:

# vLLM (merged upstream — recommended)
pip install "vllm>=0.20.2"
vllm serve meta-llama/Llama-3.3-70B-Instruct --kv-cache-dtype turboquant_4bit_nc

# Production-grade vLLM fork with CUDA dequant kernels (10× decode speedup on MoE)
pip install turboquant-plus-vllm
vllm serve varjosoft/Qwen3.6-35B-A3B-TQ3-native --kv-cache-dtype turboquant_3bit_native

# Our self-contained plugin (no vLLM patching, works on older vLLM)
pip install -e ".[vllm]"
vllm serve meta-llama/Llama-3.1-8B-Instruct --attention-backend turboquant

# SGLang (still WIP, multiple competing PRs — see INTEGRATIONS.md)
gh pr checkout 21419 --repo sgl-project/sglang
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct \
    --kv-cache-dtype turboquant

# llama.cpp via AmesianX/TurboQuant (only fork with head_dim=256 support)
git clone https://github.com/AmesianX/TurboQuant && cd TurboQuant && make GGML_CUDA=1
./llama-cli -m model.gguf -ctk q4_0 -ctv q4_0 -fa -c 131072

Tips and tricks

Distilled from the Red Hat AI / vLLM evaluation (May 11, 2026), tonbistudio V3 retrospective, scos-lab 8-model benchmark, and runtime data from 0xSero, varjoranta, and AmesianX. Read this before you tune.

1. Try FP8 first

vllm serve <model> --kv-cache-dtype fp8

If FP8 fits your memory budget, you are done. It's the best-quality / highest-throughput KV format on every model the Red Hat team tested. TurboQuant only earns its keep when you need >2× compression.

2. If TurboQuant: pick `4bit_nc`, not `3bit_nc`

vllm serve <model> --kv-cache-dtype turboquant_4bit_nc

4bit_nc is the "norm correction" variant — it skips the QJL residual and uses a per-token norm rescale instead. Across Llama-3.3-70B, Qwen3-30B-A3B, and MiniMax-M2.7, it's within ~1 pt of FP8 on AIME25 / GPQA / MATH500 / LiveCodeBench while giving 2.6–3.1× KV reduction. 3bit_nc and k3v4_nc drop 15–25 pts on those same benchmarks at ≥128K context.

3. Disable QJL when working with the reference path

The qjl_score_weight=1.0 paper default is unbiased but high-variance. Softmax amplifies that variance, especially on heads with concentrated attention. Independent ports — PyTorch, MLX, Triton — consistently see MSE-only beat MSE+QJL at b ≤ 3. In code:

cache = TurboQuantCache(n_layers=..., n_heads=..., d=128, b_mse=3,
                        qjl_score_weight=0.0)  # disable QJL
# or set enable_qjl=False in newer revisions

4. Skip the first and last 2 layers

The first and last 2 layers carry disproportionate signal. The Red Hat evaluation found that quantizing them costs more accuracy than the bits they save. vLLM does this automatically for *_nc variants; if you roll your own attention backend, replicate it:

SKIP_LAYERS = set(list(range(2)) + list(range(model.config.num_hidden_layers - 2,
                                              model.config.num_hidden_layers)))
for layer_idx in range(model.config.num_hidden_layers):
    if layer_idx in SKIP_LAYERS:
        # keep FP16 / FP8 KV for this layer
        ...

5. Profile K vs V norms before picking bits

Qwen-class models have wildly asymmetric K/V norm distributions (e.g. Qwen3.5-7B sees a ~106× ratio between K and V channel norms; GPT-2-124M only ~6×). Models with extreme ratios benefit from more bits on K than V. The k8v4 and k3v4_nc variants exist for exactly this reason:

# Use more bits on K when K dominates the norm budget (Qwen-family)
vllm serve Qwen/Qwen3-30B-A3B --kv-cache-dtype turboquant_k8v4

To profile your own model, run scos-lab/turboquant benchmarks and inspect the per-channel norm histogram.

6. Use per-layer dynamic outlier thresholds

A fixed "top-k=32 outlier channels" allocation wastes bits in the middle layers (only ~4-6% of channels are actually outliers) and under-allocates layer 0 (~20% outliers). The scos-lab dynamic threshold (channels with RMS > 3× median get the extra bits) closes roughly 60% of the accuracy gap at the same average bpv.

This is the implementation policy in our mixed_precision=True path; just confirm your MixedPrecisionConfig is using mode="dynamic", not mode="fixed_topk".

7. For frontier MoE models, use the native-packed checkpoint

If you're serving Qwen3.6-35B-A3B-class models, don't quantize at runtime — use the pre-fused varjosoft/Qwen3.6-35B-A3B-TQ3-native checkpoint through turboquant-plus-vllm. It bakes the per-expert rotation into the weight layout so the runtime kernel never sees an unfused MoE expert, which is what lets it hit 10.1× decode speedup over BF16.

8. For llama.cpp, mind the head_dim

spiritbuun/llama-cpp-turboquant-cuda hard-codes head_dim=128. It works for Llama-3, Qwen2.5, Qwen3-14B and most GQA models — but Gemma 4, Qwen3.5, and Qwen3.6 use head_dim=256, where it fails silently with degraded output. Use AmesianX/TurboQuant (v1.3.0+) for those models — it has a P1→P5 detection cascade that finds the real head_dim even when n_embd / n_head lies (which it does on those architectures).

9. Stack with token selection for long reasoning

For >32K chain-of-thought (AIME, AIME25, math), token selection dominates precision. TriAttention, SnapKV, or NVIDIA's KVPress decide which tokens to retain; TurboQuant compresses each retained token. The two operate on different axes — combine them.

10. Don't trust headline compression numbers without verifying

The tonbistudio team retracted a "18/18 perfect generation at 5× compression" claim after discovering the eval loop was running with residual_window=0, which silently disabled compression. Always log the actual compressed token count after the run; needle-in-haystack at your target context length is the only honest end-to-end test.

The 2026 KV compression landscape

TurboQuant is the precision axis of KV compression. There are three axes — precision, selection, and container — and the best 2026 stacks combine all three. Summary (full analysis in LANDSCAPE_2026.md):

Method	Released	Axis	KV reduction	Quality at ratio	TQ relationship
FP8 KV (E4M3)	Hopper / Blackwell HW	Hardware FP8 container	2×	No measurable loss (Red Hat eval)	Strictly better than TQ when 2× is enough
TurboQuant `4bit_nc`	vLLM merged, May 2026	Precision (4 bpv, no QJL)	2.6–3.1×	Within ~1 pt of FP8 on long-context evals	The practical TQ mode
TurboQuant `3bit_nc`	vLLM merged, May 2026	Precision (3 bpv, no QJL)	3.5–4.0×	15–25 pt drops on AIME25 / LCB	Edge / memory-bound only
TurboQuant (paper config, w/ QJL)	Apr 2025 / ICLR'26	Precision (3.5 bpv + QJL)	4.9×	Identical FP16 on LongBench (paper); worse in practice	The published configuration
TriAttention	Apr 2026	Token selection	10.7× on AIME25 32K CoT	Matches Full Attn reasoning	Orthogonal — stack
Adaptive KV-Quant	Apr 2026	Per-token bit-width	Variable {2, 4, 8, 16}	+8% vs static on edge	Wraps TQ as backend
LRKV	Apr 2026	Architectural	45–53% vs MHA	Lower test loss vs MHA	Pretraining-time; multiplies
KV Packet	Apr 2026	Cache reuse	Recompute-free	TTFT ↓ for RAG	Orthogonal — caches TQ packets
ParoQuant	ICLR'26	Weight quant (INT4)	4× on weights	+2.4% over AWQ on reasoning	Complementary: W4 × KV3.5
NVFP4 KV	Apr 2026	HW container (FP4)	3.5× vs BF16	Native Blackwell	TQ lives inside NVFP4 blocks
KVPress	NVIDIA	Framework	Varies	Varies	KVPress picks, TQ compresses
MLA	DeepSeek-V3	Architectural	Latent down-projection	Shipped in production	TQ compresses the latent
KIVI	ICLR'24	Precision (2.25 bpv)	7.1×	LongBench 48.50	Earlier baseline TQ beats
KVQuant	NeurIPS'24	Precision + outlier	~4.5×	LongBench ~49.5	Needs calibration; TQ doesn't
SnapKV / H2O / StreamingLLM	2024	Token eviction	4–20×	Drops on long context	Orthogonal — stack

Integrations

TurboQuant runs under every major serving engine. Concrete commands in INTEGRATIONS.md; summary:

Engine	Status (May 2026)	Entry point
vLLM (upstream)	Merged — see docs	`--kv-cache-dtype turboquant_{k8v4, 4bit_nc, k3v4_nc, 3bit_nc}`
vLLM (production fork)	`turboquant-plus-vllm` on PyPI — CUDA dequant kernels, MoE support	`vllm serve varjosoft/Qwen3.6-35B-A3B-TQ3-native --kv-cache-dtype turboquant_3bit_native`
vLLM (our plugin)	`vllm_plugin/` — works on vLLM ≥ 0.4.0 without patching	`--attention-backend turboquant`
vLLM (Triton reference)	0xSero/turboquant — 1.1K stars, RTX 5090 + 8×RTX 3090 benchmarks	`gh repo clone 0xSero/turboquant && pip install -e .`
SGLang	WIP — competing PRs #21419, #22048, #23135; none merged as of May 2026	`--kv-cache-dtype turboquant` (on PR branch)
llama.cpp (head_dim=128)	spiritbuun/llama-cpp-turboquant-cuda	`-ctk tq3 -ctv tq3 -fa`
llama.cpp (any head_dim)	AmesianX/TurboQuant v1.4.2 — only fork supporting head_dim=256+	`-ctk tq3 -ctv tq3 -fa`
PyTorch reference (QJL-free)	tonbistudio/turboquant-pytorch V3	`pip install turboquant-pytorch`
MLX (Apple Silicon)	varjoranta MLX port (Metal kernels)	`python -m turboquant.mlx.chat --model <id>`
NVIDIA KVPress	0.4.0 — framework of "press" strategies	Stack TQ under ExpectedAttention / ThinK / AdaKV
LMCache	First-class; their explainer is still the best layperson intro	Store TQ-compressed K/V in distributed cache
Transformers (HF)	Monkey-patch via `past_key_values=TurboQuantCache(...)`	`src/test_real_model.py`

Hardware support

GPU	Arch	Status	Recommended stack
B100 / B200 / GB200	Blackwell SM100	First-class	NVFP4 weights + TurboQuant KV + FlashAttention-4
RTX PRO 6000 Blackwell 96 GB	Blackwell SM120	Working (some WSL2 workarounds)	NVFP4 weights + TurboQuant KV
RTX 5090	Blackwell SM120	Working with workarounds	NVFP4 weights + TurboQuant KV
H100 / H200	Hopper SM90	First-class	FP8 weights + TurboQuant KV
A100	Ampere SM80	Fully supported	INT8 weights + TurboQuant KV
RTX 4090 / 4080	Ada SM89	Fully supported	AWQ-INT4 + TurboQuant KV (+ TriAttention for 32K+ reasoning)
AMD MI300X / MI325X	CDNA3	Via PyTorch / ROCm	INT8 weights + TurboQuant KV
Apple M3 / M4 / M5	Apple Silicon	PyTorch MPS path	MLX-INT4 weights + TurboQuant KV
Jetson Orin / Thor	Edge	Adaptive KV-Quant preferred	TurboQuant 2.5-bit fallback

Which compressor do I actually want?

A one-shot decision table for "I need to serve X on Y hardware, what do I set up?". Updated May 2026 to reflect the Red Hat AI evaluation findings. Full discussion in LANDSCAPE_2026.md.

Scenario	Start with	Add on
Just give me something that works	FP8 KV (`--kv-cache-dtype fp8`)	—
Datacenter Hopper/Blackwell, memory-constrained at long context	FP8 KV; switch to `turboquant_4bit_nc` only if FP8 doesn't fit	SnapKV / KVPress beyond 1M
Datacenter Blackwell, max throughput	NVFP4 weights + FP8 KV	—
Single-card 70B+ at 128K+	`turboquant_4bit_nc` (FP8 won't fit)	Skip first/last 2 layers
Consumer Blackwell (RTX 5090 / RTX PRO 6000)	NVFP4 weights + FP8 KV if fits, else `turboquant_4bit_nc`	—
Consumer Ada (RTX 4090 / 4080) — no FP8 attention	AWQ-INT4 + `turboquant_4bit_nc`	TriAttention for 32K+ CoT
Apple Silicon	MLX-INT4 + TurboQuant via varjoranta MLX port	—
On-device / edge with strict memory	`turboquant_3bit_nc` (accept the accuracy hit) or Adaptive KV-Quant	Token eviction
RAG, high cache reuse	`turboquant_4bit_nc`	KV Packet / LMCache
Long CoT reasoning (>32K AIME-style)	FP8 KV (3-bit TQ drops accuracy too hard)	TriAttention for token selection
Qwen3 / Qwen3.5 / Qwen3.6 (asymmetric K/V norms)	`turboquant_k8v4` or `k3v4_nc`	head_dim=256? use AmesianX llama.cpp

FAQ

Common questions and misconceptions are answered in FAQ.md. Highlights:

"Should I use FP8 or TurboQuant?" FP8 is the better default in May 2026. Reach for TurboQuant only when you need >2× compression and accept some throughput cost.
"Does the QJL step actually help?" Usually not at b ≤ 3. MSE-only beats MSE+QJL on every model the community has tested. The *_nc variants exist for this reason.
"Is this a replacement for AWQ / GPTQ / GGUF?" No — TurboQuant compresses the KV cache at inference time, stacking on top of weight quantization.
"Why 3.5 bits?" It's a mode name from the paper. In practice, outlier channels get an extra MSE bit + 1-bit QJL residual; actual budget is ~3.25–4.6 bpv depending on the mode (see BENCHMARKS.md §Current Demo Results).
"Do I need to calibrate?" No — TurboQuant is data-oblivious (random rotation + Lloyd-Max codebook are fixed at init).
"Does it work with RoPE / GQA / MLA / FlashAttention?" Yes to all.
"Which layers should I skip from compression?" The first and last 2.
"How do I choose K vs V bits for my model?" Profile the K/V norm ratio. Symmetric for Llama-class, asymmetric (K > V) for Qwen-class.

Limitations

Reference implementation — Pure PyTorch, not optimized for production throughput. Triton kernels are experimental. Use varjoranta CUDA kernels or 0xSero Triton for real serving.
CPU attention is slow — The demo runs on CPU (~25× slower than FP16). GPU kernels needed for competitive speed.
3-bit modes are not production-ready — Independent evaluations consistently find unacceptable accuracy drops on reasoning + >128K retrieval. Use 4bit_nc or FP8 for production; reserve 3-bit for edge / memory-bound deployments.
QJL is on by default for backward compatibility — But you should turn it off (qjl_score_weight=0.0). See tip #3.
Mixed-precision is approximate — Our outlier channel detection differs from the paper's theoretically optimal two-independent-instances approach (see IMPLEMENTATION_NOTES.md). Dynamic per-layer thresholds (scos-lab finding) help close the gap.
Tested on 2 models — Mistral-7B-Instruct and Nemotron-Nano-4B. For frontier-model results, see Red Hat AI eval, 0xSero RTX 5090 benchmarks, and varjoranta Qwen3.6-35B-A3B results.
vLLM plugin is a scaffold — Not yet tested at scale with actual vLLM serving. For production, use upstream --kv-cache-dtype turboquant_* (merged) or the turboquant-plus-vllm PyPI package.

Algorithm Details

TurboQuant implements two algorithms from the paper:

Algorithm 1: TurboQuant_mse (MSE-optimal)

Random rotation: Multiply by randomized Hadamard matrix Π
Scalar quantization: Lloyd-Max codebook for Beta distribution, applied per coordinate
Store: b-bit index per coordinate + FP16 norm
Distortion bound: MSE ≤ √(3π/2) · 4^(-b)

Algorithm 2: TurboQuant_prod (inner product-optimal)

Apply TurboQuant_mse with (b-1) bits
Compute residual: r = x - DeQuant(Quant(x))
QJL: sign(S · r) where S has i.i.d. N(0,1) entries
Unbiased: E[⟨y, x̂⟩] = ⟨y, x⟩ (no systematic bias)
Total: b bits per coordinate

Why Not Recursive Polar Transform?

The related PolarQuant paper uses recursive polar coordinates, but TurboQuant deliberately avoids this. Recursive polar transforms couple coordinates through sin/cos operations at each level, causing errors to compound through deep models (7 levels for d=128). TurboQuant's scalar approach quantizes each coordinate independently — zero coupling, zero compounding.

Project Structure

turboquant/
├── src/
│   ├── cache.py              # Core algorithm (encode/decode/cache/attention)
│   ├── demo.py               # Synthetic benchmark
│   ├── test_real_model.py    # Real transformer model validation
│   ├── test_turboquant.py    # Unit tests (33 tests)
│   ├── kernels.py            # Triton GPU kernels (experimental)
│   └── lut_attention.py      # LUT-based attention (experimental)
├── vllm_plugin/              # vLLM integration scaffold
├── deploy/                   # Docker deployment assets
│
├── README.md                 # ← you are here
├── LANDSCAPE_2026.md         # Full 2026 KV-compression ecosystem survey
├── INTEGRATIONS.md           # vLLM / SGLang / llama.cpp / MLX / KVPress / LMCache
├── FAQ.md                    # Common questions & misconceptions
├── BENCHMARKS.md             # Memory tables, throughput targets, methodology
├── IMPLEMENTATION_NOTES.md   # Rotation modes, outlier channels, QJL residual
├── pseudocode.md             # Line-by-line paper pseudocode for re-implementers
├── LAUNCH.md                 # Launch kit: threads, HN post, blog outline
├── reports/
│   ├── 2026-03-31-build-report.md    # RTX 5090 Blackwell benchmarks
│   ├── 2026-04-17-demo-results.md    # Fresh mixed-precision demo results
│   ├── 2026-04-17-demo-results.json  # Raw JSON artifact
│   └── scripts/
│       ├── run_demo_modes.py         # Reproducible 2.5-bit / 3.5-bit demo runner
│       └── check_thresholds.py       # CI gate on cosine-similarity regression
├── .github/workflows/
│   ├── test.yml              # Multi-python CI (3.10 / 3.11 / 3.12)
│   └── link-check.yml        # Weekly + PR link rot detection
└── setup.py                  # Package installation (exposes turboquant)

Citation

@inproceedings{zandieh2026turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}

Credits & Attribution

This is an independent open-source implementation of the TurboQuant algorithm. All credit for the algorithm design, theoretical analysis, and original research belongs to the paper authors.

Paper: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Published at ICLR 2026
Authors: Amir Zandieh (Google Research), Majid Daliri (NYU), Majid Hadian (Google DeepMind), Vahab Mirrokni (Google Research)
Related work: PolarQuant, QJL by overlapping authors
Implementation: Terp AI Labs

This implementation is not affiliated with or endorsed by Google Research, Google DeepMind, or NYU. We built it from the public paper to make TurboQuant accessible to the open-source community.

License

MIT — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github		.github
assets		assets
deploy		deploy
notebooks		notebooks
reports		reports
spaces/turboquant-demo		spaces/turboquant-demo
src		src
vllm_plugin		vllm_plugin
.gitignore		.gitignore
BENCHMARKS.md		BENCHMARKS.md
CONTRIBUTING.md		CONTRIBUTING.md
FAQ.md		FAQ.md
IMPLEMENTATION_NOTES.md		IMPLEMENTATION_NOTES.md
INTEGRATIONS.md		INTEGRATIONS.md
LANDSCAPE_2026.md		LANDSCAPE_2026.md
LAUNCH.md		LAUNCH.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
RELEASING.md		RELEASING.md
pseudocode.md		pseudocode.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

⚡ TurboQuant

Reality check (May 2026)

Where TurboQuant still wins

Table of contents

What's new — May 2026

The May 2026 must-reads

Active community ports

Native-packed checkpoints

Earlier (still relevant) — April 2026

Why TurboQuant?

📄 Paper Results (Llama-3.1-8B-Instruct, LongBench — from the paper)

🔧 Our Implementation Results (Mistral-7B-Instruct-v0.3)

How It Works

Quick Start

Tips and tricks

1. Try FP8 first

2. If TurboQuant: pick 4bit_nc, not 3bit_nc

3. Disable QJL when working with the reference path

4. Skip the first and last 2 layers

5. Profile K vs V norms before picking bits

6. Use per-layer dynamic outlier thresholds

7. For frontier MoE models, use the native-packed checkpoint

8. For llama.cpp, mind the head_dim

9. Stack with token selection for long reasoning

10. Don't trust headline compression numbers without verifying

The 2026 KV compression landscape

Integrations

Hardware support

Which compressor do I actually want?

FAQ

Limitations

Algorithm Details

Algorithm 1: TurboQuant_mse (MSE-optimal)

Algorithm 2: TurboQuant_prod (inner product-optimal)

Why Not Recursive Polar Transform?

Project Structure

Citation

Credits & Attribution

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. If TurboQuant: pick `4bit_nc`, not `3bit_nc`

Packages