Unified KV Cache Compression Methods for Auto-Regressive Models
-
Updated
Jan 4, 2025 - Python
Unified KV Cache Compression Methods for Auto-Regressive Models
LLM KV cache compression made easy
Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes.
[NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)
Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
Pytorch implementation for "Compressed Context Memory For Online Language Model Interaction" (ICLR'24)
This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"
First open-source implementation of Google TurboQuant (ICLR 2026) -- near-optimal KV cache compression for LLM inference. 5x compression with near-zero quality loss.
xKV: Cross-Layer SVD for KV-Cache Compression
(ACL2025 oral) SCOPE: Optimizing KV Cache Compression in Long-context Generation
Accurate and fast KV cache compression with a gating mechanism
First open-source KVTC implementation (NVIDIA, ICLR 2026) -- 8-32x KV cache compression via PCA + adaptive quantization + entropy coding
LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation
Native Windows build of vLLM 0.19.0 — no WSL, no Docker. Pre-built wheels + 33-file Windows patch + Multi-TurboQuant KV cache compression (6 methods, 2x cache capacity). PyTorch 2.10 + CUDA 12.6 + Triton + Flash-Attention 2.
AI agent skill implementing Google's TurboQuant compression algorithm (ICLR 2026) — 6x KV cache memory reduction, 8x speedup, zero accuracy loss. Compatible with Claude Code, Codex CLI, and all Agent Skills-compatible tools.
Near-optimal vector quantization for LLM KV cache compression. Python implementation of TurboQuant (ICLR 2026) — PolarQuant + QJL for 3-bit quantization with minimal accuracy loss and up to 8x memory reduction.
Repository for the paper: https://arxiv.org/abs/2510.00231
Drop-in KV cache compression for MLX on Apple Silicon. Brings PolarQuant (Google, ICLR 2026) to mlx-lm with first-class Gemma 4 support: MatFormer, dual head_dim, hybrid sliding/global attention, cross-layer KV sharing. 3-bit → 4.8× smaller cache, 0.995 logit cosine @ 4-bit.
Test Google's new TurboQuant KV-cache compression (ICLR 2026) on your local machine — measure real speed, memory, and accuracy differences across compression modes.
Add a description, image, and links to the kv-cache-compression topic page so that developers can more easily learn about it.
To associate your repository with the kv-cache-compression topic, visit your repo's landing page and select "manage topics."