Research Date: 2026-02-08 Status: Complete
Quantization reduces model size by converting weights from high-precision (FP16/FP32) to low-precision formats (INT4/INT8). This enables running large models on consumer hardware.
Key Insight: A 7B parameter model:
- FP16: ~14GB VRAM
- INT4: ~3.5GB VRAM (4x smaller!)
| Method | Type | Target Hardware | Best For |
|---|---|---|---|
| GPTQ | Quantization algorithm | GPU | Cloud, production |
| AWQ | Quantization algorithm | GPU (edge) | Fast inference, edge |
| GGUF | File format + algorithm | CPU (+ GPU) | Local, laptops |
Important Distinction:
- GPTQ and AWQ are quantization algorithms
- GGUF is a file format (with its own quantization method)
Quantize weights one at a time, then adjust remaining weights to compensate for the error.
For each weight:
1. Quantize the weight
2. Measure the quantization error
3. Adjust all remaining unquantized weights
4. Repeat
-
Build Sensitivity Map (Hessian)
H = 2 × X × X^T + λI- High H value = "This weight matters, be careful!"
- Low H value = "Quantize aggressively"
-
Error Calculation
error = (original_weight - quantized_weight) / sensitivity -
Compensate Other Weights
remaining_weights -= error × adjustment_factors
- Uses Cholesky decomposition for numerical stability
- Group-wise quantization (typically group_size=128)
- Column-by-column processing for optimal compensation
| Metric | Value |
|---|---|
| Compression | 4x (16-bit → 4-bit) |
| Speed | ~3.2x faster than FP16 |
| Quality | Minimal loss |
| Best for | Cloud deployment, A100/H100 |
Only 1% of weights really matter. Identify them via activation patterns and protect their precision.
Traditional: Treat all weights equally → Critical weights lose precision
AWQ: Scale up important weights → Preserve precision where it matters
Ingredient A: Salt (25g, HIGH impact)
- Without AWQ: Rounds to 0g → Dish ruined!
- With AWQ: Scale 4× → 100g → Rounds to 107g → /4 = 26.75g ✓
Ingredient B: Flour (480g, LOW impact)
- Without AWQ: Perfect match
- With AWQ: Scale 0.5× → 240g → Rounds to 267g → ×2 = 534g (acceptable)
-
Find Activation Magnitudes
s_x = mean(|X|, dim=0) # Per input channel -
Scale Factor with Alpha
scale = s_x^α- α = 0: No scaling (standard quantization)
- α = 1: Maximum protection
- 0 < α < 1: Optimal (found via grid search)
-
Apply Scaling
Q(W × scale) × (X / scale) ≈ W × X
| Metric | Value |
|---|---|
| Compression | 4x |
| Speed | 22x faster than FP16 (with kernels) |
| Calibration | Only 16-32 samples needed |
| Best for | Edge devices, fast quantization |
| Aspect | GPTQ | AWQ |
|---|---|---|
| Approach | Compensate errors | Protect important weights |
| Calibration samples | 128-256 | 16-32 |
| Speed | Fast | Faster |
| Generalization | Good | Better |
Run LLMs on CPUs without GPUs. Designed for consumer laptops and PCs.
-
Find Range
min, max = weights.min(), weights.max() step_size = (max - min) / (2^bits - 1) -
Quantize
quantized = round((weight - min) / step_size) -
Dequantize (at inference)
weight ≈ quantized × step_size + min
| Level | Bits/Weight | Size (7B) | Quality | Speed |
|---|---|---|---|---|
| Q8 | 8 | 7GB | Excellent | Fast |
| Q5 | 5 | 4.4GB | Very Good | Medium |
| Q4 | 4 | 3.5GB | Good | Medium |
| Q3 | 3 | 2.6GB | Acceptable | Slow |
| Q2 | 2 | 1.75GB | Degraded | Slow |
Most Popular: Q4 (best quality/size tradeoff)
- Multiple precision levels (choose your tradeoff)
- CPU-optimized (SIMD, AVX2/AVX-512)
- Zero calibration needed
- Fast conversion (~2-3 min for 7B)
Marlin is NOT a quantization algorithm — it's a CUDA kernel that runs GPTQ/AWQ models faster.
| Method | Without Marlin | With Marlin | Speedup |
|---|---|---|---|
| GPTQ | 276 tok/s | 712 tok/s | 2.6x |
| AWQ | 68 tok/s | 741 tok/s | 10.9x |
- Async Memory Copies — Compute while loading
- L1 Cache Bypass — Direct to shared memory
- Fused Dequantization — No intermediate storage
- Optimized L2 Usage — 80-95% cache hit rate
AWQ > GGUF ≈ GPTQ
95% 92% 90% (of FP16 quality)
AWQ+Marlin > GPTQ+Marlin > GGUF (CPU)
741 tok/s 712 tok/s varies
| Scenario | Recommendation |
|---|---|
| Production (Cloud) | AWQ with Marlin |
| High accuracy needed | GPTQ |
| Local development | GGUF Q4 |
| Edge devices | AWQ |
| No GPU | GGUF (Q4-Q5) |
| Minimal RAM | GGUF Q2-Q3 |
Goal: Run 7-8B model locally on João's machine (31GB RAM, no dedicated GPU info)
Recommended Path:
-
Start with GGUF Q4
- Works on CPU
- 3.5GB for 7B model
- Easy with llama.cpp or Ollama
- Good quality for testing
-
If GPU available:
- Use AWQ with vLLM or text-generation-inference
- Much faster inference
-
For fine-tuned models:
- Train with QLoRA (4-bit)
- Export to GGUF for local inference
- Or keep AWQ format for GPU
| Model Size | GGUF Q4 RAM | AWQ GPU VRAM |
|---|---|---|
| 7B | ~4GB | ~4GB |
| 8B | ~4.5GB | ~4.5GB |
| 13B | ~8GB | ~8GB |
| 32B | ~18GB | ~18GB |
| Format | Inference Tools |
|---|---|
| GGUF | llama.cpp, Ollama, LM Studio, text-generation-webui |
| AWQ | vLLM, text-generation-inference, HuggingFace |
| GPTQ | vLLM, AutoGPTQ, exllama |
- Runtime-adaptive precision
- <10% more VRAM than standard 4-bit
- Better accuracy
- Affine transformations before quantization
- State-of-the-art W4A4 results
- More complex calibration
- Flexible bit-level operations
- Custom precision per layer
- Research stage
- GPTQ Paper — Frantar et al., 2022
- AWQ Paper — Lin et al., 2023 (MLSys 2024 Best Paper)
- Marlin Paper — Frantar et al., 2024
- GGUF Format
- vLLM Benchmarks
- Visual Guide
Analysis by Cláudio for Project Molting