You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Real-model validation**, **adaptive compression**, and **information-theoretic foundations**. Every theoretical claim is now backed by measured data from actual model inference.
14
+
15
+
### Added
16
+
17
+
#### Real-Model Validation (Phase A)
18
+
-**Perplexity pipeline** (`--ppl <file>`): Teacher-forced PPL measurement. Gemma 4B results: 1-bit K + Q4 V PPL = 36.00 vs FP16 PPL = 35.99 — **+0.03% degradation** (effectively lossless).
19
+
-**Formal unbiasedness** (`tests/test_unbiased.cpp`): 100K random vector pairs prove all TurboQuant types have < 0.2% relative bias. The "unbiased inner product" claim is empirically verified.
20
+
-**Activation profiling** (`--profile-kv`): Per-layer pre/post-RHT distribution statistics. RHT reduces kurtosis from 10-99 to 3.9-7.9 and eliminates skewness. Honest finding: post-RHT is not perfectly Gaussian.
21
+
-**Memory bandwidth benchmark** (`--bench-memory`): tok/s vs context length across KV types.
22
+
23
+
#### Adaptive Compression (Phase B)
24
+
-**Per-layer bit recommendation** (`--recommend`): Profiles activation kurtosis, recommends 1-bit or 3-bit per layer. Gemma 270M: average 2.0 bits (vs 3.0 uniform) → 33% memory savings potential.
-**V highres window** (`-V N`): Recent N tokens stored as FP16 alongside Q4/Q2 V. Test showed Q4 V already near-lossless (PPL +0.03%), so hybrid adds no measurable benefit.
27
+
-**Online codebook calibration** (`--calibrate`): Lloyd-Max iteration on real activation data. **MSE improved 49.7%** over default N(0,1) codebook — proves model-specific calibration matters.
28
+
29
+
#### Engine (Phase C)
30
+
-**Fused Q4 domain attention**: Weighted sum computed directly from packed nibbles without dequantize buffer. NEON `vfmaq_f32` path. Reduces memory traffic.
31
+
-**Prefill benchmark** (`--bench-prefill`): Measures KV quantization overhead during prompt processing.
32
+
-**CoW benchmark** (`bench/cow_bench.sh`): Analytical memory savings for shared-prefix serving.
0 commit comments