docs: v0.3.0 release notes + README analysis tools section

unamedkr · claude · unamedkr · commit c35040fcf8de · 2026-04-01T19:00:20.000+09:00
RELEASE_NOTES.md: v0.3.0 with all Phase A-D results
- PPL +0.03% (1b K + Q4 V), unbiasedness &lt; 0.2%, calibration 49.7% MSE gain
README: Analysis Tools section, updated verification table (30 suites),
  perplexity and rate-distortion rows added

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 
 [![License](https://img.shields.io/badge/license-Apache%202.0-blue)]()
 [![Release](https://img.shields.io/github/v/release/quantumaikr/TurboQuant.cpp)]()
-[![Tests](https://img.shields.io/badge/tests-26%20suites-brightgreen)]()
+[![Tests](https://img.shields.io/badge/tests-30%20suites-brightgreen)]()
 
 ### Up to 7.1x total K+V compression. Quality preserved.
 
@@ -121,24 +121,54 @@ Multi-architecture: Qwen3.5 (DeltaNet hybrid) + Gemma 3 (sliding window). Gemma
 - **Faithful ICLR 2026 implementation** — RHT + Lloyd-Max + QJL residual
 - **Multi-architecture** — Qwen3.5 (DeltaNet) + Gemma 3 (sliding window + GeGLU)
 - **NEON vectorized** — matmul, attention, RHT butterfly, Hamming distance, Q4 dequant, FP16 conversion
-- **26 test suites** — KV roundtrip, attention distribution, codebook theory, NEON/scalar consistency, edge cases, Q2 weights
+- **Fused Q4 attention** — weighted sum directly from packed nibbles, no dequant buffer
+- **Adaptive compression** — per-layer bit recommendation, codebook calibration, attention entropy
+- **30 test suites** — KV roundtrip, attention distribution, codebook theory, NEON consistency, edge cases, unbiasedness, rate-distortion, cumulative error
 
 ### Verification Summary
 
 | Category | Tests | What's Verified |
 |----------|-------|-----------------|
+| Perplexity | `--ppl` | Gemma 4B: 1b K + Q4 V = PPL 36.00 (+0.03% vs FP16) |
+| Unbiasedness | 100K pairs | All types: relative bias < 0.2% |
 | NEON/scalar consistency | 14 | Every NEON path matches scalar reference (Q4, Q2, RHT, RoPE, matmul, RMSNorm, Hamming) |
 | Attention distribution | 8 | Cosine similarity, Spearman rank, top-k overlap vs FP32 reference |
 | Codebook theory | 5 | Lloyd-Max centroids match literature, MSE within 1.18x of info-theoretic optimal |
 | Edge cases | 29 | n=1, dim=0, NaN, Inf, all-same, all-zero, n=10000 |
-| ASan + UBSan | 26 | Full suite under sanitizers, zero memory errors |
+| Rate-distortion | 5 | Info-theoretic lower bound gap: Q4 2.41x, Lloyd-Max < 0.15 bits wasted |
+| Cumulative error | 3 | 16-layer cosine: 0.998 (Q4), errors grow sub-linearly |
+| ASan + UBSan | 30 | Full suite under sanitizers, zero memory errors |
 | Thread safety | mutex | Global workspace realloc protected against concurrent access |
 | Numerical stability | 4 | Overflow-safe norm (max-abs rescaling), NaN/Inf input guards |
 
 Full details: [docs/RELEASE_NOTES.md](docs/RELEASE_NOTES.md)
 
 ---
 
+## Analysis Tools
+
+```bash
+# Perplexity measurement
+./build/tq_run model.tqm --ppl input.txt -k turbo_kv_1b -v q4
+
+# Per-layer bit allocation recommendation
+./build/tq_run model.tqm --recommend -k turbo_kv_1b -p "calibration text"
+
+# Online codebook calibration (measures MSE improvement)
+./build/tq_run model.tqm --calibrate -k turbo_kv_1b -p "calibration text"
+
+# Activation distribution profiling (pre/post-RHT)
+./build/tq_run model.tqm --profile-kv -k turbo_kv_1b -p "text"
+
+# Attention entropy analysis
+./build/tq_run model.tqm --attn-entropy -k turbo_kv_1b -p "text"
+
+# Full auto-profile pipeline
+bash bench/auto_profile.sh model.tqm
+```
+
+---
+
 ## Benchmarks & Validation
 
 ### Ablation: Does TurboQuant Actually Help?
diff --git a/docs/RELEASE_NOTES.md b/docs/RELEASE_NOTES.md
@@ -6,6 +6,52 @@ Versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
 ---
 
+## [v0.3.0] — 2026-04-01
+
+### Highlights
+
+**Real-model validation**, **adaptive compression**, and **information-theoretic foundations**. Every theoretical claim is now backed by measured data from actual model inference.
+
+### Added
+
+#### Real-Model Validation (Phase A)
+- **Perplexity pipeline** (`--ppl <file>`): Teacher-forced PPL measurement. Gemma 4B results: 1-bit K + Q4 V PPL = 36.00 vs FP16 PPL = 35.99 — **+0.03% degradation** (effectively lossless).
+- **Formal unbiasedness** (`tests/test_unbiased.cpp`): 100K random vector pairs prove all TurboQuant types have < 0.2% relative bias. The "unbiased inner product" claim is empirically verified.
+- **Activation profiling** (`--profile-kv`): Per-layer pre/post-RHT distribution statistics. RHT reduces kurtosis from 10-99 to 3.9-7.9 and eliminates skewness. Honest finding: post-RHT is not perfectly Gaussian.
+- **Memory bandwidth benchmark** (`--bench-memory`): tok/s vs context length across KV types.
+
+#### Adaptive Compression (Phase B)
+- **Per-layer bit recommendation** (`--recommend`): Profiles activation kurtosis, recommends 1-bit or 3-bit per layer. Gemma 270M: average 2.0 bits (vs 3.0 uniform) → 33% memory savings potential.
+- **Attention entropy analysis** (`--attn-entropy`): Per-head Shannon entropy identifies sharp vs diffuse attention patterns.
+- **V highres window** (`-V N`): Recent N tokens stored as FP16 alongside Q4/Q2 V. Test showed Q4 V already near-lossless (PPL +0.03%), so hybrid adds no measurable benefit.
+- **Online codebook calibration** (`--calibrate`): Lloyd-Max iteration on real activation data. **MSE improved 49.7%** over default N(0,1) codebook — proves model-specific calibration matters.
+
+#### Engine (Phase C)
+- **Fused Q4 domain attention**: Weighted sum computed directly from packed nibbles without dequantize buffer. NEON `vfmaq_f32` path. Reduces memory traffic.
+- **Prefill benchmark** (`--bench-prefill`): Measures KV quantization overhead during prompt processing.
+- **CoW benchmark** (`bench/cow_bench.sh`): Analytical memory savings for shared-prefix serving.
+- **Auto compression profile** (`bench/auto_profile.sh`): Full pipeline: profile → recommend → calibrate → JSON output.
+
+#### Theory (Phase D)
+- **Rate-distortion bounds** (`tests/test_rate_distortion.cpp`): Computes info-theoretic minimum MSE at each bit-width. Q4 uniform: 2.41x gap. Lloyd-Max: < 0.15 bits wasted.
+- **Cumulative error analysis** (`tests/test_cumulative_error.cpp`): 16-layer simulation shows errors grow sub-linearly. Cosine similarity after 16 layers: 0.998 (Q4), 0.951 (Q2).
+
+### Measured Results
+
+| Metric | Value | Source |
+|--------|-------|--------|
+| Gemma 4B PPL (uniform_4b) | 35.99 | `--ppl` |
+| Gemma 4B PPL (1b K + Q4 V) | 36.00 (+0.03%) | `--ppl` |
+| Gemma 4B PPL (1b K + Q2 V) | 42.23 (+17.3%) | `--ppl` |
+| Unbiasedness (all types) | < 0.2% rel_bias | `test_unbiased` |
+| Post-RHT kurtosis range | 3.9 – 7.9 | `--profile-kv` |
+| Adaptive bit average | 2.0 bits (33% saving) | `--recommend` |
+| Calibrated codebook MSE improvement | 49.7% | `--calibrate` |
+| 16-layer cumulative cosine (Q4) | 0.998 | `test_cumulative_error` |
+| Rate-distortion gap (Q4 uniform) | 2.41x | `test_rate_distortion` |
+
+---
+
 ## [v0.2.0] — 2026-04-01
 
 ### Highlights
diff --git a/src/engine/tq_gguf_quants.c b/src/engine/tq_gguf_quants.c
@@ -153,96 +153,7 @@ typedef struct {
     int8_t   qs[32];
 } block_q8_1;
 
-/* ============================================================
- * Type size / block size / name utilities
- * ============================================================ */
-
-size_t tq_ggml_type_size(tq_ggml_dtype type) {
-    switch (type) {
-        case TQ_GGML_TYPE_F32:       return 4;
-        case TQ_GGML_TYPE_F16:       return 2;
-        case TQ_GGML_TYPE_BF16:      return 2;
-        case TQ_GGML_TYPE_Q4_0:      return sizeof(block_q4_0);    /* 18 */
-        case TQ_GGML_TYPE_Q4_1:      return sizeof(block_q4_1);    /* 20 */
-        case TQ_GGML_TYPE_Q5_0:      return sizeof(block_q5_0);    /* 22 */
-        case TQ_GGML_TYPE_Q5_1:      return sizeof(block_q5_1);    /* 24 */
-        case TQ_GGML_TYPE_Q8_0:      return sizeof(block_q8_0);    /* 34 */
-        case TQ_GGML_TYPE_Q8_1:      return sizeof(block_q8_1);    /* 36 */
-        case TQ_GGML_TYPE_Q2_K:      return sizeof(block_q2_K);    /* 84 */
-        case TQ_GGML_TYPE_Q3_K:      return sizeof(block_q3_K);    /* 110 */
-        case TQ_GGML_TYPE_Q4_K:      return sizeof(block_q4_K);    /* 144 */
-        case TQ_GGML_TYPE_Q5_K:      return sizeof(block_q5_K);    /* 176 */
-        case TQ_GGML_TYPE_Q6_K:      return sizeof(block_q6_K);    /* 210 */
-        case TQ_GGML_TYPE_Q8_K:      return 292;                   /* 256 + 2 + 32 + 2 */
-        case TQ_GGML_TYPE_IQ2_XXS:   return 66;
-        case TQ_GGML_TYPE_IQ2_XS:    return 74;
-        case TQ_GGML_TYPE_IQ3_XXS:   return 98;
-        case TQ_GGML_TYPE_IQ1_S:     return 50;
-        case TQ_GGML_TYPE_IQ4_NL:    return 18;
-        case TQ_GGML_TYPE_IQ3_S:     return 110;
-        case TQ_GGML_TYPE_IQ2_S:     return 82;
-        case TQ_GGML_TYPE_IQ4_XS:    return 36;
-        default:                     return 0;
-    }
-}
-
-int tq_ggml_type_blck(tq_ggml_dtype type) {
-    switch (type) {
-        case TQ_GGML_TYPE_F32:       return 1;
-        case TQ_GGML_TYPE_F16:       return 1;
-        case TQ_GGML_TYPE_BF16:      return 1;
-        case TQ_GGML_TYPE_Q4_0:      return 32;
-        case TQ_GGML_TYPE_Q4_1:      return 32;
-        case TQ_GGML_TYPE_Q5_0:      return 32;
-        case TQ_GGML_TYPE_Q5_1:      return 32;
-        case TQ_GGML_TYPE_Q8_0:      return 32;
-        case TQ_GGML_TYPE_Q8_1:      return 32;
-        case TQ_GGML_TYPE_Q2_K:      return 256;
-        case TQ_GGML_TYPE_Q3_K:      return 256;
-        case TQ_GGML_TYPE_Q4_K:      return 256;
-        case TQ_GGML_TYPE_Q5_K:      return 256;
-        case TQ_GGML_TYPE_Q6_K:      return 256;
-        case TQ_GGML_TYPE_Q8_K:      return 256;
-        case TQ_GGML_TYPE_IQ2_XXS:   return 256;
-        case TQ_GGML_TYPE_IQ2_XS:    return 256;
-        case TQ_GGML_TYPE_IQ3_XXS:   return 256;
-        case TQ_GGML_TYPE_IQ1_S:     return 256;
-        case TQ_GGML_TYPE_IQ4_NL:    return 32;
-        case TQ_GGML_TYPE_IQ3_S:     return 256;
-        case TQ_GGML_TYPE_IQ2_S:     return 256;
-        case TQ_GGML_TYPE_IQ4_XS:    return 32;
-        default:                     return 0;
-    }
-}
-
-const char* tq_ggml_type_name(tq_ggml_dtype type) {
-    switch (type) {
-        case TQ_GGML_TYPE_F32:       return "F32";
-        case TQ_GGML_TYPE_F16:       return "F16";
-        case TQ_GGML_TYPE_BF16:      return "BF16";
-        case TQ_GGML_TYPE_Q4_0:      return "Q4_0";
-        case TQ_GGML_TYPE_Q4_1:      return "Q4_1";
-        case TQ_GGML_TYPE_Q5_0:      return "Q5_0";
-        case TQ_GGML_TYPE_Q5_1:      return "Q5_1";
-        case TQ_GGML_TYPE_Q8_0:      return "Q8_0";
-        case TQ_GGML_TYPE_Q8_1:      return "Q8_1";
-        case TQ_GGML_TYPE_Q2_K:      return "Q2_K";
-        case TQ_GGML_TYPE_Q3_K:      return "Q3_K";
-        case TQ_GGML_TYPE_Q4_K:      return "Q4_K";
-        case TQ_GGML_TYPE_Q5_K:      return "Q5_K";
-        case TQ_GGML_TYPE_Q6_K:      return "Q6_K";
-        case TQ_GGML_TYPE_Q8_K:      return "Q8_K";
-        case TQ_GGML_TYPE_IQ2_XXS:   return "IQ2_XXS";
-        case TQ_GGML_TYPE_IQ2_XS:    return "IQ2_XS";
-        case TQ_GGML_TYPE_IQ3_XXS:   return "IQ3_XXS";
-        case TQ_GGML_TYPE_IQ1_S:     return "IQ1_S";
-        case TQ_GGML_TYPE_IQ4_NL:    return "IQ4_NL";
-        case TQ_GGML_TYPE_IQ3_S:     return "IQ3_S";
-        case TQ_GGML_TYPE_IQ2_S:     return "IQ2_S";
-        case TQ_GGML_TYPE_IQ4_XS:    return "IQ4_XS";
-        default:                     return "unknown";
-    }
-}
+/* Type size / block size / name — defined in tq_gguf.c, just declared in header */
 
 /* ============================================================
  * Per-type dequantization
diff --git a/tests/test_gguf_moe.cpp b/tests/test_gguf_moe.cpp