From 58d8ad66033992b1639dd09661bb3b1dc47f1eb0 Mon Sep 17 00:00:00 2001 From: quantumaikr Date: Fri, 10 Apr 2026 23:52:59 +0900 Subject: [PATCH] docs(guide): add 'When to use which?' scenario table + C code in CTA Address Reddit feedback: guide only showed KV compression benchmarks vs llama.cpp but didn't explain when to use quant.cpp vs llama.cpp. Changes: 1. Added "When to use which?" table after the PPL comparison with concrete scenarios (WASM 192KB, MCU, game engines, teaching) and explicit acknowledgment of llama.cpp strengths (GPU, models) 2. CTA now shows both Python AND C single-header code side by side, reinforcing the "one file" value proposition 3. Updated i18n strings for EN and KO Co-Authored-By: Claude Opus 4.6 (1M context) --- site/index.html | 43 ++++++++++++++++++++++++++++++++++++++----- 1 file changed, 38 insertions(+), 5 deletions(-) diff --git a/site/index.html b/site/index.html index 68cc9c4..8624585 100644 --- a/site/index.html +++ b/site/index.html @@ -480,7 +480,7 @@

Compression vs Quality

-

vs llama.cpp

+

vs llama.cpp KV compression

Same 4-bit budget, 3.5x less quality degradation:

PPL Degradation at 4-bit (lower is better)
@@ -494,6 +494,23 @@

vs llama.cpp

+

When to use which?

+

llama.cpp is excellent. The difference is integration scope, not capability:

+
+ + + + + + + + + + +
Scenarioquant.cppllama.cpp
WASM browser demo192 KB binaryTensor graph too large
Microcontroller / RTOS#include onlyNeeds build system
Game engine pluginDrop one .h file250K LOC build
Learn in an afternoon16K LOC250K+ LOC
GPU throughputBasicFull Metal/CUDA
Model coverage7 architectures100+
+
+

Use llama.cpp for speed on a workstation. Use quant.cpp when you need to ship LLM inference inside something.

+

Context Length on 8GB Mac

@@ -572,12 +589,28 @@

Glossary

Try It Yourself

-

Three lines of Python. No GPU, no API key, no setup.

-
pip install quantcpp
+    

Python one-liner or C single-header. No GPU, no API key, no setup.

+
+
+
Python
+
pip install quantcpp
 
 from quantcpp import Model
 m = Model.from_pretrained("Llama-3.2-1B")
 print(m.ask("What is gravity?"))
+
+
+
C (single header)
+
#include "quant.h"
+
+int main() {
+    quant_model* m = quant_load("model.gguf");
+    quant_generate(quant_new(m, NULL),
+        "Hello!", print_token, NULL);
+}
+// cc app.c -lm -lpthread
+
+

GitHub PyPI @@ -715,7 +748,7 @@

Try It Yourself

'ch5.label':'Chapter 5','ch5.title':'Benchmarks','ch5.desc':'All measurements on Llama 3.2 1B Instruct (Q8_0 GGUF), Apple M1 Pro, 8 threads.', 'ch6.label':'Chapter 6','ch6.title':'Research Foundations','ch6.desc':'Each technique in quant.cpp is grounded in peer-reviewed research:', 'gl.label':'Reference','gl.title':'Glossary', - 'cta.title':'Try It Yourself','cta.desc':'Three lines of Python. No GPU, no API key, no setup.', + 'cta.title':'Try It Yourself','cta.desc':'Python one-liner or C single-header. No GPU, no API key, no setup.', }, ko: { 'nav.problem':'문제점','nav.solution':'핵심 발견','nav.techniques':'4가지 기술', @@ -748,7 +781,7 @@

Try It Yourself

'ch5.label':'챕터 5','ch5.title':'벤치마크','ch5.desc':'모든 측정: Llama 3.2 1B Instruct (Q8_0 GGUF), Apple M1 Pro, 8 스레드.', 'ch6.label':'챕터 6','ch6.title':'연구 기반','ch6.desc':'quant.cpp의 각 기술은 동료 심사를 거친 연구에 기반합니다:', 'gl.label':'참조','gl.title':'용어집', - 'cta.title':'직접 해보기','cta.desc':'Python 3줄. GPU도, API 키도, 설정도 필요 없습니다.', + 'cta.title':'직접 해보기','cta.desc':'Python 한 줄 또는 C 헤더 하나. GPU도, API 키도, 설정도 필요 없습니다.', } };