diff --git a/site/index.html b/site/index.html index 03c4de7..4894289 100644 --- a/site/index.html +++ b/site/index.html @@ -522,7 +522,7 @@

Context Length on 8GB Mac

Movement

Beyond RAG

-
+
Chunking RAG was a workaround for small context windows.
The workaround became dogma.
Now context windows are big enough that we don't need the workaround.
@@ -531,7 +531,7 @@

Beyond RAG

Traditional RAG splits documents into 512-token chunks, embeds them in a vector database, and retrieves fragments. This was a reasonable engineering compromise when LLMs had 2K context windows. Now they have 128K. The compromise should have started disappearing.

-

It didn't. The infrastructure became dogma. Vector DBs became billion-dollar companies. "RAG pipeline" became something every AI engineer was expected to build, regardless of whether their use case actually needed one.

+

It didn't. The infrastructure became dogma. Vector DBs became billion-dollar companies. "RAG pipeline" became something every AI engineer was expected to build, regardless of whether their use case actually needed one.

Chunk-Level RAG vs Document-Level RAG
@@ -600,34 +600,34 @@

Read Once, Query Forever

- -

7/7 vs 0/7 — Verified

-

We compared three approaches on a synthetic 5-section document with 7 questions (4 single-hop, 3 multi-hop). Tested with Llama 3.2 3B Q8_0:

+ +

7/7 vs 0/7 — Verified

+

We compared three approaches on a synthetic 5-section document with 7 questions (4 single-hop, 3 multi-hop). Tested with Llama 3.2 3B Q8_0:

-
Fact Extraction Accuracy
+
Fact Extraction Accuracy
-
Chunk-RAG (wrong section retrieved)0/7 — all hallucinated
+
Chunk-RAG (wrong section retrieved)0/7 — all hallucinated
0%
-
Full Document (FP32 KV)7/7
+
Full Document (FP32 KV)7/7
100%
-
Full Document (6.4x KV compression)7/7
-
100% — same as FP32
+
Full Document (6.4x KV compression)7/7
+
100% — same as FP32
-

The Hallucination Problem

-

When chunk-RAG retrieved the wrong section, the model didn't say "I don't know" — it generated plausible-sounding lies:

+

The Hallucination Problem

+

When chunk-RAG retrieved the wrong section, the model didn't say "I don't know" — it generated plausible-sounding lies:

-
+
Q: Who is the CTO?
Chunk-RAG: "John Smith"   → truth: Maria Santos

@@ -639,28 +639,28 @@

The Hallucination Problem

-

This is the fundamental danger of chunk-RAG: retrieval failure becomes silent hallucination. KV compression makes it possible to load the entire document into context, eliminating this failure mode on consumer hardware.

+

This is the fundamental danger of chunk-RAG: retrieval failure becomes silent hallucination. KV compression makes it possible to load the entire document into context, eliminating this failure mode on consumer hardware.

-

KV Compression = Zero Quality Loss

-

FP32 7/7 = 6.4x compressed 7/7. The 6.4x memory savings cost nothing in fact extraction quality.

+

KV Compression = Zero Quality Loss

+

FP32 7/7 = 6.4x compressed 7/7. The 6.4x memory savings cost nothing in fact extraction quality.

🔗
-

Multi-Hop Reasoning Works

-

"What risk affects the growth region?" requires linking Section 3 (Asia growth) with Section 5 (Asia currency risk). Full-doc: ✓. Chunk-RAG: impossible.

+

Multi-Hop Reasoning Works

+

"What risk affects the growth region?" requires linking Section 3 (Asia growth) with Section 5 (Asia currency risk). Full-doc: ✓. Chunk-RAG: impossible.

💻
-

Runs on 16GB Mac

-

Llama 3.2 3B Q8_0, no GPU. 6.4x KV compression makes this practical on consumer hardware.

+

Runs on 16GB Mac

+

Llama 3.2 3B Q8_0, no GPU. 6.4x KV compression makes this practical on consumer hardware.

@@ -743,7 +743,7 @@

Try It Yourself

@@ -913,7 +913,30 @@

Try It Yourself

"rag.card2.d": "Can't fit 100K documents in context. Prefill is slow. RAG narrows the search to 2-3 relevant documents that DO fit.", "rag.card3.t": "Read Once, Query Forever", "rag.card3.d": "Pre-process documents into .kv files (GPU, once). Load instantly on any laptop (0.5s). Query offline, unlimited, private.", - "rag.pipeline.title": "Pre-computed KV Library Pattern" + "rag.pipeline.title": "Pre-computed KV Library Pattern", + "rag.quote": "Chunking RAG was a workaround for small context windows.
The workaround became dogma.
Now context windows are big enough that we don't need the workaround.
— Welcome to Beyond RAG.", + "rag.para2": "It didn't. The infrastructure became dogma. Vector DBs became billion-dollar companies. \"RAG pipeline\" became something every AI engineer was expected to build, regardless of whether their use case actually needed one.", + "verify.label": "Measured Result", + "verify.title": "7/7 vs 0/7 — Verified", + "verify.intro": "We compared three approaches on a synthetic 5-section document with 7 questions (4 single-hop, 3 multi-hop). Tested with Llama 3.2 3B Q8_0:", + "verify.viz.title": "Fact Extraction Accuracy", + "verify.bar1.label": "Chunk-RAG (wrong section retrieved)", + "verify.bar1.val": "0/7 — all hallucinated", + "verify.bar2.label": "Full Document (FP32 KV)", + "verify.bar3.label": "Full Document (6.4x KV compression)", + "verify.bar3.inner": "100% — same as FP32", + "verify.halluc.title": "The Hallucination Problem", + "verify.halluc.desc": "When chunk-RAG retrieved the wrong section, the model didn't say \"I don't know\" — it generated plausible-sounding lies:", + "verify.halluc.examples": "
Q: Who is the CTO?
Chunk-RAG: \"John Smith\"   → truth: Maria Santos

Q: What is the revenue?
Chunk-RAG: \"$1,000,000\"   → truth: 847 million

Q: What percent is R&D?
Chunk-RAG: \"15% of net income\"   → truth: 14% of revenue
", + "verify.halluc.summary": "This is the fundamental danger of chunk-RAG: retrieval failure becomes silent hallucination. KV compression makes it possible to load the entire document into context, eliminating this failure mode on consumer hardware.", + "verify.card1.t": "KV Compression = Zero Quality Loss", + "verify.card1.d": "FP32 7/7 = 6.4x compressed 7/7. The 6.4x memory savings cost nothing in fact extraction quality.", + "verify.card2.t": "Multi-Hop Reasoning Works", + "verify.card2.d": "\"What risk affects the growth region?\" requires linking Section 3 (Asia growth) with Section 5 (Asia currency risk). Full-doc: ✓. Chunk-RAG: impossible.", + "verify.card3.t": "Runs on 16GB Mac", + "verify.card3.d": "Llama 3.2 3B Q8_0, no GPU. 6.4x KV compression makes this practical on consumer hardware.", + "verify.cta": "Read the Beyond RAG Manifesto →", + "footer.text": "quant.cpp · Apache 2.0 · GitHub · Made by quantumaikr" }, ko: { "nav.problem": "\uBB38\uC81C\uC810", @@ -1077,7 +1100,30 @@

Try It Yourself

"rag.card2.d": "100K 문서를 한 번에 컨텍스트에 넣을 수 없습니다. Prefill이 느립니다. RAG는 검색을 2-3개 관련 문서로 좁혀줍니다.", "rag.card3.t": "한 번 읽고, 영원히 질문", "rag.card3.d": "문서를 .kv 파일로 사전 처리 (GPU, 1회). 어떤 노트북에서든 즉시 로드 (0.5초). 오프라인, 무제한, 프라이빗 질문.", - "rag.pipeline.title": "사전 계산된 KV 라이브러리 패턴" + "rag.pipeline.title": "사전 계산된 KV 라이브러리 패턴", + "rag.quote": "청킹 RAG는 작은 컨텍스트 윈도우에 대한 임시방편이었습니다.
그 임시방편이 정설이 됐습니다.
이제 컨텍스트 윈도우가 충분히 커져서 임시방편이 필요 없습니다.
— Beyond RAG에 오신 것을 환영합니다.", + "rag.para2": "사라지지 않았습니다. 인프라가 정설이 됐습니다. 벡터 DB는 수십억 달러 기업이 됐습니다. \"RAG 파이프라인\"은 실제 용도가 필요하든 아니든 모든 AI 엔지니어가 구축해야 할 무언가가 됐습니다.", + "verify.label": "측정 결과", + "verify.title": "7/7 vs 0/7 — 검증됨", + "verify.intro": "5개 섹션의 합성 문서와 7개 질문(4개 단일-hop, 3개 multi-hop)으로 세 가지 접근법을 비교했습니다. Llama 3.2 3B Q8_0으로 테스트:", + "verify.viz.title": "사실 추출 정확도", + "verify.bar1.label": "Chunk-RAG (잘못된 섹션 검색)", + "verify.bar1.val": "0/7 — 전부 환각", + "verify.bar2.label": "전체 문서 (FP32 KV)", + "verify.bar3.label": "전체 문서 (6.4배 KV 압축)", + "verify.bar3.inner": "100% — FP32와 동일", + "verify.halluc.title": "환각 문제", + "verify.halluc.desc": "Chunk-RAG가 잘못된 섹션을 검색했을 때, 모델은 \"모르겠습니다\"라고 말하지 않고 그럴듯한 거짓말을 생성했습니다:", + "verify.halluc.examples": "
Q: CTO는 누구인가요?
Chunk-RAG: \"John Smith\"   → 정답: Maria Santos

Q: 매출은 얼마인가요?
Chunk-RAG: \"$1,000,000\"   → 정답: 8억 4,700만

Q: R&D는 몇 퍼센트인가요?
Chunk-RAG: \"순이익의 15%\"   → 정답: 매출의 14%
", + "verify.halluc.summary": "이것이 chunk-RAG의 근본적 위험입니다: 검색 실패가 조용한 환각이 됩니다. KV 압축은 전체 문서를 컨텍스트에 로드할 수 있게 하여, 소비자 하드웨어에서 이 실패 모드를 제거합니다.", + "verify.card1.t": "KV 압축 = 품질 손실 0", + "verify.card1.d": "FP32 7/7 = 6.4배 압축 7/7. 6.4배 메모리 절감이 사실 추출 품질에 아무런 비용도 들이지 않습니다.", + "verify.card2.t": "Multi-Hop 추론 작동", + "verify.card2.d": "\"성장 지역에 영향을 미치는 위험은?\"은 섹션 3(아시아 성장)과 섹션 5(아시아 통화 위험)를 연결해야 합니다. 전체 문서: ✓. Chunk-RAG: 불가능.", + "verify.card3.t": "16GB Mac에서 실행", + "verify.card3.d": "Llama 3.2 3B Q8_0, GPU 없음. 6.4배 KV 압축으로 소비자 하드웨어에서 실용적이 됩니다.", + "verify.cta": "Beyond RAG 선언문 읽기 →", + "footer.text": "quant.cpp · Apache 2.0 · GitHub · 제작 quantumaikr" } };