diff --git a/site/index.html b/site/index.html index 03c4de7..4894289 100644 --- a/site/index.html +++ b/site/index.html @@ -522,7 +522,7 @@
+Chunking RAG was a workaround for small context windows.
The workaround became dogma.
Now context windows are big enough that we don't need the workaround.
@@ -531,7 +531,7 @@Beyond RAG
Traditional RAG splits documents into 512-token chunks, embeds them in a vector database, and retrieves fragments. This was a reasonable engineering compromise when LLMs had 2K context windows. Now they have 128K. The compromise should have started disappearing.
-It didn't. The infrastructure became dogma. Vector DBs became billion-dollar companies. "RAG pipeline" became something every AI engineer was expected to build, regardless of whether their use case actually needed one.
+It didn't. The infrastructure became dogma. Vector DBs became billion-dollar companies. "RAG pipeline" became something every AI engineer was expected to build, regardless of whether their use case actually needed one.
Chunk-Level RAG vs Document-Level RAG@@ -600,34 +600,34 @@Read Once, Query Forever
-Measured Result-7/7 vs 0/7 — Verified
-We compared three approaches on a synthetic 5-section document with 7 questions (4 single-hop, 3 multi-hop). Tested with Llama 3.2 3B Q8_0:
+Measured Result+7/7 vs 0/7 — Verified
+We compared three approaches on a synthetic 5-section document with 7 questions (4 single-hop, 3 multi-hop). Tested with Llama 3.2 3B Q8_0:
--Fact Extraction Accuracy+Fact Extraction AccuracyThe Hallucination Problem
-When chunk-RAG retrieved the wrong section, the model didn't say "I don't know" — it generated plausible-sounding lies:
+The Hallucination Problem
+When chunk-RAG retrieved the wrong section, the model didn't say "I don't know" — it generated plausible-sounding lies:
-@@ -743,7 +743,7 @@+-Q: Who is the CTO?Chunk-RAG: "John Smith" → truth: Maria Santos
@@ -639,28 +639,28 @@The Hallucination Problem
This is the fundamental danger of chunk-RAG: retrieval failure becomes silent hallucination. KV compression makes it possible to load the entire document into context, eliminating this failure mode on consumer hardware.
+This is the fundamental danger of chunk-RAG: retrieval failure becomes silent hallucination. KV compression makes it possible to load the entire document into context, eliminating this failure mode on consumer hardware.
-KV Compression = Zero Quality Loss
-FP32 7/7 = 6.4x compressed 7/7. The 6.4x memory savings cost nothing in fact extraction quality.
+KV Compression = Zero Quality Loss
+FP32 7/7 = 6.4x compressed 7/7. The 6.4x memory savings cost nothing in fact extraction quality.
-Multi-Hop Reasoning Works
-"What risk affects the growth region?" requires linking Section 3 (Asia growth) with Section 5 (Asia currency risk). Full-doc: ✓. Chunk-RAG: impossible.
+Multi-Hop Reasoning Works
+"What risk affects the growth region?" requires linking Section 3 (Asia growth) with Section 5 (Asia currency risk). Full-doc: ✓. Chunk-RAG: impossible.
-Runs on 16GB Mac
-Llama 3.2 3B Q8_0, no GPU. 6.4x KV compression makes this practical on consumer hardware.
+Runs on 16GB Mac
+Llama 3.2 3B Q8_0, no GPU. 6.4x KV compression makes this practical on consumer hardware.
Try It Yourself
@@ -913,7 +913,30 @@Try It Yourself
"rag.card2.d": "Can't fit 100K documents in context. Prefill is slow. RAG narrows the search to 2-3 relevant documents that DO fit.", "rag.card3.t": "Read Once, Query Forever", "rag.card3.d": "Pre-process documents into .kv files (GPU, once). Load instantly on any laptop (0.5s). Query offline, unlimited, private.", - "rag.pipeline.title": "Pre-computed KV Library Pattern" + "rag.pipeline.title": "Pre-computed KV Library Pattern", + "rag.quote": "Chunking RAG was a workaround for small context windows.
The workaround became dogma.
Now context windows are big enough that we don't need the workaround.
— Welcome to Beyond RAG.", + "rag.para2": "It didn't. The infrastructure became dogma. Vector DBs became billion-dollar companies. \"RAG pipeline\" became something every AI engineer was expected to build, regardless of whether their use case actually needed one.", + "verify.label": "Measured Result", + "verify.title": "7/7 vs 0/7 — Verified", + "verify.intro": "We compared three approaches on a synthetic 5-section document with 7 questions (4 single-hop, 3 multi-hop). Tested with Llama 3.2 3B Q8_0:", + "verify.viz.title": "Fact Extraction Accuracy", + "verify.bar1.label": "Chunk-RAG (wrong section retrieved)", + "verify.bar1.val": "0/7 — all hallucinated", + "verify.bar2.label": "Full Document (FP32 KV)", + "verify.bar3.label": "Full Document (6.4x KV compression)", + "verify.bar3.inner": "100% — same as FP32", + "verify.halluc.title": "The Hallucination Problem", + "verify.halluc.desc": "When chunk-RAG retrieved the wrong section, the model didn't say \"I don't know\" — it generated plausible-sounding lies:", + "verify.halluc.examples": "Q: Who is the CTO?Chunk-RAG: \"John Smith\" → truth: Maria SantosQ: What is the revenue?Chunk-RAG: \"$1,000,000\" → truth: 847 millionQ: What percent is R&D?Chunk-RAG: \"15% of net income\" → truth: 14% of revenue", + "verify.halluc.summary": "This is the fundamental danger of chunk-RAG: retrieval failure becomes silent hallucination. KV compression makes it possible to load the entire document into context, eliminating this failure mode on consumer hardware.", + "verify.card1.t": "KV Compression = Zero Quality Loss", + "verify.card1.d": "FP32 7/7 = 6.4x compressed 7/7. The 6.4x memory savings cost nothing in fact extraction quality.", + "verify.card2.t": "Multi-Hop Reasoning Works", + "verify.card2.d": "\"What risk affects the growth region?\" requires linking Section 3 (Asia growth) with Section 5 (Asia currency risk). Full-doc: ✓. Chunk-RAG: impossible.", + "verify.card3.t": "Runs on 16GB Mac", + "verify.card3.d": "Llama 3.2 3B Q8_0, no GPU. 6.4x KV compression makes this practical on consumer hardware.", + "verify.cta": "Read the Beyond RAG Manifesto →", + "footer.text": "quant.cpp · Apache 2.0 · GitHub · Made by quantumaikr" }, ko: { "nav.problem": "\uBB38\uC81C\uC810", @@ -1077,7 +1100,30 @@Try It Yourself
"rag.card2.d": "100K 문서를 한 번에 컨텍스트에 넣을 수 없습니다. Prefill이 느립니다. RAG는 검색을 2-3개 관련 문서로 좁혀줍니다.", "rag.card3.t": "한 번 읽고, 영원히 질문", "rag.card3.d": "문서를 .kv 파일로 사전 처리 (GPU, 1회). 어떤 노트북에서든 즉시 로드 (0.5초). 오프라인, 무제한, 프라이빗 질문.", - "rag.pipeline.title": "사전 계산된 KV 라이브러리 패턴" + "rag.pipeline.title": "사전 계산된 KV 라이브러리 패턴", + "rag.quote": "청킹 RAG는 작은 컨텍스트 윈도우에 대한 임시방편이었습니다.
그 임시방편이 정설이 됐습니다.
이제 컨텍스트 윈도우가 충분히 커져서 임시방편이 필요 없습니다.
— Beyond RAG에 오신 것을 환영합니다.", + "rag.para2": "사라지지 않았습니다. 인프라가 정설이 됐습니다. 벡터 DB는 수십억 달러 기업이 됐습니다. \"RAG 파이프라인\"은 실제 용도가 필요하든 아니든 모든 AI 엔지니어가 구축해야 할 무언가가 됐습니다.", + "verify.label": "측정 결과", + "verify.title": "7/7 vs 0/7 — 검증됨", + "verify.intro": "5개 섹션의 합성 문서와 7개 질문(4개 단일-hop, 3개 multi-hop)으로 세 가지 접근법을 비교했습니다. Llama 3.2 3B Q8_0으로 테스트:", + "verify.viz.title": "사실 추출 정확도", + "verify.bar1.label": "Chunk-RAG (잘못된 섹션 검색)", + "verify.bar1.val": "0/7 — 전부 환각", + "verify.bar2.label": "전체 문서 (FP32 KV)", + "verify.bar3.label": "전체 문서 (6.4배 KV 압축)", + "verify.bar3.inner": "100% — FP32와 동일", + "verify.halluc.title": "환각 문제", + "verify.halluc.desc": "Chunk-RAG가 잘못된 섹션을 검색했을 때, 모델은 \"모르겠습니다\"라고 말하지 않고 그럴듯한 거짓말을 생성했습니다:", + "verify.halluc.examples": "Q: CTO는 누구인가요?Chunk-RAG: \"John Smith\" → 정답: Maria SantosQ: 매출은 얼마인가요?Chunk-RAG: \"$1,000,000\" → 정답: 8억 4,700만Q: R&D는 몇 퍼센트인가요?Chunk-RAG: \"순이익의 15%\" → 정답: 매출의 14%", + "verify.halluc.summary": "이것이 chunk-RAG의 근본적 위험입니다: 검색 실패가 조용한 환각이 됩니다. KV 압축은 전체 문서를 컨텍스트에 로드할 수 있게 하여, 소비자 하드웨어에서 이 실패 모드를 제거합니다.", + "verify.card1.t": "KV 압축 = 품질 손실 0", + "verify.card1.d": "FP32 7/7 = 6.4배 압축 7/7. 6.4배 메모리 절감이 사실 추출 품질에 아무런 비용도 들이지 않습니다.", + "verify.card2.t": "Multi-Hop 추론 작동", + "verify.card2.d": "\"성장 지역에 영향을 미치는 위험은?\"은 섹션 3(아시아 성장)과 섹션 5(아시아 통화 위험)를 연결해야 합니다. 전체 문서: ✓. Chunk-RAG: 불가능.", + "verify.card3.t": "16GB Mac에서 실행", + "verify.card3.d": "Llama 3.2 3B Q8_0, GPU 없음. 6.4배 KV 압축으로 소비자 하드웨어에서 실용적이 됩니다.", + "verify.cta": "Beyond RAG 선언문 읽기 →", + "footer.text": "quant.cpp · Apache 2.0 · GitHub · 제작 quantumaikr" } };