Clarify Code Research as orchestration layer over base RAG tools

ofriw · claude · ofriw · commit 374353785539 · 2025-11-12T14:39:13.000+02:00
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
diff --git a/src/content/docs/code-research.mdx b/src/content/docs/code-research.mdx
@@ -78,6 +78,14 @@ Code Research features won't work without LLM configuration. However:
 - **Refactoring prep** - Understand all dependencies before making changes
 - **Code archaeology** - Learn unfamiliar systems quickly
 
+## When to Use Direct Search Instead
+
+Code Research is designed for architectural exploration. For simpler queries, use the base search tools directly:
+
+- **Quick symbol lookups** - Use regex search to find all occurrences of a specific function or class name
+- **Known file/function** - Use semantic search when you know roughly what you're looking for
+- **Architectural questions** - Use Code Research to understand how components interact and why
+
 ---
 
 ## Usage
@@ -244,17 +252,34 @@ At each level, an LLM generates **context-aware follow-up questions** to explore
 
 ### Graph RAG Without the Graph
 
-Traditional Graph RAG systems build explicit knowledge graphs—extracting entities, mining relationships, and storing them in graph databases. ChunkHound achieves the same **multi-hop exploration** and **relationship-aware retrieval** without any graph construction, leveraging the semantic structure already captured during the cAST chunking pass.
+Traditional Graph RAG systems build explicit knowledge graphs—extracting entities, mining relationships, and storing them in graph databases. Code Research **approximates graph-like exploration** through orchestration, trading explicit relationship modeling for zero upfront cost and automatic scaling.
+
+#### How Orchestration Creates a Virtual Graph
 
-#### Virtual Graph Through cAST
+ChunkHound's base layer (cAST index + semantic/regex search) provides traditional RAG capabilities. The Code Research sub-agent orchestrates these tools to create Graph RAG behavior:
 
-The cAST (context-aware AST) chunking algorithm doesn't just split code into pieces—it creates a **virtual graph structure** where:
+**Base Layer Foundation:**
+- **Chunks as nodes**: cAST chunking preserves metadata (function names, class hierarchies, parameters, imports)
+- **Vector similarity as edges**: Semantic search finds conceptually related chunks via HNSW index
+- **Symbol references as edges**: Regex search finds all exact symbol occurrences
 
-- **Nodes**: Semantic chunks enriched with metadata (function names, class hierarchies, parameter lists, import paths)
-- **Edges**: Vector similarity + structural relationships (file proximity, AST nesting, import dependencies, symbol references)
-- **Traversal**: Multi-hop semantic search explores the vector space like a breadth-first graph walk
+**Orchestration Layer Creates the Graph:**
+- **BFS traversal**: Iteratively calls semantic search, starting from initial results and expanding through related chunks
+- **Query expansion**: Generates multiple semantic entry points, exploring different "neighborhoods" in parallel
+- **Symbol extraction + regex**: Pulls symbols from semantic results, triggers parallel regex to find all references
+- **Follow-up questions**: Creates targeted queries based on discovered code, recursively exploring architectural boundaries
+- **Convergence detection**: Monitors score degradation to prevent infinite traversal
 
-Because cAST preserves AST relationships during chunking (which functions call what, which classes inherit from where, which modules import others), the chunks already encode the code's architectural graph—**no separate extraction needed**. This virtual graph approach scales efficiently to multi-million LOC repositories without the computational overhead and maintenance burden of explicit graph structures, which become increasingly expensive at scale.
+Because cAST chunks preserve semantic boundaries, multi-hop expansion follows meaningful architectural connections rather than arbitrary text proximity. This structural awareness is why orchestration can approximate graph traversal—the base chunks already encode relationships that orchestration discovers through iterative search.
+
+The virtual graph emerges through **orchestrated tool use**, not pre-computed storage:
+- Initial semantic search → discovers conceptually relevant chunks
+- Multi-hop expansion → follows vector similarity "edges" through BFS
+- Symbol extraction → identifies key entities from high-relevance results
+- Regex search → finds all references, completing the "graph" of connections
+- Follow-ups → explores architectural relationships discovered in results
+
+This approach scales efficiently to multi-million LOC repositories because there's no explicit graph to maintain—the "graph" is the pattern of orchestrated search calls, adapted dynamically to each query's needs.
 
 #### Hybrid Semantic + Symbol Search
 
@@ -268,15 +293,18 @@ The results are unified through simple deduplication by chunk ID. Semantic resul
 
 #### Why This Works
 
-Graph RAG's benefits come from following relationships between connected information. ChunkHound achieves this dynamically:
+Traditional semantic search finds conceptually similar code but misses architectural relationships. Knowledge graphs model these relationships explicitly but require expensive upfront extraction and ongoing maintenance.
+
+Code Research combines base search capabilities (semantic + regex) with intelligent orchestration:
 
-1. **No upfront cost**: No entity extraction, no graph construction, no graph maintenance
-2. **Query-adaptive**: Relationships discovered on-demand based on what each specific query needs
-3. **Language-agnostic**: Works across all 29 supported file types through universal AST patterns
-4. **Precise + conceptual**: Combines semantic understanding with exact symbol matching
-5. **Scales automatically**: Token budgets and traversal adapt to repository size
+1. **Query expansion** - Multiple semantic entry points discover different code neighborhoods
+2. **Multi-hop exploration** - BFS through semantic neighborhoods following architectural connections
+3. **Symbol extraction + regex** - Comprehensive coverage beyond semantic discovery
+4. **Follow-up generation** - Context-aware questions explore architectural boundaries
+5. **Adaptive scaling** - Token budgets (30k-150k) scale with codebase size
+6. **Map-reduce synthesis** - Parallel cluster synthesis with deterministic citation remapping
 
-The result: Graph RAG's architectural understanding and multi-hop discovery, achieved through the semantic knowledge already embedded in your cAST-parsed codebase.
+The virtual graph emerges through orchestrated tool use—no upfront construction, no separate storage, no synchronization overhead. Query-adaptive orchestration scales from quick searches to deep architectural exploration automatically.
 
 ### Adaptive Scaling
 
diff --git a/src/content/docs/index.mdx b/src/content/docs/index.mdx
@@ -24,23 +24,62 @@ Your AI assistant helps you code but lacks critical context:
 
 ## What is Deep Research for Code & Files?
 
-Just like Deep Research transforms how you explore the web, ChunkHound with the [Code Research](/code-research) tool brings Deep Research to your local codebase and files. **Search across 29 file types** including Python, JavaScript, TypeScript, Markdown, PDFs, and more.
+Just like Deep Research transforms how you explore the web, ChunkHound's [Code Research](/code-research) tool brings iterative, multi-hop exploration to your local codebase. Code Research is a **specialized orchestration sub-agent** that uses ChunkHound's semantic and regex search tools strategically, exploring your codebase like an experienced engineer across **29 file types** including Python, JavaScript, TypeScript, Markdown, PDFs, and more.
 
-- **Iterative Discovery** - Start with a concept, expand through code, markdown files, and comments
-- **Multi-hop Search** - Find connections between implementation and written knowledge
-- **Architecture Map** - Understand relationships across code files, README files, and design notes
-- **Complete Context** - Find that `validateEmail()` AND the markdown file explaining why it exists
+## Why ChunkHound is Different
+
+**Two-Layer Architecture: Best of Both Worlds**
+
+ChunkHound provides both traditional RAG capabilities AND intelligent orchestration for deep exploration:
+
+### Base Layer: Enhanced RAG
+Like traditional RAG systems, ChunkHound maintains an index and provides search tools—but with critical improvements:
+- **[cAST chunking](/under-the-hood#the-cast-algorithm)**: Structure-aware code segmentation (4.3 point gain on retrieval benchmarks)
+- **Semantic search**: Natural language queries via HNSW vector indexing
+- **Regex search**: Exact pattern matching for comprehensive symbol coverage
+
+### Orchestration Layer: Code Research Sub-Agent
+The [Code Research](/code-research) tool is a specialized orchestration layer that uses base search tools strategically:
+- **Multi-hop exploration**: BFS traversal discovering architectural relationships
+- **Query expansion**: Multiple semantic entry points to cast wider nets
+- **Follow-up generation**: Iterative questioning based on discovered code
+- **Adaptive scaling**: Token budgets automatically scale from 30k-150k based on repository size
+- **Map-reduce synthesis**: Handles millions of lines without context collapse
+
+**The result**: Virtual Graph RAG behavior through orchestration, not explicit graph construction.
+
+**What this means for you:**
+- **Use what you need**: Direct semantic/regex search for quick lookups, Code Research for architectural exploration
+- **Zero upfront cost**: No entity extraction, no graph database to maintain
+- **Query-adaptive**: Simple questions get fast answers, complex questions trigger deep exploration automatically
+- **Scales to monorepos**: Orchestration layer adapts exploration depth and synthesis budgets to codebase size
+
+**Compare approaches:**
+
+| Approach | Base Capability | Orchestration | Monorepo Scale | Maintenance |
+|----------|----------------|---------------|----------------|-------------|
+| **Keyword Search** | Exact matching | None | ✓ Fast | None |
+| **Traditional RAG** | Semantic search | None | ✓ Scales | Re-index files |
+| **Knowledge Graphs** | Relationship queries | Pre-computed | ✗ Expensive | Continuous sync |
+| **ChunkHound** | Semantic + Regex | Code Research sub-agent | ✓ Automatic | Automatic (incremental + realtime) |
 
 ## Production Ready
 
-**Battle-tested at scale:**
-- Handles codebases from thousands to **millions of lines**
-- **29 languages and formats** with structured parsing (22 programming languages, 5 config formats, 2 document formats)
-- **5 minutes** from installation to first search
-- **Zero** cloud dependencies - runs entirely local
+**Battle-tested at monorepo scale:**
+- **Millions of lines** across multi-language codebases
+- **29 languages and formats** with AST-aware parsing (Python, TypeScript, Go, Rust, C++, Java, and more)
+- **5 minutes** from installation to first deep research query
+- **Zero** cloud dependencies - your code stays local, searches stay fast
+- **Automatic scaling** - token budgets and exploration depth adapt to repository size
+
+**Ideal for:**
+- **Large monorepos** with cross-team dependencies and circular references
+- **Multi-language projects** requiring consistent search across all code
+- **Security-sensitive codebases** that can't use cloud-based code search
+- **Offline development** environments or air-gapped systems
 
 Built on proven foundations:<br />
-[Tree-sitter](https://tree-sitter.github.io/tree-sitter/) for parsing • [DuckDB](https://duckdb.org/) for local storage • [MCP](https://modelcontextprotocol.io/) for AI integration
+[Tree-sitter](https://tree-sitter.github.io/tree-sitter/) for parsing • [DuckDB](https://duckdb.org/) for local vector search • [MCP](https://modelcontextprotocol.io/) for AI integration
 
 **Stop recreating code. Start with deep understanding.**
 
diff --git a/src/content/docs/under-the-hood.mdx b/src/content/docs/under-the-hood.mdx
@@ -36,6 +36,84 @@ ChunkHound uses a local-first architecture with embedded databases and universal
 
 ChunkHound's local-first architecture provides key advantages: **Privacy** - Your code never leaves your machine. **Speed** - No network latency or API rate limits. **Reliability** - Works offline and in air-gapped environments. **Cost** - No per-token charges for indexing large codebases.
 
+## Design Philosophy
+
+ChunkHound's architecture is built on a **two-layer design** optimized for large monorepos and multi-language codebases:
+
+### Layer 1: Enhanced RAG Foundation
+
+ChunkHound provides traditional RAG capabilities with critical improvements:
+
+**cAST Chunking (Base Parsing):**
+- Structure-aware code segmentation preserving AST relationships
+- Function signatures, class hierarchies, imports, nesting maintained
+- 4.3 point gain on retrieval benchmarks vs fixed-size chunking
+
+**Semantic Search (Base Tool):**
+- HNSW vector indexing for fast nearest-neighbor retrieval
+- Optional multi-hop expansion with reranking (when provider supports it)
+- Finds conceptually similar code across all 29 supported file types
+
+**Regex Search (Base Tool):**
+- Pattern matching against indexed content
+- Exact symbol reference discovery
+- Zero API costs (local database operation)
+
+These base capabilities work like traditional RAG systems—maintain an index, support semantic and regex queries—but with structure-aware chunking that preserves code meaning.
+
+### Layer 2: Code Research Orchestration
+
+The [Code Research](/code-research) tool is a specialized sub-agent that orchestrates base tools strategically:
+
+**1. Query Expansion (LLM-driven)**
+- Generates multiple semantic entry points
+- Explores different "neighborhoods" in parallel
+- Casts wider net before narrowing to high-relevance results
+
+**2. Multi-hop Exploration (Iterative tool use)**
+- Calls semantic search repeatedly, following BFS through related chunks
+- Each hop discovers new semantic neighborhoods
+- Reranking maintains focus on original query
+
+**3. Symbol Extraction + Regex (Hybrid approach)**
+- Pulls symbols from high-relevance semantic results
+- Triggers parallel regex searches for comprehensive coverage
+- Combines conceptual discovery (semantic) with exhaustive matching (regex)
+
+**4. Follow-up Generation (Recursive exploration)**
+- Creates context-aware questions based on discovered code
+- Explores architectural boundaries iteratively
+- Maintains ancestor chain for coherent traversal
+
+**5. Adaptive Scaling (Codebase-aware budgets)**
+- Token budgets scale 30k-150k based on repository size
+- [Convergence detection](#convergence-detection) prevents infinite loops
+- Map-reduce synthesis for large result sets
+
+**Performance Impact**: Virtual graph behavior through orchestration, not explicit graph construction. Zero upfront extraction cost, zero graph storage, zero synchronization overhead.
+
+**For monorepos specifically**: "How does auth work?" in a 10KB project triggers quick semantic search, while the same question in a 10M LOC monorepo automatically scales orchestration—deeper exploration, larger budgets, map-reduce synthesis—no manual tuning required.
+
+---
+
+## Comparison with Alternative Approaches
+
+| Dimension | Traditional RAG | Explicit Knowledge Graphs | Agentic Search | ChunkHound Base | ChunkHound + Code Research |
+|-----------|----------------|---------------------------|----------------|-----------------|---------------------------|
+| **Discovery depth** | Single-hop semantic | Multi-hop (follow edges) | Iterative exploration | Semantic + Regex | Orchestrated multi-hop BFS |
+| **Setup time** | Minutes (indexing) | Hours (extraction + build) | Minutes (indexing) | Minutes (indexing) | Minutes (indexing) |
+| **Maintenance** | Re-index files | Re-extract + sync graph | Re-index files | Re-index files | Re-index files |
+| **Monorepo scale** | ✓ Fast | ✗ Expensive (quadratic) | ✓ Depends on agent | ✓ Fast | ✓ Automatic scaling |
+| **Architecture understanding** | ✗ Limited | ✓ Explicit relationships | ✓ Through exploration | ~ Basic relationships | ✓ Virtual graph via orchestration |
+| **Query adaptation** | Fixed | Fixed | ✓ Adaptive | Fixed | ✓ Adaptive (budgets + convergence) |
+| **Language support** | Per-language tuning | Per-language extraction | Depends on tools | Universal (29 types) | Universal (29 types) |
+| **Relationship tracking** | None | Pre-computed | Tool-dependent | None | On-demand via orchestration |
+| **Synthesis** | None | None | LLM-dependent | None | Map-reduce with citations |
+
+**Key insight**: ChunkHound provides **both layers**—use base semantic/regex for quick queries, trigger Code Research orchestration for architectural exploration. Best of both worlds.
+
+---
+
 ## The cAST Algorithm
 
 When AI assistants search your codebase, they need code split into "chunks" - searchable pieces small enough to understand but large enough to be meaningful. The challenge: how do you split code without breaking its logic?
@@ -171,7 +249,12 @@ Multi-hop search implements several termination criteria to balance comprehensiv
 
 The system monitors three key signals for termination. First, it employs **rate-of-change monitoring** similar to early stopping in machine learning: when reranking scores degrade by more than 0.15 between iterations, indicating diminishing relevance returns. This derivative-based stopping criterion is common in optimization algorithms, effectively measuring the "convergence velocity" of score improvements. Second, it respects computational boundaries—both execution time (5 seconds maximum) and result volume (500 candidates maximum). Third, it detects resource exhaustion when fewer than 5 high-scoring candidates remain for productive expansion.
 
-This convergence detection creates a practical balance. The algorithm explores broadly enough to discover cross-domain relationships while terminating before semantic drift compromises result quality.
+**Monorepo-specific convergence behavior**: In large monorepos with circular dependencies and cross-cutting concerns, convergence detection prevents infinite loops. For example, searching for "logging" in a million-line codebase might discover:
+- Hop 1: Core logging modules (50 chunks)
+- Hop 2: Error handling that uses logging (200 chunks)
+- Hop 3: Business logic that uses error handling (1000+ chunks, but scores degrading)
+
+The gradient-based stopping recognizes when hop 3 adds volume without relevance, terminating before exploring every file that transitively imports logging. This creates a practical balance: the algorithm explores broadly enough to discover cross-domain relationships while terminating before semantic drift compromises result quality.
 
 <Aside type="tip" title="When Multi-Hop Activates">
 Multi-hop search automatically activates when you use providers with reranking support: