Skip to content

Commit 3743537

Browse files
ofriwclaude
andcommitted
Clarify Code Research as orchestration layer over base RAG tools
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent cfe8ec6 commit 3743537

3 files changed

Lines changed: 176 additions & 26 deletions

File tree

src/content/docs/code-research.mdx

Lines changed: 42 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,14 @@ Code Research features won't work without LLM configuration. However:
7878
- **Refactoring prep** - Understand all dependencies before making changes
7979
- **Code archaeology** - Learn unfamiliar systems quickly
8080

81+
## When to Use Direct Search Instead
82+
83+
Code Research is designed for architectural exploration. For simpler queries, use the base search tools directly:
84+
85+
- **Quick symbol lookups** - Use regex search to find all occurrences of a specific function or class name
86+
- **Known file/function** - Use semantic search when you know roughly what you're looking for
87+
- **Architectural questions** - Use Code Research to understand how components interact and why
88+
8189
---
8290

8391
## Usage
@@ -244,17 +252,34 @@ At each level, an LLM generates **context-aware follow-up questions** to explore
244252

245253
### Graph RAG Without the Graph
246254

247-
Traditional Graph RAG systems build explicit knowledge graphs—extracting entities, mining relationships, and storing them in graph databases. ChunkHound achieves the same **multi-hop exploration** and **relationship-aware retrieval** without any graph construction, leveraging the semantic structure already captured during the cAST chunking pass.
255+
Traditional Graph RAG systems build explicit knowledge graphs—extracting entities, mining relationships, and storing them in graph databases. Code Research **approximates graph-like exploration** through orchestration, trading explicit relationship modeling for zero upfront cost and automatic scaling.
256+
257+
#### How Orchestration Creates a Virtual Graph
248258

249-
#### Virtual Graph Through cAST
259+
ChunkHound's base layer (cAST index + semantic/regex search) provides traditional RAG capabilities. The Code Research sub-agent orchestrates these tools to create Graph RAG behavior:
250260

251-
The cAST (context-aware AST) chunking algorithm doesn't just split code into pieces—it creates a **virtual graph structure** where:
261+
**Base Layer Foundation:**
262+
- **Chunks as nodes**: cAST chunking preserves metadata (function names, class hierarchies, parameters, imports)
263+
- **Vector similarity as edges**: Semantic search finds conceptually related chunks via HNSW index
264+
- **Symbol references as edges**: Regex search finds all exact symbol occurrences
252265

253-
- **Nodes**: Semantic chunks enriched with metadata (function names, class hierarchies, parameter lists, import paths)
254-
- **Edges**: Vector similarity + structural relationships (file proximity, AST nesting, import dependencies, symbol references)
255-
- **Traversal**: Multi-hop semantic search explores the vector space like a breadth-first graph walk
266+
**Orchestration Layer Creates the Graph:**
267+
- **BFS traversal**: Iteratively calls semantic search, starting from initial results and expanding through related chunks
268+
- **Query expansion**: Generates multiple semantic entry points, exploring different "neighborhoods" in parallel
269+
- **Symbol extraction + regex**: Pulls symbols from semantic results, triggers parallel regex to find all references
270+
- **Follow-up questions**: Creates targeted queries based on discovered code, recursively exploring architectural boundaries
271+
- **Convergence detection**: Monitors score degradation to prevent infinite traversal
256272

257-
Because cAST preserves AST relationships during chunking (which functions call what, which classes inherit from where, which modules import others), the chunks already encode the code's architectural graph—**no separate extraction needed**. This virtual graph approach scales efficiently to multi-million LOC repositories without the computational overhead and maintenance burden of explicit graph structures, which become increasingly expensive at scale.
273+
Because cAST chunks preserve semantic boundaries, multi-hop expansion follows meaningful architectural connections rather than arbitrary text proximity. This structural awareness is why orchestration can approximate graph traversal—the base chunks already encode relationships that orchestration discovers through iterative search.
274+
275+
The virtual graph emerges through **orchestrated tool use**, not pre-computed storage:
276+
- Initial semantic search → discovers conceptually relevant chunks
277+
- Multi-hop expansion → follows vector similarity "edges" through BFS
278+
- Symbol extraction → identifies key entities from high-relevance results
279+
- Regex search → finds all references, completing the "graph" of connections
280+
- Follow-ups → explores architectural relationships discovered in results
281+
282+
This approach scales efficiently to multi-million LOC repositories because there's no explicit graph to maintain—the "graph" is the pattern of orchestrated search calls, adapted dynamically to each query's needs.
258283

259284
#### Hybrid Semantic + Symbol Search
260285

@@ -268,15 +293,18 @@ The results are unified through simple deduplication by chunk ID. Semantic resul
268293

269294
#### Why This Works
270295

271-
Graph RAG's benefits come from following relationships between connected information. ChunkHound achieves this dynamically:
296+
Traditional semantic search finds conceptually similar code but misses architectural relationships. Knowledge graphs model these relationships explicitly but require expensive upfront extraction and ongoing maintenance.
297+
298+
Code Research combines base search capabilities (semantic + regex) with intelligent orchestration:
272299

273-
1. **No upfront cost**: No entity extraction, no graph construction, no graph maintenance
274-
2. **Query-adaptive**: Relationships discovered on-demand based on what each specific query needs
275-
3. **Language-agnostic**: Works across all 29 supported file types through universal AST patterns
276-
4. **Precise + conceptual**: Combines semantic understanding with exact symbol matching
277-
5. **Scales automatically**: Token budgets and traversal adapt to repository size
300+
1. **Query expansion** - Multiple semantic entry points discover different code neighborhoods
301+
2. **Multi-hop exploration** - BFS through semantic neighborhoods following architectural connections
302+
3. **Symbol extraction + regex** - Comprehensive coverage beyond semantic discovery
303+
4. **Follow-up generation** - Context-aware questions explore architectural boundaries
304+
5. **Adaptive scaling** - Token budgets (30k-150k) scale with codebase size
305+
6. **Map-reduce synthesis** - Parallel cluster synthesis with deterministic citation remapping
278306

279-
The result: Graph RAG's architectural understanding and multi-hop discovery, achieved through the semantic knowledge already embedded in your cAST-parsed codebase.
307+
The virtual graph emerges through orchestrated tool use—no upfront construction, no separate storage, no synchronization overhead. Query-adaptive orchestration scales from quick searches to deep architectural exploration automatically.
280308

281309
### Adaptive Scaling
282310

src/content/docs/index.mdx

Lines changed: 50 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -24,23 +24,62 @@ Your AI assistant helps you code but lacks critical context:
2424

2525
## What is Deep Research for Code & Files?
2626

27-
Just like Deep Research transforms how you explore the web, ChunkHound with the [Code Research](/code-research) tool brings Deep Research to your local codebase and files. **Search across 29 file types** including Python, JavaScript, TypeScript, Markdown, PDFs, and more.
27+
Just like Deep Research transforms how you explore the web, ChunkHound's [Code Research](/code-research) tool brings iterative, multi-hop exploration to your local codebase. Code Research is a **specialized orchestration sub-agent** that uses ChunkHound's semantic and regex search tools strategically, exploring your codebase like an experienced engineer across **29 file types** including Python, JavaScript, TypeScript, Markdown, PDFs, and more.
2828

29-
- **Iterative Discovery** - Start with a concept, expand through code, markdown files, and comments
30-
- **Multi-hop Search** - Find connections between implementation and written knowledge
31-
- **Architecture Map** - Understand relationships across code files, README files, and design notes
32-
- **Complete Context** - Find that `validateEmail()` AND the markdown file explaining why it exists
29+
## Why ChunkHound is Different
30+
31+
**Two-Layer Architecture: Best of Both Worlds**
32+
33+
ChunkHound provides both traditional RAG capabilities AND intelligent orchestration for deep exploration:
34+
35+
### Base Layer: Enhanced RAG
36+
Like traditional RAG systems, ChunkHound maintains an index and provides search tools—but with critical improvements:
37+
- **[cAST chunking](/under-the-hood#the-cast-algorithm)**: Structure-aware code segmentation (4.3 point gain on retrieval benchmarks)
38+
- **Semantic search**: Natural language queries via HNSW vector indexing
39+
- **Regex search**: Exact pattern matching for comprehensive symbol coverage
40+
41+
### Orchestration Layer: Code Research Sub-Agent
42+
The [Code Research](/code-research) tool is a specialized orchestration layer that uses base search tools strategically:
43+
- **Multi-hop exploration**: BFS traversal discovering architectural relationships
44+
- **Query expansion**: Multiple semantic entry points to cast wider nets
45+
- **Follow-up generation**: Iterative questioning based on discovered code
46+
- **Adaptive scaling**: Token budgets automatically scale from 30k-150k based on repository size
47+
- **Map-reduce synthesis**: Handles millions of lines without context collapse
48+
49+
**The result**: Virtual Graph RAG behavior through orchestration, not explicit graph construction.
50+
51+
**What this means for you:**
52+
- **Use what you need**: Direct semantic/regex search for quick lookups, Code Research for architectural exploration
53+
- **Zero upfront cost**: No entity extraction, no graph database to maintain
54+
- **Query-adaptive**: Simple questions get fast answers, complex questions trigger deep exploration automatically
55+
- **Scales to monorepos**: Orchestration layer adapts exploration depth and synthesis budgets to codebase size
56+
57+
**Compare approaches:**
58+
59+
| Approach | Base Capability | Orchestration | Monorepo Scale | Maintenance |
60+
|----------|----------------|---------------|----------------|-------------|
61+
| **Keyword Search** | Exact matching | None | ✓ Fast | None |
62+
| **Traditional RAG** | Semantic search | None | ✓ Scales | Re-index files |
63+
| **Knowledge Graphs** | Relationship queries | Pre-computed | ✗ Expensive | Continuous sync |
64+
| **ChunkHound** | Semantic + Regex | Code Research sub-agent | ✓ Automatic | Automatic (incremental + realtime) |
3365

3466
## Production Ready
3567

36-
**Battle-tested at scale:**
37-
- Handles codebases from thousands to **millions of lines**
38-
- **29 languages and formats** with structured parsing (22 programming languages, 5 config formats, 2 document formats)
39-
- **5 minutes** from installation to first search
40-
- **Zero** cloud dependencies - runs entirely local
68+
**Battle-tested at monorepo scale:**
69+
- **Millions of lines** across multi-language codebases
70+
- **29 languages and formats** with AST-aware parsing (Python, TypeScript, Go, Rust, C++, Java, and more)
71+
- **5 minutes** from installation to first deep research query
72+
- **Zero** cloud dependencies - your code stays local, searches stay fast
73+
- **Automatic scaling** - token budgets and exploration depth adapt to repository size
74+
75+
**Ideal for:**
76+
- **Large monorepos** with cross-team dependencies and circular references
77+
- **Multi-language projects** requiring consistent search across all code
78+
- **Security-sensitive codebases** that can't use cloud-based code search
79+
- **Offline development** environments or air-gapped systems
4180

4281
Built on proven foundations:<br />
43-
[Tree-sitter](https://tree-sitter.github.io/tree-sitter/) for parsing • [DuckDB](https://duckdb.org/) for local storage[MCP](https://modelcontextprotocol.io/) for AI integration
82+
[Tree-sitter](https://tree-sitter.github.io/tree-sitter/) for parsing • [DuckDB](https://duckdb.org/) for local vector search[MCP](https://modelcontextprotocol.io/) for AI integration
4483

4584
**Stop recreating code. Start with deep understanding.**
4685

src/content/docs/under-the-hood.mdx

Lines changed: 84 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,84 @@ ChunkHound uses a local-first architecture with embedded databases and universal
3636

3737
ChunkHound's local-first architecture provides key advantages: **Privacy** - Your code never leaves your machine. **Speed** - No network latency or API rate limits. **Reliability** - Works offline and in air-gapped environments. **Cost** - No per-token charges for indexing large codebases.
3838

39+
## Design Philosophy
40+
41+
ChunkHound's architecture is built on a **two-layer design** optimized for large monorepos and multi-language codebases:
42+
43+
### Layer 1: Enhanced RAG Foundation
44+
45+
ChunkHound provides traditional RAG capabilities with critical improvements:
46+
47+
**cAST Chunking (Base Parsing):**
48+
- Structure-aware code segmentation preserving AST relationships
49+
- Function signatures, class hierarchies, imports, nesting maintained
50+
- 4.3 point gain on retrieval benchmarks vs fixed-size chunking
51+
52+
**Semantic Search (Base Tool):**
53+
- HNSW vector indexing for fast nearest-neighbor retrieval
54+
- Optional multi-hop expansion with reranking (when provider supports it)
55+
- Finds conceptually similar code across all 29 supported file types
56+
57+
**Regex Search (Base Tool):**
58+
- Pattern matching against indexed content
59+
- Exact symbol reference discovery
60+
- Zero API costs (local database operation)
61+
62+
These base capabilities work like traditional RAG systems—maintain an index, support semantic and regex queries—but with structure-aware chunking that preserves code meaning.
63+
64+
### Layer 2: Code Research Orchestration
65+
66+
The [Code Research](/code-research) tool is a specialized sub-agent that orchestrates base tools strategically:
67+
68+
**1. Query Expansion (LLM-driven)**
69+
- Generates multiple semantic entry points
70+
- Explores different "neighborhoods" in parallel
71+
- Casts wider net before narrowing to high-relevance results
72+
73+
**2. Multi-hop Exploration (Iterative tool use)**
74+
- Calls semantic search repeatedly, following BFS through related chunks
75+
- Each hop discovers new semantic neighborhoods
76+
- Reranking maintains focus on original query
77+
78+
**3. Symbol Extraction + Regex (Hybrid approach)**
79+
- Pulls symbols from high-relevance semantic results
80+
- Triggers parallel regex searches for comprehensive coverage
81+
- Combines conceptual discovery (semantic) with exhaustive matching (regex)
82+
83+
**4. Follow-up Generation (Recursive exploration)**
84+
- Creates context-aware questions based on discovered code
85+
- Explores architectural boundaries iteratively
86+
- Maintains ancestor chain for coherent traversal
87+
88+
**5. Adaptive Scaling (Codebase-aware budgets)**
89+
- Token budgets scale 30k-150k based on repository size
90+
- [Convergence detection](#convergence-detection) prevents infinite loops
91+
- Map-reduce synthesis for large result sets
92+
93+
**Performance Impact**: Virtual graph behavior through orchestration, not explicit graph construction. Zero upfront extraction cost, zero graph storage, zero synchronization overhead.
94+
95+
**For monorepos specifically**: "How does auth work?" in a 10KB project triggers quick semantic search, while the same question in a 10M LOC monorepo automatically scales orchestration—deeper exploration, larger budgets, map-reduce synthesis—no manual tuning required.
96+
97+
---
98+
99+
## Comparison with Alternative Approaches
100+
101+
| Dimension | Traditional RAG | Explicit Knowledge Graphs | Agentic Search | ChunkHound Base | ChunkHound + Code Research |
102+
|-----------|----------------|---------------------------|----------------|-----------------|---------------------------|
103+
| **Discovery depth** | Single-hop semantic | Multi-hop (follow edges) | Iterative exploration | Semantic + Regex | Orchestrated multi-hop BFS |
104+
| **Setup time** | Minutes (indexing) | Hours (extraction + build) | Minutes (indexing) | Minutes (indexing) | Minutes (indexing) |
105+
| **Maintenance** | Re-index files | Re-extract + sync graph | Re-index files | Re-index files | Re-index files |
106+
| **Monorepo scale** | ✓ Fast | ✗ Expensive (quadratic) | ✓ Depends on agent | ✓ Fast | ✓ Automatic scaling |
107+
| **Architecture understanding** | ✗ Limited | ✓ Explicit relationships | ✓ Through exploration | ~ Basic relationships | ✓ Virtual graph via orchestration |
108+
| **Query adaptation** | Fixed | Fixed | ✓ Adaptive | Fixed | ✓ Adaptive (budgets + convergence) |
109+
| **Language support** | Per-language tuning | Per-language extraction | Depends on tools | Universal (29 types) | Universal (29 types) |
110+
| **Relationship tracking** | None | Pre-computed | Tool-dependent | None | On-demand via orchestration |
111+
| **Synthesis** | None | None | LLM-dependent | None | Map-reduce with citations |
112+
113+
**Key insight**: ChunkHound provides **both layers**—use base semantic/regex for quick queries, trigger Code Research orchestration for architectural exploration. Best of both worlds.
114+
115+
---
116+
39117
## The cAST Algorithm
40118

41119
When AI assistants search your codebase, they need code split into "chunks" - searchable pieces small enough to understand but large enough to be meaningful. The challenge: how do you split code without breaking its logic?
@@ -171,7 +249,12 @@ Multi-hop search implements several termination criteria to balance comprehensiv
171249

172250
The system monitors three key signals for termination. First, it employs **rate-of-change monitoring** similar to early stopping in machine learning: when reranking scores degrade by more than 0.15 between iterations, indicating diminishing relevance returns. This derivative-based stopping criterion is common in optimization algorithms, effectively measuring the "convergence velocity" of score improvements. Second, it respects computational boundaries—both execution time (5 seconds maximum) and result volume (500 candidates maximum). Third, it detects resource exhaustion when fewer than 5 high-scoring candidates remain for productive expansion.
173251

174-
This convergence detection creates a practical balance. The algorithm explores broadly enough to discover cross-domain relationships while terminating before semantic drift compromises result quality.
252+
**Monorepo-specific convergence behavior**: In large monorepos with circular dependencies and cross-cutting concerns, convergence detection prevents infinite loops. For example, searching for "logging" in a million-line codebase might discover:
253+
- Hop 1: Core logging modules (50 chunks)
254+
- Hop 2: Error handling that uses logging (200 chunks)
255+
- Hop 3: Business logic that uses error handling (1000+ chunks, but scores degrading)
256+
257+
The gradient-based stopping recognizes when hop 3 adds volume without relevance, terminating before exploring every file that transitively imports logging. This creates a practical balance: the algorithm explores broadly enough to discover cross-domain relationships while terminating before semantic drift compromises result quality.
175258

176259
<Aside type="tip" title="When Multi-Hop Activates">
177260
Multi-hop search automatically activates when you use providers with reranking support:

0 commit comments

Comments
 (0)