You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/content/docs/code-research.mdx
+42-14Lines changed: 42 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -78,6 +78,14 @@ Code Research features won't work without LLM configuration. However:
78
78
-**Refactoring prep** - Understand all dependencies before making changes
79
79
-**Code archaeology** - Learn unfamiliar systems quickly
80
80
81
+
## When to Use Direct Search Instead
82
+
83
+
Code Research is designed for architectural exploration. For simpler queries, use the base search tools directly:
84
+
85
+
-**Quick symbol lookups** - Use regex search to find all occurrences of a specific function or class name
86
+
-**Known file/function** - Use semantic search when you know roughly what you're looking for
87
+
-**Architectural questions** - Use Code Research to understand how components interact and why
88
+
81
89
---
82
90
83
91
## Usage
@@ -244,17 +252,34 @@ At each level, an LLM generates **context-aware follow-up questions** to explore
244
252
245
253
### Graph RAG Without the Graph
246
254
247
-
Traditional Graph RAG systems build explicit knowledge graphs—extracting entities, mining relationships, and storing them in graph databases. ChunkHound achieves the same **multi-hop exploration** and **relationship-aware retrieval** without any graph construction, leveraging the semantic structure already captured during the cAST chunking pass.
255
+
Traditional Graph RAG systems build explicit knowledge graphs—extracting entities, mining relationships, and storing them in graph databases. Code Research **approximates graph-like exploration** through orchestration, trading explicit relationship modeling for zero upfront cost and automatic scaling.
256
+
257
+
#### How Orchestration Creates a Virtual Graph
248
258
249
-
#### Virtual Graph Through cAST
259
+
ChunkHound's base layer (cAST index + semantic/regex search) provides traditional RAG capabilities. The Code Research sub-agent orchestrates these tools to create Graph RAG behavior:
250
260
251
-
The cAST (context-aware AST) chunking algorithm doesn't just split code into pieces—it creates a **virtual graph structure** where:
261
+
**Base Layer Foundation:**
262
+
-**Chunks as nodes**: cAST chunking preserves metadata (function names, class hierarchies, parameters, imports)
263
+
-**Vector similarity as edges**: Semantic search finds conceptually related chunks via HNSW index
264
+
-**Symbol references as edges**: Regex search finds all exact symbol occurrences
252
265
253
-
-**Nodes**: Semantic chunks enriched with metadata (function names, class hierarchies, parameter lists, import paths)
-**Traversal**: Multi-hop semantic search explores the vector space like a breadth-first graph walk
266
+
**Orchestration Layer Creates the Graph:**
267
+
-**BFS traversal**: Iteratively calls semantic search, starting from initial results and expanding through related chunks
268
+
-**Query expansion**: Generates multiple semantic entry points, exploring different "neighborhoods" in parallel
269
+
-**Symbol extraction + regex**: Pulls symbols from semantic results, triggers parallel regex to find all references
270
+
-**Follow-up questions**: Creates targeted queries based on discovered code, recursively exploring architectural boundaries
271
+
-**Convergence detection**: Monitors score degradation to prevent infinite traversal
256
272
257
-
Because cAST preserves AST relationships during chunking (which functions call what, which classes inherit from where, which modules import others), the chunks already encode the code's architectural graph—**no separate extraction needed**. This virtual graph approach scales efficiently to multi-million LOC repositories without the computational overhead and maintenance burden of explicit graph structures, which become increasingly expensive at scale.
273
+
Because cAST chunks preserve semantic boundaries, multi-hop expansion follows meaningful architectural connections rather than arbitrary text proximity. This structural awareness is why orchestration can approximate graph traversal—the base chunks already encode relationships that orchestration discovers through iterative search.
274
+
275
+
The virtual graph emerges through **orchestrated tool use**, not pre-computed storage:
- Multi-hop expansion → follows vector similarity "edges" through BFS
278
+
- Symbol extraction → identifies key entities from high-relevance results
279
+
- Regex search → finds all references, completing the "graph" of connections
280
+
- Follow-ups → explores architectural relationships discovered in results
281
+
282
+
This approach scales efficiently to multi-million LOC repositories because there's no explicit graph to maintain—the "graph" is the pattern of orchestrated search calls, adapted dynamically to each query's needs.
258
283
259
284
#### Hybrid Semantic + Symbol Search
260
285
@@ -268,15 +293,18 @@ The results are unified through simple deduplication by chunk ID. Semantic resul
268
293
269
294
#### Why This Works
270
295
271
-
Graph RAG's benefits come from following relationships between connected information. ChunkHound achieves this dynamically:
296
+
Traditional semantic search finds conceptually similar code but misses architectural relationships. Knowledge graphs model these relationships explicitly but require expensive upfront extraction and ongoing maintenance.
297
+
298
+
Code Research combines base search capabilities (semantic + regex) with intelligent orchestration:
272
299
273
-
1.**No upfront cost**: No entity extraction, no graph construction, no graph maintenance
274
-
2.**Query-adaptive**: Relationships discovered on-demand based on what each specific query needs
275
-
3.**Language-agnostic**: Works across all 29 supported file types through universal AST patterns
276
-
4.**Precise + conceptual**: Combines semantic understanding with exact symbol matching
277
-
5.**Scales automatically**: Token budgets and traversal adapt to repository size
5.**Adaptive scaling** - Token budgets (30k-150k) scale with codebase size
305
+
6.**Map-reduce synthesis** - Parallel cluster synthesis with deterministic citation remapping
278
306
279
-
The result: Graph RAG's architectural understanding and multi-hop discovery, achieved through the semantic knowledge already embedded in your cAST-parsed codebase.
307
+
The virtual graph emerges through orchestrated tool use—no upfront construction, no separate storage, no synchronization overhead. Query-adaptive orchestration scales from quick searches to deep architectural exploration automatically.
Copy file name to clipboardExpand all lines: src/content/docs/index.mdx
+50-11Lines changed: 50 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,23 +24,62 @@ Your AI assistant helps you code but lacks critical context:
24
24
25
25
## What is Deep Research for Code & Files?
26
26
27
-
Just like Deep Research transforms how you explore the web, ChunkHound with the [Code Research](/code-research) tool brings Deep Research to your local codebase and files. **Search across 29 file types** including Python, JavaScript, TypeScript, Markdown, PDFs, and more.
27
+
Just like Deep Research transforms how you explore the web, ChunkHound's [Code Research](/code-research) tool brings iterative, multi-hop exploration to your local codebase. Code Research is a **specialized orchestration sub-agent** that uses ChunkHound's semantic and regex search tools strategically, exploring your codebase like an experienced engineer across **29 file types** including Python, JavaScript, TypeScript, Markdown, PDFs, and more.
28
28
29
-
-**Iterative Discovery** - Start with a concept, expand through code, markdown files, and comments
30
-
-**Multi-hop Search** - Find connections between implementation and written knowledge
31
-
-**Architecture Map** - Understand relationships across code files, README files, and design notes
32
-
-**Complete Context** - Find that `validateEmail()` AND the markdown file explaining why it exists
29
+
## Why ChunkHound is Different
30
+
31
+
**Two-Layer Architecture: Best of Both Worlds**
32
+
33
+
ChunkHound provides both traditional RAG capabilities AND intelligent orchestration for deep exploration:
34
+
35
+
### Base Layer: Enhanced RAG
36
+
Like traditional RAG systems, ChunkHound maintains an index and provides search tools—but with critical improvements:
37
+
-**[cAST chunking](/under-the-hood#the-cast-algorithm)**: Structure-aware code segmentation (4.3 point gain on retrieval benchmarks)
38
+
-**Semantic search**: Natural language queries via HNSW vector indexing
39
+
-**Regex search**: Exact pattern matching for comprehensive symbol coverage
40
+
41
+
### Orchestration Layer: Code Research Sub-Agent
42
+
The [Code Research](/code-research) tool is a specialized orchestration layer that uses base search tools strategically:
- Handles codebases from thousands to **millions of lines**
38
-
-**29 languages and formats** with structured parsing (22 programming languages, 5 config formats, 2 document formats)
39
-
-**5 minutes** from installation to first search
40
-
-**Zero** cloud dependencies - runs entirely local
68
+
**Battle-tested at monorepo scale:**
69
+
-**Millions of lines** across multi-language codebases
70
+
-**29 languages and formats** with AST-aware parsing (Python, TypeScript, Go, Rust, C++, Java, and more)
71
+
-**5 minutes** from installation to first deep research query
72
+
-**Zero** cloud dependencies - your code stays local, searches stay fast
73
+
-**Automatic scaling** - token budgets and exploration depth adapt to repository size
74
+
75
+
**Ideal for:**
76
+
-**Large monorepos** with cross-team dependencies and circular references
77
+
-**Multi-language projects** requiring consistent search across all code
78
+
-**Security-sensitive codebases** that can't use cloud-based code search
79
+
-**Offline development** environments or air-gapped systems
41
80
42
81
Built on proven foundations:<br />
43
-
[Tree-sitter](https://tree-sitter.github.io/tree-sitter/) for parsing • [DuckDB](https://duckdb.org/) for local storage • [MCP](https://modelcontextprotocol.io/) for AI integration
82
+
[Tree-sitter](https://tree-sitter.github.io/tree-sitter/) for parsing • [DuckDB](https://duckdb.org/) for local vector search • [MCP](https://modelcontextprotocol.io/) for AI integration
44
83
45
84
**Stop recreating code. Start with deep understanding.**
Copy file name to clipboardExpand all lines: src/content/docs/under-the-hood.mdx
+84-1Lines changed: 84 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,6 +36,84 @@ ChunkHound uses a local-first architecture with embedded databases and universal
36
36
37
37
ChunkHound's local-first architecture provides key advantages: **Privacy** - Your code never leaves your machine. **Speed** - No network latency or API rate limits. **Reliability** - Works offline and in air-gapped environments. **Cost** - No per-token charges for indexing large codebases.
38
38
39
+
## Design Philosophy
40
+
41
+
ChunkHound's architecture is built on a **two-layer design** optimized for large monorepos and multi-language codebases:
42
+
43
+
### Layer 1: Enhanced RAG Foundation
44
+
45
+
ChunkHound provides traditional RAG capabilities with critical improvements:
- Function signatures, class hierarchies, imports, nesting maintained
50
+
- 4.3 point gain on retrieval benchmarks vs fixed-size chunking
51
+
52
+
**Semantic Search (Base Tool):**
53
+
- HNSW vector indexing for fast nearest-neighbor retrieval
54
+
- Optional multi-hop expansion with reranking (when provider supports it)
55
+
- Finds conceptually similar code across all 29 supported file types
56
+
57
+
**Regex Search (Base Tool):**
58
+
- Pattern matching against indexed content
59
+
- Exact symbol reference discovery
60
+
- Zero API costs (local database operation)
61
+
62
+
These base capabilities work like traditional RAG systems—maintain an index, support semantic and regex queries—but with structure-aware chunking that preserves code meaning.
63
+
64
+
### Layer 2: Code Research Orchestration
65
+
66
+
The [Code Research](/code-research) tool is a specialized sub-agent that orchestrates base tools strategically:
67
+
68
+
**1. Query Expansion (LLM-driven)**
69
+
- Generates multiple semantic entry points
70
+
- Explores different "neighborhoods" in parallel
71
+
- Casts wider net before narrowing to high-relevance results
72
+
73
+
**2. Multi-hop Exploration (Iterative tool use)**
74
+
- Calls semantic search repeatedly, following BFS through related chunks
75
+
- Each hop discovers new semantic neighborhoods
76
+
- Reranking maintains focus on original query
77
+
78
+
**3. Symbol Extraction + Regex (Hybrid approach)**
79
+
- Pulls symbols from high-relevance semantic results
80
+
- Triggers parallel regex searches for comprehensive coverage
81
+
- Combines conceptual discovery (semantic) with exhaustive matching (regex)
**Performance Impact**: Virtual graph behavior through orchestration, not explicit graph construction. Zero upfront extraction cost, zero graph storage, zero synchronization overhead.
94
+
95
+
**For monorepos specifically**: "How does auth work?" in a 10KB project triggers quick semantic search, while the same question in a 10M LOC monorepo automatically scales orchestration—deeper exploration, larger budgets, map-reduce synthesis—no manual tuning required.
96
+
97
+
---
98
+
99
+
## Comparison with Alternative Approaches
100
+
101
+
| Dimension | Traditional RAG | Explicit Knowledge Graphs | Agentic Search | ChunkHound Base | ChunkHound + Code Research |
**Key insight**: ChunkHound provides **both layers**—use base semantic/regex for quick queries, trigger Code Research orchestration for architectural exploration. Best of both worlds.
114
+
115
+
---
116
+
39
117
## The cAST Algorithm
40
118
41
119
When AI assistants search your codebase, they need code split into "chunks" - searchable pieces small enough to understand but large enough to be meaningful. The challenge: how do you split code without breaking its logic?
@@ -171,7 +249,12 @@ Multi-hop search implements several termination criteria to balance comprehensiv
171
249
172
250
The system monitors three key signals for termination. First, it employs **rate-of-change monitoring** similar to early stopping in machine learning: when reranking scores degrade by more than 0.15 between iterations, indicating diminishing relevance returns. This derivative-based stopping criterion is common in optimization algorithms, effectively measuring the "convergence velocity" of score improvements. Second, it respects computational boundaries—both execution time (5 seconds maximum) and result volume (500 candidates maximum). Third, it detects resource exhaustion when fewer than 5 high-scoring candidates remain for productive expansion.
173
251
174
-
This convergence detection creates a practical balance. The algorithm explores broadly enough to discover cross-domain relationships while terminating before semantic drift compromises result quality.
252
+
**Monorepo-specific convergence behavior**: In large monorepos with circular dependencies and cross-cutting concerns, convergence detection prevents infinite loops. For example, searching for "logging" in a million-line codebase might discover:
253
+
- Hop 1: Core logging modules (50 chunks)
254
+
- Hop 2: Error handling that uses logging (200 chunks)
255
+
- Hop 3: Business logic that uses error handling (1000+ chunks, but scores degrading)
256
+
257
+
The gradient-based stopping recognizes when hop 3 adds volume without relevance, terminating before exploring every file that transitively imports logging. This creates a practical balance: the algorithm explores broadly enough to discover cross-domain relationships while terminating before semantic drift compromises result quality.
175
258
176
259
<Asidetype="tip"title="When Multi-Hop Activates">
177
260
Multi-hop search automatically activates when you use providers with reranking support:
0 commit comments