Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 14 additions & 3 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,10 @@ src/codegrok_mcp/
- **Embedding Model**: `nomic-ai/CodeRankEmbed` (768 dims, 8192 max tokens)
- **Chunk Strategy**: Symbol-based (each function/class/method = 1 chunk)
- **Max Chunk Size**: 4000 chars (~1000-1300 tokens)
- **Storage**: `.codegrok/` (chromadb/ + metadata.json + memory_metadata.json)
- **Parallelism**: CPU count - 1 workers (min 1, max 32)
- **Storage**: `.codegrok/` (chromadb/ + metadata.json + memory_metadata.json + checkpoint.json)
- **Parallelism**: CPU count - 1 workers (min 1, max 4 for parsing)
- **File Batch Size**: 500 files per parse batch (memory optimization)
- **Default Timeout**: 600s (configurable via `CODEGROK_TIMEOUT` env var or `timeout_seconds` param)
- **Memory TTLs**: session (24h), day, week, month, permanent

## Commands
Expand All @@ -59,13 +61,22 @@ mypy src/ # Type check
## Gotchas

1. **State is global singleton** - `state.py` holds SourceRetriever + MemoryRetriever across MCP calls
2. **Incremental reindex uses SHA256** - File hash comparison, not mtime
2. **Incremental reindex uses mtime** - File modification time comparison for change detection
3. **ChromaDB collections**: `codebase_chunks` (code) and `memories` (memory layer)
4. **No LLM code** - Removed from parent CodeGrok; source_retriever.py has no ask/rerank methods
5. **Tree-sitter node names vary by language** - language_configs.py normalizes them
6. **Embedding is cached** - LRU(1000) + batch processing in embedding_service.py
7. **Memory tags stored as CSV** - ChromaDB doesn't support list metadata; tags joined with commas
8. **All tools require `learn` first** - Except `list_supported_languages` (static data)
9. **discover_files respects .gitignore** - Uses `pathspec` + `os.walk()` with directory pruning; also respects nested `.gitignore` files
10. **Indexing uses upsert** - `collection.upsert()` instead of delete-recreate; stale chunks cleaned after embedding
11. **Checkpointing** - `.codegrok/checkpoint.json` saves progress every 1000 chunks; atomic writes via `os.replace()`; deleted on success
12. **max_files safety limit** - `discover_files()` stops at 200K files to prevent DoS (addresses SECURITY_REVIEW HIGH-003)
13. **Memory-optimized parsing** - Symbols converted to chunks per file batch (500 files), then freed; `gc.collect()` between batches; chunks list freed after embedding
14. **Worker cap** - Parallel parse workers capped at `MAX_PARSE_WORKERS=4` to limit memory from tree-sitter instances
15. **Configurable timeout** - `learn` tool has `timeout_seconds` param; also reads `CODEGROK_TIMEOUT` env var; defaults to 600s
16. **Background indexing** - `learn` returns immediately; indexing runs in `threading.Thread(daemon=True)`; client polls `get_stats()` for progress; `IndexingStatus` in `state.py` tracks progress thread-safely
17. **learn stateful responses** - Returns `indexing_started` (new), `indexing_in_progress` (already running), `complete` (done, clears result), or raises `ToolError` (failed, clears error for retry)

## Adding Languages

Expand Down
192 changes: 192 additions & 0 deletions docs/INDEXING_IMPROVEMENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
# Indexing Improvements (v0.2.1)

Fixes for the `learn` tool hanging on large codebases with many folders/subfolders,
high memory consumption during indexing, and missing timeout protection.

## Changes

### 1. `.gitignore` Support

`discover_files()` now respects `.gitignore` patterns using the `pathspec` library.

- Uses `os.walk()` instead of `Path.rglob("*")` for directory pruning
- Loads root `.gitignore` and stacks nested `.gitignore` files as it descends
- Prunes ignored directories in-place (never descends into `node_modules/`, `build/`, etc.)
- Uses `followlinks=False` to prevent symlink loops
- Backward-compatible: `respect_gitignore=True` by default, can be disabled

### 2. Safety Limits

- `max_files=200_000` circuit breaker stops file discovery if exceeded
- Emits a warning when the limit is hit
- Addresses **SECURITY_REVIEW HIGH-003** (Unbounded Resource Consumption / DoS)

### 3. Upsert-Based Indexing

- `index_codebase()` now uses `get_or_create_collection()` + `collection.upsert()` instead of deleting and recreating the collection
- Chunk IDs are deterministic (`filepath:name:line_start`), making upsert idempotent
- Stale chunks (from deleted/renamed files) are cleaned up after the embedding loop

### 4. Resumable Checkpointing

- Saves progress to `.codegrok/checkpoint.json` every 1000 chunks
- Atomic writes via `os.replace()` (POSIX-safe)
- On restart, detects checkpoint and resumes from where it left off
- Checkpoint is deleted on successful completion

### 5. Improved Progress Reporting

- New `"discovery_progress"` event emitted every 1000 files during file traversal
- ETA added to embedding progress messages (e.g., "Embedding... (5000/10000 chunks, ~2.3m remaining)")
- MCP client now shows progress during the file discovery phase (0-5% range)

### 6. Memory Optimizations

Reduces peak memory consumption during indexing:

- **Inline symbol-to-chunk conversion**: Symbols are converted to chunks per file batch and freed immediately, instead of accumulating all symbols in a separate list before chunking
- **File batch processing**: Files are parsed in batches of 500 (`FILE_BATCH_SIZE`), with `gc.collect()` between batches to free memory promptly
- **Worker cap**: Parallel parse workers capped at 4 (`MAX_PARSE_WORKERS`) to limit memory from tree-sitter parser instances (previously up to 32)
- **Post-embedding cleanup**: Chunks list is explicitly deleted and garbage collected after embedding completes

### 7. Configurable MCP Timeout

Prevents the `learn` tool from running indefinitely on very large codebases:

- Default timeout: 600 seconds (10 minutes)
- **Environment variable**: Set `CODEGROK_TIMEOUT` in your MCP client config
- **Per-call override**: Pass `timeout_seconds` parameter to the `learn` tool
- Priority: `timeout_seconds` param > `CODEGROK_TIMEOUT` env var > default (600s)
- On timeout: checkpoint is preserved, re-running `learn` resumes from where it stopped

**Configuration example** (claude_desktop_config.json):
```json
{
"mcpServers": {
"codegrok": {
"command": "codegrok-mcp",
"env": {
"CODEGROK_TIMEOUT": "1200"
}
}
}
}
```

### 8. Background Indexing

Solves MCP transport timeout: clients (Claude Desktop, etc.) have ~60-120s transport-level
timeouts that kill long-running tool calls. Now `learn` returns immediately and indexing
runs in a background thread.

- `learn` starts a `threading.Thread(daemon=True)` and returns `{"status": "indexing_started"}`
- `IndexingStatus` dataclass in `state.py` provides thread-safe progress tracking with `threading.Lock`
- `get_stats()` includes `indexing` field with `active`, `progress`, `message`, `error` when indexing is in progress
- Polling pattern: client calls `get_stats()` repeatedly to check progress
- If `learn` is called while indexing is active, returns `{"status": "indexing_in_progress"}`
- If `learn` is called after completion, returns the result once and clears it
- If previous indexing failed, raises `ToolError` with error details and clears for retry
- `load_only` mode remains synchronous (fast, no background needed)

**Workflow:**
```
1. learn(path="/project") → {"status": "indexing_started", "progress": 0}
2. get_stats() → {"indexing": {"active": true, "progress": 42, ...}}
3. get_stats() → {"loaded": true, "stats": {...}} (indexing done)
```

## New Dependencies

- `pathspec>=0.11.0` — Pure Python `.gitignore` pattern matching (used by `black`, `flake8`, etc.)

## Security Alignment

| Security Finding | How Addressed |
|-----------------|---------------|
| HIGH-003: Unbounded Resource Consumption | `max_files` limit + `.gitignore` filtering |
| LOW-009: Symlink Following | `followlinks=False` in `os.walk()` |

## MCP Tools Used in Development

This feature was planned and implemented using the following MCP tools:

| MCP Tool | How It Was Used |
|----------|----------------|
| **Sequential Thinking** (`mcp__sequential-thinking__sequentialthinking`) | 7-step chain-of-thought to plan execution order, identify risks (backward compatibility, atomic writes, symlink loops), decide to skip async writer thread, and design test strategy |
| **Snyk Code Scan** (`mcp__Snyk__snyk_code_scan`) | SAST scan on all modified files (`source_retriever.py`, `server.py`) — 0 issues found |

## Test Coverage

### New Unit Tests (`tests/unit/test_discover_files.py`)

| Test | What it verifies |
|------|-----------------|
| `test_discover_files_basic` | Finds .py files in simple directory |
| `test_discover_files_skip_dirs` | Skips `node_modules/`, `__pycache__/`, `.git/` even when nested |
| `test_discover_files_gitignore` | Respects root `.gitignore` patterns |
| `test_discover_files_nested_gitignore` | Handles `.gitignore` in subdirectories |
| `test_discover_files_max_files_limit` | Stops at `max_files` and returns partial results |
| `test_discover_files_no_gitignore` | Works when no `.gitignore` exists |
| `test_discover_files_respect_gitignore_false` | Opt-out disables gitignore filtering |
| `test_discover_files_progress_callback` | Callback mechanism works correctly |
| `test_file_batch_size` | FILE_BATCH_SIZE constant is 500 |
| `test_max_parse_workers` | MAX_PARSE_WORKERS constant is 4 |

### New Unit Tests (`tests/unit/test_timeout.py`)

| Test | What it verifies |
|------|-----------------|
| `test_default_timeout` | Returns 600s when no override or env var |
| `test_per_call_override` | Per-call override takes highest priority |
| `test_env_var_override` | CODEGROK_TIMEOUT env var overrides default |
| `test_env_var_invalid_ignored` | Invalid env var falls back to default |
| `test_env_var_zero_ignored` | Zero env var falls back to default |
| `test_env_var_negative_ignored` | Negative env var falls back to default |
| `test_override_zero_uses_env` | Override of 0/None falls through to env |
| `test_override_negative_uses_env` | Negative override falls through to env |
| `test_default_is_600` | Default constant is 600 seconds |

### New Integration Tests (`tests/integration/test_source_retriever.py`)

| Test | What it verifies |
|------|-----------------|
| `test_index_codebase_upsert_idempotent` | Re-indexing produces same chunk count |
| `test_stale_chunk_removal` | Old chunks removed after file deletion |
| `test_checkpoint_save_and_load` | Checkpoint round-trip |
| `test_checkpoint_load_missing_file` | Handles missing checkpoint |
| `test_checkpoint_load_corrupted` | Handles corrupted JSON |
| `test_checkpoint_cleanup_on_success` | Checkpoint deleted after success |
| `test_checkpoint_load_none_path` | Handles None path |
| `test_file_batch_processing_parallel` | Parallel batch processing produces correct results |
| `test_file_batch_processing_sequential` | Sequential inline chunking produces correct results |
| `test_worker_cap_respected` | Workers capped at MAX_PARSE_WORKERS |
| `test_explicit_max_workers_not_overridden` | Explicit max_workers used as-is |

### New Unit Tests (`tests/unit/test_background_indexing.py`)

| Test | What it verifies |
|------|-----------------|
| `test_initial_state` | IndexingStatus defaults are correct |
| `test_start` | start() sets active, clears error/result |
| `test_start_clears_previous_error` | Restart after failure clears error |
| `test_start_clears_previous_result` | Restart after success clears result |
| `test_update` | Progress and message update correctly |
| `test_update_caps_at_99` | Progress capped at 99 (100 = complete only) |
| `test_complete` | Complete sets progress=100, stores result |
| `test_fail` | Fail sets error, deactivates |
| `test_to_dict` | Dict output has correct fields |
| `test_to_dict_excludes_result` | Result not exposed in to_dict |
| `test_thread_safety` | 10 concurrent threads update without errors |
| `test_state_has_indexing_status` | MCPSessionState includes IndexingStatus |
| `test_singleton_state_has_indexing` | Singleton state has indexing field |
| `test_callback_updates_indexing_status` | Progress callback updates status |
| `test_callback_discovery_progress` | Discovery progress event handled |
| `test_callback_embedding_progress_with_eta` | ETA displayed in message |
| `test_learn_returns_in_progress_when_active` | learn rejects when already indexing |
| `test_learn_returns_completed_result` | learn returns result after completion |
| `test_learn_raises_on_previous_error` | learn raises ToolError on prior failure |
| `test_learn_starts_background_thread` | learn spawns daemon thread |
| `test_learn_auto_with_existing_uses_incremental` | auto mode uses incremental reindex |
| `test_learn_full_mode_uses_full_index` | full mode uses full index |
| `test_get_stats_includes_indexing_when_active` | get_stats shows indexing progress |
| `test_get_stats_no_indexing_when_idle` | get_stats omits indexing when idle |
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ dependencies = [
"sentence-transformers>=2.2.0",
"torch>=2.0.0",
"einops>=0.7.0",
"pathspec>=0.11.0",
]


Expand Down
8 changes: 8 additions & 0 deletions src/codegrok_mcp/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,27 +24,35 @@ def __getattr__(name: str):
"""Lazy import heavy modules only when accessed."""
if name == "SourceRetriever":
from codegrok_mcp.indexing.source_retriever import SourceRetriever

return SourceRetriever
elif name == "TreeSitterParser":
from codegrok_mcp.parsers.treesitter_parser import TreeSitterParser

return TreeSitterParser
elif name == "ThreadLocalParserFactory":
from codegrok_mcp.parsers.treesitter_parser import ThreadLocalParserFactory

return ThreadLocalParserFactory
elif name == "Symbol":
from codegrok_mcp.core.models import Symbol

return Symbol
elif name == "SymbolType":
from codegrok_mcp.core.models import SymbolType

return SymbolType
elif name == "ParsedFile":
from codegrok_mcp.core.models import ParsedFile

return ParsedFile
elif name == "CodebaseIndex":
from codegrok_mcp.core.models import CodebaseIndex

return CodebaseIndex
elif name == "IParser":
from codegrok_mcp.core.interfaces import IParser

return IParser
raise AttributeError(f"module {__name__!r} has no attribute {name!r}")

Expand Down
5 changes: 5 additions & 0 deletions src/codegrok_mcp/core/exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ class CodeGrokException(Exception):
All custom exceptions in CodeGrok inherit from this class,
allowing for broad exception catching when needed.
"""

pass


Expand Down Expand Up @@ -46,6 +47,7 @@ class IndexingError(CodeGrokException):
- Chunking failures
- ChromaDB storage errors
"""

pass


Expand All @@ -57,6 +59,7 @@ class EmbeddingError(CodeGrokException):
- Encoding errors
- Memory issues
"""

pass


Expand All @@ -68,6 +71,7 @@ class SearchError(CodeGrokException):
- Missing index
- Invalid query parameters
"""

pass


Expand All @@ -79,4 +83,5 @@ class ConfigurationError(CodeGrokException):
- Invalid file paths
- Missing required parameters
"""

pass
Loading