dondetir · jaggernaut007 · Mar 11, 2026 · Mar 11, 2026 · Mar 11, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -42,8 +42,10 @@ src/codegrok_mcp/
 - **Embedding Model**: `nomic-ai/CodeRankEmbed` (768 dims, 8192 max tokens)
 - **Chunk Strategy**: Symbol-based (each function/class/method = 1 chunk)
 - **Max Chunk Size**: 4000 chars (~1000-1300 tokens)
-- **Storage**: `.codegrok/` (chromadb/ + metadata.json + memory_metadata.json)
-- **Parallelism**: CPU count - 1 workers (min 1, max 32)
+- **Storage**: `.codegrok/` (chromadb/ + metadata.json + memory_metadata.json + checkpoint.json)
+- **Parallelism**: CPU count - 1 workers (min 1, max 4 for parsing)
+- **File Batch Size**: 500 files per parse batch (memory optimization)
+- **Default Timeout**: 600s (configurable via `CODEGROK_TIMEOUT` env var or `timeout_seconds` param)
 - **Memory TTLs**: session (24h), day, week, month, permanent
 
 ## Commands
@@ -59,13 +61,22 @@ mypy src/                 # Type check
 ## Gotchas
 
 1. **State is global singleton** - `state.py` holds SourceRetriever + MemoryRetriever across MCP calls
-2. **Incremental reindex uses SHA256** - File hash comparison, not mtime
+2. **Incremental reindex uses mtime** - File modification time comparison for change detection
 3. **ChromaDB collections**: `codebase_chunks` (code) and `memories` (memory layer)
 4. **No LLM code** - Removed from parent CodeGrok; source_retriever.py has no ask/rerank methods
 5. **Tree-sitter node names vary by language** - language_configs.py normalizes them
 6. **Embedding is cached** - LRU(1000) + batch processing in embedding_service.py
 7. **Memory tags stored as CSV** - ChromaDB doesn't support list metadata; tags joined with commas
 8. **All tools require `learn` first** - Except `list_supported_languages` (static data)
+9. **discover_files respects .gitignore** - Uses `pathspec` + `os.walk()` with directory pruning; also respects nested `.gitignore` files
+10. **Indexing uses upsert** - `collection.upsert()` instead of delete-recreate; stale chunks cleaned after embedding
+11. **Checkpointing** - `.codegrok/checkpoint.json` saves progress every 1000 chunks; atomic writes via `os.replace()`; deleted on success
+12. **max_files safety limit** - `discover_files()` stops at 200K files to prevent DoS (addresses SECURITY_REVIEW HIGH-003)
+13. **Memory-optimized parsing** - Symbols converted to chunks per file batch (500 files), then freed; `gc.collect()` between batches; chunks list freed after embedding
+14. **Worker cap** - Parallel parse workers capped at `MAX_PARSE_WORKERS=4` to limit memory from tree-sitter instances
+15. **Configurable timeout** - `learn` tool has `timeout_seconds` param; also reads `CODEGROK_TIMEOUT` env var; defaults to 600s
+16. **Background indexing** - `learn` returns immediately; indexing runs in `threading.Thread(daemon=True)`; client polls `get_stats()` for progress; `IndexingStatus` in `state.py` tracks progress thread-safely
+17. **learn stateful responses** - Returns `indexing_started` (new), `indexing_in_progress` (already running), `complete` (done, clears result), or raises `ToolError` (failed, clears error for retry)
 
 ## Adding Languages
 

diff --git a/docs/INDEXING_IMPROVEMENTS.md b/docs/INDEXING_IMPROVEMENTS.md
@@ -0,0 +1,192 @@
+# Indexing Improvements (v0.2.1)
+
+Fixes for the `learn` tool hanging on large codebases with many folders/subfolders,
+high memory consumption during indexing, and missing timeout protection.
+
+## Changes
+
+### 1. `.gitignore` Support
+
+`discover_files()` now respects `.gitignore` patterns using the `pathspec` library.
+
+- Uses `os.walk()` instead of `Path.rglob("*")` for directory pruning
+- Loads root `.gitignore` and stacks nested `.gitignore` files as it descends
+- Prunes ignored directories in-place (never descends into `node_modules/`, `build/`, etc.)
+- Uses `followlinks=False` to prevent symlink loops
+- Backward-compatible: `respect_gitignore=True` by default, can be disabled
+
+### 2. Safety Limits
+
+- `max_files=200_000` circuit breaker stops file discovery if exceeded
+- Emits a warning when the limit is hit
+- Addresses **SECURITY_REVIEW HIGH-003** (Unbounded Resource Consumption / DoS)
+
+### 3. Upsert-Based Indexing
+
+- `index_codebase()` now uses `get_or_create_collection()` + `collection.upsert()` instead of deleting and recreating the collection
+- Chunk IDs are deterministic (`filepath:name:line_start`), making upsert idempotent
+- Stale chunks (from deleted/renamed files) are cleaned up after the embedding loop
+
+### 4. Resumable Checkpointing
+
+- Saves progress to `.codegrok/checkpoint.json` every 1000 chunks
+- Atomic writes via `os.replace()` (POSIX-safe)
+- On restart, detects checkpoint and resumes from where it left off
+- Checkpoint is deleted on successful completion
+
+### 5. Improved Progress Reporting
+
+- New `"discovery_progress"` event emitted every 1000 files during file traversal
+- ETA added to embedding progress messages (e.g., "Embedding... (5000/10000 chunks, ~2.3m remaining)")
+- MCP client now shows progress during the file discovery phase (0-5% range)
+
+### 6. Memory Optimizations
+
+Reduces peak memory consumption during indexing:
+
+- **Inline symbol-to-chunk conversion**: Symbols are converted to chunks per file batch and freed immediately, instead of accumulating all symbols in a separate list before chunking
+- **File batch processing**: Files are parsed in batches of 500 (`FILE_BATCH_SIZE`), with `gc.collect()` between batches to free memory promptly
+- **Worker cap**: Parallel parse workers capped at 4 (`MAX_PARSE_WORKERS`) to limit memory from tree-sitter parser instances (previously up to 32)
+- **Post-embedding cleanup**: Chunks list is explicitly deleted and garbage collected after embedding completes
+
+### 7. Configurable MCP Timeout
+
+Prevents the `learn` tool from running indefinitely on very large codebases:
+
+- Default timeout: 600 seconds (10 minutes)
+- **Environment variable**: Set `CODEGROK_TIMEOUT` in your MCP client config
+- **Per-call override**: Pass `timeout_seconds` parameter to the `learn` tool
+- Priority: `timeout_seconds` param > `CODEGROK_TIMEOUT` env var > default (600s)
+- On timeout: checkpoint is preserved, re-running `learn` resumes from where it stopped
+
+**Configuration example** (claude_desktop_config.json):
+```json
+{
+  "mcpServers": {
+    "codegrok": {
+      "command": "codegrok-mcp",
+      "env": {
+        "CODEGROK_TIMEOUT": "1200"
+      }
+    }
+  }
+}
+```
+
+### 8. Background Indexing
+
+Solves MCP transport timeout: clients (Claude Desktop, etc.) have ~60-120s transport-level
+timeouts that kill long-running tool calls. Now `learn` returns immediately and indexing
+runs in a background thread.
+
+- `learn` starts a `threading.Thread(daemon=True)` and returns `{"status": "indexing_started"}`
+- `IndexingStatus` dataclass in `state.py` provides thread-safe progress tracking with `threading.Lock`
+- `get_stats()` includes `indexing` field with `active`, `progress`, `message`, `error` when indexing is in progress
+- Polling pattern: client calls `get_stats()` repeatedly to check progress
+- If `learn` is called while indexing is active, returns `{"status": "indexing_in_progress"}`
+- If `learn` is called after completion, returns the result once and clears it
+- If previous indexing failed, raises `ToolError` with error details and clears for retry
+- `load_only` mode remains synchronous (fast, no background needed)
+
+**Workflow:**
+```
+1. learn(path="/project")         → {"status": "indexing_started", "progress": 0}
+2. get_stats()                    → {"indexing": {"active": true, "progress": 42, ...}}
+3. get_stats()                    → {"loaded": true, "stats": {...}}  (indexing done)
+```
+
+## New Dependencies
+
+- `pathspec>=0.11.0` — Pure Python `.gitignore` pattern matching (used by `black`, `flake8`, etc.)
+
+## Security Alignment
+
+| Security Finding | How Addressed |
+|-----------------|---------------|
+| HIGH-003: Unbounded Resource Consumption | `max_files` limit + `.gitignore` filtering |
+| LOW-009: Symlink Following | `followlinks=False` in `os.walk()` |
+
+## MCP Tools Used in Development
+
+This feature was planned and implemented using the following MCP tools:
+
+| MCP Tool | How It Was Used |
+|----------|----------------|
+| **Sequential Thinking** (`mcp__sequential-thinking__sequentialthinking`) | 7-step chain-of-thought to plan execution order, identify risks (backward compatibility, atomic writes, symlink loops), decide to skip async writer thread, and design test strategy |
+| **Snyk Code Scan** (`mcp__Snyk__snyk_code_scan`) | SAST scan on all modified files (`source_retriever.py`, `server.py`) — 0 issues found |
+
+## Test Coverage
+
+### New Unit Tests (`tests/unit/test_discover_files.py`)
+
+| Test | What it verifies |
+|------|-----------------|
+| `test_discover_files_basic` | Finds .py files in simple directory |
+| `test_discover_files_skip_dirs` | Skips `node_modules/`, `__pycache__/`, `.git/` even when nested |
+| `test_discover_files_gitignore` | Respects root `.gitignore` patterns |
+| `test_discover_files_nested_gitignore` | Handles `.gitignore` in subdirectories |
+| `test_discover_files_max_files_limit` | Stops at `max_files` and returns partial results |
+| `test_discover_files_no_gitignore` | Works when no `.gitignore` exists |
+| `test_discover_files_respect_gitignore_false` | Opt-out disables gitignore filtering |
+| `test_discover_files_progress_callback` | Callback mechanism works correctly |
+| `test_file_batch_size` | FILE_BATCH_SIZE constant is 500 |
+| `test_max_parse_workers` | MAX_PARSE_WORKERS constant is 4 |
+
+### New Unit Tests (`tests/unit/test_timeout.py`)
+
+| Test | What it verifies |
+|------|-----------------|
+| `test_default_timeout` | Returns 600s when no override or env var |
+| `test_per_call_override` | Per-call override takes highest priority |
+| `test_env_var_override` | CODEGROK_TIMEOUT env var overrides default |
+| `test_env_var_invalid_ignored` | Invalid env var falls back to default |
+| `test_env_var_zero_ignored` | Zero env var falls back to default |
+| `test_env_var_negative_ignored` | Negative env var falls back to default |
+| `test_override_zero_uses_env` | Override of 0/None falls through to env |
+| `test_override_negative_uses_env` | Negative override falls through to env |
+| `test_default_is_600` | Default constant is 600 seconds |
+
+### New Integration Tests (`tests/integration/test_source_retriever.py`)
+
+| Test | What it verifies |
+|------|-----------------|
+| `test_index_codebase_upsert_idempotent` | Re-indexing produces same chunk count |
+| `test_stale_chunk_removal` | Old chunks removed after file deletion |
+| `test_checkpoint_save_and_load` | Checkpoint round-trip |
+| `test_checkpoint_load_missing_file` | Handles missing checkpoint |
+| `test_checkpoint_load_corrupted` | Handles corrupted JSON |
+| `test_checkpoint_cleanup_on_success` | Checkpoint deleted after success |
+| `test_checkpoint_load_none_path` | Handles None path |
+| `test_file_batch_processing_parallel` | Parallel batch processing produces correct results |
+| `test_file_batch_processing_sequential` | Sequential inline chunking produces correct results |
+| `test_worker_cap_respected` | Workers capped at MAX_PARSE_WORKERS |
+| `test_explicit_max_workers_not_overridden` | Explicit max_workers used as-is |
+
+### New Unit Tests (`tests/unit/test_background_indexing.py`)
+
+| Test | What it verifies |
+|------|-----------------|
+| `test_initial_state` | IndexingStatus defaults are correct |
+| `test_start` | start() sets active, clears error/result |
+| `test_start_clears_previous_error` | Restart after failure clears error |
+| `test_start_clears_previous_result` | Restart after success clears result |
+| `test_update` | Progress and message update correctly |
+| `test_update_caps_at_99` | Progress capped at 99 (100 = complete only) |
+| `test_complete` | Complete sets progress=100, stores result |
+| `test_fail` | Fail sets error, deactivates |
+| `test_to_dict` | Dict output has correct fields |
+| `test_to_dict_excludes_result` | Result not exposed in to_dict |
+| `test_thread_safety` | 10 concurrent threads update without errors |
+| `test_state_has_indexing_status` | MCPSessionState includes IndexingStatus |
+| `test_singleton_state_has_indexing` | Singleton state has indexing field |
+| `test_callback_updates_indexing_status` | Progress callback updates status |
+| `test_callback_discovery_progress` | Discovery progress event handled |
+| `test_callback_embedding_progress_with_eta` | ETA displayed in message |
+| `test_learn_returns_in_progress_when_active` | learn rejects when already indexing |
+| `test_learn_returns_completed_result` | learn returns result after completion |
+| `test_learn_raises_on_previous_error` | learn raises ToolError on prior failure |
+| `test_learn_starts_background_thread` | learn spawns daemon thread |
+| `test_learn_auto_with_existing_uses_incremental` | auto mode uses incremental reindex |
+| `test_learn_full_mode_uses_full_index` | full mode uses full index |
+| `test_get_stats_includes_indexing_when_active` | get_stats shows indexing progress |
+| `test_get_stats_no_indexing_when_idle` | get_stats omits indexing when idle |
diff --git a/pyproject.toml b/pyproject.toml
@@ -32,6 +32,7 @@ dependencies = [
     "sentence-transformers>=2.2.0",
     "torch>=2.0.0",
     "einops>=0.7.0",
+    "pathspec>=0.11.0",
 ]
 
 

diff --git a/src/codegrok_mcp/__init__.py b/src/codegrok_mcp/__init__.py
@@ -24,27 +24,35 @@ def __getattr__(name: str):
     """Lazy import heavy modules only when accessed."""
     if name == "SourceRetriever":
         from codegrok_mcp.indexing.source_retriever import SourceRetriever
+
         return SourceRetriever
     elif name == "TreeSitterParser":
         from codegrok_mcp.parsers.treesitter_parser import TreeSitterParser
+
         return TreeSitterParser
     elif name == "ThreadLocalParserFactory":
         from codegrok_mcp.parsers.treesitter_parser import ThreadLocalParserFactory
+
         return ThreadLocalParserFactory
     elif name == "Symbol":
         from codegrok_mcp.core.models import Symbol
+
         return Symbol
     elif name == "SymbolType":
         from codegrok_mcp.core.models import SymbolType
+
         return SymbolType
     elif name == "ParsedFile":
         from codegrok_mcp.core.models import ParsedFile
+
         return ParsedFile
     elif name == "CodebaseIndex":
         from codegrok_mcp.core.models import CodebaseIndex
+
         return CodebaseIndex
     elif name == "IParser":
         from codegrok_mcp.core.interfaces import IParser
+
         return IParser
     raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
 

diff --git a/src/codegrok_mcp/core/exceptions.py b/src/codegrok_mcp/core/exceptions.py
@@ -19,6 +19,7 @@ class CodeGrokException(Exception):
     All custom exceptions in CodeGrok inherit from this class,
     allowing for broad exception catching when needed.
     """
+
     pass
 
 
@@ -46,6 +47,7 @@ class IndexingError(CodeGrokException):
     - Chunking failures
     - ChromaDB storage errors
     """
+
     pass
 
 
@@ -57,6 +59,7 @@ class EmbeddingError(CodeGrokException):
     - Encoding errors
     - Memory issues
     """
+
     pass
 
 
@@ -68,6 +71,7 @@ class SearchError(CodeGrokException):
     - Missing index
     - Invalid query parameters
     """
+
     pass
 
 
@@ -79,4 +83,5 @@ class ConfigurationError(CodeGrokException):
     - Invalid file paths
     - Missing required parameters
     """
+
     pass