| title | Evaluation Guide |
|---|---|
| layout | default |
| parent | Performance |
| nav_order | 2 |
codetect includes a comprehensive evaluation tool to measure the performance improvement of MCP tools vs. standard CLI tools (grep, find, etc.) when working with Claude Code.
The codetect-eval tool runs test cases against your codebase twice: once with MCP tools enabled, and once with only standard CLI tools. This provides quantitative data on the benefits of using codetect.
# Run all test cases on a target repository
codetect-eval run --repo /path/to/project
# Run only specific categories
codetect-eval run --repo /path/to/project --category search,navigate
# View the most recent results
codetect-eval report
# List available test cases
codetect-eval list --repo /path/to/projectRun evaluation test cases against a repository.
codetect-eval run [options]Options:
--repo <path>- Repository to evaluate (default: current directory)--cases <dir>- Test cases directory (default: evals/cases)--output <dir>- Output directory for results (default: evals/results)--category <cat>- Filter by category (search, navigate, understand)--timeout <dur>- Timeout per test case (default: 5m)--verbose- Verbose output
Examples:
# Evaluate current repository with all test cases
codetect-eval run
# Evaluate specific repository with only search tests
codetect-eval run --repo /path/to/project --category search
# Run with custom timeout and verbose output
codetect-eval run --repo /path/to/project --timeout 10m --verboseDisplay a saved evaluation report.
codetect-eval report [options]Options:
--results <path>- Path to specific results JSON file (default: most recent)
Examples:
# Show the most recent report
codetect-eval report
# Show a specific report
codetect-eval report --results ~/.codetect/projects/myrepo-a1b2c3d4/evals/results/2024-01-10-120000-results.jsonList available test cases.
codetect-eval list [options]Options:
--repo <path>- Repository path (default: current directory)--cases <dir>- Test cases directory (default: evals/cases)--category <cat>- Filter by category
Examples:
# List all test cases in current directory
codetect-eval list
# List test cases for a specific repository
codetect-eval list --repo /path/to/project
# List only navigation test cases
codetect-eval list --category navigateWhen you run evals on a repository without test cases, you'll see a helpful error message suggesting how to create them.
Eval data is stored in the centralized data directory:
~/.codetect/projects/your-project-a1b2c3d4/
└── evals/
├── cases/ # Test case JSONL files
│ ├── search.jsonl
│ ├── navigate.jsonl
│ └── understand.jsonl
├── results/ # Evaluation results (JSON)
│ └── 2024-01-10-120000-results.json
└── logs/ # Raw Claude output logs
This approach:
- Keeps project directories clean (no
.codetect/needed) - Stores results centrally
- Allows different repos to have different test cases
Create the directory structure and add JSONL files:
# Find your project's data directory
ls ~/.codetect/projects/
# Then create cases directory
mkdir -p ~/.codetect/projects/your-project-<hash>/evals/casesCreate test case files (e.g., evals/cases/search.jsonl):
{"id":"search-001","category":"search","description":"Find error handling","prompt":"Find all error handling code in this repository","ground_truth":{"files":["internal/errors.go","pkg/handler.go"]},"difficulty":"easy"}
{"id":"search-002","category":"search","description":"Find HTTP handlers","prompt":"Find all HTTP request handlers","ground_truth":{"files":["internal/handlers/","pkg/api/"]},"difficulty":"medium"}Start a Claude Code session in the target repository and paste this prompt:
Create eval test cases for the codetect MCP tool.
These test cases will be used by codetect-eval to measure MCP search performance
against this repository (without pre-indexing). Create JSONL files organized by
category:
- search.jsonl: keyword/regex searches, file pattern matching
- navigate.jsonl: finding definitions, references, call hierarchies
- understand.jsonl: code comprehension, architectural questions
Each line should be a JSON object with this structure:
{
"id": "unique-id",
"category": "search|navigate|understand",
"description": "Brief description of what this tests",
"prompt": "The actual question/search to ask",
"difficulty": "easy|medium|hard",
"ground_truth": {
"files": ["expected/file/paths.go"],
"symbols": ["expectedFunctionName"],
"lines": {"file.go": [10, 20]},
"content": ["expected snippets in output"]
}
}
Create 5-10 test cases per category based on this repository's actual code structure.
Focus on queries that have clear, verifiable answers.
Each test case is a JSON object on a single line (JSONL format).
{
"id": "unique-test-id",
"category": "search|navigate|understand",
"description": "Brief description of what this tests",
"prompt": "The actual prompt given to Claude",
"ground_truth": {
"files": ["expected/file1.go", "expected/file2.go"],
"symbols": ["ExpectedSymbol1", "ExpectedSymbol2"],
"lines": {"file.go": [10, 25, 42]},
"content": ["expected code snippet"]
},
"difficulty": "easy|medium|hard"
}- id: Unique identifier for the test case (e.g., "search-001", "navigate-005")
- category: Test category (search, navigate, or understand)
- description: Human-readable description for reports
- prompt: The exact prompt given to Claude Code
- ground_truth: Expected results for validation
- files: List of file paths that should be found/mentioned
- symbols: List of symbol names (functions, types, etc.)
- lines: Map of files to line numbers
- content: Expected code snippets or text
- difficulty: Subjective difficulty rating
Search Example:
{"id":"search-001","category":"search","description":"Find authentication logic","prompt":"Find all code related to user authentication","ground_truth":{"files":["internal/auth/","pkg/middleware/auth.go"],"symbols":["Authenticate","ValidateToken"]},"difficulty":"medium"}Navigate Example:
{"id":"navigate-001","category":"navigate","description":"Find Server struct definition","prompt":"Find where the Server struct is defined","ground_truth":{"files":["internal/server/server.go"],"symbols":["Server"],"lines":{"internal/server/server.go":[15]}},"difficulty":"easy"}Understand Example:
{"id":"understand-001","category":"understand","description":"Explain request flow","prompt":"Explain how HTTP requests are processed from router to handler","ground_truth":{"files":["internal/router/","internal/handlers/","pkg/middleware/"]},"difficulty":"hard"}Finding code patterns, keywords, or concepts across the codebase.
Good prompts:
- "Find all error handling code"
- "Find where we make HTTP requests"
- "Search for database query code"
Ground truth: Files and symbols that match the search criteria
Locating specific symbols, definitions, or implementations.
Good prompts:
- "Find the definition of the Server struct"
- "Where is the HandleRequest function implemented?"
- "Find all implementations of the Handler interface"
Ground truth: Specific files, symbols, and line numbers
Comprehending code structure, relationships, or architecture.
Good prompts:
- "Explain the authentication flow"
- "How does the caching system work?"
- "What happens when a user logs in?"
Ground truth: Key files involved in the concept (harder to validate precisely)
Each test case runs twice (with and without MCP tools) and measures:
- Precision: Correct items / Total returned
- Recall: Correct items / Total expected
- F1 Score: Harmonic mean of precision and recall
Based on comparing results against the ground truth.
- Token usage: Input tokens, output tokens, cache reads, cache creation
- Latency: Time to complete the task
- Cost: Estimated API cost in USD
- Turns: Number of back-and-forth interactions
- Success rate: Percentage of test cases that completed successfully
- Error rate: Percentage that failed or timed out
After running evaluations, you'll see a summary report:
Evaluation Report
=================
Total Cases: 30
With MCP Tools:
Avg Accuracy: 92.5%
Avg Tokens: 12,450
Avg Cost: $0.15
Avg Latency: 8.2s
Success Rate: 96.7%
Without MCP Tools:
Avg Accuracy: 78.3%
Avg Tokens: 28,900
Avg Cost: $0.42
Avg Latency: 15.6s
Success Rate: 86.7%
Improvements:
Accuracy: +14.2%
Token Reduction: 56.9%
Cost Reduction: 64.3%
Latency Reduction: 47.4%
Higher accuracy means MCP tools help Claude find the right code more reliably.
Lower token usage means less context switching and more efficient searches, leading to lower costs.
Lower latency means faster responses, improving developer productivity.
Higher success rate means fewer failed attempts and timeout errors.
- Use realistic prompts - Write prompts you'd actually use during development
- Cover different difficulties - Include easy, medium, and hard test cases
- Test edge cases - Include ambiguous queries, large search spaces, etc.
- Keep ground truth accurate - Update when code changes
- Balance categories - Mix search, navigate, and understand tests
- Index first - Run
codetect indexbefore evaluating - Use consistent versions - Don't upgrade mid-evaluation
- Run multiple times - Results can vary; average over multiple runs
- Document context - Note repo size, language, domain in reports
- Update with code changes - Keep ground truth in sync
- Add new tests - When you encounter interesting queries
- Remove stale tests - Delete tests for removed features
- Version control cases - Keep test case JSONL files in version control
- Results are centralized - Stored under
~/.codetect/projects/
Error:
ERROR: No test cases found!
The eval runner could not find any test cases in: /path/to/cases/dir
Solution: Create test cases using the manual or AI-assisted approach above.
If test cases are timing out, increase the timeout:
codetect-eval run --repo /path/to/project --timeout 10mIf accuracy is low across the board:
- Check that ground truth is correct
- Verify the repo is indexed:
codetect stats - Try simpler prompts to isolate the issue
If MCP tools aren't reducing tokens:
- Check that MCP is actually being used (look for tool calls in verbose output)
- Verify
.mcp.jsonis configured correctly - Ensure codetect is running:
ps aux | grep codetect
Organize test cases by feature or module:
evals/cases/
├── auth/
│ ├── search.jsonl
│ └── navigate.jsonl
├── api/
│ ├── search.jsonl
│ └── navigate.jsonl
└── database/
└── search.jsonl
The eval tool will recursively find all *.jsonl files.
Add eval runs to your CI pipeline:
# .github/workflows/eval.yml
- name: Run codetect evals
run: |
codetect index
codetect-eval run --repo . --output ci-results/Run evals before and after upgrading codetect:
# Before upgrade
codetect-eval run --repo .
cp ~/.codetect/projects/<name>-<hash>/evals/results/latest.json before.json
# Upgrade
codetect update
# After upgrade
codetect-eval run --repo .
cp ~/.codetect/projects/<name>-<hash>/evals/results/latest.json after.json
# Compare (manually or with custom script)Q: Do I need to create test cases for every project?
A: No, you can use the same test cases across similar projects, but custom cases will be more accurate.
Q: Can I run evals without semantic search enabled?
A: Yes, semantic search is optional. The eval tool will work with just keyword search.
Q: How long does it take to run evals?
A: Depends on the number of test cases and timeout settings. 30 test cases typically take 10-15 minutes.
Q: Are eval results deterministic?
A: No, Claude's responses can vary. Run multiple times and average results for reliability.
Q: Can I share eval results?
A: Yes, results are stored as JSON files. You can share them or aggregate across teams.
Q: What if my ground truth is wrong?
A: Update the JSONL file with the correct ground truth and re-run. Results are only as good as your ground truth.
- Installation Guide - Setup codetect before running evals
- Architecture - Understanding how MCP tools work
- MCP Compatibility - Supported tools and configurations