title	Evaluation Guide
layout	default
parent	Performance
nav_order	2

Performance Evaluation Guide

codetect includes a comprehensive evaluation tool to measure the performance improvement of MCP tools vs. standard CLI tools (grep, find, etc.) when working with Claude Code.

Overview

The codetect-eval tool runs test cases against your codebase twice: once with MCP tools enabled, and once with only standard CLI tools. This provides quantitative data on the benefits of using codetect.

Quick Start

# Run all test cases on a target repository
codetect-eval run --repo /path/to/project

# Run only specific categories
codetect-eval run --repo /path/to/project --category search,navigate

# View the most recent results
codetect-eval report

# List available test cases
codetect-eval list --repo /path/to/project

Commands

run

Run evaluation test cases against a repository.

codetect-eval run [options]

Options:

--repo <path> - Repository to evaluate (default: current directory)
--cases <dir> - Test cases directory (default: evals/cases)
--output <dir> - Output directory for results (default: evals/results)
--category <cat> - Filter by category (search, navigate, understand)
--timeout <dur> - Timeout per test case (default: 5m)
--verbose - Verbose output

Examples:

# Evaluate current repository with all test cases
codetect-eval run

# Evaluate specific repository with only search tests
codetect-eval run --repo /path/to/project --category search

# Run with custom timeout and verbose output
codetect-eval run --repo /path/to/project --timeout 10m --verbose

report

Display a saved evaluation report.

codetect-eval report [options]

Options:

--results <path> - Path to specific results JSON file (default: most recent)

Examples:

# Show the most recent report
codetect-eval report

# Show a specific report
codetect-eval report --results ~/.codetect/projects/myrepo-a1b2c3d4/evals/results/2024-01-10-120000-results.json

list

List available test cases.

codetect-eval list [options]

Options:

--repo <path> - Repository path (default: current directory)
--cases <dir> - Test cases directory (default: evals/cases)
--category <cat> - Filter by category

Examples:

# List all test cases in current directory
codetect-eval list

# List test cases for a specific repository
codetect-eval list --repo /path/to/project

# List only navigation test cases
codetect-eval list --category navigate

Creating Eval Cases for Your Repository

When you run evals on a repository without test cases, you'll see a helpful error message suggesting how to create them.

Repository-Specific Storage

Eval data is stored in the centralized data directory:

~/.codetect/projects/your-project-a1b2c3d4/
└── evals/
    ├── cases/          # Test case JSONL files
    │   ├── search.jsonl
    │   ├── navigate.jsonl
    │   └── understand.jsonl
    ├── results/        # Evaluation results (JSON)
    │   └── 2024-01-10-120000-results.json
    └── logs/           # Raw Claude output logs

This approach:

Keeps project directories clean (no .codetect/ needed)
Stores results centrally
Allows different repos to have different test cases

Manual Creation

Create the directory structure and add JSONL files:

# Find your project's data directory
ls ~/.codetect/projects/
# Then create cases directory
mkdir -p ~/.codetect/projects/your-project-<hash>/evals/cases

Create test case files (e.g., evals/cases/search.jsonl):

{"id":"search-001","category":"search","description":"Find error handling","prompt":"Find all error handling code in this repository","ground_truth":{"files":["internal/errors.go","pkg/handler.go"]},"difficulty":"easy"}
{"id":"search-002","category":"search","description":"Find HTTP handlers","prompt":"Find all HTTP request handlers","ground_truth":{"files":["internal/handlers/","pkg/api/"]},"difficulty":"medium"}

AI-Assisted Creation

Start a Claude Code session in the target repository and paste this prompt:

Create eval test cases for the codetect MCP tool.

These test cases will be used by codetect-eval to measure MCP search performance
against this repository (without pre-indexing). Create JSONL files organized by
category:
- search.jsonl: keyword/regex searches, file pattern matching
- navigate.jsonl: finding definitions, references, call hierarchies
- understand.jsonl: code comprehension, architectural questions

Each line should be a JSON object with this structure:
{
  "id": "unique-id",
  "category": "search|navigate|understand",
  "description": "Brief description of what this tests",
  "prompt": "The actual question/search to ask",
  "difficulty": "easy|medium|hard",
  "ground_truth": {
    "files": ["expected/file/paths.go"],
    "symbols": ["expectedFunctionName"],
    "lines": {"file.go": [10, 20]},
    "content": ["expected snippets in output"]
  }
}

Create 5-10 test cases per category based on this repository's actual code structure.
Focus on queries that have clear, verifiable answers.

Test Case Format

Each test case is a JSON object on a single line (JSONL format).

Required Fields

{
  "id": "unique-test-id",
  "category": "search|navigate|understand",
  "description": "Brief description of what this tests",
  "prompt": "The actual prompt given to Claude",
  "ground_truth": {
    "files": ["expected/file1.go", "expected/file2.go"],
    "symbols": ["ExpectedSymbol1", "ExpectedSymbol2"],
    "lines": {"file.go": [10, 25, 42]},
    "content": ["expected code snippet"]
  },
  "difficulty": "easy|medium|hard"
}

Field Descriptions

id: Unique identifier for the test case (e.g., "search-001", "navigate-005")
category: Test category (search, navigate, or understand)
description: Human-readable description for reports
prompt: The exact prompt given to Claude Code
ground_truth: Expected results for validation
- files: List of file paths that should be found/mentioned
- symbols: List of symbol names (functions, types, etc.)
- lines: Map of files to line numbers
- content: Expected code snippets or text
difficulty: Subjective difficulty rating

Example Test Cases

Search Example:

{"id":"search-001","category":"search","description":"Find authentication logic","prompt":"Find all code related to user authentication","ground_truth":{"files":["internal/auth/","pkg/middleware/auth.go"],"symbols":["Authenticate","ValidateToken"]},"difficulty":"medium"}

Navigate Example:

{"id":"navigate-001","category":"navigate","description":"Find Server struct definition","prompt":"Find where the Server struct is defined","ground_truth":{"files":["internal/server/server.go"],"symbols":["Server"],"lines":{"internal/server/server.go":[15]}},"difficulty":"easy"}

Understand Example:

{"id":"understand-001","category":"understand","description":"Explain request flow","prompt":"Explain how HTTP requests are processed from router to handler","ground_truth":{"files":["internal/router/","internal/handlers/","pkg/middleware/"]},"difficulty":"hard"}

Test Case Categories

search

Finding code patterns, keywords, or concepts across the codebase.

Good prompts:

"Find all error handling code"
"Find where we make HTTP requests"
"Search for database query code"

Ground truth: Files and symbols that match the search criteria

navigate

Locating specific symbols, definitions, or implementations.

Good prompts:

"Find the definition of the Server struct"
"Where is the HandleRequest function implemented?"
"Find all implementations of the Handler interface"

Ground truth: Specific files, symbols, and line numbers

understand

Comprehending code structure, relationships, or architecture.

Good prompts:

"Explain the authentication flow"
"How does the caching system work?"
"What happens when a user logs in?"

Ground truth: Key files involved in the concept (harder to validate precisely)

Metrics Measured

Each test case runs twice (with and without MCP tools) and measures:

Accuracy

Precision: Correct items / Total returned
Recall: Correct items / Total expected
F1 Score: Harmonic mean of precision and recall

Based on comparing results against the ground truth.

Performance

Token usage: Input tokens, output tokens, cache reads, cache creation
Latency: Time to complete the task
Cost: Estimated API cost in USD
Turns: Number of back-and-forth interactions

Success

Success rate: Percentage of test cases that completed successfully
Error rate: Percentage that failed or timed out

Understanding Results

After running evaluations, you'll see a summary report:

Evaluation Report
=================
Total Cases: 30

With MCP Tools:
  Avg Accuracy: 92.5%
  Avg Tokens: 12,450
  Avg Cost: $0.15
  Avg Latency: 8.2s
  Success Rate: 96.7%

Without MCP Tools:
  Avg Accuracy: 78.3%
  Avg Tokens: 28,900
  Avg Cost: $0.42
  Avg Latency: 15.6s
  Success Rate: 86.7%

Improvements:
  Accuracy: +14.2%
  Token Reduction: 56.9%
  Cost Reduction: 64.3%
  Latency Reduction: 47.4%

Key Takeaways

Higher accuracy means MCP tools help Claude find the right code more reliably.

Lower token usage means less context switching and more efficient searches, leading to lower costs.

Lower latency means faster responses, improving developer productivity.

Higher success rate means fewer failed attempts and timeout errors.

Best Practices

Writing Good Test Cases

Use realistic prompts - Write prompts you'd actually use during development
Cover different difficulties - Include easy, medium, and hard test cases
Test edge cases - Include ambiguous queries, large search spaces, etc.
Keep ground truth accurate - Update when code changes
Balance categories - Mix search, navigate, and understand tests

Running Evaluations

Index first - Run codetect index before evaluating
Use consistent versions - Don't upgrade mid-evaluation
Run multiple times - Results can vary; average over multiple runs
Document context - Note repo size, language, domain in reports

Maintaining Test Cases

Update with code changes - Keep ground truth in sync
Add new tests - When you encounter interesting queries
Remove stale tests - Delete tests for removed features
Version control cases - Keep test case JSONL files in version control
Results are centralized - Stored under ~/.codetect/projects/

Troubleshooting

No test cases found

Error:

ERROR: No test cases found!
The eval runner could not find any test cases in: /path/to/cases/dir

Solution: Create test cases using the manual or AI-assisted approach above.

Timeout errors

If test cases are timing out, increase the timeout:

codetect-eval run --repo /path/to/project --timeout 10m

Low accuracy scores

If accuracy is low across the board:

Check that ground truth is correct
Verify the repo is indexed: codetect stats
Try simpler prompts to isolate the issue

High token usage with MCP

If MCP tools aren't reducing tokens:

Check that MCP is actually being used (look for tool calls in verbose output)
Verify .mcp.json is configured correctly
Ensure codetect is running: ps aux | grep codetect

Advanced Usage

Custom Test Suites

Organize test cases by feature or module:

evals/cases/
├── auth/
│   ├── search.jsonl
│   └── navigate.jsonl
├── api/
│   ├── search.jsonl
│   └── navigate.jsonl
└── database/
    └── search.jsonl

The eval tool will recursively find all *.jsonl files.

Continuous Evaluation

Add eval runs to your CI pipeline:

# .github/workflows/eval.yml
- name: Run codetect evals
  run: |
    codetect index
    codetect-eval run --repo . --output ci-results/

Comparing Versions

Run evals before and after upgrading codetect:

# Before upgrade
codetect-eval run --repo .
cp ~/.codetect/projects/<name>-<hash>/evals/results/latest.json before.json

# Upgrade
codetect update

# After upgrade
codetect-eval run --repo .
cp ~/.codetect/projects/<name>-<hash>/evals/results/latest.json after.json

# Compare (manually or with custom script)

FAQ

Q: Do I need to create test cases for every project?

A: No, you can use the same test cases across similar projects, but custom cases will be more accurate.

Q: Can I run evals without semantic search enabled?

A: Yes, semantic search is optional. The eval tool will work with just keyword search.

Q: How long does it take to run evals?

A: Depends on the number of test cases and timeout settings. 30 test cases typically take 10-15 minutes.

Q: Are eval results deterministic?

A: No, Claude's responses can vary. Run multiple times and average results for reliability.

Q: Can I share eval results?

A: Yes, results are stored as JSON files. You can share them or aggregate across teams.

Q: What if my ground truth is wrong?

A: Update the JSONL file with the correct ground truth and re-run. Results are only as good as your ground truth.

FilesExpand file tree

evaluation.md

Latest commit

History

evaluation.md

File metadata and controls

Performance Evaluation Guide

Overview

Quick Start

Commands

run

report

list

Creating Eval Cases for Your Repository

Repository-Specific Storage

Manual Creation

AI-Assisted Creation

Test Case Format

Required Fields

Field Descriptions

Example Test Cases

Test Case Categories

search

navigate

understand

Metrics Measured

Accuracy

Performance

Success

Understanding Results

Key Takeaways

Best Practices

Writing Good Test Cases

Running Evaluations

Maintaining Test Cases

Troubleshooting

No test cases found

Timeout errors

Low accuracy scores

High token usage with MCP

Advanced Usage

Custom Test Suites

Continuous Evaluation

Comparing Versions

FAQ

Related Documentation