| version | 1.0 |
|---|---|
| last_updated | 2025-12-17 |
The metrics system automatically measures code quality for agent submissions, tracking everything from lines of code to cyclomatic complexity to code clones.
When an agent completes a checkpoint, the metrics system analyzes the submitted code and generates:
- Line metrics: LOC, comments, total lines
- Lint metrics: Ruff errors and violations
- Complexity metrics: Cyclomatic complexity (A-F ratings), nesting depth
- Pattern violations: AST-grep rule violations across 7 categories
- Code quality: Waste detection (trivial wrappers, single-use functions), code clones
- Dependencies: Graph metrics for import relationships
Results are saved to JSON/JSONL files in each checkpoint's quality_analysis/ directory.
Checkpoint-level results (individual checkpoint):
- Want details on test results and quality metrics for a single checkpoint? See Checkpoint Results - covers evaluation.json and quality_analysis/
- Contains correctness (test results) and quality (code metrics) for that checkpoint
- Located in:
checkpoint_N/evaluation.jsonandcheckpoint_N/quality_analysis/
Run-level results (aggregated across all checkpoints):
- Comparing runs or analyzing trends across checkpoints? See Run-Level Results - aggregated statistics
- Contains solve rates, average costs, efficiency metrics, quality trends
- Located in:
checkpoint_results.jsonlandresult.jsonat run root
- New to metrics? Start with Interpreting Results - explains what each metric means
- Looking at output files? See Output Files Reference - file locations and formats
- Adjusting thresholds? Read Configuration Guide - thresholds and AST-grep rules
| Concept | Description |
|---|---|
| Checkpoint Results | Correctness (tests) + Quality (code metrics) for a single checkpoint |
| Run Summary | Aggregated statistics across all checkpoints and problems |
| LOC | Lines of code (source lines, excluding blanks) |
| Cyclomatic Complexity (CC) | Number of independent paths through code (A=1-5, F=41+) |
| Maintainability Index (MI) | Composite score of code maintainability (A >= 19) |
| AST-grep Violations | Pattern-based slop-rule violations from configs/slop_rules.yaml |
| Waste | Abstraction inefficiencies (trivial wrappers, single-use functions) |
| Clones | Duplicate code blocks detected via AST hashing |
| Delta Metrics | Percentage changes between checkpoints |
| Pass Rate | Percentage of tests passing by category (CORE, FUNCTIONALITY, etc.) |
| Solve Rate | Percentage of checkpoints/problems meeting success criteria |
- CC ratings: More A/B ratings, fewer D/E/F
- Lint errors: Lower is better
- AST-grep violations: Lower is better (especially safety/complexity categories)
- Waste metrics: Fewer trivial wrappers and single-use functions
- Clone ratio: Lower percentage means less duplication
Metrics are saved at two levels:
Checkpoint-level (detailed for single checkpoint):
outputs/run_name/problem_name/checkpoint_N/
├── evaluation.json # Test results
├── quality_analysis/
│ ├── overall_quality.json # Aggregated snapshot metrics
│ ├── files.jsonl # Per-file metrics
│ ├── symbols.jsonl # Per-function/class metrics
│ └── ast_grep.jsonl # Pattern violations
└── evaluation/
├── stdout.txt, stderr.txt, report.json # Test artifacts
Run-level (aggregated across all checkpoints):
outputs/run_name/
├── checkpoint_results.jsonl # All checkpoint metrics in one file
└── result.json # Aggregated statistics and summaries
Use checkpoint-level files for detailed analysis of a specific checkpoint. Use run-level files for comparing runs or identifying trends.
Delta metrics (prefixed with delta.) show percentage changes:
delta.loc: Lines of code changedelta.lint_errors: Lint error changedelta.ast_grep_violations: Violation changedelta.churn_ratio: Code churn (lines added + removed / prior total)
- Quality metrics computation:
src/slop_code/metrics/driver.py: Main entry point for measuring qualitylanguages/: Language-specific parsers (Python, JavaScript, etc.)checkpoint/: Checkpoint-level metrics extraction and delta computationsummary/: Run-level aggregation and summary statistics
- Evaluation (test results):
src/slop_code/evaluation/report.pyCorrectnessResults: Test result modelGroupType: Test categorization (CORE, FUNCTIONALITY, REGRESSION, ERROR)PassPolicy: Success criteria
- AST-grep rules:
configs/slop_rules.yaml - Main entry points:
- Snapshot quality:
slop_code.metrics.driver.measure_snapshot_quality() - Checkpoint metrics:
slop_code.metrics.checkpoint.driver.get_checkpoint_metrics() - Run summary:
slop_code.metrics.summary.aggregatorsmodule
- Snapshot quality:
- v1.1 (2025-12-26): Added checkpoint-results.md and run-results.md documentation for two-level metrics hierarchy
- v1.0 (2025-12-17): Initial metrics documentation