CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

SlopCodeBench (SCBench) is a benchmark for evaluating coding agents under iterative specification refinement. Agents implement a spec, then extend their own code as the spec changes through checkpoints, exposing behaviors like path dependence, non-convergence, and trade-offs between explicit handling and structural stability.

Essential Commands

Setup and Installation

uv sync                          # Install dependencies (Python 3.12+ required)

Running Agents

# Run an agent on a problem
uv run slop-code run \
  --agent claude_code \
  --model anthropic/opus-4.5 \
  --environment configs/environments/docker-python3.12-uv.yaml \
  --prompt configs/prompts/just-solve.jinja \
  --problem file_backup \
  thinking=low \
  version=2.0.51

# Run multiple problems
uv run slop-code run --problem file_backup --problem execution_server ...

Run parameters:

thinking=none|low|medium|high - Extended thinking budget (low=10k, medium=20k, high=40k tokens)
version=X.Y.Z - Agent version to use
Results saved to: outputs/{model}/{agent}-{prompt}_{params}_{timestamp}/

Evaluation

# Evaluate a run directory
slop-code eval outputs/<run-directory>/

# Grade code quality with LLM judge
slop-code metrics judge \
  --rubric configs/rubrics/llm_judge.jsonl \
  --model <model on openrouter> \
  --criteria-template configs/rubrics/templates/criteria_with_pn.j2 \
  --prefix-template configs/rubrics/templates/no_expl.j2

Testing

uv run pytest -q                          # Run all tests
uv run pytest tests/path/to/test_file.py  # Run specific test file
uv run pytest -xvs                        # Verbose with early exit on failure

Testing Problem Solutions

Use the /run-tests skill instead of raw pytest for problem tests. This ensures tests run in the correct Docker environment with proper isolation.

# Use the skill:
/run-tests <snapshot_path> <problem_name> <checkpoint_index>

# Example:
/run-tests outputs/run_001/submissions/file_backup/checkpoint_2/snapshot file_backup checkpoint_2

The underlying command is:

slop-code --quiet eval-snapshot {snapshot_path} \
  -p {problem_name} \
  -c {checkpoint_index} \
  -e configs/environments/docker-python3.12-uv.yaml \
  -o /tmp/eval-output \
  --json

Results are saved to the output directory with evaluation.json (test results) and quality_analysis/ (code metrics).

Code Quality

uv run ruff check .                       # Lint
uv run isort .                            # Format imports

Architecture

Core Module Structure

src/slop_code/ - Main library organized into:

agent_runner/ - Agent lifecycle management and execution
- agent.py - Agent base class and protocol
- registry.py - Agent and config registration system
- state.py - Agent state management across checkpoints
- trajectory.py - Execution history tracking
- agents/ - Agent implementations (claude_code, codex, gemini, miniswe, opencode, openhands)
- credentials.py - API key and credential management
execution/ - Isolated execution environments
- session.py - Session lifecycle: workspace + runtime coordination
- workspace.py - Isolated directories, snapshots, file operations
- runtime.py - SubmissionRuntime protocol for command execution
- docker_runtime/ - Docker container execution with networking and setup scripts
- snapshot.py - Capturing workspace state and diffs between checkpoints
- assets.py - Static asset resolution and placeholder substitution
- models.py - EnvironmentSpec and related Pydantic models
evaluation/ - Pytest-based test execution and validation
- config.py - ProblemConfig, CheckpointConfig, MarkerConfig definitions
- pytest_runner.py - PytestRunner orchestrates pytest execution via uvx
- report.py - CorrectnessResults, TestResult, and GroupType (CORE, FUNCTIONALITY, REGRESSION, ERROR)
metrics/ - Code quality measurement
- driver.py - Quality metric orchestration
- grade.py - Grading logic and scoring
- models.py - Core data models for metrics
- checkpoint/ - Per-checkpoint quality tracking and extraction
- summary/ - Aggregation and run-level statistics
- languages/ - Language-specific parsers (Python, JavaScript, etc.)
- rubric/ - LLM judge templates and criteria
entrypoints/ - CLI and command handlers
- cli.py - Main Typer application entry point (registered as slop-code)
- commands/ - Individual command implementations (run_agent, eval_*, docker, etc.)
- config/ - Run configuration loading and resolvers
- problem_runner/ - Problem execution driver and state management
- evaluation/ - Evaluation command drivers
dashboard/ - Dash-based visualization UI
- app.py - Main Dash application
- pages/ - Individual dashboard pages (overview, checkpoints, quality, efficiency, etc.)
- graphs/ - Plotly graph components
common/ - Shared utilities
- common.py - General helpers
- constants.py - System-wide constants
- llms.py - LLM API interactions via litellm
- paths.py - Path resolution utilities

Key Architectural Patterns

Session → Workspace → Runtime Flow:

Session manages overall execution lifecycle
Workspace provides isolated directory with file operations and snapshotting
SubmissionRuntime (Docker or Local) executes commands and captures output
Snapshots capture state between checkpoints for comparison

Agent Execution Flow:

Agent registered via register_agent() and register_agent_config()
ProblemRunner loads problem config and creates workspace
For each checkpoint:
- Agent receives spec and implements solution
- Session.spawn() creates runtime
- Evaluation runs test cases via adapters
- Workspace snapshot captures final state
- Quality metrics computed via language parsers
Results aggregated into CorrectnessResults and quality reports

Problem Evaluation Flow:

ProblemConfig defines checkpoints with pytest markers
Each checkpoint has checkpoint_N.md (spec) and tests/test_checkpoint_N.py (tests)
PytestRunner copies tests to workspace, generates pytest.ini with markers
Pytest runs via uvx (isolated from solution environment)
Tests categorized by markers: unmarked=CORE, @pytest.mark.functionality=FUNCTIONALITY, @pytest.mark.error=ERROR, prior checkpoint tests=REGRESSION
PassPolicy ("core-cases" or "all-non-error-cases") determines checkpoint success
Metrics computed: correctness (pass/fail) + quality (complexity, duplication, etc.)

Configuration Hierarchy:

configs/agents/*.yaml - Agent definitions
configs/models/*.yaml - Model specifications
configs/environments/*.yaml - Runtime environments (Docker, local)
configs/prompts/*.jinja - Agent prompt templates
configs/runs/*.yaml - Complete run configurations
problems/*/config.yaml - Problem-specific settings (inline checkpoints)

Problem Structure

Each problem in problems/ follows this pattern:

problem_name/
├── config.yaml              # Problem metadata and inline checkpoint definitions
├── checkpoint_1.md          # Specification for checkpoint 1
├── checkpoint_2.md          # Specification for checkpoint 2
└── tests/
    ├── conftest.py          # Pytest configuration (entrypoint, checkpoint fixtures)
    ├── test_checkpoint_1.py # Tests for checkpoint 1
    ├── test_checkpoint_2.py # Tests for checkpoint 2
    ├── data/                # Test case data
    │   ├── checkpoint_1/
    │   │   ├── core/        # Core test cases (must pass)
    │   │   ├── hidden/      # Functionality tests (optional)
    │   │   └── errors/      # Error handling tests
    │   └── checkpoint_2/
    └── assets/              # Static test files

Modern config.yaml structure (checkpoints are inline, not separate files):

name: problem_name
entry_file: main_command
timeout: 20
checkpoints:
  checkpoint_1:
    version: 1
    order: 1
    state: Core Tests
  checkpoint_2:
    version: 1
    order: 2
    state: Extended Features
test_dependencies:
  - pyyaml  # Additional packages for test environment
markers:  # Custom pytest markers beyond built-ins
  custom_marker:
    description: Custom test category
    group: Functionality

Important Technical Notes

Configuration System

All configuration uses Pydantic models with strict validation
OmegaConf used for YAML loading with resolvers
Checkpoints are now defined inline in config.yaml, not as separate checkpoint_N/config.yaml files
Static assets support placeholder resolution (e.g., %%%ENTRYPOINT:entry_file%%%)

Agent Implementation

Agents must implement Agent protocol from agent_runner/agent.py
Lifecycle methods: setup(), run(), reset(), cleanup()
State preserved between checkpoints via AgentState
Credentials loaded from environment variables via credentials.py

Docker Execution

First run builds Docker images (5-10 minutes), subsequent runs are fast
Images cached per agent version
Workspaces mounted into containers with isolated networking
Setup commands run before each checkpoint

Evaluation System

GroupType: CORE (must pass), FUNCTIONALITY (optional), REGRESSION (from prior checkpoints), ERROR (expected failures)
PassPolicy: "core-cases" (all CORE tests pass), "all-non-error-cases" (CORE + FUNCTIONALITY + REGRESSION all pass)
Markers: Tests categorized by pytest markers (@pytest.mark.error, @pytest.mark.functionality)
Isolation: Tests run via uvx for complete isolation from solution environment

Quality Metrics

Language-specific parsers extract AST information
Metrics: cyclomatic complexity, duplication, code churn, maintainability index
ast-grep rules in configs/ast-grep-rules/ for pattern-based analysis
LLM judge for subjective quality assessment via rubric templates

Logging and Debugging

Uses structlog with structured logging throughout
Set verbose=True in logger calls for detailed output
Workspace snapshots enable diffing between checkpoints
outputs/ contains full execution artifacts per run

Common Patterns

Adding a New Agent

Create agent class in src/slop_code/agent_runner/agents/
Implement Agent protocol (setup, run, reset, cleanup)
Create config class extending AgentConfigBase
Register with register_agent() and register_agent_config()
Add YAML config to configs/agents/
Document in docs/agents/agents/

Adding a New Problem

Design checkpoints and spec (see docs/contributing-problems/)
Create directory structure in problems/
Write config.yaml with inline checkpoint definitions
Write checkpoint_N.md for each checkpoint specification
Create tests/conftest.py with entrypoint/checkpoint fixtures
Write tests/test_checkpoint_N.py for each checkpoint
Add test data in tests/data/checkpoint_N/{core,hidden,errors}/
Use pytest markers for test categorization (@pytest.mark.error, etc.)
Validate with slop-code eval and submit PR

Running Evaluation on Existing Workspace

# Evaluate checkpoint without re-running agent
slop-code eval checkpoint \
  --workspace outputs/{run}/submissions/{problem}/checkpoint_N/ \
  --problem {problem} \
  --checkpoint checkpoint_N

Extracting Quality Metrics

# Run static analysis on checkpoint
slop-code metrics static \
  --workspace outputs/{run}/submissions/{problem}/checkpoint_N/

Development Workflow

Branch from main - Latest stable code on main
Run tests - uv run pytest -q before committing
Lint - uv run ruff check . and uv run isort .
Commit style - Short, capitalized summaries (e.g., "Fix Docker runtime cleanup")
PR - Include description, link issues, note tests run, add screenshots for UI changes

Known Limitations

Docker required - no pure local execution for benchmarks
First run slow due to image builds (cached afterward)
Some agents (OpenHands) require additional dependencies (see dependency-groups.openhands)
LLM judge quality depends on model capabilities
Workspace diffs large for binary files

Available Skills

Skills are invoked with /skill-name <args>. See .claude/skills/ for full documentation.

Skill	Usage	Description
`/run-tests`	`/run-tests <snapshot> <problem> <checkpoint>`	Run problem tests in Docker using eval-snapshot
`/fix-solution`	`/fix-solution <snapshot> <problem> <checkpoint>`	Iteratively test and repair a solution until tests pass
`/edge-cases`	`/edge-cases <problem> <checkpoint>`	Analyze tests and suggest missing edge cases
`/validate-run`	`/validate-run <run_path> [problem]`	Validate all checkpoints from an agent run
`/test-ambiguity-detector`	`/test-ambiguity-detector <problem> <checkpoint>`	Find ambiguous test assumptions

Key Files to Reference

Core Implementation

src/slop_code/agent_runner/agent.py - Agent protocol definition
src/slop_code/execution/session.py - Execution session management
src/slop_code/evaluation/pytest_runner.py - Pytest execution orchestration
src/slop_code/entrypoints/problem_runner/driver.py - Problem execution driver

Documentation

docs/execution/README.md - Execution architecture deep dive
docs/evaluation/README.md - Evaluation system guide
docs/problems/tutorial.md - Problem creation tutorial

Pytest Documentation

docs/evaluation-tests/README.md - How pytest evaluation works
docs/evaluation-tests/conftest-patterns.md - Fixture patterns (session, module, factory)
docs/evaluation-tests/markers.md - Test categorization (CORE, FUNCTIONALITY, ERROR, REGRESSION)
docs/evaluation-tests/test-data.md - Organizing test data (inline vs external)
docs/evaluation-tests/runner-internals.md - Technical reference for PytestRunner
docs/evaluation-tests/fixtures-advanced.md - Advanced patterns (expensive resources, composition)
docs/evaluation-tests/stateful-testing.md - State across checkpoints and modules
docs/evaluation-tests/complex-parametrization.md - Loading cases from JSON, YAML, directories
docs/evaluation-tests/debugging-workflows.md - Debugging test failures and workflows

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Project Overview

Essential Commands

Setup and Installation

Running Agents

Evaluation

Testing

Testing Problem Solutions

Code Quality

Architecture

Core Module Structure

Key Architectural Patterns

Problem Structure

Important Technical Notes

Configuration System

Agent Implementation

Docker Execution

Evaluation System

Quality Metrics

Logging and Debugging

Common Patterns

Adding a New Agent

Adding a New Problem

Running Evaluation on Existing Workspace

Extracting Quality Metrics

Development Workflow

Known Limitations

Available Skills

Key Files to Reference

Core Implementation

Documentation

Pytest Documentation

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Essential Commands

Setup and Installation

Running Agents

Evaluation

Testing

Testing Problem Solutions

Code Quality

Architecture

Core Module Structure

Key Architectural Patterns

Problem Structure

Important Technical Notes

Configuration System

Agent Implementation

Docker Execution

Evaluation System

Quality Metrics

Logging and Debugging

Common Patterns

Adding a New Agent

Adding a New Problem

Running Evaluation on Existing Workspace

Extracting Quality Metrics

Development Workflow

Known Limitations

Available Skills

Key Files to Reference

Core Implementation

Documentation

Pytest Documentation