This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
SlopCodeBench (SCBench) is a benchmark for evaluating coding agents under iterative specification refinement. Agents implement a spec, then extend their own code as the spec changes through checkpoints, exposing behaviors like path dependence, non-convergence, and trade-offs between explicit handling and structural stability.
uv sync # Install dependencies (Python 3.12+ required)# Run an agent on a problem
uv run slop-code run \
--agent claude_code \
--model anthropic/opus-4.5 \
--environment configs/environments/docker-python3.12-uv.yaml \
--prompt configs/prompts/just-solve.jinja \
--problem file_backup \
thinking=low \
version=2.0.51
# Run multiple problems
uv run slop-code run --problem file_backup --problem execution_server ...Run parameters:
thinking=none|low|medium|high- Extended thinking budget (low=10k, medium=20k, high=40k tokens)version=X.Y.Z- Agent version to use- Results saved to:
outputs/{model}/{agent}-{prompt}_{params}_{timestamp}/
# Evaluate a run directory
slop-code eval outputs/<run-directory>/
# Grade code quality with LLM judge
slop-code metrics judge \
--rubric configs/rubrics/llm_judge.jsonl \
--model <model on openrouter> \
--criteria-template configs/rubrics/templates/criteria_with_pn.j2 \
--prefix-template configs/rubrics/templates/no_expl.j2uv run pytest -q # Run all tests
uv run pytest tests/path/to/test_file.py # Run specific test file
uv run pytest -xvs # Verbose with early exit on failureUse the /run-tests skill instead of raw pytest for problem tests. This ensures tests run in the correct Docker environment with proper isolation.
# Use the skill:
/run-tests <snapshot_path> <problem_name> <checkpoint_index>
# Example:
/run-tests outputs/run_001/submissions/file_backup/checkpoint_2/snapshot file_backup checkpoint_2The underlying command is:
slop-code --quiet eval-snapshot {snapshot_path} \
-p {problem_name} \
-c {checkpoint_index} \
-e configs/environments/docker-python3.12-uv.yaml \
-o /tmp/eval-output \
--jsonResults are saved to the output directory with evaluation.json (test results) and quality_analysis/ (code metrics).
uv run ruff check . # Lint
uv run isort . # Format importssrc/slop_code/ - Main library organized into:
-
agent_runner/- Agent lifecycle management and executionagent.py- Agent base class and protocolregistry.py- Agent and config registration systemstate.py- Agent state management across checkpointstrajectory.py- Execution history trackingagents/- Agent implementations (claude_code, codex, gemini, miniswe, opencode, openhands)credentials.py- API key and credential management
-
execution/- Isolated execution environmentssession.py- Session lifecycle: workspace + runtime coordinationworkspace.py- Isolated directories, snapshots, file operationsruntime.py-SubmissionRuntimeprotocol for command executiondocker_runtime/- Docker container execution with networking and setup scriptssnapshot.py- Capturing workspace state and diffs between checkpointsassets.py- Static asset resolution and placeholder substitutionmodels.py-EnvironmentSpecand related Pydantic models
-
evaluation/- Pytest-based test execution and validationconfig.py-ProblemConfig,CheckpointConfig,MarkerConfigdefinitionspytest_runner.py-PytestRunnerorchestrates pytest execution via uvxreport.py-CorrectnessResults,TestResult, andGroupType(CORE, FUNCTIONALITY, REGRESSION, ERROR)
-
metrics/- Code quality measurementdriver.py- Quality metric orchestrationgrade.py- Grading logic and scoringmodels.py- Core data models for metricscheckpoint/- Per-checkpoint quality tracking and extractionsummary/- Aggregation and run-level statisticslanguages/- Language-specific parsers (Python, JavaScript, etc.)rubric/- LLM judge templates and criteria
-
entrypoints/- CLI and command handlerscli.py- Main Typer application entry point (registered asslop-code)commands/- Individual command implementations (run_agent, eval_*, docker, etc.)config/- Run configuration loading and resolversproblem_runner/- Problem execution driver and state managementevaluation/- Evaluation command drivers
-
dashboard/- Dash-based visualization UIapp.py- Main Dash applicationpages/- Individual dashboard pages (overview, checkpoints, quality, efficiency, etc.)graphs/- Plotly graph components
-
common/- Shared utilitiescommon.py- General helpersconstants.py- System-wide constantsllms.py- LLM API interactions via litellmpaths.py- Path resolution utilities
Session → Workspace → Runtime Flow:
Sessionmanages overall execution lifecycleWorkspaceprovides isolated directory with file operations and snapshottingSubmissionRuntime(Docker or Local) executes commands and captures output- Snapshots capture state between checkpoints for comparison
Agent Execution Flow:
- Agent registered via
register_agent()andregister_agent_config() ProblemRunnerloads problem config and creates workspace- For each checkpoint:
- Agent receives spec and implements solution
Session.spawn()creates runtime- Evaluation runs test cases via adapters
- Workspace snapshot captures final state
- Quality metrics computed via language parsers
- Results aggregated into
CorrectnessResultsand quality reports
Problem Evaluation Flow:
ProblemConfigdefines checkpoints with pytest markers- Each checkpoint has
checkpoint_N.md(spec) andtests/test_checkpoint_N.py(tests) PytestRunnercopies tests to workspace, generates pytest.ini with markers- Pytest runs via
uvx(isolated from solution environment) - Tests categorized by markers: unmarked=CORE,
@pytest.mark.functionality=FUNCTIONALITY,@pytest.mark.error=ERROR, prior checkpoint tests=REGRESSION PassPolicy("core-cases" or "all-non-error-cases") determines checkpoint success- Metrics computed: correctness (pass/fail) + quality (complexity, duplication, etc.)
Configuration Hierarchy:
configs/agents/*.yaml- Agent definitionsconfigs/models/*.yaml- Model specificationsconfigs/environments/*.yaml- Runtime environments (Docker, local)configs/prompts/*.jinja- Agent prompt templatesconfigs/runs/*.yaml- Complete run configurationsproblems/*/config.yaml- Problem-specific settings (inline checkpoints)
Each problem in problems/ follows this pattern:
problem_name/
├── config.yaml # Problem metadata and inline checkpoint definitions
├── checkpoint_1.md # Specification for checkpoint 1
├── checkpoint_2.md # Specification for checkpoint 2
└── tests/
├── conftest.py # Pytest configuration (entrypoint, checkpoint fixtures)
├── test_checkpoint_1.py # Tests for checkpoint 1
├── test_checkpoint_2.py # Tests for checkpoint 2
├── data/ # Test case data
│ ├── checkpoint_1/
│ │ ├── core/ # Core test cases (must pass)
│ │ ├── hidden/ # Functionality tests (optional)
│ │ └── errors/ # Error handling tests
│ └── checkpoint_2/
└── assets/ # Static test files
Modern config.yaml structure (checkpoints are inline, not separate files):
name: problem_name
entry_file: main_command
timeout: 20
checkpoints:
checkpoint_1:
version: 1
order: 1
state: Core Tests
checkpoint_2:
version: 1
order: 2
state: Extended Features
test_dependencies:
- pyyaml # Additional packages for test environment
markers: # Custom pytest markers beyond built-ins
custom_marker:
description: Custom test category
group: Functionality- All configuration uses Pydantic models with strict validation
- OmegaConf used for YAML loading with resolvers
- Checkpoints are now defined inline in
config.yaml, not as separatecheckpoint_N/config.yamlfiles - Static assets support placeholder resolution (e.g.,
%%%ENTRYPOINT:entry_file%%%)
- Agents must implement
Agentprotocol fromagent_runner/agent.py - Lifecycle methods:
setup(),run(),reset(),cleanup() - State preserved between checkpoints via
AgentState - Credentials loaded from environment variables via
credentials.py
- First run builds Docker images (5-10 minutes), subsequent runs are fast
- Images cached per agent version
- Workspaces mounted into containers with isolated networking
- Setup commands run before each checkpoint
- GroupType: CORE (must pass), FUNCTIONALITY (optional), REGRESSION (from prior checkpoints), ERROR (expected failures)
- PassPolicy:
"core-cases"(all CORE tests pass),"all-non-error-cases"(CORE + FUNCTIONALITY + REGRESSION all pass) - Markers: Tests categorized by pytest markers (
@pytest.mark.error,@pytest.mark.functionality) - Isolation: Tests run via
uvxfor complete isolation from solution environment
- Language-specific parsers extract AST information
- Metrics: cyclomatic complexity, duplication, code churn, maintainability index
- ast-grep rules in
configs/ast-grep-rules/for pattern-based analysis - LLM judge for subjective quality assessment via rubric templates
- Uses
structlogwith structured logging throughout - Set
verbose=Truein logger calls for detailed output - Workspace snapshots enable diffing between checkpoints
outputs/contains full execution artifacts per run
- Create agent class in
src/slop_code/agent_runner/agents/ - Implement
Agentprotocol (setup, run, reset, cleanup) - Create config class extending
AgentConfigBase - Register with
register_agent()andregister_agent_config() - Add YAML config to
configs/agents/ - Document in
docs/agents/agents/
- Design checkpoints and spec (see
docs/contributing-problems/) - Create directory structure in
problems/ - Write
config.yamlwith inline checkpoint definitions - Write
checkpoint_N.mdfor each checkpoint specification - Create
tests/conftest.pywith entrypoint/checkpoint fixtures - Write
tests/test_checkpoint_N.pyfor each checkpoint - Add test data in
tests/data/checkpoint_N/{core,hidden,errors}/ - Use pytest markers for test categorization (
@pytest.mark.error, etc.) - Validate with
slop-code evaland submit PR
# Evaluate checkpoint without re-running agent
slop-code eval checkpoint \
--workspace outputs/{run}/submissions/{problem}/checkpoint_N/ \
--problem {problem} \
--checkpoint checkpoint_N# Run static analysis on checkpoint
slop-code metrics static \
--workspace outputs/{run}/submissions/{problem}/checkpoint_N/- Branch from main - Latest stable code on
main - Run tests -
uv run pytest -qbefore committing - Lint -
uv run ruff check .anduv run isort . - Commit style - Short, capitalized summaries (e.g., "Fix Docker runtime cleanup")
- PR - Include description, link issues, note tests run, add screenshots for UI changes
- Docker required - no pure local execution for benchmarks
- First run slow due to image builds (cached afterward)
- Some agents (OpenHands) require additional dependencies (see
dependency-groups.openhands) - LLM judge quality depends on model capabilities
- Workspace diffs large for binary files
Skills are invoked with /skill-name <args>. See .claude/skills/ for full documentation.
| Skill | Usage | Description |
|---|---|---|
/run-tests |
/run-tests <snapshot> <problem> <checkpoint> |
Run problem tests in Docker using eval-snapshot |
/fix-solution |
/fix-solution <snapshot> <problem> <checkpoint> |
Iteratively test and repair a solution until tests pass |
/edge-cases |
/edge-cases <problem> <checkpoint> |
Analyze tests and suggest missing edge cases |
/validate-run |
/validate-run <run_path> [problem] |
Validate all checkpoints from an agent run |
/test-ambiguity-detector |
/test-ambiguity-detector <problem> <checkpoint> |
Find ambiguous test assumptions |
src/slop_code/agent_runner/agent.py- Agent protocol definitionsrc/slop_code/execution/session.py- Execution session managementsrc/slop_code/evaluation/pytest_runner.py- Pytest execution orchestrationsrc/slop_code/entrypoints/problem_runner/driver.py- Problem execution driver
docs/execution/README.md- Execution architecture deep divedocs/evaluation/README.md- Evaluation system guidedocs/problems/tutorial.md- Problem creation tutorial
docs/evaluation-tests/README.md- How pytest evaluation worksdocs/evaluation-tests/conftest-patterns.md- Fixture patterns (session, module, factory)docs/evaluation-tests/markers.md- Test categorization (CORE, FUNCTIONALITY, ERROR, REGRESSION)docs/evaluation-tests/test-data.md- Organizing test data (inline vs external)docs/evaluation-tests/runner-internals.md- Technical reference for PytestRunnerdocs/evaluation-tests/fixtures-advanced.md- Advanced patterns (expensive resources, composition)docs/evaluation-tests/stateful-testing.md- State across checkpoints and modulesdocs/evaluation-tests/complex-parametrization.md- Loading cases from JSON, YAML, directoriesdocs/evaluation-tests/debugging-workflows.md- Debugging test failures and workflows