Skip to content

amazon-science/StaminaBench

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

paper

StaminaBench is a framework for evaluating LLM-powered coding agents on iterative software engineering tasks. The primary benchmark, Iterative REST Server Generation, asks an agent to implement a REST API from a natural-language specification, runs a test suite against the agent's server, feeds back failures, and iterates. After each turn the schema evolves (new entities, renamed fields, new guard conditions, etc.) and the agent must keep the server consistent with the updated spec.

How it fits together

There are three moving parts:

  1. Scenario data — a deterministic schema (entities, fields, actions, analytics) plus, per turn, a natural-language spec, a pytest suite, and a ground-truth Flask server. Generated either programmatically (offline, no LLM) or via LLM (richer, requires Bedrock).
  2. Agent harness — a thin Python wrapper around a CLI (Mini-SWE, OpenHands, OpenCode, …) that talks to the agent over stdin/stdout, records a trajectory, and runs the agent's command inside a Docker container.
  3. Evaluation loop — for each scenario, hand the agent the spec + tests + a working dir, let it produce a server, run the test suite, feed failures back, and either move on after the turn passes or retry up to attempt_limit times. Then evolve the schema and repeat for n_turns.

Scoring is passed_tests / total_tests per turn, averaged across turns and then scenarios.

Supported Agents

Agent Module Docker image Dockerfile
Mini-SWE staminabench/agents/mini_swe_agent.py agent-benchmarking:mini staminabench/agents/docker/Dockerfile.mini
OpenHands staminabench/agents/openhands_agent.py agent-benchmarking:openhands staminabench/agents/docker/Dockerfile.openhands
OpenCode staminabench/agents/opencode_agent.py agent-benchmarking:opencode staminabench/agents/docker/Dockerfile.opencode
Qwen Code staminabench/agents/qwen_code_agent.py user-supplied
Kimi CLI staminabench/agents/kimi_cli_agent.py user-supplied
Vibe staminabench/agents/vibe_agent.py user-supplied
Mock (testing) staminabench/agents/mock_agent.py

All CLI agents share the template-method runner in staminabench/agents/cli_agent.py. Adding a new agent means subclassing CLIBasedAgent and implementing _build_command, _parse_response, and _should_retry.

Setup

1. Python environment

The project is managed with uv. Install uv (curl -LsSf https://astral.sh/uv/install.sh | sh), then from the repo root:

uv sync --extra dev

That creates a .venv/ with everything pinned by uv.lock. Run anything through uv run:

uv run python -m staminabench.run_eval ...
uv run pytest tests/

Some transitive deps (notably numpy) compile native extensions — uv's bundled Python ships its own libc, but you still need a C/C++ toolchain on the host (gcc ≥ 9.3, libc headers). On stock Ubuntu 22.04+ / Debian 12+ / macOS with Xcode CLT this is already there; older distros may need attention.

2. AWS / Bedrock

The default model backend is Bedrock. You need:

  • an AWS profile with Bedrock access (aws configure / ~/.aws/credentials), or an EC2 instance role with Bedrock permissions
  • the model IDs you plan to use enabled in the AWS console under Bedrock → Model access.
    • Default used by this repo: zai.glm-5 (Z.ai GLM-5 on Bedrock).
    • Region: us-east-1 (configurable via AWS_REGION).

For long-running jobs on EC2, prefer the instance role and unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN so session tokens don't expire mid-run.

If you only want to run the programmatic benchmark (no LLM-sampled scenarios) you don't need Bedrock — the benchmark generator runs entirely offline.

3. Docker images

Each CLI agent runs inside its own container. All required CLIs are installed from the public registries (PyPI / npm); the build context is just src/staminabench/agents/docker/ and nothing outside the repo is needed.

From the repo root:

bash src/staminabench/agents/docker/build.sh                 # build all three images
bash src/staminabench/agents/docker/build.sh mini            # or build just one
bash src/staminabench/agents/docker/build.sh mini opencode   # or a subset
bash src/staminabench/agents/docker/test.sh                  # smoke-test all built images

Resulting images (~1.5–1.8 GB each):

Tag CLI Source
agent-benchmarking:mini mini mini-swe-agent on PyPI
agent-benchmarking:opencode opencode opencode-ai on npm
agent-benchmarking:openhands openhands openhands on PyPI

Pass the tag as docker.image_name=agent-benchmarking:<tag> when running staminabench.run_eval.

For Qwen Code, Kimi CLI, and Vibe, supply your own image — those upstreams don't publish a clean PyPI/npm wheel that installs in a base Ubuntu container.

Env vars forwarded into the container

The harness automatically forwards the host's AWS credential chain into the container so Bedrock works inside: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, AWS_BEARER_TOKEN_BEDROCK, AWS_REGION, AWS_PROFILE. Anything else needs to be added explicitly via docker.env_vars or docker.extra_docker_args.

Host UID mismatch

The container runs with -u $(id -u):$(id -g) so files written into the mounted workspace are owned by the host user. If your host UID isn't in the image's /etc/passwd (common on shared dev machines / EC2 with custom UIDs), tools inside the container can't find a home directory and fail to write their config (e.g. OpenCode failing on ~/.local, OpenHands on its persistence dir).

Workaround: run as root inside the container by passing extra docker args:

docker.extra_docker_args="-u 0:0 -e HOME=/root -e XDG_CONFIG_HOME=/root/.config -e XDG_DATA_HOME=/root/.local/share"

The OpenCode image already ships its config under both /home/ubuntu/.config/opencode/ and /root/.config/opencode/ so this workaround works without a rebuild.

Running an Evaluation

Config-driven via OmegaConf. Two input channels: --configs (YAML files) and --values (dotlist overrides).

Quick smoke test (mock agent, no Docker)

uv run python -m staminabench.run_eval --values \
    agent=MOCK \
    benchmark=RESTFUL_SERVER_GEN_ITERATIVE \
    output_dir=/tmp/smoke \
    quick_debug=true

quick_debug=true overrides the iterative benchmark to 2 scenarios × 2 turns. Without it, the defaults are 10 scenarios × 3 turns × 5 changes/turn × 10 attempts (see staminabench/framework/config.py).

Real run

uv run python -m staminabench.run_eval --values \
    agent=MINI_SWE \
    benchmark=RESTFUL_SERVER_GEN_ITERATIVE \
    output_dir=/path/to/results \
    restful_server_gen_iterative.n_scenarios=20 \
    restful_server_gen_iterative.n_turns=100 \
    restful_server_gen_iterative.attempt_limit=10 \
    restful_server_gen_iterative.base_seed=42 \
    mini_swe.model_id="bedrock/zai.glm-5" \
    docker.image_name=agent-benchmarking:mini \
    batch_size=5 \
    use_docker=true

Rough timing

Useful for capacity-planning a sweep:

Phase Approx. wall time
Programmatic scenario gen ~30 s per scenario per 10 turns
LLM-sampled scenario gen ~30 s per scenario per turn (depends heavily on num_changes_per_turn)
Agent eval per turn (real CLI on Bedrock) 1–10 min, dominated by the agent and its retries

batch_size=N parallelises across scenarios with a ThreadPoolExecutor; the practical limit is the number of Docker containers your host can run concurrently.

Key config fields

Top-level (EvalConfig in staminabench/run_eval.py):

Field Purpose
agent One of MINI_SWE, OPENHANDS, OPENCODE, QWEN_CODE, KIMI_CLI, VIBE, MOCK
benchmark RESTFUL_SERVER_GEN (single-turn) or RESTFUL_SERVER_GEN_ITERATIVE (multi-turn)
output_dir Where trajectories, logs, and reports are written
use_docker Run the agent inside a container (usually true; false for the mock agent)
batch_size Parallel scenarios (1 = sequential)
quick_debug Override to 2 scenarios × 2 turns for smoke testing
scenarios_dir Load pre-generated scenarios instead of generating on the fly

Iterative-benchmark fields live under restful_server_gen_iterative.*. The interesting ones (with their defaults):

Field Default Meaning
n_scenarios 10 Number of independent scenarios
n_turns 3 Schema-evolution turns per scenario
num_changes_per_turn 5 Schema changes between turn N and turn N+1
attempt_limit 10 Max retries within a turn before moving on
base_seed 42 Deterministic sampling root
enable_feedback True Show test-failure details to the agent on retry
test_fail_feedback_type DETAILED One of DETAILED, MEDIUM, MINIMAL
use_llm_sampler False If True, use Bedrock to fill schema changes; otherwise fully programmatic (no AWS needed)
llm_sampler_model_id zai.glm-5 Bedrock model for the sampler when use_llm_sampler=True
execute_episodes None e.g. "1,5,10" to run a subset

What success looks like

After the run finishes, output_dir/ contains:

output_dir/
├── config.yaml              # Snapshot of the resolved config
├── report.json              # Top-level summary (success_rate, average_score, per-scenario results)
├── scenario_0/
│   ├── trajectory.json      # Full agent transcript + per-step token/cost data
│   ├── agent_code/          # Snapshots of the agent's working dir per turn/attempt
│   ├── ground_truth/        # Reference Flask servers, one per turn
│   ├── test_suites/         # Generated pytest suites, one per turn
│   ├── schemas/             # IR YAMLs per turn
│   ├── server_logs/         # stdout/stderr from the agent's running server
│   ├── test_results/        # Raw pytest output per attempt
│   ├── agent_logs/step_N.txt  # Verbatim CLI step logs
│   └── changes/             # Natural-language change descriptions per turn
├── scenario_1/
└── …

report.json looks roughly like:

{
  "agent": "mini_swe",
  "benchmark": "restful_server_gen_iterative",
  "success_rate": 0.85,
  "average_score": 0.92,
  "num_tasks": 20,
  "results": [...],
  "trajectory_summaries": [...]
}

For per-attempt timing, error classification, and a markdown report:

uv run python -m staminabench.results_analysis.analyze_trajectories_with_attempts /path/to/results

Pre-Generating Scenarios

For sweeps it's worth generating scenario data once and reusing it across agents. There are two paths:

Programmatic (no AWS, fast)

Fully deterministic from base_seed. Schemas are sampled by RESTSpecSampler and changes by the random-weighted generator in schema_gen/changes/generator.py.

uv run python -m staminabench.generate_scenarios --values \
    benchmark=RESTFUL_SERVER_GEN_ITERATIVE \
    output_dir=/path/to/scenarios \
    restful_server_gen_iterative.use_llm_sampler=false \
    restful_server_gen_iterative.n_scenarios=20 \
    restful_server_gen_iterative.n_turns=100 \
    restful_server_gen_iterative.base_seed=42

LLM-sampled (Bedrock, richer domains)

The schemas pick a domain topic from domain_topics.txt and the LLM fills in realistic entity/field/action names; the change sampler picks change types programmatically, then asks the LLM to fill in the structured details.

unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN  # use instance role

uv run python -m staminabench.generate_scenarios --values \
    benchmark=RESTFUL_SERVER_GEN_ITERATIVE \
    output_dir=/path/to/scenarios \
    restful_server_gen_iterative.use_llm_sampler=true \
    restful_server_gen_iterative.n_scenarios=20 \
    restful_server_gen_iterative.n_turns=100 \
    restful_server_gen_iterative.num_changes_per_turn=5 \
    restful_server_gen_iterative.base_seed=42

Then point evaluations at the resulting directory:

uv run python -m staminabench.run_eval --values \
    agent=MINI_SWE \
    benchmark=RESTFUL_SERVER_GEN_ITERATIVE \
    scenarios_dir=/path/to/scenarios \
    output_dir=/path/to/results

Pre-generated data used in the paper

The exact scenario datasets used for the experiments are shipped in data/:

File Sampler Scenarios × turns Notes
data/programmatic.zip Programmatic (no LLM) 20 × 100 base_seed=0, fully deterministic
data/llm.zip LLM-sampled 20 × 100 base_seed=42, model us.anthropic.claude-haiku-4-5-20251001-v1:0

The accompanying data/generation_config_*.yaml files record the full generation config for each. Unzip and point scenarios_dir at the extracted directory:

unzip data/programmatic.zip -d /path/to/scenarios
uv run python -m staminabench.run_eval --values \
    agent=MINI_SWE \
    benchmark=RESTFUL_SERVER_GEN_ITERATIVE \
    scenarios_dir=/path/to/scenarios \
    output_dir=/path/to/results

Layout

.
├── pyproject.toml                        # Packaging (src layout, name = staminabench)
├── configs/                              # Example YAML configs
├── tests/                                # Unit + integration tests
└── src/staminabench/                     # Importable package
    ├── run_eval.py                       # Entry point (python -m staminabench.run_eval)
    ├── generate_scenarios.py             # Pre-generate scenario data
    ├── configs.py, agent.py, benchmark.py, environment.py, user.py, utils.py
    ├── docker_environment.py             # Docker-based environment
    ├── data_structures.py, trajectory_info.py
    │
    ├── agents/                           # Agent CLI wrappers
    │   └── docker/                       # Dockerfiles and build scripts
    │
    ├── framework/                        # Benchmark orchestration
    │   ├── interfaces.py                 # SpecSampler, ChangeSampler, etc.
    │   ├── iterative_benchmark.py        # Multi-turn loop, turn evaluator
    │   ├── scenario.py                   # ScenarioStaticData, BenchmarkGenerator
    │   └── config.py                     # IterativeBenchmarkConfig
    │
    ├── benchmarks/
    │   └── restful_server_generation/
    │       ├── iterative_benchmark.py    # Main benchmark (multi-turn)
    │       ├── benchmark.py              # Single-turn benchmark
    │       ├── rest_benchmark.py         # Domain implementations
    │       ├── llm_samplers.py           # LLM-driven spec + change sampling
    │       ├── schema_gen/               # IR, change types, codegen, test gen
    │       └── templates/                # Flask code templates
    │
    └── results_analysis/                 # Scripts for analysing trajectories

Tests

# Fast unit tests (no Docker)
uv run pytest tests/ --ignore=tests/test_docker_environment.py

# Full suite (requires Docker images)
uv run pytest tests/

Scoring

Each scenario produces a trajectory with per-turn scores:

  • Per-turn score: passed_tests / total_tests on the final attempt of the turn
  • Per-scenario score: mean of per-turn scores
  • Overall score: mean across scenarios

See staminabench/results_analysis/analyze_trajectories_with_attempts.py for the canonical analysis script.


This code is being released solely for academic and scientific reproducibility purposes, in support of the methods and findings described in the associated publication. Pull requests are not being accepted in order to maintain the code exactly as it was used in the paper.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors