StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

StaminaBench is a framework for evaluating LLM-powered coding agents on iterative software engineering tasks. The primary benchmark, Iterative REST Server Generation, asks an agent to implement a REST API from a natural-language specification, runs a test suite against the agent's server, feeds back failures, and iterates. After each turn the schema evolves (new entities, renamed fields, new guard conditions, etc.) and the agent must keep the server consistent with the updated spec.

How it fits together

There are three moving parts:

Scenario data — a deterministic schema (entities, fields, actions, analytics) plus, per turn, a natural-language spec, a pytest suite, and a ground-truth Flask server. Generated either programmatically (offline, no LLM) or via LLM (richer, requires Bedrock).
Agent harness — a thin Python wrapper around a CLI (Mini-SWE, OpenHands, OpenCode, …) that talks to the agent over stdin/stdout, records a trajectory, and runs the agent's command inside a Docker container.
Evaluation loop — for each scenario, hand the agent the spec + tests + a working dir, let it produce a server, run the test suite, feed failures back, and either move on after the turn passes or retry up to attempt_limit times. Then evolve the schema and repeat for n_turns.

Scoring is passed_tests / total_tests per turn, averaged across turns and then scenarios.

Supported Agents

Agent	Module	Docker image	Dockerfile
Mini-SWE	`staminabench/agents/mini_swe_agent.py`	`agent-benchmarking:mini`	`staminabench/agents/docker/Dockerfile.mini`
OpenHands	`staminabench/agents/openhands_agent.py`	`agent-benchmarking:openhands`	`staminabench/agents/docker/Dockerfile.openhands`
OpenCode	`staminabench/agents/opencode_agent.py`	`agent-benchmarking:opencode`	`staminabench/agents/docker/Dockerfile.opencode`
Qwen Code	`staminabench/agents/qwen_code_agent.py`	user-supplied	—
Kimi CLI	`staminabench/agents/kimi_cli_agent.py`	user-supplied	—
Vibe	`staminabench/agents/vibe_agent.py`	user-supplied	—
Mock (testing)	`staminabench/agents/mock_agent.py`	—	—

All CLI agents share the template-method runner in staminabench/agents/cli_agent.py. Adding a new agent means subclassing CLIBasedAgent and implementing _build_command, _parse_response, and _should_retry.

Setup

1. Python environment

The project is managed with uv. Install uv (curl -LsSf https://astral.sh/uv/install.sh | sh), then from the repo root:

uv sync --extra dev

That creates a .venv/ with everything pinned by uv.lock. Run anything through uv run:

uv run python -m staminabench.run_eval ...
uv run pytest tests/

Some transitive deps (notably numpy) compile native extensions — uv's bundled Python ships its own libc, but you still need a C/C++ toolchain on the host (gcc ≥ 9.3, libc headers). On stock Ubuntu 22.04+ / Debian 12+ / macOS with Xcode CLT this is already there; older distros may need attention.

2. AWS / Bedrock

The default model backend is Bedrock. You need:

an AWS profile with Bedrock access (aws configure / ~/.aws/credentials), or an EC2 instance role with Bedrock permissions
the model IDs you plan to use enabled in the AWS console under Bedrock → Model access.
- Default used by this repo: zai.glm-5 (Z.ai GLM-5 on Bedrock).
- Region: us-east-1 (configurable via AWS_REGION).

For long-running jobs on EC2, prefer the instance role and unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN so session tokens don't expire mid-run.

If you only want to run the programmatic benchmark (no LLM-sampled scenarios) you don't need Bedrock — the benchmark generator runs entirely offline.

3. Docker images

Each CLI agent runs inside its own container. All required CLIs are installed from the public registries (PyPI / npm); the build context is just src/staminabench/agents/docker/ and nothing outside the repo is needed.

From the repo root:

bash src/staminabench/agents/docker/build.sh                 # build all three images
bash src/staminabench/agents/docker/build.sh mini            # or build just one
bash src/staminabench/agents/docker/build.sh mini opencode   # or a subset
bash src/staminabench/agents/docker/test.sh                  # smoke-test all built images

Resulting images (~1.5–1.8 GB each):

Tag	CLI	Source
`agent-benchmarking:mini`	`mini`	`mini-swe-agent` on PyPI
`agent-benchmarking:opencode`	`opencode`	`opencode-ai` on npm
`agent-benchmarking:openhands`	`openhands`	`openhands` on PyPI

Pass the tag as docker.image_name=agent-benchmarking:<tag> when running staminabench.run_eval.

For Qwen Code, Kimi CLI, and Vibe, supply your own image — those upstreams don't publish a clean PyPI/npm wheel that installs in a base Ubuntu container.

Env vars forwarded into the container

The harness automatically forwards the host's AWS credential chain into the container so Bedrock works inside: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, AWS_BEARER_TOKEN_BEDROCK, AWS_REGION, AWS_PROFILE. Anything else needs to be added explicitly via docker.env_vars or docker.extra_docker_args.

Host UID mismatch

The container runs with -u $(id -u):$(id -g) so files written into the mounted workspace are owned by the host user. If your host UID isn't in the image's /etc/passwd (common on shared dev machines / EC2 with custom UIDs), tools inside the container can't find a home directory and fail to write their config (e.g. OpenCode failing on ~/.local, OpenHands on its persistence dir).

Workaround: run as root inside the container by passing extra docker args:

docker.extra_docker_args="-u 0:0 -e HOME=/root -e XDG_CONFIG_HOME=/root/.config -e XDG_DATA_HOME=/root/.local/share"

The OpenCode image already ships its config under both /home/ubuntu/.config/opencode/ and /root/.config/opencode/ so this workaround works without a rebuild.

Running an Evaluation

Config-driven via OmegaConf. Two input channels: --configs (YAML files) and --values (dotlist overrides).

Quick smoke test (mock agent, no Docker)

uv run python -m staminabench.run_eval --values \
    agent=MOCK \
    benchmark=RESTFUL_SERVER_GEN_ITERATIVE \
    output_dir=/tmp/smoke \
    quick_debug=true

quick_debug=true overrides the iterative benchmark to 2 scenarios × 2 turns. Without it, the defaults are 10 scenarios × 3 turns × 5 changes/turn × 10 attempts (see staminabench/framework/config.py).

Real run

uv run python -m staminabench.run_eval --values \
    agent=MINI_SWE \
    benchmark=RESTFUL_SERVER_GEN_ITERATIVE \
    output_dir=/path/to/results \
    restful_server_gen_iterative.n_scenarios=20 \
    restful_server_gen_iterative.n_turns=100 \
    restful_server_gen_iterative.attempt_limit=10 \
    restful_server_gen_iterative.base_seed=42 \
    mini_swe.model_id="bedrock/zai.glm-5" \
    docker.image_name=agent-benchmarking:mini \
    batch_size=5 \
    use_docker=true

Rough timing

Useful for capacity-planning a sweep:

Phase	Approx. wall time
Programmatic scenario gen	~30 s per scenario per 10 turns
LLM-sampled scenario gen	~30 s per scenario per turn (depends heavily on `num_changes_per_turn`)
Agent eval per turn (real CLI on Bedrock)	1–10 min, dominated by the agent and its retries

batch_size=N parallelises across scenarios with a ThreadPoolExecutor; the practical limit is the number of Docker containers your host can run concurrently.

Key config fields

Top-level (EvalConfig in staminabench/run_eval.py):

Field	Purpose
`agent`	One of `MINI_SWE`, `OPENHANDS`, `OPENCODE`, `QWEN_CODE`, `KIMI_CLI`, `VIBE`, `MOCK`
`benchmark`	`RESTFUL_SERVER_GEN` (single-turn) or `RESTFUL_SERVER_GEN_ITERATIVE` (multi-turn)
`output_dir`	Where trajectories, logs, and reports are written
`use_docker`	Run the agent inside a container (usually `true`; `false` for the mock agent)
`batch_size`	Parallel scenarios (1 = sequential)
`quick_debug`	Override to 2 scenarios × 2 turns for smoke testing
`scenarios_dir`	Load pre-generated scenarios instead of generating on the fly

Iterative-benchmark fields live under restful_server_gen_iterative.*. The interesting ones (with their defaults):

Field	Default	Meaning
`n_scenarios`	10	Number of independent scenarios
`n_turns`	3	Schema-evolution turns per scenario
`num_changes_per_turn`	5	Schema changes between turn N and turn N+1
`attempt_limit`	10	Max retries within a turn before moving on
`base_seed`	42	Deterministic sampling root
`enable_feedback`	True	Show test-failure details to the agent on retry
`test_fail_feedback_type`	`DETAILED`	One of `DETAILED`, `MEDIUM`, `MINIMAL`
`use_llm_sampler`	`False`	If `True`, use Bedrock to fill schema changes; otherwise fully programmatic (no AWS needed)
`llm_sampler_model_id`	`zai.glm-5`	Bedrock model for the sampler when `use_llm_sampler=True`
`execute_episodes`	`None`	e.g. `"1,5,10"` to run a subset

What success looks like

After the run finishes, output_dir/ contains:

output_dir/
├── config.yaml              # Snapshot of the resolved config
├── report.json              # Top-level summary (success_rate, average_score, per-scenario results)
├── scenario_0/
│   ├── trajectory.json      # Full agent transcript + per-step token/cost data
│   ├── agent_code/          # Snapshots of the agent's working dir per turn/attempt
│   ├── ground_truth/        # Reference Flask servers, one per turn
│   ├── test_suites/         # Generated pytest suites, one per turn
│   ├── schemas/             # IR YAMLs per turn
│   ├── server_logs/         # stdout/stderr from the agent's running server
│   ├── test_results/        # Raw pytest output per attempt
│   ├── agent_logs/step_N.txt  # Verbatim CLI step logs
│   └── changes/             # Natural-language change descriptions per turn
├── scenario_1/
└── …

report.json looks roughly like:

{
  "agent": "mini_swe",
  "benchmark": "restful_server_gen_iterative",
  "success_rate": 0.85,
  "average_score": 0.92,
  "num_tasks": 20,
  "results": [...],
  "trajectory_summaries": [...]
}

For per-attempt timing, error classification, and a markdown report:

uv run python -m staminabench.results_analysis.analyze_trajectories_with_attempts /path/to/results

Pre-Generating Scenarios

For sweeps it's worth generating scenario data once and reusing it across agents. There are two paths:

Programmatic (no AWS, fast)

Fully deterministic from base_seed. Schemas are sampled by RESTSpecSampler and changes by the random-weighted generator in schema_gen/changes/generator.py.

uv run python -m staminabench.generate_scenarios --values \
    benchmark=RESTFUL_SERVER_GEN_ITERATIVE \
    output_dir=/path/to/scenarios \
    restful_server_gen_iterative.use_llm_sampler=false \
    restful_server_gen_iterative.n_scenarios=20 \
    restful_server_gen_iterative.n_turns=100 \
    restful_server_gen_iterative.base_seed=42

LLM-sampled (Bedrock, richer domains)

The schemas pick a domain topic from domain_topics.txt and the LLM fills in realistic entity/field/action names; the change sampler picks change types programmatically, then asks the LLM to fill in the structured details.

unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN  # use instance role

uv run python -m staminabench.generate_scenarios --values \
    benchmark=RESTFUL_SERVER_GEN_ITERATIVE \
    output_dir=/path/to/scenarios \
    restful_server_gen_iterative.use_llm_sampler=true \
    restful_server_gen_iterative.n_scenarios=20 \
    restful_server_gen_iterative.n_turns=100 \
    restful_server_gen_iterative.num_changes_per_turn=5 \
    restful_server_gen_iterative.base_seed=42

Then point evaluations at the resulting directory:

uv run python -m staminabench.run_eval --values \
    agent=MINI_SWE \
    benchmark=RESTFUL_SERVER_GEN_ITERATIVE \
    scenarios_dir=/path/to/scenarios \
    output_dir=/path/to/results

Pre-generated data used in the paper

The exact scenario datasets used for the experiments are shipped in data/:

File	Sampler	Scenarios × turns	Notes
`data/programmatic.zip`	Programmatic (no LLM)	20 × 100	`base_seed=0`, fully deterministic
`data/llm.zip`	LLM-sampled	20 × 100	`base_seed=42`, model `us.anthropic.claude-haiku-4-5-20251001-v1:0`

The accompanying data/generation_config_*.yaml files record the full generation config for each. Unzip and point scenarios_dir at the extracted directory:

unzip data/programmatic.zip -d /path/to/scenarios
uv run python -m staminabench.run_eval --values \
    agent=MINI_SWE \
    benchmark=RESTFUL_SERVER_GEN_ITERATIVE \
    scenarios_dir=/path/to/scenarios \
    output_dir=/path/to/results

Layout

.
├── pyproject.toml                        # Packaging (src layout, name = staminabench)
├── configs/                              # Example YAML configs
├── tests/                                # Unit + integration tests
└── src/staminabench/                     # Importable package
    ├── run_eval.py                       # Entry point (python -m staminabench.run_eval)
    ├── generate_scenarios.py             # Pre-generate scenario data
    ├── configs.py, agent.py, benchmark.py, environment.py, user.py, utils.py
    ├── docker_environment.py             # Docker-based environment
    ├── data_structures.py, trajectory_info.py
    │
    ├── agents/                           # Agent CLI wrappers
    │   └── docker/                       # Dockerfiles and build scripts
    │
    ├── framework/                        # Benchmark orchestration
    │   ├── interfaces.py                 # SpecSampler, ChangeSampler, etc.
    │   ├── iterative_benchmark.py        # Multi-turn loop, turn evaluator
    │   ├── scenario.py                   # ScenarioStaticData, BenchmarkGenerator
    │   └── config.py                     # IterativeBenchmarkConfig
    │
    ├── benchmarks/
    │   └── restful_server_generation/
    │       ├── iterative_benchmark.py    # Main benchmark (multi-turn)
    │       ├── benchmark.py              # Single-turn benchmark
    │       ├── rest_benchmark.py         # Domain implementations
    │       ├── llm_samplers.py           # LLM-driven spec + change sampling
    │       ├── schema_gen/               # IR, change types, codegen, test gen
    │       └── templates/                # Flask code templates
    │
    └── results_analysis/                 # Scripts for analysing trajectories

Tests

# Fast unit tests (no Docker)
uv run pytest tests/ --ignore=tests/test_docker_environment.py

# Full suite (requires Docker images)
uv run pytest tests/

Scoring

Each scenario produces a trajectory with per-turn scores:

Per-turn score: passed_tests / total_tests on the final attempt of the turn
Per-scenario score: mean of per-turn scores
Overall score: mean across scenarios

See staminabench/results_analysis/analyze_trajectories_with_attempts.py for the canonical analysis script.

This code is being released solely for academic and scientific reproducibility purposes, in support of the methods and findings described in the associated publication. Pull requests are not being accepted in order to maintain the code exactly as it was used in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
data		data
src/staminabench		src/staminabench
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

How it fits together

Supported Agents

Setup

1. Python environment

2. AWS / Bedrock

3. Docker images

Env vars forwarded into the container

Host UID mismatch

Running an Evaluation

Quick smoke test (mock agent, no Docker)

Real run

Rough timing

Key config fields

What success looks like

Pre-Generating Scenarios

Programmatic (no AWS, fast)

LLM-sampled (Bedrock, richer domains)

Pre-generated data used in the paper

Layout

Tests

Scoring

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

How it fits together

Supported Agents

Setup

1. Python environment

2. AWS / Bedrock

3. Docker images

Env vars forwarded into the container

Host UID mismatch

Running an Evaluation

Quick smoke test (mock agent, no Docker)

Real run

Rough timing

Key config fields

What success looks like

Pre-Generating Scenarios

Programmatic (no AWS, fast)

LLM-sampled (Bedrock, richer domains)

Pre-generated data used in the paper

Layout

Tests

Scoring

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages