A coding agent that edits real repositories with a sandboxed workspace tool layer (read, write, search, run), persists every step into a JSON trace with token accounting, and renders runs on a live dashboard.
$ python3 -m agentops.cli run \
--scenario "Refactor duplicated coupon validation across web and API checkout layers" \
--workspace .scratch --runs-dir runs --backend dryrunEvery screenshot above is generated from a real agentops CLI run, not
hardcoded. The dashboard reads runs/*.json produced by the agent.
- Python coding agent with a ReAct-style tool-use loop
- Workspace tools:
list_files,read_file,search_code,write_file,run_cmd - LLM client supporting OpenAI, Anthropic, and a deterministic dry-run backend
- Token accounting persisted into each run trace
- Example checkout codebase with duplicated coupon validation that the agent refactors into a shared module
- Tests covering both the tool layer and an end-to-end agent run
- Static dashboard that loads
runs/*.jsonproduced by the agent
agent-workbench/
├── agentops/ # Python package
│ ├── agent.py # ReAct loop
│ ├── cli.py # `agentops run`, `agentops report`
│ ├── llm.py # OpenAI / Anthropic / dryrun backends
│ ├── tools.py # Workspace tool layer
│ └── trace.py # Run trace persistence
├── examples/checkout/ # Toy production-style codebase
│ ├── api_checkout.py
│ ├── web_checkout.py
│ └── tests/test_existing.py
├── runs/ # Real trace JSON + manifest
├── tests/ # Pytest for agent + tools
├── web/ # Dashboard (HTML/CSS/JS, no build step)
├── pyproject.toml
└── README.md
# 1. Install pytest for development (optional)
pip install -e .[test]
# 2. Run the agent on a copy of the example codebase (offline, no API key)
cp -r examples/checkout .scratch
python3 -m agentops.cli run \
--scenario "Refactor duplicated coupon validation across web and API checkout layers" \
--workspace .scratch \
--runs-dir runs \
--backend dryrun
# 3. Confirm the existing checkout tests still pass after the agent refactor
cd .scratch && python3 -m pytest -q tests && cd ..
# 4. View the dashboard (renders runs/*.json)
python3 -m http.server 4173
open http://localhost:4173/web/The dry-run backend is deterministic and offline. To run the agent against a real model, set the matching API key and pick a backend.
export OPENAI_API_KEY=sk-...
python3 -m agentops.cli run \
--scenario "Refactor duplicated coupon validation" \
--workspace .scratch \
--backend openai \
--model gpt-4o-mini
export ANTHROPIC_API_KEY=sk-ant-...
python3 -m agentops.cli run \
--scenario "Refactor duplicated coupon validation" \
--workspace .scratch \
--backend anthropic \
--model claude-3-5-sonnet-latestToken usage from real backends is recorded in each run JSON and shown in the dashboard.
python3 -m pytest -qCovers:
- Workspace tool boundaries (path escape protection, file IO, search, list)
- End-to-end dry-run agent run on the example codebase
- Existing checkout tests still pass after the agent refactor
The toy codebase intentionally duplicates coupon validation across web and API layers. The agent:
- Scans the repository
- Searches for the duplicated validator
- Reads each implementation
- Writes a shared
coupon_validatormodule - Patches both call sites to use it
- Runs the existing pytest suite to verify behavior is preserved
This mirrors a real engineering task: identify duplication, extract a shared module, keep behavior stable, and verify with tests.
runs/manifest.jsonandruns/run-*.jsonare real artifacts you can attach to an application form as proof of agent execution.- The dashboard panels (steps, tokens, scenario summary, run history) read directly from those JSON files. No mock data.
- For production-grade evidence, run the agent with
--backend openaior--backend anthropicand submit the resulting trace plus billing screenshot.
MIT License. See LICENSE for details.

