AgentOps Lab

A coding agent that edits real repositories with a sandboxed workspace tool layer (read, write, search, run), persists every step into a JSON trace with token accounting, and renders runs on a live dashboard.

$ python3 -m agentops.cli run \
    --scenario "Refactor duplicated coupon validation across web and API checkout layers" \
    --workspace .scratch --runs-dir runs --backend dryrun

Every screenshot above is generated from a real agentops CLI run, not hardcoded. The dashboard reads runs/*.json produced by the agent.

What is real here

Python coding agent with a ReAct-style tool-use loop
Workspace tools: list_files, read_file, search_code, write_file, run_cmd
LLM client supporting OpenAI, Anthropic, and a deterministic dry-run backend
Token accounting persisted into each run trace
Example checkout codebase with duplicated coupon validation that the agent refactors into a shared module
Tests covering both the tool layer and an end-to-end agent run
Static dashboard that loads runs/*.json produced by the agent

Project layout

agent-workbench/
├── agentops/                  # Python package
│   ├── agent.py               # ReAct loop
│   ├── cli.py                 # `agentops run`, `agentops report`
│   ├── llm.py                 # OpenAI / Anthropic / dryrun backends
│   ├── tools.py               # Workspace tool layer
│   └── trace.py               # Run trace persistence
├── examples/checkout/         # Toy production-style codebase
│   ├── api_checkout.py
│   ├── web_checkout.py
│   └── tests/test_existing.py
├── runs/                      # Real trace JSON + manifest
├── tests/                     # Pytest for agent + tools
├── web/                       # Dashboard (HTML/CSS/JS, no build step)
├── pyproject.toml
└── README.md

Quickstart

# 1. Install pytest for development (optional)
pip install -e .[test]

# 2. Run the agent on a copy of the example codebase (offline, no API key)
cp -r examples/checkout .scratch
python3 -m agentops.cli run \
  --scenario "Refactor duplicated coupon validation across web and API checkout layers" \
  --workspace .scratch \
  --runs-dir runs \
  --backend dryrun

# 3. Confirm the existing checkout tests still pass after the agent refactor
cd .scratch && python3 -m pytest -q tests && cd ..

# 4. View the dashboard (renders runs/*.json)
python3 -m http.server 4173
open http://localhost:4173/web/

Use a real LLM backend

The dry-run backend is deterministic and offline. To run the agent against a real model, set the matching API key and pick a backend.

export OPENAI_API_KEY=sk-...
python3 -m agentops.cli run \
  --scenario "Refactor duplicated coupon validation" \
  --workspace .scratch \
  --backend openai \
  --model gpt-4o-mini

export ANTHROPIC_API_KEY=sk-ant-...
python3 -m agentops.cli run \
  --scenario "Refactor duplicated coupon validation" \
  --workspace .scratch \
  --backend anthropic \
  --model claude-3-5-sonnet-latest

Token usage from real backends is recorded in each run JSON and shown in the dashboard.

Tests

python3 -m pytest -q

Covers:

Workspace tool boundaries (path escape protection, file IO, search, list)
End-to-end dry-run agent run on the example codebase
Existing checkout tests still pass after the agent refactor

What the example demonstrates

The toy codebase intentionally duplicates coupon validation across web and API layers. The agent:

Scans the repository
Searches for the duplicated validator
Reads each implementation
Writes a shared coupon_validator module
Patches both call sites to use it
Runs the existing pytest suite to verify behavior is preserved

This mirrors a real engineering task: identify duplication, extract a shared module, keep behavior stable, and verify with tests.

Notes for application use

runs/manifest.json and runs/run-*.json are real artifacts you can attach to an application form as proof of agent execution.
The dashboard panels (steps, tokens, scenario summary, run history) read directly from those JSON files. No mock data.
For production-grade evidence, run the agent with --backend openai or --backend anthropic and submit the resulting trace plus billing screenshot.

License

MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentOps Lab

What is real here

Project layout

Quickstart

Use a real LLM backend

Tests

What the example demonstrates

Notes for application use

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
agentops		agentops
docs		docs
examples/checkout		examples/checkout
proof		proof
runs		runs
scripts		scripts
tests		tests
web		web
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.html		index.html
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

AgentOps Lab

What is real here

Project layout

Quickstart

Use a real LLM backend

Tests

What the example demonstrates

Notes for application use

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages