Skip to content

hirarano/agent-workbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AgentOps Lab

Tests Python Backends License

A coding agent that edits real repositories with a sandboxed workspace tool layer (read, write, search, run), persists every step into a JSON trace with token accounting, and renders runs on a live dashboard.

AgentOps Lab dashboard

$ python3 -m agentops.cli run \
    --scenario "Refactor duplicated coupon validation across web and API checkout layers" \
    --workspace .scratch --runs-dir runs --backend dryrun

Agent CLI run

Every screenshot above is generated from a real agentops CLI run, not hardcoded. The dashboard reads runs/*.json produced by the agent.

What is real here

  • Python coding agent with a ReAct-style tool-use loop
  • Workspace tools: list_files, read_file, search_code, write_file, run_cmd
  • LLM client supporting OpenAI, Anthropic, and a deterministic dry-run backend
  • Token accounting persisted into each run trace
  • Example checkout codebase with duplicated coupon validation that the agent refactors into a shared module
  • Tests covering both the tool layer and an end-to-end agent run
  • Static dashboard that loads runs/*.json produced by the agent

Project layout

agent-workbench/
├── agentops/                  # Python package
│   ├── agent.py               # ReAct loop
│   ├── cli.py                 # `agentops run`, `agentops report`
│   ├── llm.py                 # OpenAI / Anthropic / dryrun backends
│   ├── tools.py               # Workspace tool layer
│   └── trace.py               # Run trace persistence
├── examples/checkout/         # Toy production-style codebase
│   ├── api_checkout.py
│   ├── web_checkout.py
│   └── tests/test_existing.py
├── runs/                      # Real trace JSON + manifest
├── tests/                     # Pytest for agent + tools
├── web/                       # Dashboard (HTML/CSS/JS, no build step)
├── pyproject.toml
└── README.md

Quickstart

# 1. Install pytest for development (optional)
pip install -e .[test]

# 2. Run the agent on a copy of the example codebase (offline, no API key)
cp -r examples/checkout .scratch
python3 -m agentops.cli run \
  --scenario "Refactor duplicated coupon validation across web and API checkout layers" \
  --workspace .scratch \
  --runs-dir runs \
  --backend dryrun

# 3. Confirm the existing checkout tests still pass after the agent refactor
cd .scratch && python3 -m pytest -q tests && cd ..

# 4. View the dashboard (renders runs/*.json)
python3 -m http.server 4173
open http://localhost:4173/web/

Use a real LLM backend

The dry-run backend is deterministic and offline. To run the agent against a real model, set the matching API key and pick a backend.

export OPENAI_API_KEY=sk-...
python3 -m agentops.cli run \
  --scenario "Refactor duplicated coupon validation" \
  --workspace .scratch \
  --backend openai \
  --model gpt-4o-mini

export ANTHROPIC_API_KEY=sk-ant-...
python3 -m agentops.cli run \
  --scenario "Refactor duplicated coupon validation" \
  --workspace .scratch \
  --backend anthropic \
  --model claude-3-5-sonnet-latest

Token usage from real backends is recorded in each run JSON and shown in the dashboard.

Tests

python3 -m pytest -q

Covers:

  • Workspace tool boundaries (path escape protection, file IO, search, list)
  • End-to-end dry-run agent run on the example codebase
  • Existing checkout tests still pass after the agent refactor

What the example demonstrates

The toy codebase intentionally duplicates coupon validation across web and API layers. The agent:

  1. Scans the repository
  2. Searches for the duplicated validator
  3. Reads each implementation
  4. Writes a shared coupon_validator module
  5. Patches both call sites to use it
  6. Runs the existing pytest suite to verify behavior is preserved

This mirrors a real engineering task: identify duplication, extract a shared module, keep behavior stable, and verify with tests.

Notes for application use

  • runs/manifest.json and runs/run-*.json are real artifacts you can attach to an application form as proof of agent execution.
  • The dashboard panels (steps, tokens, scenario summary, run history) read directly from those JSON files. No mock data.
  • For production-grade evidence, run the agent with --backend openai or --backend anthropic and submit the resulting trace plus billing screenshot.

License

MIT License. See LICENSE for details.

About

A coding agent that edits real repos with workspace tools (read/write/search/run), produces JSON traces with token accounting, and renders runs on a live dashboard.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors