Understudy is a scenario-driven testing framework for AI agents that simulates realistic multi-turn users, runs those scenes against an agent through a simple app adapter, records a structured execution trace of messages, tool calls, and handoffs, and then evaluates behavior with deterministic checks, optional LLM judges, and run reports.
Testing with understudy is 4 steps:
- Wrap your agent — Adapt your agent (ADK, LangGraph, HTTP) to understudy's interface
- Mock your tools — Register handlers that return test data instead of calling real services
- Write scenes — YAML files defining what the simulated user wants and what you expect
- Run and assert — Execute simulations, check traces, generate reports
The key insight: assert against the trace, not the prose. Don't check what the agent said—check what it did (tool calls).
See real examples:
- Example scene — YAML defining a test scenario
- ADK test file — pytest assertions against traces
- LangGraph test file — same tests, different framework
- Example report — HTML report with metrics and transcripts
pip install understudy[all]from understudy.adk import ADKApp
from my_agent import agent
app = ADKApp(agent=agent)Your agent has tools that call external services. Mock them for testing:
from understudy.mocks import MockToolkit
mocks = MockToolkit()
@mocks.handle("lookup_order")
def lookup_order(order_id: str) -> dict:
return {"order_id": order_id, "items": [...], "status": "delivered"}
@mocks.handle("create_return")
def create_return(order_id: str, item_sku: str, reason: str) -> dict:
return {"return_id": "RET-001", "status": "created"}Create scenes/return_backpack.yaml:
id: return_eligible_backpack
description: Customer wants to return a backpack
starting_prompt: "I'd like to return an item please."
conversation_plan: |
Goal: Return the hiking backpack from order ORD-10031.
- Provide order ID when asked
- Return reason: too small
persona: cooperative
max_turns: 15
expectations:
required_tools:
- lookup_order
- create_return
forbidden_tools:
- issue_refundfrom understudy import Scene, run
scene = Scene.from_file("scenes/return_backpack.yaml")
trace = run(app, scene, mocks=mocks)
assert trace.called("lookup_order")
assert trace.called("create_return")
assert not trace.called("issue_refund")Or with pytest (define app and mocks fixtures in conftest.py):
pytest test_returns.py -vRun multiple scenes with multiple simulations per scene:
from understudy import Suite, RunStorage
suite = Suite.from_directory("scenes/")
storage = RunStorage()
# Run each scene 3 times and tag for comparison
results = suite.run(
app,
mocks=mocks,
storage=storage,
n_sims=3,
tags={"version": "v1"},
)
print(f"{results.pass_count}/{len(results.results)} passed")Understudy separates simulation (generating traces) from evaluation (checking traces). Use together or separately:
understudy run \
--app mymodule:agent_app \
--scene ./scenes/ \
--n-sims 3 \
--junit results.xmlGenerate traces only:
understudy simulate \
--app mymodule:agent_app \
--scenes ./scenes/ \
--output ./traces/ \
--n-sims 3Evaluate existing traces:
understudy evaluate \
--traces ./traces/ \
--output ./results/ \
--junit results.xmlPython API:
from understudy import simulate_batch, evaluate_batch
# Generate traces
traces = simulate_batch(
app=agent_app,
scenes="./scenes/",
n_sims=3,
output="./traces/",
)
# Evaluate later
results = evaluate_batch(
traces="./traces/",
output="./results/",
)# Run simulations
understudy run --app mymodule:app --scene ./scenes/
understudy simulate --app mymodule:app --scenes ./scenes/
understudy evaluate --traces ./traces/
# View results
understudy list
understudy show <run_id>
understudy summary
# Compare runs by tag
understudy compare --tag version --before v1 --after v2
# Generate reports
understudy report -o report.html
understudy compare --tag version --before v1 --after v2 --html comparison.html
# Interactive browser
understudy serve --port 8080
# HTTP simulator server (for browser/UI testing)
understudy serve-api --port 8000
# Cleanup
understudy delete <run_id>
understudy clearFor qualities that can't be checked deterministically:
from understudy.judges import Judge
empathy_judge = Judge(
rubric="The agent acknowledged frustration and was empathetic while enforcing policy.",
samples=5,
)
result = empathy_judge.evaluate(trace)
assert result.score == 1Built-in rubrics:
from understudy.judges import (
TOOL_USAGE_CORRECTNESS,
POLICY_COMPLIANCE,
TONE_EMPATHY,
ADVERSARIAL_ROBUSTNESS,
TASK_COMPLETION,
)The understudy summary command shows:
- Pass rate — percentage of scenes that passed all expectations
- Avg turns — average conversation length
- Tool usage — distribution of tool calls across runs
- Agents — which agents were invoked
The HTML report (understudy report) includes:
- All metrics above
- Full conversation transcripts
- Tool call details with arguments
- Expectation check results
- Judge evaluation results (when used)
See the full documentation for:
- Installation guide
- Writing scenes
- ADK integration
- LangGraph integration
- HTTP client for deployed agents
- API reference
MIT