Regression testing for AI agents.
Snapshot behavior, detect regressions, block broken agents before production.
EvalView sends test queries to your agent, records everything (tool calls, parameters, sequence, output, cost, latency), and diffs it against a golden baseline. When something changes, you know immediately.
✓ login-flow PASSED
⚠ refund-request TOOLS_CHANGED
- lookup_order → check_policy → process_refund
+ lookup_order → check_policy → process_refund → escalate_to_human
✗ billing-dispute REGRESSION -30 pts
Score: 85 → 55 Output similarity: 35%
Normal tests catch crashes. Tracing shows what happened after the fact. EvalView catches the harder class: the agent returns 200 but silently takes the wrong tool path, skips a clarification, or degrades output quality after a model update.
pip install evalviewAlready have a local agent running?
evalview init # Detect agent, create starter suite
evalview snapshot # Save current behavior as baseline
evalview check # Catch regressions after every changeNo agent yet?
evalview demo # See regression detection live (~30 seconds, no API key)Other entry paths:
# Generate tests from a live agent
evalview generate --agent http://localhost:8000
# Generate from existing logs
evalview generate --from-log traffic.jsonl
# Capture real user flows via proxy
evalview capture --agent http://localhost:8000/invoke┌────────────┐ ┌──────────┐ ┌──────────────┐
│ Test Cases │ ──→ │ EvalView │ ──→ │ Your Agent │
│ (YAML) │ │ │ ←── │ local / cloud │
└────────────┘ └──────────┘ └──────────────┘
evalview init— detects your running agent, creates a starter test suiteevalview snapshot— runs tests, saves traces as golden baselinesevalview check— replays tests, diffs against baselines, flags regressionsevalview monitor— runs checks continuously with optional Slack alerts
Your data stays local. Nothing is sent to EvalView servers.
| Status | Meaning | Action |
|---|---|---|
| ✅ PASSED | Behavior matches baseline | Ship with confidence |
| Different tools called | Review the diff | |
| Same tools, output shifted | Review the diff | |
| ❌ REGRESSION | Score dropped significantly | Fix before shipping |
| Layer | What it checks | Needs API key? | Cost |
|---|---|---|---|
| Tool calls + sequence | Exact tool names, order, parameters | No | Free |
| Code-based checks | Regex, JSON schema, contains/not_contains | No | Free |
| Semantic similarity | Output meaning via embeddings | OPENAI_API_KEY |
~$0.00004/test |
| LLM-as-judge | Output quality scored by GPT | OPENAI_API_KEY |
~$0.01/test |
The first two layers alone catch most regressions — fully offline, zero cost.
name: refund-needs-order-number
turns:
- query: "I want a refund"
expected:
output:
contains: ["order number"]
- query: "Order 4812"
expected:
tools: ["lookup_order", "check_policy"]
thresholds:
min_score: 70If the agent stops asking for the order number or takes a different tool path on the follow-up, EvalView flags it.
| Feature | Description | Docs |
|---|---|---|
| Golden baseline diffing | Tool call + parameter + output regression detection | Docs |
| Multi-turn testing | Sequential turns with context injection | Docs |
| Multi-reference goldens | Up to 5 variants for non-deterministic agents | Docs |
forbidden_tools |
Safety contracts — hard-fail on any violation | Docs |
| Semantic similarity | Embedding-based output comparison | Docs |
| Production monitoring | evalview monitor with Slack alerts and JSONL history |
Docs |
| A/B comparison | evalview compare --v1 <url> --v2 <url> |
Docs |
| Test generation | evalview generate — auto-create test suites |
Docs |
| Silent model detection | Alerts when LLM provider updates the model version | Docs |
| Gradual drift detection | Trend analysis across check history | Docs |
| Statistical mode (pass@k) | Run N times, require a pass rate | Docs |
| HTML trace replay | Step-by-step forensic debugging | Docs |
| Pytest plugin | evalview_check fixture for standard pytest |
Docs |
| Git hooks | Pre-push regression blocking, zero CI config | Docs |
| LLM judge caching | ~80% cost reduction in statistical mode | Docs |
| Skills testing | E2E testing for Claude Code, Codex, OpenClaw | Docs |
Works with LangGraph, CrewAI, OpenAI, Claude, Mistral, HuggingFace, Ollama, MCP, and any HTTP API.
| Agent | E2E Testing | Trace Capture |
|---|---|---|
| LangGraph | ✅ | ✅ |
| CrewAI | ✅ | ✅ |
| OpenAI Assistants | ✅ | ✅ |
| Claude Code | ✅ | ✅ |
| Ollama | ✅ | ✅ |
| Any HTTP API | ✅ | ✅ |
Framework details → | Starter examples →
evalview install-hooks # Pre-push regression blocking, zero configOr in GitHub Actions:
# .github/workflows/evalview.yml
name: Agent Health Check
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hidai25/eval-view@v0.5.1
with:
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
command: check
fail-on: 'REGRESSION'evalview monitor # Check every 5 min
evalview monitor --interval 60 # Every minute
evalview monitor --slack-webhook https://hooks.slack.com/services/...
evalview monitor --history monitor.jsonl # JSONL for dashboardsNew regressions trigger Slack alerts. Recoveries send all-clear. No spam on persistent failures.
def test_weather_regression(evalview_check):
diff = evalview_check("weather-lookup")
assert diff.overall_severity.value != "regression", diff.summary()pip install evalview # Plugin registers automatically
pytest # Runs alongside your existing testsclaude mcp add --transport stdio evalview -- evalview mcp serve8 tools: create_test, run_snapshot, run_check, list_tests, validate_skill, generate_skill_tests, run_skill_test, generate_visual_report
MCP setup details
# 1. Install
pip install evalview
# 2. Connect to Claude Code
claude mcp add --transport stdio evalview -- evalview mcp serve
# 3. Make Claude Code proactive
cp CLAUDE.md.example CLAUDE.mdThen just ask Claude: "did my refactor break anything?" and it runs run_check inline.
| LangSmith | Braintrust | Promptfoo | EvalView | |
|---|---|---|---|---|
| Primary focus | Observability | Scoring | Prompt comparison | Regression detection |
| Tool call + parameter diffing | — | — | — | Yes |
| Golden baseline regression | — | Manual | — | Automatic |
| Works without API keys | No | No | Partial | Yes |
| Production monitoring | Tracing | — | — | Check loop + Slack |
| Getting Started | Core Features | Integrations |
|---|---|---|
| Getting Started | Golden Traces | CI/CD |
| CLI Reference | Evaluation Metrics | MCP Contracts |
| FAQ | Test Generation | Skills Testing |
| YAML Schema | Statistical Mode | Chat Mode |
| Framework Support | Behavior Coverage | Debugging |
- Bug or feature request? Run
evalview feedbackor open an issue - Questions? GitHub Discussions
- Setup help? Email hidai@evalview.com
- Contributing? See CONTRIBUTING.md
License: Apache 2.0
