GitHub - hidai25/eval-view: Regression testing for AI agents. Snapshot behavior, diff tool calls, catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.

Regression testing for AI agents.
Snapshot behavior, detect regressions, block broken agents before production.

EvalView sends test queries to your agent, records everything (tool calls, parameters, sequence, output, cost, latency), and diffs it against a golden baseline. When something changes, you know immediately.

  ✓ login-flow           PASSED
  ⚠ refund-request       TOOLS_CHANGED
      - lookup_order → check_policy → process_refund
      + lookup_order → check_policy → process_refund → escalate_to_human
  ✗ billing-dispute      REGRESSION  -30 pts
      Score: 85 → 55  Output similarity: 35%

Normal tests catch crashes. Tracing shows what happened after the fact. EvalView catches the harder class: the agent returns 200 but silently takes the wrong tool path, skips a clarification, or degrades output quality after a model update.

Quick Start

pip install evalview

Already have a local agent running?

evalview init        # Detect agent, create starter suite
evalview snapshot    # Save current behavior as baseline
evalview check       # Catch regressions after every change

No agent yet?

evalview demo        # See regression detection live (~30 seconds, no API key)

Other entry paths:

# Generate tests from a live agent
evalview generate --agent http://localhost:8000

# Generate from existing logs
evalview generate --from-log traffic.jsonl

# Capture real user flows via proxy
evalview capture --agent http://localhost:8000/invoke

How It Works

┌────────────┐      ┌──────────┐      ┌──────────────┐
│ Test Cases  │ ──→  │ EvalView │ ──→  │  Your Agent   │
│   (YAML)   │      │          │ ←──  │ local / cloud │
└────────────┘      └──────────┘      └──────────────┘

evalview init — detects your running agent, creates a starter test suite
evalview snapshot — runs tests, saves traces as golden baselines
evalview check — replays tests, diffs against baselines, flags regressions
evalview monitor — runs checks continuously with optional Slack alerts

Your data stays local. Nothing is sent to EvalView servers.

What It Catches

Status	Meaning	Action
✅ PASSED	Behavior matches baseline	Ship with confidence
⚠️ TOOLS_CHANGED	Different tools called	Review the diff
⚠️ OUTPUT_CHANGED	Same tools, output shifted	Review the diff
❌ REGRESSION	Score dropped significantly	Fix before shipping

Four Scoring Layers

Layer	What it checks	Needs API key?	Cost
Tool calls + sequence	Exact tool names, order, parameters	No	Free
Code-based checks	Regex, JSON schema, contains/not_contains	No	Free
Semantic similarity	Output meaning via embeddings	`OPENAI_API_KEY`	~$0.00004/test
LLM-as-judge	Output quality scored by GPT	`OPENAI_API_KEY`	~$0.01/test

The first two layers alone catch most regressions — fully offline, zero cost.

Multi-Turn Testing

name: refund-needs-order-number
turns:
  - query: "I want a refund"
    expected:
      output:
        contains: ["order number"]
  - query: "Order 4812"
    expected:
      tools: ["lookup_order", "check_policy"]
thresholds:
  min_score: 70

If the agent stops asking for the order number or takes a different tool path on the follow-up, EvalView flags it.

Key Features

Feature	Description	Docs
Golden baseline diffing	Tool call + parameter + output regression detection	Docs
Multi-turn testing	Sequential turns with context injection	Docs
Multi-reference goldens	Up to 5 variants for non-deterministic agents	Docs
`forbidden_tools`	Safety contracts — hard-fail on any violation	Docs
Semantic similarity	Embedding-based output comparison	Docs
Production monitoring	`evalview monitor` with Slack alerts and JSONL history	Docs
A/B comparison	`evalview compare --v1 <url> --v2 <url>`	Docs
Test generation	`evalview generate` — auto-create test suites	Docs
Silent model detection	Alerts when LLM provider updates the model version	Docs
Gradual drift detection	Trend analysis across check history	Docs
Statistical mode (pass@k)	Run N times, require a pass rate	Docs
HTML trace replay	Step-by-step forensic debugging	Docs
Pytest plugin	`evalview_check` fixture for standard pytest	Docs
Git hooks	Pre-push regression blocking, zero CI config	Docs
LLM judge caching	~80% cost reduction in statistical mode	Docs
Skills testing	E2E testing for Claude Code, Codex, OpenClaw	Docs

Supported Frameworks

Works with LangGraph, CrewAI, OpenAI, Claude, Mistral, HuggingFace, Ollama, MCP, and any HTTP API.

Agent	E2E Testing	Trace Capture
LangGraph	✅	✅
CrewAI	✅	✅
OpenAI Assistants	✅	✅
Claude Code	✅	✅
Ollama	✅	✅
Any HTTP API	✅	✅

Framework details → | Starter examples →

CI/CD Integration

evalview install-hooks    # Pre-push regression blocking, zero config

Or in GitHub Actions:

# .github/workflows/evalview.yml
name: Agent Health Check
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hidai25/eval-view@v0.5.1
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          command: check
          fail-on: 'REGRESSION'

Full CI/CD guide →

Production Monitoring

evalview monitor                                         # Check every 5 min
evalview monitor --interval 60                           # Every minute
evalview monitor --slack-webhook https://hooks.slack.com/services/...
evalview monitor --history monitor.jsonl                 # JSONL for dashboards

New regressions trigger Slack alerts. Recoveries send all-clear. No spam on persistent failures.

Monitor config options →

Pytest Plugin

def test_weather_regression(evalview_check):
    diff = evalview_check("weather-lookup")
    assert diff.overall_severity.value != "regression", diff.summary()

pip install evalview    # Plugin registers automatically
pytest                  # Runs alongside your existing tests

Claude Code (MCP)

claude mcp add --transport stdio evalview -- evalview mcp serve

8 tools: create_test, run_snapshot, run_check, list_tests, validate_skill, generate_skill_tests, run_skill_test, generate_visual_report

MCP setup details

# 1. Install
pip install evalview

# 2. Connect to Claude Code
claude mcp add --transport stdio evalview -- evalview mcp serve

# 3. Make Claude Code proactive
cp CLAUDE.md.example CLAUDE.md

Then just ask Claude: "did my refactor break anything?" and it runs run_check inline.

Why EvalView?

	LangSmith	Braintrust	Promptfoo	EvalView
Primary focus	Observability	Scoring	Prompt comparison	Regression detection
Tool call + parameter diffing	—	—	—	Yes
Golden baseline regression	—	Manual	—	Automatic
Works without API keys	No	No	Partial	Yes
Production monitoring	Tracing	—	—	Check loop + Slack

Detailed comparisons →

Documentation

Getting Started	Core Features	Integrations
Getting Started	Golden Traces	CI/CD
CLI Reference	Evaluation Metrics	MCP Contracts
FAQ	Test Generation	Skills Testing
YAML Schema	Statistical Mode	Chat Mode
Framework Support	Behavior Coverage	Debugging

Contributing

Bug or feature request? Run evalview feedback or open an issue
Questions? GitHub Discussions
Setup help? Email hidai@evalview.com
Contributing? See CONTRIBUTING.md

License: Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 525 Commits
.evalview/golden		.evalview/golden
.github		.github
assets		assets
demo-agent		demo-agent
demo-tests		demo-tests
docs		docs
dogfood		dogfood
evalview		evalview
examples		examples
guides		guides
gym		gym
scripts		scripts
sdks/node		sdks/node
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md.example		CLAUDE.md.example
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
action.yml		action.yml
diff-report.html		diff-report.html
glama.json		glama.json
goosebench-test.txt		goosebench-test.txt
llms-full.txt		llms-full.txt
llms.txt		llms.txt
package.json		package.json
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
report.html		report.html
requirements.txt		requirements.txt
server.json		server.json
test_verbose.sh		test_verbose.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick Start

How It Works

What It Catches

Four Scoring Layers

Multi-Turn Testing

Key Features

Supported Frameworks

CI/CD Integration

Production Monitoring

Pytest Plugin

Claude Code (MCP)

Why EvalView?

Documentation

Contributing

Star History

About

Uh oh!

Releases 22

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Quick Start

How It Works

What It Catches

Four Scoring Layers

Multi-Turn Testing

Key Features

Supported Frameworks

CI/CD Integration

Production Monitoring

Pytest Plugin

Claude Code (MCP)

Why EvalView?

Documentation

Contributing

Star History

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 22

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages