Skip to content

hidai25/eval-view

Use this GitHub action with your project
Add this Action to an existing workflow or create a new one
View on Marketplace

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

525 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

EvalView
Regression testing for AI agents.
Snapshot behavior, detect regressions, block broken agents before production.

PyPI version PyPI downloads GitHub stars CI License


EvalView sends test queries to your agent, records everything (tool calls, parameters, sequence, output, cost, latency), and diffs it against a golden baseline. When something changes, you know immediately.

  ✓ login-flow           PASSED
  ⚠ refund-request       TOOLS_CHANGED
      - lookup_order → check_policy → process_refund
      + lookup_order → check_policy → process_refund → escalate_to_human
  ✗ billing-dispute      REGRESSION  -30 pts
      Score: 85 → 55  Output similarity: 35%

Normal tests catch crashes. Tracing shows what happened after the fact. EvalView catches the harder class: the agent returns 200 but silently takes the wrong tool path, skips a clarification, or degrades output quality after a model update.

EvalView — multi-turn execution trace with sequence diagram

Quick Start

pip install evalview

Already have a local agent running?

evalview init        # Detect agent, create starter suite
evalview snapshot    # Save current behavior as baseline
evalview check       # Catch regressions after every change

No agent yet?

evalview demo        # See regression detection live (~30 seconds, no API key)

Other entry paths:

# Generate tests from a live agent
evalview generate --agent http://localhost:8000

# Generate from existing logs
evalview generate --from-log traffic.jsonl

# Capture real user flows via proxy
evalview capture --agent http://localhost:8000/invoke

How It Works

┌────────────┐      ┌──────────┐      ┌──────────────┐
│ Test Cases  │ ──→  │ EvalView │ ──→  │  Your Agent   │
│   (YAML)   │      │          │ ←──  │ local / cloud │
└────────────┘      └──────────┘      └──────────────┘
  1. evalview init — detects your running agent, creates a starter test suite
  2. evalview snapshot — runs tests, saves traces as golden baselines
  3. evalview check — replays tests, diffs against baselines, flags regressions
  4. evalview monitor — runs checks continuously with optional Slack alerts

Your data stays local. Nothing is sent to EvalView servers.

What It Catches

Status Meaning Action
PASSED Behavior matches baseline Ship with confidence
⚠️ TOOLS_CHANGED Different tools called Review the diff
⚠️ OUTPUT_CHANGED Same tools, output shifted Review the diff
REGRESSION Score dropped significantly Fix before shipping

Four Scoring Layers

Layer What it checks Needs API key? Cost
Tool calls + sequence Exact tool names, order, parameters No Free
Code-based checks Regex, JSON schema, contains/not_contains No Free
Semantic similarity Output meaning via embeddings OPENAI_API_KEY ~$0.00004/test
LLM-as-judge Output quality scored by GPT OPENAI_API_KEY ~$0.01/test

The first two layers alone catch most regressions — fully offline, zero cost.

Multi-Turn Testing

name: refund-needs-order-number
turns:
  - query: "I want a refund"
    expected:
      output:
        contains: ["order number"]
  - query: "Order 4812"
    expected:
      tools: ["lookup_order", "check_policy"]
thresholds:
  min_score: 70

If the agent stops asking for the order number or takes a different tool path on the follow-up, EvalView flags it.

Key Features

Feature Description Docs
Golden baseline diffing Tool call + parameter + output regression detection Docs
Multi-turn testing Sequential turns with context injection Docs
Multi-reference goldens Up to 5 variants for non-deterministic agents Docs
forbidden_tools Safety contracts — hard-fail on any violation Docs
Semantic similarity Embedding-based output comparison Docs
Production monitoring evalview monitor with Slack alerts and JSONL history Docs
A/B comparison evalview compare --v1 <url> --v2 <url> Docs
Test generation evalview generate — auto-create test suites Docs
Silent model detection Alerts when LLM provider updates the model version Docs
Gradual drift detection Trend analysis across check history Docs
Statistical mode (pass@k) Run N times, require a pass rate Docs
HTML trace replay Step-by-step forensic debugging Docs
Pytest plugin evalview_check fixture for standard pytest Docs
Git hooks Pre-push regression blocking, zero CI config Docs
LLM judge caching ~80% cost reduction in statistical mode Docs
Skills testing E2E testing for Claude Code, Codex, OpenClaw Docs

Supported Frameworks

Works with LangGraph, CrewAI, OpenAI, Claude, Mistral, HuggingFace, Ollama, MCP, and any HTTP API.

Agent E2E Testing Trace Capture
LangGraph
CrewAI
OpenAI Assistants
Claude Code
Ollama
Any HTTP API

Framework details → | Starter examples →

CI/CD Integration

evalview install-hooks    # Pre-push regression blocking, zero config

Or in GitHub Actions:

# .github/workflows/evalview.yml
name: Agent Health Check
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hidai25/eval-view@v0.5.1
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          command: check
          fail-on: 'REGRESSION'

Full CI/CD guide →

Production Monitoring

evalview monitor                                         # Check every 5 min
evalview monitor --interval 60                           # Every minute
evalview monitor --slack-webhook https://hooks.slack.com/services/...
evalview monitor --history monitor.jsonl                 # JSONL for dashboards

New regressions trigger Slack alerts. Recoveries send all-clear. No spam on persistent failures.

Monitor config options →

Pytest Plugin

def test_weather_regression(evalview_check):
    diff = evalview_check("weather-lookup")
    assert diff.overall_severity.value != "regression", diff.summary()
pip install evalview    # Plugin registers automatically
pytest                  # Runs alongside your existing tests

Claude Code (MCP)

claude mcp add --transport stdio evalview -- evalview mcp serve

8 tools: create_test, run_snapshot, run_check, list_tests, validate_skill, generate_skill_tests, run_skill_test, generate_visual_report

MCP setup details
# 1. Install
pip install evalview

# 2. Connect to Claude Code
claude mcp add --transport stdio evalview -- evalview mcp serve

# 3. Make Claude Code proactive
cp CLAUDE.md.example CLAUDE.md

Then just ask Claude: "did my refactor break anything?" and it runs run_check inline.

Why EvalView?

LangSmith Braintrust Promptfoo EvalView
Primary focus Observability Scoring Prompt comparison Regression detection
Tool call + parameter diffing Yes
Golden baseline regression Manual Automatic
Works without API keys No No Partial Yes
Production monitoring Tracing Check loop + Slack

Detailed comparisons →

Documentation

Getting Started Core Features Integrations
Getting Started Golden Traces CI/CD
CLI Reference Evaluation Metrics MCP Contracts
FAQ Test Generation Skills Testing
YAML Schema Statistical Mode Chat Mode
Framework Support Behavior Coverage Debugging

Contributing

License: Apache 2.0


Star History

Star History Chart