This FAQ covers the most common questions about EvalView, the open-source AI agent testing framework. If your question isn't answered here, check GitHub Discussions.
EvalView is an open-source, pytest-style testing and regression detection framework for AI agents. It detects when your agent's behavior changes after you modify prompts, swap models, or update tools. Install with pip install evalview.
AI agents break silently. When you change a prompt, swap a model, or update a tool, the agent might degrade without any error. Traditional unit tests don't work because LLM outputs are non-deterministic. EvalView solves this by capturing a golden baseline of known-good behavior and automatically detecting when behavior drifts.
- AI agent developers building with LangGraph, CrewAI, OpenAI, or custom frameworks
- Prompt engineers who need to test changes without breaking production
- MLOps/DevOps teams setting up CI/CD for AI agents
- Teams maintaining SKILL.md workflows for Claude Code or OpenAI Codex
- Anyone who needs reproducible, automated testing for LLM-powered applications
Yes. EvalView is free and open source under the Apache 2.0 license. You pay only for LLM API calls if you use optional LLM-as-judge evaluation. Use Ollama for completely free, fully offline evaluation.
They answer different questions:
evalview checkchecks your agent against golden baselines you recorded. Use it when you want to know whether a code change, prompt change, or model swap broke behavior in your system.evalview model-checkchecks the closed model underneath your agent against a fixed canary suite. Use it when you want to know whetherclaude-opus-4-5orgpt-5.4itself silently changed behavior, independent of anything in your code.
Both commands share drift-classification machinery and complement each
other. A failing check plus a drifted model-check on the same day
strongly suggests the provider updated the model; a failing check
with a clean model-check points back at your own changes. See
docs/MODEL_CHECK.md for the full rationale.
LangSmith is for observability and tracing — it shows you what your agent did. EvalView is for testing and regression detection — it tells you whether your agent broke. They're complementary tools. Use LangSmith to see what happened, use EvalView to prove it didn't break.
Braintrust is an evaluation platform that scores agent quality. EvalView focuses specifically on regression detection — detecting when behavior changes. EvalView does this automatically through golden baseline diffing, while Braintrust requires manual comparison. EvalView is also fully free and open source.
Promptfoo is primarily a prompt testing and comparison tool. EvalView is an agent testing framework with native adapters for agent frameworks (LangGraph, CrewAI, OpenAI Assistants), tool call verification, golden baseline diffing, and statistical mode. EvalView tests agent behavior (tools called, sequence, cost, latency) not just prompt outputs.
Yes, that's a good analogy. EvalView provides YAML-based test cases, assertions on tool calls and output quality, CI/CD integration with exit codes, and regression detection through golden baselines. It's the testing layer that AI agent development has been missing.
Yes. EvalView has a dedicated langgraph adapter with native thread tracking and streaming support. See examples/langgraph/.
Yes. EvalView has a dedicated crewai adapter for task-based execution and multi-agent crews. See examples/crewai/.
Yes. EvalView has a dedicated openai-assistants adapter with function calling and code interpreter support.
Yes. EvalView has a dedicated anthropic adapter. See examples/anthropic/.
Yes. EvalView supports HuggingFace Spaces (Gradio-based agents) and can use HuggingFace models as the LLM-as-judge for free evaluation.
Yes. EvalView supports Ollama for both testing agents and as a free, fully offline LLM-as-judge.
Yes. Any agent that exposes an HTTP API works with EvalView's generic http adapter. EvalView also supports JSONL streaming APIs.
Yes. EvalView can test MCP servers directly and also provides MCP contract testing to detect interface drift in external MCP servers.
Yes. EvalView's core regression detection (golden baseline diffing, tool accuracy, sequence correctness) works without any API keys. The optional LLM-as-judge scoring requires an API key, but you can use Ollama for completely free local evaluation:
evalview run --judge-provider ollama --judge-model llama3.2Yes. EvalView works fully offline when using Ollama as the LLM-as-judge and testing a locally-running agent.
Yes. EvalView has a GitHub Action (hidai25/eval-view@v0.6.1), proper exit codes, JSON output mode, and PR comment support. It also works with GitLab CI, CircleCI, and any CI system that runs Python. See CI/CD Integration.
No. EvalView runs without any database. Results print to console and save as JSON files. No external dependencies required.
Yes. EvalView detects hallucinations by verifying the agent called the expected tools (catching "didn't look it up" hallucinations) and comparing agent output against tool results (catching "misinterpreted the data" hallucinations).
checks:
hallucination: trueYes. Use evalview skill validate for structure validation and evalview skill test for behavior testing. EvalView catches skills that exceed Claude Code's 15k character budget — a common silent failure. See Skills Testing.
Yes. Codex CLI uses the same SKILL.md format as Claude Code. Your tests work for both platforms.
No. evalview skill validate runs locally without any API calls. Only evalview skill test requires an Anthropic API key.
Yes. EvalView provides multiple approaches:
- Multi-reference goldens: Save up to 5 acceptable variants per test
- Statistical mode: Run tests N times with pass@k reliability metrics
- Flexible matching:
subsequencemode allows extra tools between expected ones - Tool categories: Match by intent instead of exact tool names
Yes. MCP contract testing captures a snapshot of a server's tool definitions and detects breaking changes (removed tools, new required params, type changes) before they break your agent. See MCP Contracts.
Use Statistical Mode (evalview run --runs 10). It runs tests multiple times and provides pass@k reliability metrics, flakiness scores, and statistical confidence intervals.
Run with evalview run --verbose or DEBUG=1 evalview run to see raw API responses, parsed traces, and scoring breakdowns. See the Debugging Guide.
Use Tool Categories to match by intent rather than exact tool names. For example, file_read matches read_file, bash cat, and text_editor.
Use multi-reference goldens (up to 5 variants per test), statistical mode (--runs N), or flexible sequence matching (subsequence mode). See Tutorials.
- Getting Started — Install and run your first test in 5 minutes
- Framework Support — Adapter guides for each framework
- Golden Traces — Regression detection with golden baselines
- CLI Reference — Complete command reference
- Debugging Guide — Troubleshooting common issues
- Tutorials — Step-by-step guides for advanced features