Lightweight evaluation framework for AI agents. Measure accuracy, cost, latency, and safety across any agent architecture.
Works with any agent that accepts a string and returns a string — LangChain, CrewAI, AutoGen, OpenAI Assistants, or plain functions.
pip install agentevalfrom agenteval import AgentEvaluator, TaskSuite
# Define tasks
suite = TaskSuite.from_list([
{"name": "math", "prompt": "What is 2+2?", "expected": "4", "category": "math"},
{"name": "capital", "prompt": "Capital of France?", "expected": "Paris", "category": "geo"},
{"name": "code", "prompt": "Write hello world in Python", "expected": "print", "category": "code"},
])
# Evaluate agents
evaluator = AgentEvaluator(
agents={
"agent_a": my_agent_a, # any callable(str) -> str
"agent_b": my_agent_b,
},
runs_per_task=3, # run each task 3x for reliability measurement
)
results = evaluator.run(suite)
# Check metrics
print(results["agent_a"].metrics.accuracy)
print(results["agent_a"].metrics.latency_p95)
print(results["agent_a"].safety.safety_score)
# Compare side-by-side
evaluator.compare_results(results).print_table()
# agent | accuracy | success_rate | latency_mean | latency_p95 | tokens_mean | cost_per_run
# agent_a | 91.1% | 100.0% | 2800ms | 3200ms | 450 | $0.0135
# agent_b | 87.3% | 97.5% | 3100ms | 4500ms | 520 | $0.0156
# Winner: agent_a| Module | Metrics |
|---|---|
| Accuracy | Exact match, containment match, custom judge functions |
| Latency | Mean, p50, p95, p99 (per-run, in ms) |
| Cost | Token-based cost estimation (configurable per-model pricing) |
| Reliability | Success rate across runs, error categorization |
| Safety | PII leakage (email, phone, SSN, credit card), prompt injection detection, custom forbidden patterns |
Define tasks in code, JSON, or YAML:
# tasks.yaml
name: customer_support
tasks:
- name: greeting
prompt: "Hi, I need help with my order"
expected: "help"
category: greeting
- name: refund
prompt: "I want a refund for order #1234"
expected: "refund"
category: transactionssuite = TaskSuite.from_yaml("tasks.yaml")from agenteval import SafetyChecker
checker = SafetyChecker(
check_pii=True, # emails, phones, SSNs, credit cards, IPs
check_injection=True, # prompt injection leak detection
forbidden_patterns=[ # custom regex patterns
r"SECRET_KEY",
r"password\s*[:=]",
],
)
report = checker.check(run_results)
print(report.safety_score) # 0.0 - 1.0
print(report.violations) # list of SafetyViolationBuilt-in judges for different matching strategies:
from agenteval.judges import exact_match, contains_match, numeric_match
exact_match("Paris", "Paris") # True
contains_match("The answer is 42", "42") # True
numeric_match("3.14159", "3.14", tolerance=0.01) # TrueFor semantic evaluation using OpenAI or Anthropic:
from agenteval.judges import llm_judge, anthropic_judge
# OpenAI
evaluator = AgentEvaluator(
agents={"my_agent": agent_fn},
judge_fn=llm_judge(model="gpt-4o-mini"),
)
# Anthropic
evaluator = AgentEvaluator(
agents={"my_agent": agent_fn},
judge_fn=anthropic_judge(model="claude-sonnet-4-20250514"),
)Install LLM support: pip install agenteval[openai] or pip install agenteval[anthropic]
# Validate a task suite
agenteval validate tasks.yaml
# Show task suite info
agenteval info tasks.yaml
# Version
agenteval versiongit clone https://github.com/atharvajoshi01/agenteval.git
cd agenteval
pip install -e ".[dev]"
pytestMIT