agenteval

Lightweight evaluation framework for AI agents. Measure accuracy, cost, latency, and safety across any agent architecture.

Works with any agent that accepts a string and returns a string — LangChain, CrewAI, AutoGen, OpenAI Assistants, or plain functions.

Installation

pip install agenteval

Quick Start

from agenteval import AgentEvaluator, TaskSuite

# Define tasks
suite = TaskSuite.from_list([
    {"name": "math", "prompt": "What is 2+2?", "expected": "4", "category": "math"},
    {"name": "capital", "prompt": "Capital of France?", "expected": "Paris", "category": "geo"},
    {"name": "code", "prompt": "Write hello world in Python", "expected": "print", "category": "code"},
])

# Evaluate agents
evaluator = AgentEvaluator(
    agents={
        "agent_a": my_agent_a,  # any callable(str) -> str
        "agent_b": my_agent_b,
    },
    runs_per_task=3,  # run each task 3x for reliability measurement
)

results = evaluator.run(suite)

# Check metrics
print(results["agent_a"].metrics.accuracy)
print(results["agent_a"].metrics.latency_p95)
print(results["agent_a"].safety.safety_score)

# Compare side-by-side
evaluator.compare_results(results).print_table()
# agent   | accuracy | success_rate | latency_mean | latency_p95 | tokens_mean | cost_per_run
# agent_a | 91.1%    | 100.0%       | 2800ms       | 3200ms      | 450         | $0.0135
# agent_b | 87.3%    | 97.5%        | 3100ms       | 4500ms      | 520         | $0.0156
# Winner: agent_a

What It Measures

Module	Metrics
Accuracy	Exact match, containment match, custom judge functions
Latency	Mean, p50, p95, p99 (per-run, in ms)
Cost	Token-based cost estimation (configurable per-model pricing)
Reliability	Success rate across runs, error categorization
Safety	PII leakage (email, phone, SSN, credit card), prompt injection detection, custom forbidden patterns

Task Suites

Define tasks in code, JSON, or YAML:

# tasks.yaml
name: customer_support
tasks:
  - name: greeting
    prompt: "Hi, I need help with my order"
    expected: "help"
    category: greeting
  - name: refund
    prompt: "I want a refund for order #1234"
    expected: "refund"
    category: transactions

suite = TaskSuite.from_yaml("tasks.yaml")

Safety Checks

from agenteval import SafetyChecker

checker = SafetyChecker(
    check_pii=True,           # emails, phones, SSNs, credit cards, IPs
    check_injection=True,      # prompt injection leak detection
    forbidden_patterns=[       # custom regex patterns
        r"SECRET_KEY",
        r"password\s*[:=]",
    ],
)

report = checker.check(run_results)
print(report.safety_score)  # 0.0 - 1.0
print(report.violations)    # list of SafetyViolation

Judge Functions

Built-in judges for different matching strategies:

from agenteval.judges import exact_match, contains_match, numeric_match

exact_match("Paris", "Paris")           # True
contains_match("The answer is 42", "42") # True
numeric_match("3.14159", "3.14", tolerance=0.01)  # True

LLM-as-Judge

For semantic evaluation using OpenAI or Anthropic:

from agenteval.judges import llm_judge, anthropic_judge

# OpenAI
evaluator = AgentEvaluator(
    agents={"my_agent": agent_fn},
    judge_fn=llm_judge(model="gpt-4o-mini"),
)

# Anthropic
evaluator = AgentEvaluator(
    agents={"my_agent": agent_fn},
    judge_fn=anthropic_judge(model="claude-sonnet-4-20250514"),
)

Install LLM support: pip install agenteval[openai] or pip install agenteval[anthropic]

CLI

# Validate a task suite
agenteval validate tasks.yaml

# Show task suite info
agenteval info tasks.yaml

# Version
agenteval version

Development

git clone https://github.com/atharvajoshi01/agenteval.git
cd agenteval
pip install -e ".[dev]"
pytest

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
agenteval		agenteval
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agenteval

Installation

Quick Start

What It Measures

Task Suites

Safety Checks

Judge Functions

LLM-as-Judge

CLI

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agenteval

Installation

Quick Start

What It Measures

Task Suites

Safety Checks

Judge Functions

LLM-as-Judge

CLI

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages