eval-harness

Run your AI system against a dataset. Capture traces. Score with evaluators. Compare runs. Ship.

eval-harness is a config-driven evaluation framework for AI systems — agents, RAG pipelines, code-modifying tools, multi-turn assistants, raw LLM endpoints. You describe the run in one YAML file. The harness dispatches the matrix of cases × variants, captures structured traces, runs evaluators, persists results, and produces a comparable summary.

It is not a benchmark suite. It is the harness that runs your benchmarks.

📦 On PyPI: https://pypi.org/project/eval-harness/ 📚 Source: https://github.com/regokan/evalh

Install

pip install eval-harness                            # core
pip install 'eval-harness[anthropic]'               # + Claude LLM-judge backend
pip install 'eval-harness[anthropic,langfuse,otel]' # + observability mirrors
pip install 'eval-harness[all]'                     # everything

Python 3.11+. Core install pulls only Pydantic v2, httpx, click, rich, jsonpath-ng, pyyaml, python-dotenv, jsonschema, fsspec. LLM SDKs and platform clients ship as optional extras so you only pay for what you use.

60-second tour

# 1. Install
pip install 'eval-harness[anthropic]'

# 2. Drop your Anthropic key into the smoke fixture
echo "ANTHROPIC_API_KEY=sk-ant-..." > examples/tiny_demo/.env

# 3. Run
evalh run examples/tiny_demo/eval.yaml

# 4. Inspect a case
evalh inspect runs/<run_id> --case tiny_demo_001

# 5. Compare two runs
evalh compare runs/<run_a> runs/<run_b>

That run produces a runs/<run_id>/ directory with:

config.yaml         # exact config used, secrets masked
traces.jsonl        # one Trace per (case, variant)
results.jsonl       # one EvaluationResult per (case, variant, evaluator)
summary.yaml        # per-variant pass-rates + baseline comparison
report.md           # human-readable summary

These four files are the durable surface. Everything else (drift reports, inspect output, webhook posts) is derived from them.

What it does

Concern	What you get
System under test	Plug in any HTTP service, Python function, CLI subprocess, branch checkout, Docker image, multi-turn user simulator, or replay-from-historical-trace.
Evaluators	13 built-ins covering text checks, tool-call assertions, LLM-as-judge (nl-assertions + rubric), schema validation, latency/cost gates, thinking-token rules, semantic similarity, git-diff checks, command exit codes. Plus a clean extension API.
Trace storage	Local JSONL (default), SQLite, Postgres, OTel collector, Langfuse, Phoenix, Arize, Braintrust, Slack / Discord / Linear webhooks.
Dataset sources	YAML, JSONL, plus production-traffic pulls from Langfuse, Phoenix, Arize, Helicone, Braintrust (with `embed_full_trace` for replay-style evaluation).
Workspace isolation	Tempdir snapshot (default), git worktree, Docker volume (sandboxed; can't read host `$HOME`).
Variants	One run dispatches N × M of cases × system configurations. Use for A/B testing, fleet evals, branch comparison, stochastic sampling.
Distribution	Local async (default), Ray, Modal, Celery, Kubernetes Jobs. Same code, different executor.
Drift detection	Promote a run as the baseline; `evalh drift` surfaces regressions vs. baseline. Wire to Slack via webhook on a daily cron.
Reports	Markdown summary, baseline ComparisonReport, per-evaluator rollup, regressions/improvements case-by-case.

CLI

evalh run <eval.yaml>                              # execute an eval
evalh run --retry-only-failed <run_dir>            # re-run cells that errored
evalh re-evaluate <run_dir> [--add <evaluator>]    # re-score existing traces offline
evalh inspect <run_dir> [--case <id>] [--failed]   # view a case + its results + filesystem artifacts
evalh compare <run_a> <run_b>                      # diff two runs (regressions / improvements)
evalh promote <run_dir>                            # mark a run as the eval's baseline
evalh drift <run_dir> [--exit-nonzero-on-regression]  # compare against baseline; CI gate

eval.yaml in 30 lines

eval:
  name: listing_price

dataset:
  type: yaml
  path: cases.yaml

systems:                                   # one entry per variant
  - name: agent_main
    adapter: http
    endpoint: http://localhost:8000/chat
    response_mapping:
      final_answer: $.answer
      tool_calls: $.tool_calls
  - name: agent_experimental
    adapter: http
    endpoint: http://localhost:8000/chat
    query_params: { variant: experimental }
    response_mapping: { final_answer: $.answer, tool_calls: $.tool_calls }

evaluators:
  - name: must_call_listing_tool
    type: tool_called
    config: { tool_name: get_listing_details }
  - name: answer_quality
    type: llm_judge
    config:
      model: claude-4-7
      nl_assertions:
        - "The answer mentions the listing's suburb."
        - "The answer compares the listing price to the suburb average."
      pass_when: all

run:
  max_concurrency: 4
  baseline_variant: agent_main
  cost_limit_usd: 5.00

output:
  - { type: local_files, path: runs/ }

See docs/ConfigSchema.md for every field.

Examples

Four runnable references live under examples/:

tiny_demo/ — self-contained smoke test against Claude. Needs only ANTHROPIC_API_KEY. Finishes in under a minute.
listing_price/ — realistic-shape eval: HTTP agent service, two variants, LLM judge. Plug your service in.
online_eval/ — replay-style evaluation. The fixture adapter ships embedded historical traces; the replay SystemAdapter scores them. Swap the fixture for Langfuse / Phoenix / Arize to score production traffic.
coding_agent/ — workspace-mutating agent. Claude patches a fixture repo; the command evaluator runs pytest in the artifact directory.

Distributed runs

The default LocalExecutor uses asyncio.gather + a semaphore — perfect for thousands of cases on one box. For larger fleets, plug in another executor:

run:
  executor:
    type: ray            # or modal, celery, kubernetes
    address: auto        # or your cluster address
    object_store_memory: 2_147_483_648

The cell is the unit of distribution. Workers rebuild adapters + evaluators from your eval.yaml and the entry-point registry — config travels, code doesn't, so your custom evaluators work on Ray workers without pickling pitfalls. See docs/Executors.md.

Custom evaluators

When the built-ins don't cover your domain (e.g., "the SQL the agent generated returns the same rowset as the reference SQL"), write your own and register it via Python entry-points — no fork of eval-harness required:

# your-package/pyproject.toml
[project.entry-points."eval_harness.evaluators"]
sql_equivalent = "your_package.evaluators.sql_equivalent:SqlEquivalentEvaluator"

# your eval.yaml
evaluators:
  - name: query_correctness
    type: sql_equivalent
    config: { reference_sql: "SELECT id FROM listings WHERE suburb='Richmond'" }

The same extension pattern works for system adapters, dataset adapters, trace stores, workspace adapters, embedder backends, and LLM-judge backends. See docs/Evaluators.md and docs/Adapters.md.

Observability integrations

eval-harness coexists with your existing observability stack — it doesn't replace it. The local runs/<run_id>/ directory stays canonical; remote sinks are mirrors. Failed mirror writes don't abort the run, they land in summary.yaml > sink_errors.

output:
  - { type: local_files, path: runs/ }                                        # canonical
  - { type: otel,        endpoint: "https://api.honeycomb.io" }               # mirror to Honeycomb
  - { type: langfuse,    api_key: "${LANGFUSE_API_KEY}", host: "..." }        # mirror to Langfuse UI
  - { type: webhook,     platform: slack, url: "${SLACK_WEBHOOK_URL}" }       # daily summary post

Backends shipped: OTel (Honeycomb / Datadog / Tempo / Grafana / Phoenix-OTLP / self-hosted Langfuse), Langfuse, Phoenix, Arize, Braintrust, Helicone (dataset only), Slack / Discord / Linear (webhook). See docs/Observability.md.

CI integrations

Two reference workflows live under templates/:

templates/eval.yml — on every PR, run the eval against the PR head, compare with main's baseline, post a markdown summary back to the PR comments.
templates/eval-daily.yml — on a schedule (or workflow_dispatch), run the eval, compute drift vs. the saved baseline, and post regressions to a webhook channel.

Walkthrough in docs/CI.md.

Documentation

Topic	Doc
Why the project exists	`docs/PRD.md`
End-to-end design	`docs/Architecture.md`
Trace / Case / Result / Summary models	`docs/DataModel.md`
`eval.yaml` and `cases.yaml` field reference	`docs/ConfigSchema.md`
System / Dataset / TraceStore / Workspace / Enricher contracts	`docs/Adapters.md`
Built-in evaluators + writing your own	`docs/Evaluators.md`
The variant matrix (A/B, branch, fleet, sampling)	`docs/Variants.md`
Filesystem artifacts + sandboxed workspaces	`docs/Filesystem.md`
Concurrency model + executor abstraction	`docs/Concurrency.md`
Distributed executors (Ray, Modal, Celery, K8s)	`docs/Executors.md`
Observability platform integrations	`docs/Observability.md`
Drift detection + CLI surface	`docs/CLI.md`
GitHub Actions recipes	`docs/CI.md`
Project layout + plugin packaging	`docs/RepositoryStructure.md`
Milestone-by-milestone history	`CHANGELOG.md`, `docs/Roadmap.md`

Status

All planned milestones are shipped — v0 through v2. The project covers what the roadmap set out to do and nothing beyond it (hosted SaaS, web dashboard, auth, and built-in dataset libraries are explicitly out of scope; see docs/Roadmap.md > Forever-maybe).

Snapshot: 132 source files · 657+ tests · ruff & mypy --strict clean · 6 adapter families · 5 executor backends · 8 observability platform integrations.

Contributing

Issues and PRs welcome. See CONTRIBUTING.md for setup, testing, and submission guidelines. The architectural rails the project is built against live under .claude/rules/ — read those before substantive PRs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eval-harness

Install

60-second tour

What it does

CLI

eval.yaml in 30 lines

Examples

Distributed runs

Custom evaluators

Observability integrations

CI integrations

Documentation

Status

Contributing

License

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.claude/rules		.claude/rules
.github		.github
benchmarks		benchmarks
docs		docs
eval_harness		eval_harness
examples		examples
runs		runs
templates		templates
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

eval-harness

Install

60-second tour

What it does

CLI

eval.yaml in 30 lines

Examples

Distributed runs

Custom evaluators

Observability integrations

CI integrations

Documentation

Status

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages