AegisEval

Reliable infrastructure for agentic knowledge-work evals.

AegisEval is a local-first Python harness for building, running, grading, replaying, and auditing realistic knowledge-work agent tasks. It is designed around one premise: an agent eval score is only useful if the task, environment, grader, trace, and final artifacts can be inspected and reproduced.

What it does

Defines versioned task specs for knowledge-work environments.
Creates deterministic local workspaces per trial.
Captures JSONL traces for lifecycle events, model calls, artifacts, and grader results.
Grades final environment artifacts rather than trusting agent self-reports.
Runs multi-task, multi-trial suites with flake visibility.
Generates static HTML reports with per-trial trace drilldowns and artifact previews.
Produces eval-quality audits with a human transcript review queue.
Includes red-team scanners for fabricated citations and artifact spoofing.
Supports dummy, subprocess, OpenAI-compatible, and Anthropic model adapters.

Why this exists

Agent evals are easy to make impressive and hard to make trustworthy. A pass rate can hide brittle graders, contaminated tasks, missing controls, API failures, reward hacking, and models that merely claim success. AegisEval treats reliability infrastructure as part of the eval itself.

The harness is informed by Anthropic's public writing on agent evals, long-running agent harnesses, reward hacking, and automated auditing. See docs/anthropic-eval-harness-notes.md. Security boundaries are documented in SECURITY.md and docs/threat-model.md.

Quickstart

git clone https://github.com/stchakwdev/aegiseval.git
cd aegiseval
python3 -m venv .venv
.venv/bin/pip install -e . pytest
.venv/bin/pytest -q

Run one deterministic task:

.venv/bin/aegis run tasks/doc_synthesis_001 --agent dummy --out runs/doc-demo
.venv/bin/aegis replay runs/doc-demo
.venv/bin/aegis report runs/doc-demo

Run a full local suite:

.venv/bin/aegis suite tasks/doc_synthesis_001 tasks/data_analysis_001 \
  --trials 2 \
  --agent dummy \
  --out runs/suite-demo

.venv/bin/aegis audit-suite runs/suite-demo
.venv/bin/aegis redteam --out runs/redteam-demo

Model-backed agents

Model adapters write files into the same workspace as scripted agents. The expected response shape is JSON only:

{"files": [{"path": "final.md", "content": "..."}]}

OpenAI-compatible endpoint, including OpenAI, OpenRouter, Z.ai-compatible gateways, or local vLLM:

OPENROUTER_API_KEY=... .venv/bin/aegis run tasks/doc_synthesis_001 \
  --agent openai-compatible \
  --model z-ai/glm-5.1 \
  --base-url https://openrouter.ai/api/v1 \
  --api-key-env OPENROUTER_API_KEY \
  --out runs/openrouter-demo

Anthropic Messages API:

ANTHROPIC_API_KEY=... .venv/bin/aegis run tasks/doc_synthesis_001 \
  --agent anthropic \
  --model claude-sonnet-4-5 \
  --api-key-env ANTHROPIC_API_KEY \
  --out runs/anthropic-demo

Overnight evals

For longer real-model runs:

PYTHONPATH=src python scripts/run_overnight_eval.py \
  --trials 25 \
  --model z-ai/glm-5.1 \
  --base-url https://openrouter.ai/api/v1 \
  --api-key-env OPENROUTER_API_KEY \
  --out runs/overnight/glm-5.1-$(date -u +%Y%m%dT%H%M%SZ)

See docs/overnight-evals.md for background execution, artifacts, and review checklist.

Example report

A curated static report is included at examples/reports/portfolio-demo/report.html. It runs five deterministic knowledge-work tasks across two trials each and includes trace drilldowns plus an eval-quality audit.

Output artifacts

A suite run writes:

runs/suite-demo/
├── suite_result.json
├── report.html
├── eval_audit.json
├── eval_audit.md
└── trials/
    └── <task_id>/trial-001/
        ├── result.json
        ├── trace.jsonl
        ├── trace.html
        └── workspace/

The key portfolio artifacts are report.html, trace.html, artifact previews, and eval_audit.md.

Role fit

AegisEval is built to demonstrate production-grade ownership of agentic evaluation infrastructure: canonical task specs, stable runs, instrumentation, outcome grading, replayability, reliability metrics, release gates, and eval-integrity failure analysis.

Development

python3 -m pip install -e '.[dev]'
python3 -m pytest -q
ruff check .
PYTHONPATH=src python scripts/e2e_smoke.py
python3 scripts/render_brand_assets.py

AegisEval ships a py.typed marker and keeps runtime dependencies small. ruff is configured for local quality checks but is intentionally not required at runtime.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
examples/reports/portfolio-demo		examples/reports/portfolio-demo
scripts		scripts
src/aegiseval		src/aegiseval
tasks		tasks
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AegisEval

What it does

Why this exists

Quickstart

Model-backed agents

Overnight evals

Example report

Output artifacts

Role fit

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AegisEval

What it does

Why this exists

Quickstart

Model-backed agents

Overnight evals

Example report

Output artifacts

Role fit

Development

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages