AI agent evaluation harness for @worlds/client. Runs deterministic assertion
checks and live model trials against a seeded in-memory LibSQL world.
This repository is a consumer of the @worlds/client package. It tests
whether an AI agent can successfully use the client's public API (search,
sparql, import) through AI SDK tool adapters.
This stays a targeted smoke harness, not a general eval framework. It verifies tool-use behavior, SPARQL handoff quality, step budgets, and read-only guard enforcement through deterministic code checks rather than LLM judging.
# Install dependencies (Bun runs npm lifecycle scripts for @worlds/client TensorFlow hooks)
bun install
# Run unit tests (no API key needed)
bun run test
# Run live evals (requires GOOGLE_GENERATIVE_AI_API_KEY in .env)
bun run evals
# Smoke one case after assertion changes
bun run evals --filter search-miss-unknown-labelNode.js users can run the CLI with bunx tsx src/cli/run.ts (requires a local
.env or exported environment variables).
| Variable | Required | Default |
|---|---|---|
GOOGLE_GENERATIVE_AI_API_KEY |
Yes (live) | — |
EVAL_PROVIDER_ID |
No | google |
EVAL_MODEL_ID |
No | gemini-3.1-flash-lite |
Unit tests (tests/**/*.test.ts) do not use these variables and run without an
API key.
| Flag | Description |
|---|---|
--list |
Print matching case ids and descriptions, then exit |
--filter <pattern> |
Test-runner-like filter on case id or description (literal or /regex/i) |
--permit-no-files |
Exit 0 when the filter matches no cases (default: error) |
--trials <N> |
Run each selected case N times and aggregate pass rates (default 1) |
--min-pass-rate |
With --trials, require each case pass rate ≥ threshold (0–1); default requires 100% |
--tool-config <id> |
Run with a named tool configuration (default baseline) |
--compare <a,b> |
Run the same selected cases against multiple tool configs and write a diff artifact |
- Summary: per-case pass/fail, step count, tool names, assertion lines
- Local scratch:
results/latest.json,results/stats-latest.json, andresults/compare-*.json(gitignored) - Exit code:
0when all cases pass;1on failure;2on fatal API abort before a full result is available
| Layer | Command | API key | When |
|---|---|---|---|
| Unit tests | bun run ci |
No | Every push and pull request (GitHub CI workflow) |
| Live agent evals | bun run evals |
Yes | Local dev; GitHub Agent evals workflow (manual dispatch) |
| Scheduled baseline | --trials 10 |
Yes | Weekly (Mon 06:00 UTC), skipped if no harness commits in 7 days; uploads results/*.json as a workflow artifact |
| Manual dispatch | configurable | Yes | Same workflow artifact flow as the scheduled baseline |
| Path | Role |
|---|---|
src/cli/run.ts |
CLI entry, filtering, suite execution |
src/runner/ |
Agent execution, system prompt, trajectory |
src/tool-configs/ |
Named tool-set registry for tool iteration |
src/tools/ |
Eval-isolated tools and SPARQL read-only guard |
src/assertions/assertion-registry.ts |
Composable assertion kinds (runAssertionSpecs) |
src/assertions/trajectory-reducers.ts |
Shared trajectory extractors and diagnostics |
src/cases/index.ts |
Eval case catalog (prompts + assertion specs) |
src/cases/test-fixtures.ts |
Golden trajectories and outputs for case tests |
src/fixtures/index.ts |
Seeded world fixture registry |
src/runner/run-eval-suite.ts |
Suite orchestration, stats, and compare assembly |
src/results/result-store.ts |
Result artifact paths and JSON persistence |
src/reporting/markdown-table.ts |
Shared Markdown table helpers for CI reports |
src/fixtures/primary-world.ts |
Primary fixture (work -> protagonist -> house) |
src/fixtures/scholar-world.ts |
Scholar fixture (paper -> author/venue/year) |
tests/ |
Deterministic unit tests |
results/ |
Local run output (gitignored) |
The Agent evals workflow uploads results/*.json
as a workflow artifact for each credentialed run. Generated trajectories are not
committed and do not open pull requests. The workflow also publishes a
best-effort GitHub Discussion in the Evals category with links to the workflow
run and artifact.
Repository setup: enable GitHub Discussions and create a dedicated category
named Evals. If the category is missing, CI still uploads artifacts and skips
discussion publication with a warning.
- Full eval-driven iteration loop: AGENTS.md — Eval iteration.
- Deterministic assertions are the pass/fail gate. Prefer proofs (SPARQL guard,
tool descriptions,
@worlds/clientinvariants) over new eval code; when tests are needed, add a registrykindonce and wire cases insrc/cases/index.ts. See AGENTS.md for the full proof-vs-test policy. - Generated trajectories are external artifacts, not source-controlled history.
- Incomplete, rate-limited, or credential-skipped live runs are operational signals only; do not cite them as benchmark evidence.
- Add dogfooding failures by reusing an existing assertion
kindon a new or existing case; extend the registry only when no composable kind fits.
The default baseline tool config maps the discovery role to searchWorld and
the query role to executeSparql. Case prompts use semantic placeholders such
as {{discovery}} and {{query}}, so new tool configs can swap tool names
without rewriting the scenario catalog.
To test one config:
bun run evals --tool-config baseline --trials 10 --min-pass-rate 0.7To compare configs after registering another config in src/tool-configs/:
bun run evals --compare baseline,experimental --trials 10 --min-pass-rate 0.7
bun run scripts/render-compare-report.tsComparison runs write per-config outputs plus a side-by-side
results/compare-*.json artifact for spotting case-level regressions and
assertion-level near misses.