worlds-client-evals

AI agent evaluation harness for @worlds/client. Runs deterministic assertion checks and live model trials against a seeded in-memory LibSQL world.

This repository is a consumer of the @worlds/client package. It tests whether an AI agent can successfully use the client's public API (search, sparql, import) through AI SDK tool adapters.

Design direction

This stays a targeted smoke harness, not a general eval framework. It verifies tool-use behavior, SPARQL handoff quality, step budgets, and read-only guard enforcement through deterministic code checks rather than LLM judging.

Quickstart

# Install dependencies (Bun runs npm lifecycle scripts for @worlds/client TensorFlow hooks)
bun install

# Run unit tests (no API key needed)
bun run test

# Run live evals (requires GOOGLE_GENERATIVE_AI_API_KEY in .env)
bun run evals

# Smoke one case after assertion changes
bun run evals --filter search-miss-unknown-label

Node.js users can run the CLI with bunx tsx src/cli/run.ts (requires a local .env or exported environment variables).

Environment

Variable	Required	Default
`GOOGLE_GENERATIVE_AI_API_KEY`	Yes (live)	—
`EVAL_PROVIDER_ID`	No	`google`
`EVAL_MODEL_ID`	No	`gemini-3.1-flash-lite`

Unit tests (tests/**/*.test.ts) do not use these variables and run without an API key.

Flags

Flag	Description
`--list`	Print matching case ids and descriptions, then exit
`--filter <pattern>`	Test-runner-like filter on case `id` or `description` (literal or `/regex/i`)
`--permit-no-files`	Exit 0 when the filter matches no cases (default: error)
`--trials <N>`	Run each selected case `N` times and aggregate pass rates (default `1`)
`--min-pass-rate`	With `--trials`, require each case pass rate ≥ threshold (0–1); default requires 100%
`--tool-config <id>`	Run with a named tool configuration (default `baseline`)
`--compare <a,b>`	Run the same selected cases against multiple tool configs and write a diff artifact

Output

Summary: per-case pass/fail, step count, tool names, assertion lines
Local scratch: results/latest.json, results/stats-latest.json, and results/compare-*.json (gitignored)
Exit code: 0 when all cases pass; 1 on failure; 2 on fatal API abort before a full result is available

CI

Layer	Command	API key	When
Unit tests	`bun run ci`	No	Every push and pull request (GitHub `CI` workflow)
Live agent evals	`bun run evals`	Yes	Local dev; GitHub `Agent evals` workflow (manual dispatch)
Scheduled baseline	`--trials 10`	Yes	Weekly (Mon 06:00 UTC), skipped if no harness commits in 7 days; uploads `results/*.json` as a workflow artifact
Manual dispatch	configurable	Yes	Same workflow artifact flow as the scheduled baseline

Layout

Path	Role
`src/cli/run.ts`	CLI entry, filtering, suite execution
`src/runner/`	Agent execution, system prompt, trajectory
`src/tool-configs/`	Named tool-set registry for tool iteration
`src/tools/`	Eval-isolated tools and SPARQL read-only guard
`src/assertions/assertion-registry.ts`	Composable assertion kinds (`runAssertionSpecs`)
`src/assertions/trajectory-reducers.ts`	Shared trajectory extractors and diagnostics
`src/cases/index.ts`	Eval case catalog (prompts + assertion specs)
`src/cases/test-fixtures.ts`	Golden trajectories and outputs for case tests
`src/fixtures/index.ts`	Seeded world fixture registry
`src/runner/run-eval-suite.ts`	Suite orchestration, stats, and compare assembly
`src/results/result-store.ts`	Result artifact paths and JSON persistence
`src/reporting/markdown-table.ts`	Shared Markdown table helpers for CI reports
`src/fixtures/primary-world.ts`	Primary fixture (work -> protagonist -> house)
`src/fixtures/scholar-world.ts`	Scholar fixture (paper -> author/venue/year)
`tests/`	Deterministic unit tests
`results/`	Local run output (gitignored)

Eval artifacts from CI

The Agent evals workflow uploads results/*.json as a workflow artifact for each credentialed run. Generated trajectories are not committed and do not open pull requests. The workflow also publishes a best-effort GitHub Discussion in the Evals category with links to the workflow run and artifact.

Repository setup: enable GitHub Discussions and create a dedicated category named Evals. If the category is missing, CI still uploads artifacts and skips discussion publication with a warning.

Evaluation policy

Full eval-driven iteration loop: AGENTS.md — Eval iteration.
Deterministic assertions are the pass/fail gate. Prefer proofs (SPARQL guard, tool descriptions, @worlds/client invariants) over new eval code; when tests are needed, add a registry kind once and wire cases in src/cases/index.ts. See AGENTS.md for the full proof-vs-test policy.
Generated trajectories are external artifacts, not source-controlled history.
Incomplete, rate-limited, or credential-skipped live runs are operational signals only; do not cite them as benchmark evidence.
Add dogfooding failures by reusing an existing assertion kind on a new or existing case; extend the registry only when no composable kind fits.

Tool configuration iteration

The default baseline tool config maps the discovery role to searchWorld and the query role to executeSparql. Case prompts use semantic placeholders such as {{discovery}} and {{query}}, so new tool configs can swap tool names without rewriting the scenario catalog.

To test one config:

bun run evals --tool-config baseline --trials 10 --min-pass-rate 0.7

To compare configs after registering another config in src/tool-configs/:

bun run evals --compare baseline,experimental --trials 10 --min-pass-rate 0.7
bun run scripts/render-compare-report.ts

Comparison runs write per-config outputs plus a side-by-side results/compare-*.json artifact for spotting case-level regressions and assertion-level near misses.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
scripts		scripts
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.npmrc		.npmrc
AGENTS.md		AGENTS.md
README.md		README.md
biome.json		biome.json
bun.lock		bun.lock
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

worlds-client-evals

Design direction

Quickstart

Environment

Flags

Output

CI

Layout

Eval artifacts from CI

Evaluation policy

Tool configuration iteration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

worlds-client-evals

Design direction

Quickstart

Environment

Flags

Output

CI

Layout

Eval artifacts from CI

Evaluation policy

Tool configuration iteration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages