feat(eval-harness): minimal live eval harness v0 by ankitdas-volgapartners · Pull Request #45 · affandar/PilotSwarm

ankitdas-volgapartners · 2026-06-01T19:34:04Z

Summary

A minimal live eval harness for PilotSwarm durable-runtime checks. Runs JSON
scenarios against a managed PilotSwarm worker, captures CMS/tool evidence,
and grades responses with deterministic checks plus an optional provider-backed
LLM judge.

The package is purely additive — nothing in packages/sdk/, packages/portal/,
packages/cli/, packages/mcp-server/, or packages/ui-* is touched.

v0 scope (locked)

One driver (live, managed worker against real PostgreSQL + GitHub token)
One reporter (console, in-place progress)
18 scenarios across live/durable/multi-turn/safety
3 run plans: live-smoke, live-critical-path, live-all
Provider-backed LLM judge (Copilot, gpt-5.4) on opt-in scenarios
3 default test tools: delete_agent (safety fixture), test_add (minimal
scaffolding), test_untrusted_status (indirect-injection fixture)

Every scenario exercises a real PilotSwarm runtime feature: CMS event capture,
durable waits, worker-restart chaos, multi-turn session memory, judge
integration, or adversarial safety behavior.

Future expansion happens through documented plugin extension points
(registerTool, registerDriver, registerReporter, registerScenarioKind).

Out of scope (deferred deliberately)

Meta scenarios, prompt variants, ablations, model sweeps, expanded reporters,
post-run trajectory summaries, broad platform positioning, additional chaos
types beyond worker-restart.

Diff size

84 files, +6,797 / −28 lines. New package under packages/eval-harness/ plus
small root touches: workspace build wiring, test gate, .gitignore for
generated eval artifacts, and a proposal record.

What's NOT changing

packages/sdk/, packages/portal/, packages/cli/, packages/mcp-server/,
packages/ui-* — untouched
No SDK API changes, no CMS schema changes, no migrations
Existing tests under packages/sdk/test/local/ and packages/mcp-server/
are unchanged

The only repo-level deltas are 4 opt-out-compatible additions in commit 2:

Surface	Change	Opt-out
Root `package.json` build chain	Appends `--workspace=pilotswarm-eval-harness`	n/a (additive)
`scripts/run-tests.sh`	Adds `run_eval_harness_tests` before SDK suites	`SKIP_EVAL_HARNESS_TESTS=1`
Root `.gitignore`	Adds `*.eval-results/`	n/a
`package-lock.json`	Registers new workspace	n/a

Testing

npm run build --workspace=pilotswarm-eval-harness — clean (tsc --noEmit)
npm test --workspace=pilotswarm-eval-harness — 11 files / 59 tests pass
in ~1.2s
Live live-all integration: 18/18 PASS end-to-end with provider-backed
Copilot judge

Live runs need DATABASE_URL, GITHUB_TOKEN, and a reachable PostgreSQL.

How to try it

npm install
npm run build --workspace=pilotswarm-eval-harness

# Live smoke (needs DATABASE_URL + GITHUB_TOKEN + Postgres)
set -a; source .env; set +a
packages/eval-harness/bin/run-eval.sh --run=live-smoke

# Full v0 corpus
packages/eval-harness/bin/run-eval.sh --run=live-all

Downstream extension via packages/eval-harness/docs/DOWNSTREAM-GUIDE.md.

Reviewer guidance

Suggested review order:

packages/eval-harness/src/schema/{config,manifest,scenario,check-types}.ts — the contract surface
packages/eval-harness/src/engine/managed-live-runner.ts + chaos-controller.ts + managed-live-support.ts — worker pool ownership, chaos injection during durable waits, dehydration cycle
packages/eval-harness/src/engine/run-manifest.ts + discover.ts — orchestration entry, manifest cycle handling
packages/eval-harness/src/checks/llm-judge.ts — provider routing, llmJudgeRequired enforcement, budget reservation/refund-on-error, OPENAI_BASE_URL warning
packages/eval-harness/src/checks/index.ts — built-in check evaluators (schema refinements prevent vacuous-pass shapes)
packages/eval-harness/src/reporters/output.ts — redaction patterns
packages/eval-harness/src/index.ts — public API surface
packages/eval-harness/bin/run-eval.{sh,ts} — CLI flags, exit codes, plugin loader
packages/eval-harness/scenarios/ — 18 JSON scenarios (each is a one-page contract)

Design notes (honest gating)

No silent-PASS judge fallback. When llmJudge.enabled=false, an explicit
llm-judge check errors loudly instead of returning fabricated PASS.
No fake cost/token telemetry. The live runner has no per-call cost/token
measurement, so costUsd/tokensIn/tokensOut on ObservedResult are
optional; judge evidence renders "unmeasured" when absent rather than a
misleading 0.
No vacuous-pass schema shapes. Zod refinements on response-contains,
tool-call-count, cms-event-count reject configurations that would pass
unconditionally.
No overfit prompts. Scenarios use realistic phrasing; assertions test
PilotSwarm features (CMS events, dehydration cycles, session memory) rather
than prompt-coercion.
llmJudgeReservedUsd (not Spent) — honestly named to reflect that v0
reports reservations, not measured spend.

Commits

feat(eval-harness): minimal live eval harness v0 — the product
chore: wire eval-harness into workspace build + test gate — root integration
docs(proposals-impl): add eval-harness proposal record — design context

A managed live PilotSwarm eval harness that runs JSON scenarios against a real worker, captures CMS/tool evidence, and grades responses with deterministic checks plus an optional provider-backed LLM judge. v0 scope: - One driver (live, managed PilotSwarm worker against real PostgreSQL) - One reporter (console with in-place progress) - 18 scenarios across live/durable/multi-turn/safety, each exercising a real PilotSwarm runtime feature (CMS event capture, durable waits, worker-restart chaos, multi-turn session memory, judge integration, adversarial safety behavior) - 3 bundled run plans: live-smoke, live-critical-path, live-all - Provider-backed LLM judge (Copilot, gpt-5.4) for scenarios that opt in - 3 default test tools (delete_agent safety fixture, test_add minimal scaffolding, test_untrusted_status for indirect-injection testing) Out of scope for v0: meta scenarios, prompt variants, ablations, model sweeps, expanded reporters, post-run trajectory summaries.

- Add pilotswarm-eval-harness to root build chain - Add run_eval_harness_tests stage to scripts/run-tests.sh with SKIP_EVAL_HARNESS_TESTS=1 opt-out - Update package-lock.json for the new workspace - Add *.eval-results/ to root .gitignore for generated eval artifacts

Captures the locked v0 scope, run-config/manifest/scenario ownership model, and the kept live path. Active package docs live under packages/eval-harness/.

ankitdas-volgapartners added 3 commits June 2, 2026 00:55

docs(proposals-impl): add eval-harness proposal record

d051ec1

Captures the locked v0 scope, run-config/manifest/scenario ownership model, and the kept live path. Active package docs live under packages/eval-harness/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval-harness): minimal live eval harness v0#45

feat(eval-harness): minimal live eval harness v0#45
ankitdas-volgapartners wants to merge 3 commits into
affandar:mainfrom
volga-partners:feat/eval-harness-v0

ankitdas-volgapartners commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ankitdas-volgapartners commented Jun 1, 2026

Summary

v0 scope (locked)

Out of scope (deferred deliberately)

Diff size

What's NOT changing

Testing

How to try it

Reviewer guidance

Design notes (honest gating)

Commits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant