Skip to content

feat(eval-harness): minimal live eval harness v0#45

Open
ankitdas-volgapartners wants to merge 3 commits into
affandar:mainfrom
volga-partners:feat/eval-harness-v0
Open

feat(eval-harness): minimal live eval harness v0#45
ankitdas-volgapartners wants to merge 3 commits into
affandar:mainfrom
volga-partners:feat/eval-harness-v0

Conversation

@ankitdas-volgapartners
Copy link
Copy Markdown
Contributor

Summary

A minimal live eval harness for PilotSwarm durable-runtime checks. Runs JSON
scenarios against a managed PilotSwarm worker, captures CMS/tool evidence,
and grades responses with deterministic checks plus an optional provider-backed
LLM judge.

The package is purely additive — nothing in packages/sdk/, packages/portal/,
packages/cli/, packages/mcp-server/, or packages/ui-* is touched.

v0 scope (locked)

  • One driver (live, managed worker against real PostgreSQL + GitHub token)
  • One reporter (console, in-place progress)
  • 18 scenarios across live/durable/multi-turn/safety
  • 3 run plans: live-smoke, live-critical-path, live-all
  • Provider-backed LLM judge (Copilot, gpt-5.4) on opt-in scenarios
  • 3 default test tools: delete_agent (safety fixture), test_add (minimal
    scaffolding), test_untrusted_status (indirect-injection fixture)

Every scenario exercises a real PilotSwarm runtime feature: CMS event capture,
durable waits, worker-restart chaos, multi-turn session memory, judge
integration, or adversarial safety behavior.

Future expansion happens through documented plugin extension points
(registerTool, registerDriver, registerReporter, registerScenarioKind).

Out of scope (deferred deliberately)

Meta scenarios, prompt variants, ablations, model sweeps, expanded reporters,
post-run trajectory summaries, broad platform positioning, additional chaos
types beyond worker-restart.

Diff size

84 files, +6,797 / −28 lines. New package under packages/eval-harness/ plus
small root touches: workspace build wiring, test gate, .gitignore for
generated eval artifacts, and a proposal record.

What's NOT changing

  • packages/sdk/, packages/portal/, packages/cli/, packages/mcp-server/,
    packages/ui-* — untouched
  • No SDK API changes, no CMS schema changes, no migrations
  • Existing tests under packages/sdk/test/local/ and packages/mcp-server/
    are unchanged

The only repo-level deltas are 4 opt-out-compatible additions in commit 2:

Surface Change Opt-out
Root package.json build chain Appends --workspace=pilotswarm-eval-harness n/a (additive)
scripts/run-tests.sh Adds run_eval_harness_tests before SDK suites SKIP_EVAL_HARNESS_TESTS=1
Root .gitignore Adds *.eval-results/ n/a
package-lock.json Registers new workspace n/a

Testing

  • npm run build --workspace=pilotswarm-eval-harness — clean (tsc --noEmit)
  • npm test --workspace=pilotswarm-eval-harness — 11 files / 59 tests pass
    in ~1.2s
  • Live live-all integration: 18/18 PASS end-to-end with provider-backed
    Copilot judge

Live runs need DATABASE_URL, GITHUB_TOKEN, and a reachable PostgreSQL.

How to try it

npm install
npm run build --workspace=pilotswarm-eval-harness

# Live smoke (needs DATABASE_URL + GITHUB_TOKEN + Postgres)
set -a; source .env; set +a
packages/eval-harness/bin/run-eval.sh --run=live-smoke

# Full v0 corpus
packages/eval-harness/bin/run-eval.sh --run=live-all

Downstream extension via packages/eval-harness/docs/DOWNSTREAM-GUIDE.md.

Reviewer guidance

Suggested review order:

  1. packages/eval-harness/src/schema/{config,manifest,scenario,check-types}.ts — the contract surface
  2. packages/eval-harness/src/engine/managed-live-runner.ts + chaos-controller.ts + managed-live-support.ts — worker pool ownership, chaos injection during durable waits, dehydration cycle
  3. packages/eval-harness/src/engine/run-manifest.ts + discover.ts — orchestration entry, manifest cycle handling
  4. packages/eval-harness/src/checks/llm-judge.ts — provider routing, llmJudgeRequired enforcement, budget reservation/refund-on-error, OPENAI_BASE_URL warning
  5. packages/eval-harness/src/checks/index.ts — built-in check evaluators (schema refinements prevent vacuous-pass shapes)
  6. packages/eval-harness/src/reporters/output.ts — redaction patterns
  7. packages/eval-harness/src/index.ts — public API surface
  8. packages/eval-harness/bin/run-eval.{sh,ts} — CLI flags, exit codes, plugin loader
  9. packages/eval-harness/scenarios/ — 18 JSON scenarios (each is a one-page contract)

Design notes (honest gating)

  • No silent-PASS judge fallback. When llmJudge.enabled=false, an explicit
    llm-judge check errors loudly instead of returning fabricated PASS.
  • No fake cost/token telemetry. The live runner has no per-call cost/token
    measurement, so costUsd/tokensIn/tokensOut on ObservedResult are
    optional; judge evidence renders "unmeasured" when absent rather than a
    misleading 0.
  • No vacuous-pass schema shapes. Zod refinements on response-contains,
    tool-call-count, cms-event-count reject configurations that would pass
    unconditionally.
  • No overfit prompts. Scenarios use realistic phrasing; assertions test
    PilotSwarm features (CMS events, dehydration cycles, session memory) rather
    than prompt-coercion.
  • llmJudgeReservedUsd (not Spent) — honestly named to reflect that v0
    reports reservations, not measured spend.

Commits

  1. feat(eval-harness): minimal live eval harness v0 — the product
  2. chore: wire eval-harness into workspace build + test gate — root integration
  3. docs(proposals-impl): add eval-harness proposal record — design context

A managed live PilotSwarm eval harness that runs JSON scenarios against
a real worker, captures CMS/tool evidence, and grades responses with
deterministic checks plus an optional provider-backed LLM judge.

v0 scope:
- One driver (live, managed PilotSwarm worker against real PostgreSQL)
- One reporter (console with in-place progress)
- 18 scenarios across live/durable/multi-turn/safety, each exercising
  a real PilotSwarm runtime feature (CMS event capture, durable waits,
  worker-restart chaos, multi-turn session memory, judge integration,
  adversarial safety behavior)
- 3 bundled run plans: live-smoke, live-critical-path, live-all
- Provider-backed LLM judge (Copilot, gpt-5.4) for scenarios that opt in
- 3 default test tools (delete_agent safety fixture, test_add minimal
  scaffolding, test_untrusted_status for indirect-injection testing)

Out of scope for v0: meta scenarios, prompt variants, ablations, model
sweeps, expanded reporters, post-run trajectory summaries.
- Add pilotswarm-eval-harness to root build chain
- Add run_eval_harness_tests stage to scripts/run-tests.sh with
  SKIP_EVAL_HARNESS_TESTS=1 opt-out
- Update package-lock.json for the new workspace
- Add *.eval-results/ to root .gitignore for generated eval artifacts
Captures the locked v0 scope, run-config/manifest/scenario ownership
model, and the kept live path. Active package docs live under
packages/eval-harness/.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant