feat(eval-harness): minimal live eval harness v0#45
Open
ankitdas-volgapartners wants to merge 3 commits into
Open
feat(eval-harness): minimal live eval harness v0#45ankitdas-volgapartners wants to merge 3 commits into
ankitdas-volgapartners wants to merge 3 commits into
Conversation
A managed live PilotSwarm eval harness that runs JSON scenarios against a real worker, captures CMS/tool evidence, and grades responses with deterministic checks plus an optional provider-backed LLM judge. v0 scope: - One driver (live, managed PilotSwarm worker against real PostgreSQL) - One reporter (console with in-place progress) - 18 scenarios across live/durable/multi-turn/safety, each exercising a real PilotSwarm runtime feature (CMS event capture, durable waits, worker-restart chaos, multi-turn session memory, judge integration, adversarial safety behavior) - 3 bundled run plans: live-smoke, live-critical-path, live-all - Provider-backed LLM judge (Copilot, gpt-5.4) for scenarios that opt in - 3 default test tools (delete_agent safety fixture, test_add minimal scaffolding, test_untrusted_status for indirect-injection testing) Out of scope for v0: meta scenarios, prompt variants, ablations, model sweeps, expanded reporters, post-run trajectory summaries.
- Add pilotswarm-eval-harness to root build chain - Add run_eval_harness_tests stage to scripts/run-tests.sh with SKIP_EVAL_HARNESS_TESTS=1 opt-out - Update package-lock.json for the new workspace - Add *.eval-results/ to root .gitignore for generated eval artifacts
Captures the locked v0 scope, run-config/manifest/scenario ownership model, and the kept live path. Active package docs live under packages/eval-harness/.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A minimal live eval harness for PilotSwarm durable-runtime checks. Runs JSON
scenarios against a managed PilotSwarm worker, captures CMS/tool evidence,
and grades responses with deterministic checks plus an optional provider-backed
LLM judge.
The package is purely additive — nothing in
packages/sdk/,packages/portal/,packages/cli/,packages/mcp-server/, orpackages/ui-*is touched.v0 scope (locked)
live, managed worker against real PostgreSQL + GitHub token)console, in-place progress)live-smoke,live-critical-path,live-alldelete_agent(safety fixture),test_add(minimalscaffolding),
test_untrusted_status(indirect-injection fixture)Every scenario exercises a real PilotSwarm runtime feature: CMS event capture,
durable waits, worker-restart chaos, multi-turn session memory, judge
integration, or adversarial safety behavior.
Future expansion happens through documented plugin extension points
(
registerTool,registerDriver,registerReporter,registerScenarioKind).Out of scope (deferred deliberately)
Meta scenarios, prompt variants, ablations, model sweeps, expanded reporters,
post-run trajectory summaries, broad platform positioning, additional chaos
types beyond
worker-restart.Diff size
84 files, +6,797 / −28 lines. New package under
packages/eval-harness/plussmall root touches: workspace build wiring, test gate,
.gitignoreforgenerated eval artifacts, and a proposal record.
What's NOT changing
packages/sdk/,packages/portal/,packages/cli/,packages/mcp-server/,packages/ui-*— untouchedpackages/sdk/test/local/andpackages/mcp-server/are unchanged
The only repo-level deltas are 4 opt-out-compatible additions in commit 2:
package.jsonbuild chain--workspace=pilotswarm-eval-harnessscripts/run-tests.shrun_eval_harness_testsbefore SDK suitesSKIP_EVAL_HARNESS_TESTS=1.gitignore*.eval-results/package-lock.jsonTesting
npm run build --workspace=pilotswarm-eval-harness— clean (tsc --noEmit)npm test --workspace=pilotswarm-eval-harness— 11 files / 59 tests passin ~1.2s
live-allintegration: 18/18 PASS end-to-end with provider-backedCopilot judge
Live runs need
DATABASE_URL,GITHUB_TOKEN, and a reachable PostgreSQL.How to try it
Downstream extension via
packages/eval-harness/docs/DOWNSTREAM-GUIDE.md.Reviewer guidance
Suggested review order:
packages/eval-harness/src/schema/{config,manifest,scenario,check-types}.ts— the contract surfacepackages/eval-harness/src/engine/managed-live-runner.ts+chaos-controller.ts+managed-live-support.ts— worker pool ownership, chaos injection during durable waits, dehydration cyclepackages/eval-harness/src/engine/run-manifest.ts+discover.ts— orchestration entry, manifest cycle handlingpackages/eval-harness/src/checks/llm-judge.ts— provider routing,llmJudgeRequiredenforcement, budget reservation/refund-on-error,OPENAI_BASE_URLwarningpackages/eval-harness/src/checks/index.ts— built-in check evaluators (schema refinements prevent vacuous-pass shapes)packages/eval-harness/src/reporters/output.ts— redaction patternspackages/eval-harness/src/index.ts— public API surfacepackages/eval-harness/bin/run-eval.{sh,ts}— CLI flags, exit codes, plugin loaderpackages/eval-harness/scenarios/— 18 JSON scenarios (each is a one-page contract)Design notes (honest gating)
llmJudge.enabled=false, an explicitllm-judgecheck errors loudly instead of returning fabricated PASS.measurement, so
costUsd/tokensIn/tokensOutonObservedResultareoptional; judge evidence renders "unmeasured" when absent rather than a
misleading
0.response-contains,tool-call-count,cms-event-countreject configurations that would passunconditionally.
PilotSwarm features (CMS events, dehydration cycles, session memory) rather
than prompt-coercion.
llmJudgeReservedUsd(notSpent) — honestly named to reflect that v0reports reservations, not measured spend.
Commits
feat(eval-harness): minimal live eval harness v0— the productchore: wire eval-harness into workspace build + test gate— root integrationdocs(proposals-impl): add eval-harness proposal record— design context