From 690572fe4755e09b618c6c3542b6a36cb8342b00 Mon Sep 17 00:00:00 2001 From: miguel Date: Fri, 15 May 2026 14:36:33 -0700 Subject: [PATCH] docs(verifier): add README for the new rubric verifier Documents the verifier system landing across PRs #2129-#2138: pipeline architecture, Approach A vs B tradeoffs, env knobs, on-disk trajectory layout, offline `bench verify` usage, external harness adapter integration, prompts library, error taxonomy, and known limitations. Co-Authored-By: Claude Opus 4.7 (1M context) --- packages/core/lib/v3/verifier/README.md | 336 ++++++++++++++++++++++++ 1 file changed, 336 insertions(+) create mode 100644 packages/core/lib/v3/verifier/README.md diff --git a/packages/core/lib/v3/verifier/README.md b/packages/core/lib/v3/verifier/README.md new file mode 100644 index 000000000..2bbac7149 --- /dev/null +++ b/packages/core/lib/v3/verifier/README.md @@ -0,0 +1,336 @@ +# Stagehand Verifier + +Rubric-grounded trajectory verifier for Stagehand agent runs. Produces a structured `Verdict` from a recorded `Trajectory` and a `TaskSpec` — no live browser required. + +The verifier is a TypeScript port of the Microsoft FARA paper's `MMRubricAgent` ([arxiv 2511.19663](https://arxiv.org/abs/2511.19663)), restructured to fit Stagehand's evals framework and trimmed of the per-criterion fan-out that's necessary for small-context-window models but counterproductive for frontier multimodal LLMs. + +**Status:** Default backend for `V3Evaluator` when `STAGEHAND_EVALUATOR_BACKEND=verifier`. Legacy YES/NO `V3Evaluator.ask()` remains the default until benchmarking validates the new path end-to-end. + +--- + +## What it produces + +```ts +interface Verdict { + outcomeSuccess: boolean; // Did the agent accomplish the task? + processScore: number; // [0, 1] — Σ earned / Σ max across rubric criteria + perCriterion: CriterionScore[]; // Per-rubric-item breakdown + taskValidity: TaskValidity; // { isAmbiguous, isInvalid, ambiguityReason?, invalidReason? } + evidenceInsufficient: string[]; // Criteria the verifier couldn't ground confidently + findings?: VerifierFinding[]; // Advisory diagnostics (agent_strategy, rubric_quality, etc.) + firstPointOfFailure?: { // Earliest taxonomy-classified failure when outcomeSuccess=false + stepIndex: number; + errorCode: string; // e.g., "1.4" — Selection: wrong values + category: string; + description: string; + }; + rawSteps?: Record; // primaryIntent, reasoning, approach, optionalsMode, totals +} +``` + +The `outcomeSuccess` flag and `processScore` are deliberately independent. An agent can: +- Follow the right steps but get blocked by an uncontrollable factor (high `processScore`, `outcomeSuccess=false`) +- Reach the answer through an unexpected path (variable `processScore`, `outcomeSuccess=true`) + +**Outcome vs process distinction (load-bearing):** `processScore` rewards effort and grants credit when an uncontrollable blocker prevented a sub-step (FARA's "blocker → full credit" rule). `outcomeSuccess` asks "did the agent deliver the user's actual ask?" — render failures, Access Denied walls, calendar widgets that don't accept clicks all count *against* outcome, regardless of cause. Exception: tasks worded "indicate if no available rooms / no flights" — reporting unavailability *is* the deliverable. + +--- + +## Pipeline at a glance + +```mermaid +flowchart TB + classDef llm fill:#fde8e2,stroke:#c0584a,color:#141413; + classDef cheap fill:#e8efe3,stroke:#788c5d,color:#141413; + classDef out fill:#fff,stroke:#141413,stroke-width:2px,color:#141413; + + IN[("Trajectory + TaskSpec")]:::cheap + + S0["Step 0a — Rubric resolution
precomputed | cached | generated
0–1 LLM call"]:::cheap + IN --> S0 + + S1["Step 1 — Canonical evidence
SSIM/MSE dedup · 0.7× downscale
+ ariaTree, tool outputs as text
(no LLM)"]:::cheap + S0 --> S1 + + S2["Step 2 — Batched relevance
B evidence × all N criteria per call
⌈M/B⌉ LLM calls (B=4 default)"]:::llm + S1 --> S2 + + S3["Step 3 — Top-K per criterion
(no LLM)"]:::cheap + S2 --> S3 + + CH{{"VERIFIER_APPROACH
default: b"}}:::cheap + S3 --> CH + + subgraph APPA["Approach A — fault-isolated"] + A1["scorePerCriterion
N parallel calls"]:::llm + A2["processScore
(deterministic)"]:::cheap + A3["verifyOutcomeFused
1 call (folds 9a + 10)"]:::llm + A1 --> A2 --> A3 + end + + subgraph APPB["Approach B — single fused call"] + B1["fusedJudgment
1 call (folds 4 + 6 + 8 + 9a + 10)"]:::llm + end + + CH -->|a| APPA + CH -->|b| APPB + + V[("Verdict")]:::out + APPA --> V + APPB --> V +``` + +Per-trajectory totals at default settings (M=evidence points after dedup, N=rubric criteria): + +| Approach | LLM calls | Wall time (parallel) | Mean cost (Gemini Flash) | +|---|---|---|---| +| A (fault-isolated) | `⌈M/4⌉ + N + 1` ≈ 28-50 | ~270s | $0.13 | +| B (fused) | `⌈M/4⌉ + 1` ≈ 14-44 | ~310s | $0.12 | + +Compared to the original FARA fan-out (~130 calls/verification), both approaches are 3–10× cheaper. + +--- + +## The two approaches + +### Approach A — per-criterion (fault-isolated) + +One LLM call per rubric criterion sees only that criterion's top-K evidence. `processScore` is computed deterministically from the per-criterion earned-points; a final outcome call consumes the pre-scored rubric. + +``` +[Top-K] → [N per-criterion calls] → [processScore = Σ/max] → [1 outcome call] +``` + +**Use when:** you need fault isolation. If a single criterion's scoring is suspect, you can re-run just that call. Per-criterion contexts are smaller so individual LLM calls are faster. + +### Approach B — fused (default) + +One multimodal call sees the full rubric + per-criterion top-K evidence + action history + final answer, and returns per-criterion scores *plus* outcome in one structured response. Failure analysis and task validity are folded into the same response by default. + +``` +[Top-K] → [1 fused call] → Verdict (incl. failure_point + task_validity) +``` + +**Use when:** you want the cheapest pipeline. B's single call is dramatically faster at concurrency=1 (a few seconds vs minutes for A's N sequential calls). Default for the project — accuracy parity with A on the slices we've tested. + +### Accuracy comparison + +On a varied 30-trajectory ground-truth slice (10 categories × 3 difficulty levels): + +| | Approach A | Approach B | +|---|---|---| +| Outcome accuracy vs GT | 80% | **87%** | +| `isInvalid` accuracy | 95-100% | 95-100% | +| processScore correlation A↔B | r = 0.80 | (per same data) | + +B's 5–7pp edge traces to one consistent failure mode in A: over-application of the FARA "uncontrollable blocker → full credit" rule to `outcomeSuccess`. B correctly separates effort credit (process) from deliverable verification (outcome). + +--- + +## Configuration + +All env-var based; runtime overrides default behavior. + +| Variable | Default | Purpose | +|---|---|---| +| `VERIFIER_APPROACH` | `b` | `a` or `b`. Picks the per-criterion vs fused pipeline. | +| `VERIFIER_OPTIONAL_STEPS` | `folded` | `folded` (cheap), `separate` (legacy 9a+10 calls), `skip` (no validity/failure). | +| `VERIFIER_RELEVANCE_BATCH_SIZE` | `4` | Evidence points per relevance-grading LLM call. | +| `VERIFIER_MAX_PARALLEL` | `8` | Concurrency cap on parallel LLM calls within the verifier. | +| `VERIFIER_TOP_K` | `5` | Top-K evidence points selected per criterion after Step 2. | +| `VERIFIER_SSIM_THRESHOLD` | `0.75` | Frames with SSIM ≥ this are deduped. | +| `VERIFIER_MSE_THRESHOLD` | `30` | Frames with MSE < this short-circuit as duplicates. | +| `VERIFIER_IMAGE_RESIZE` | `0.7` | Downscale factor before relevance grading. | +| `VERIFIER_DISABLE_RUBRIC_CACHE` | unset | Set `1` to skip `.rubric-cache/` and always regenerate via Step 0a. | +| `VERIFIER_PERSIST_TRAJECTORIES` | follows env | Force trajectory recorder on/off (`1`/`0`); defaults to on locally, off in CI. | +| `STAGEHAND_EVALUATOR_BACKEND` | `legacy` | Set `verifier` to route `V3Evaluator.verify()` and `.generateRubric()` to the new pipeline. | + +--- + +## Public API + +The verifier is internal to `@browserbasehq/stagehand`. Public surface is on `V3Evaluator`: + +```ts +import { V3Evaluator } from "@browserbasehq/stagehand"; + +const evaluator = new V3Evaluator(v3); + +// Generate a rubric from a task spec (Step 0a). Used when no precomputedRubric is provided. +const rubric = await evaluator.generateRubric({ + id: "my-task", + instruction: "Search for flights from SF to NY", + initUrl: "https://example.com", +}); + +// Verify a trajectory against a task spec. Returns a Verdict. +const verdict = await evaluator.verify(trajectory, taskSpec); +``` + +Hands-off integration via the evals adapter: + +```ts +import { runWithVerifier, verdictToSuccess } from "@browserbasehq/stagehand-evals/framework/verifierAdapter"; + +const { verdict, trajectory, trajectoryDir } = await runWithVerifier({ + v3, agent, + taskSpec: { id, instruction, initUrl, precomputedRubric }, + dataset: "agent-custom", + agentOptions: { maxSteps: 50 }, +}); + +return { + _success: verdictToSuccess(verdict, "outcome"), + outcomeSuccess: verdict.outcomeSuccess, + processScore: verdict.processScore, +}; +``` + +--- + +## On-disk trajectory layout + +Each agent run that records a trajectory persists to: + +``` +.trajectories/// +├── task_data.json # TaskSpec + status + finalAnswer + latest verdict +├── trajectory.json # steps[] with action history + evidence refs +├── times.json # timing + token usage + stepCount +├── core.log # human-readable per-step summary +├── screenshots/ +│ ├── probe/.png # tier-2: post-action probe screenshot per step +│ └── agent/.png # tier-1: image the model received per step +└── scores/ + ├── mmrubric_v1.json # initial verdict from the agent run + └── mmrubric_