feat(evals): migrate remaining verifier datasets by miguelg719 · Pull Request #2136 · browserbase/stagehand

miguelg719 · 2026-05-15T20:58:51Z

Why

After WebTailBench is wired, the remaining eval datasets and custom agent tasks need to use the same verifier-backed scoring path so the benchmark matrix can run consistently across suites.

What Changed

Migrated OnlineMind2Web and WebVoyager task wrappers to verifier-backed scoring.
Migrated custom agent bench tasks to runWithVerifier.
Added adHocRubric helper for custom task rubrics.
Backfilled WebTailBench rubric data.
Removed repeated unsafe EVAL_SUCCESS_MODE casts now that success-mode validation is centralized.
Preserved task-level success reporting through verifier verdicts.
Removed upstream verifier references from comments.

Tests

pnpm --filter @browserbasehq/stagehand-evals run typecheck
pnpm --dir packages/evals exec vitest run tests/framework/rubricCache.test.ts tests/framework/verifierAdapter.test.ts tests/tui/commandTree.test.ts tests/framework/claudeCodeRunner.test.ts tests/framework/codexRunner.test.ts tests/cli.test.ts
pnpm -w exec prettier --check packages/core/lib/v3/verifier packages/core/tests/unit/verifier-failure-step-parser.test.ts packages/evals/cli.ts packages/evals/framework packages/evals/tests/framework/rubricCache.test.ts packages/evals/tests/framework/verifierAdapter.test.ts packages/evals/tests/tui/commandTree.test.ts packages/evals/tui packages/evals/tasks/bench/agent
git diff --check

changeset-bot · 2026-05-15T20:59:09Z

⚠️ No Changeset found

Latest commit: bf49158

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

cubic-dev-ai

1 issue found across 45 files

Confidence score: 4/5

This PR is likely safe to merge with minimal risk: the reported issue is moderate (5/10) and appears limited to an edge case around environment configuration.
In packages/evals/tasks/bench/agent/amazon_shoes_cart.ts, casting process.env.EVAL_SUCCESS_MODE without validation can allow unrecognized values, and verdictToSuccess may return undefined because its switch has no default branch.
The impact is mainly incorrect eval success handling when misconfigured env values are provided, rather than a broad runtime break across normal flows.
Pay close attention to packages/evals/tasks/bench/agent/amazon_shoes_cart.ts - validate EVAL_SUCCESS_MODE and ensure verdictToSuccess has a safe fallback/default behavior.

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/tasks/bench/agent/amazon_shoes_cart.ts">

<violation number="1" location="packages/evals/tasks/bench/agent/amazon_shoes_cart.ts:40">
P2: Unsafe cast of `process.env.EVAL_SUCCESS_MODE` can cause `verdictToSuccess` to return `undefined` if the env var is set to an unrecognized value (the switch in `verdictToSuccess` has no default case). Consider validating the value or providing a fallback after the cast.</violation>
</file>

Architecture diagram

sequenceDiagram
    participant Runner as Eval Runner
    participant Task as Agent Task
    participant Verifier as runWithVerifier (verifierAdapter)
    participant AdHoc as adHocRubric
    participant Cache as Rubric Cache
    participant VerifierLLM as V3Evaluator.verify()
    participant Trajectory as TrajectoryRecorder

    Note over Runner,Trajectory: NEW: Verifier-Backed Scoring Pipeline

    Runner->>Task: execute() with test params
    Task->>Task: Create initUrl, instruction, expectedAnswer
    opt Custom tasks (agent-custom) OR gaia dataset
        Task->>AdHoc: adHocRubric(criteria1, criteria2, ...)
        AdHoc-->>Task: Rubric w/ 1-point items
        Task->>Task: Set precomputedRubric on TaskSpec
    end
    opt OnlineMind2Web / WebVoyager
        Task->>Task: No precomputedRubric (will use Step 0a)
    end
    Task->>Verifier: runWithVerifier({ v3, agent, taskSpec, dataset })
    Verifier->>Trajectory: Start recording agent steps
    Verifier->>Verifier: agent.execute(taskSpec.instruction)
    Verifier->>Trajectory: Stop recording, get trajectory
    alt Has precomputedRubric
        Verifier->>Verifier: Use taskSpec.precomputedRubric
    else No precomputedRubric (OnlineMind2Web / WebVoyager)
        Verifier->>Cache: Lookup rubric for task_id + dataset
        alt Cached rubric exists
            Cache-->>Verifier: Hydrated rubric
        else No cached rubric
            Verifier->>VerifierLLM: Step 0a: Generate rubric from task + trajectory
            VerifierLLM-->>Verifier: Generated rubric
            Verifier->>Cache: Store generated rubric
        end
    end
    Verifier->>VerifierLLM: verify(trajectory, rubric) -> Step 6 rescore + Step 8 outcome
    VerifierLLM-->>Verifier: verdict { outcomeSuccess, processScore, evidenceInsufficient, rawSteps }
    Verifier-->>Task: { verdict, trajectoryDir, rubric }
    Task->>Task: verdictToSuccess(verdict, EVAL_SUCCESS_MODE)
    alt successMode = "outcome"
        Task->>Task: Use verdict.outcomeSuccess
    else successMode = "process"
        Task->>Task: Use processScore threshold
    else successMode = "both"
        Task->>Task: Combine outcome + process
    end
    Task-->>Runner: { _success, outcomeSuccess, processScore, trajectoryDir, criterionCount, stepCount, debugUrl, sessionUrl }
    opt Error path
        Task-->>Runner: { _success: false, error, trajectoryDir, debugUrl, sessionUrl }
    end

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic}

cubic-dev-ai · 2026-05-15T21:03:08Z

-      console.log(`reasoning: ${reasoning}`);
-
-      const success = evaluation === "YES";
+      const successMode =


P2: Unsafe cast of process.env.EVAL_SUCCESS_MODE can cause verdictToSuccess to return undefined if the env var is set to an unrecognized value (the switch in verdictToSuccess has no default case). Consider validating the value or providing a fallback after the cast.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/tasks/bench/agent/amazon_shoes_cart.ts, line 40: <comment>Unsafe cast of `process.env.EVAL_SUCCESS_MODE` can cause `verdictToSuccess` to return `undefined` if the env var is set to an unrecognized value (the switch in `verdictToSuccess` has no default case). Consider validating the value or providing a fallback after the cast.</comment> <file context> @@ -1,69 +1,61 @@ - console.log(`reasoning: ${reasoning}`); - - const success = evaluation === "YES"; + const successMode = + (process.env.EVAL_SUCCESS_MODE as "outcome" | "process" | "both") || + "outcome"; </file context>

cubic-dev-ai Bot reviewed May 15, 2026

View reviewed changes

miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch from f7d8f2d to 9ffa4e9 Compare May 15, 2026 21:23

miguelg719 force-pushed the miguelgonzalez/verifier-07-evals-adapter branch 2 times, most recently from 105b13c to f7df0cf Compare May 15, 2026 21:45

miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch from 9ffa4e9 to 2a24116 Compare May 15, 2026 21:45

miguelg719 force-pushed the miguelgonzalez/verifier-07-evals-adapter branch from f7df0cf to a476ada Compare May 15, 2026 22:33

miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch from 2a24116 to 71ed229 Compare May 15, 2026 22:33

miguelg719 force-pushed the miguelgonzalez/verifier-07-evals-adapter branch from a476ada to 8ec81fd Compare May 15, 2026 23:27

miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch from 71ed229 to 2f80063 Compare May 15, 2026 23:27

miguelg719 force-pushed the miguelgonzalez/verifier-07-evals-adapter branch from 8ec81fd to cce3a9b Compare May 16, 2026 04:40

miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch from 2f80063 to 1b60889 Compare May 16, 2026 04:40

miguelg719 added 5 commits May 15, 2026 22:49

feat(evals): migrate remaining verifier datasets

fa5cff4

fix(evals): remove success mode casts

5f55ada

docs(evals): remove rollout comments from migrated tasks

2a2b7f5

fix(evals): align migrated task result API

ad89a1f

docs(evals): trim dataset migration comments

bf49158

miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch from 1b60889 to bf49158 Compare May 16, 2026 05:50

miguelg719 force-pushed the miguelgonzalez/verifier-07-evals-adapter branch from cce3a9b to 47dc1d5 Compare May 16, 2026 05:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): migrate remaining verifier datasets#2136

feat(evals): migrate remaining verifier datasets#2136
miguelg719 wants to merge 5 commits into
miguelgonzalez/verifier-07-evals-adapterfrom
miguelgonzalez/verifier-08-dataset-migrations

miguelg719 commented May 15, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented May 15, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

miguelg719 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What Changed

Tests

Uh oh!

changeset-bot Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

miguelg719 commented May 15, 2026 •

edited

Loading

changeset-bot Bot commented May 15, 2026 •

edited

Loading

cubic-dev-ai Bot May 15, 2026 •

edited

Loading