Skip to content

feat(evals): migrate remaining verifier datasets#2136

Open
miguelg719 wants to merge 5 commits into
miguelgonzalez/verifier-07-evals-adapterfrom
miguelgonzalez/verifier-08-dataset-migrations
Open

feat(evals): migrate remaining verifier datasets#2136
miguelg719 wants to merge 5 commits into
miguelgonzalez/verifier-07-evals-adapterfrom
miguelgonzalez/verifier-08-dataset-migrations

Conversation

@miguelg719
Copy link
Copy Markdown
Collaborator

@miguelg719 miguelg719 commented May 15, 2026

Why

After WebTailBench is wired, the remaining eval datasets and custom agent tasks need to use the same verifier-backed scoring path so the benchmark matrix can run consistently across suites.

What Changed

  • Migrated OnlineMind2Web and WebVoyager task wrappers to verifier-backed scoring.
  • Migrated custom agent bench tasks to runWithVerifier.
  • Added adHocRubric helper for custom task rubrics.
  • Backfilled WebTailBench rubric data.
  • Removed repeated unsafe EVAL_SUCCESS_MODE casts now that success-mode validation is centralized.
  • Preserved task-level success reporting through verifier verdicts.
  • Removed upstream verifier references from comments.

Tests

  • pnpm --filter @browserbasehq/stagehand-evals run typecheck
  • pnpm --dir packages/evals exec vitest run tests/framework/rubricCache.test.ts tests/framework/verifierAdapter.test.ts tests/tui/commandTree.test.ts tests/framework/claudeCodeRunner.test.ts tests/framework/codexRunner.test.ts tests/cli.test.ts
  • pnpm -w exec prettier --check packages/core/lib/v3/verifier packages/core/tests/unit/verifier-failure-step-parser.test.ts packages/evals/cli.ts packages/evals/framework packages/evals/tests/framework/rubricCache.test.ts packages/evals/tests/framework/verifierAdapter.test.ts packages/evals/tests/tui/commandTree.test.ts packages/evals/tui packages/evals/tasks/bench/agent
  • git diff --check

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 15, 2026

⚠️ No Changeset found

Latest commit: bf49158

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 45 files

Confidence score: 4/5

  • This PR is likely safe to merge with minimal risk: the reported issue is moderate (5/10) and appears limited to an edge case around environment configuration.
  • In packages/evals/tasks/bench/agent/amazon_shoes_cart.ts, casting process.env.EVAL_SUCCESS_MODE without validation can allow unrecognized values, and verdictToSuccess may return undefined because its switch has no default branch.
  • The impact is mainly incorrect eval success handling when misconfigured env values are provided, rather than a broad runtime break across normal flows.
  • Pay close attention to packages/evals/tasks/bench/agent/amazon_shoes_cart.ts - validate EVAL_SUCCESS_MODE and ensure verdictToSuccess has a safe fallback/default behavior.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/tasks/bench/agent/amazon_shoes_cart.ts">

<violation number="1" location="packages/evals/tasks/bench/agent/amazon_shoes_cart.ts:40">
P2: Unsafe cast of `process.env.EVAL_SUCCESS_MODE` can cause `verdictToSuccess` to return `undefined` if the env var is set to an unrecognized value (the switch in `verdictToSuccess` has no default case). Consider validating the value or providing a fallback after the cast.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant Runner as Eval Runner
    participant Task as Agent Task
    participant Verifier as runWithVerifier (verifierAdapter)
    participant AdHoc as adHocRubric
    participant Cache as Rubric Cache
    participant VerifierLLM as V3Evaluator.verify()
    participant Trajectory as TrajectoryRecorder

    Note over Runner,Trajectory: NEW: Verifier-Backed Scoring Pipeline

    Runner->>Task: execute() with test params
    Task->>Task: Create initUrl, instruction, expectedAnswer
    opt Custom tasks (agent-custom) OR gaia dataset
        Task->>AdHoc: adHocRubric(criteria1, criteria2, ...)
        AdHoc-->>Task: Rubric w/ 1-point items
        Task->>Task: Set precomputedRubric on TaskSpec
    end
    opt OnlineMind2Web / WebVoyager
        Task->>Task: No precomputedRubric (will use Step 0a)
    end
    Task->>Verifier: runWithVerifier({ v3, agent, taskSpec, dataset })
    Verifier->>Trajectory: Start recording agent steps
    Verifier->>Verifier: agent.execute(taskSpec.instruction)
    Verifier->>Trajectory: Stop recording, get trajectory
    alt Has precomputedRubric
        Verifier->>Verifier: Use taskSpec.precomputedRubric
    else No precomputedRubric (OnlineMind2Web / WebVoyager)
        Verifier->>Cache: Lookup rubric for task_id + dataset
        alt Cached rubric exists
            Cache-->>Verifier: Hydrated rubric
        else No cached rubric
            Verifier->>VerifierLLM: Step 0a: Generate rubric from task + trajectory
            VerifierLLM-->>Verifier: Generated rubric
            Verifier->>Cache: Store generated rubric
        end
    end
    Verifier->>VerifierLLM: verify(trajectory, rubric) -> Step 6 rescore + Step 8 outcome
    VerifierLLM-->>Verifier: verdict { outcomeSuccess, processScore, evidenceInsufficient, rawSteps }
    Verifier-->>Task: { verdict, trajectoryDir, rubric }
    Task->>Task: verdictToSuccess(verdict, EVAL_SUCCESS_MODE)
    alt successMode = "outcome"
        Task->>Task: Use verdict.outcomeSuccess
    else successMode = "process"
        Task->>Task: Use processScore threshold
    else successMode = "both"
        Task->>Task: Combine outcome + process
    end
    Task-->>Runner: { _success, outcomeSuccess, processScore, trajectoryDir, criterionCount, stepCount, debugUrl, sessionUrl }
    opt Error path
        Task-->>Runner: { _success: false, error, trajectoryDir, debugUrl, sessionUrl }
    end
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic

console.log(`reasoning: ${reasoning}`);

const success = evaluation === "YES";
const successMode =
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Unsafe cast of process.env.EVAL_SUCCESS_MODE can cause verdictToSuccess to return undefined if the env var is set to an unrecognized value (the switch in verdictToSuccess has no default case). Consider validating the value or providing a fallback after the cast.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/tasks/bench/agent/amazon_shoes_cart.ts, line 40:

<comment>Unsafe cast of `process.env.EVAL_SUCCESS_MODE` can cause `verdictToSuccess` to return `undefined` if the env var is set to an unrecognized value (the switch in `verdictToSuccess` has no default case). Consider validating the value or providing a fallback after the cast.</comment>

<file context>
@@ -1,69 +1,61 @@
-      console.log(`reasoning: ${reasoning}`);
-
-      const success = evaluation === "YES";
+      const successMode =
+        (process.env.EVAL_SUCCESS_MODE as "outcome" | "process" | "both") ||
+        "outcome";
</file context>
Fix with Cubic

@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch from f7d8f2d to 9ffa4e9 Compare May 15, 2026 21:23
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-07-evals-adapter branch 2 times, most recently from 105b13c to f7df0cf Compare May 15, 2026 21:45
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch from 9ffa4e9 to 2a24116 Compare May 15, 2026 21:45
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-07-evals-adapter branch from f7df0cf to a476ada Compare May 15, 2026 22:33
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch from 2a24116 to 71ed229 Compare May 15, 2026 22:33
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-07-evals-adapter branch from a476ada to 8ec81fd Compare May 15, 2026 23:27
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch from 71ed229 to 2f80063 Compare May 15, 2026 23:27
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-07-evals-adapter branch from 8ec81fd to cce3a9b Compare May 16, 2026 04:40
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch from 2f80063 to 1b60889 Compare May 16, 2026 04:40
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch from 1b60889 to bf49158 Compare May 16, 2026 05:50
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-07-evals-adapter branch from cce3a9b to 47dc1d5 Compare May 16, 2026 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant