feat(evals): migrate remaining verifier datasets#2136
Conversation
|
There was a problem hiding this comment.
1 issue found across 45 files
Confidence score: 4/5
- This PR is likely safe to merge with minimal risk: the reported issue is moderate (5/10) and appears limited to an edge case around environment configuration.
- In
packages/evals/tasks/bench/agent/amazon_shoes_cart.ts, castingprocess.env.EVAL_SUCCESS_MODEwithout validation can allow unrecognized values, andverdictToSuccessmay returnundefinedbecause its switch has no default branch. - The impact is mainly incorrect eval success handling when misconfigured env values are provided, rather than a broad runtime break across normal flows.
- Pay close attention to
packages/evals/tasks/bench/agent/amazon_shoes_cart.ts- validateEVAL_SUCCESS_MODEand ensureverdictToSuccesshas a safe fallback/default behavior.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/evals/tasks/bench/agent/amazon_shoes_cart.ts">
<violation number="1" location="packages/evals/tasks/bench/agent/amazon_shoes_cart.ts:40">
P2: Unsafe cast of `process.env.EVAL_SUCCESS_MODE` can cause `verdictToSuccess` to return `undefined` if the env var is set to an unrecognized value (the switch in `verdictToSuccess` has no default case). Consider validating the value or providing a fallback after the cast.</violation>
</file>
Architecture diagram
sequenceDiagram
participant Runner as Eval Runner
participant Task as Agent Task
participant Verifier as runWithVerifier (verifierAdapter)
participant AdHoc as adHocRubric
participant Cache as Rubric Cache
participant VerifierLLM as V3Evaluator.verify()
participant Trajectory as TrajectoryRecorder
Note over Runner,Trajectory: NEW: Verifier-Backed Scoring Pipeline
Runner->>Task: execute() with test params
Task->>Task: Create initUrl, instruction, expectedAnswer
opt Custom tasks (agent-custom) OR gaia dataset
Task->>AdHoc: adHocRubric(criteria1, criteria2, ...)
AdHoc-->>Task: Rubric w/ 1-point items
Task->>Task: Set precomputedRubric on TaskSpec
end
opt OnlineMind2Web / WebVoyager
Task->>Task: No precomputedRubric (will use Step 0a)
end
Task->>Verifier: runWithVerifier({ v3, agent, taskSpec, dataset })
Verifier->>Trajectory: Start recording agent steps
Verifier->>Verifier: agent.execute(taskSpec.instruction)
Verifier->>Trajectory: Stop recording, get trajectory
alt Has precomputedRubric
Verifier->>Verifier: Use taskSpec.precomputedRubric
else No precomputedRubric (OnlineMind2Web / WebVoyager)
Verifier->>Cache: Lookup rubric for task_id + dataset
alt Cached rubric exists
Cache-->>Verifier: Hydrated rubric
else No cached rubric
Verifier->>VerifierLLM: Step 0a: Generate rubric from task + trajectory
VerifierLLM-->>Verifier: Generated rubric
Verifier->>Cache: Store generated rubric
end
end
Verifier->>VerifierLLM: verify(trajectory, rubric) -> Step 6 rescore + Step 8 outcome
VerifierLLM-->>Verifier: verdict { outcomeSuccess, processScore, evidenceInsufficient, rawSteps }
Verifier-->>Task: { verdict, trajectoryDir, rubric }
Task->>Task: verdictToSuccess(verdict, EVAL_SUCCESS_MODE)
alt successMode = "outcome"
Task->>Task: Use verdict.outcomeSuccess
else successMode = "process"
Task->>Task: Use processScore threshold
else successMode = "both"
Task->>Task: Combine outcome + process
end
Task-->>Runner: { _success, outcomeSuccess, processScore, trajectoryDir, criterionCount, stepCount, debugUrl, sessionUrl }
opt Error path
Task-->>Runner: { _success: false, error, trajectoryDir, debugUrl, sessionUrl }
end
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic
| console.log(`reasoning: ${reasoning}`); | ||
|
|
||
| const success = evaluation === "YES"; | ||
| const successMode = |
There was a problem hiding this comment.
P2: Unsafe cast of process.env.EVAL_SUCCESS_MODE can cause verdictToSuccess to return undefined if the env var is set to an unrecognized value (the switch in verdictToSuccess has no default case). Consider validating the value or providing a fallback after the cast.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/tasks/bench/agent/amazon_shoes_cart.ts, line 40:
<comment>Unsafe cast of `process.env.EVAL_SUCCESS_MODE` can cause `verdictToSuccess` to return `undefined` if the env var is set to an unrecognized value (the switch in `verdictToSuccess` has no default case). Consider validating the value or providing a fallback after the cast.</comment>
<file context>
@@ -1,69 +1,61 @@
- console.log(`reasoning: ${reasoning}`);
-
- const success = evaluation === "YES";
+ const successMode =
+ (process.env.EVAL_SUCCESS_MODE as "outcome" | "process" | "both") ||
+ "outcome";
</file context>
f7d8f2d to
9ffa4e9
Compare
105b13c to
f7df0cf
Compare
9ffa4e9 to
2a24116
Compare
f7df0cf to
a476ada
Compare
2a24116 to
71ed229
Compare
a476ada to
8ec81fd
Compare
71ed229 to
2f80063
Compare
8ec81fd to
cce3a9b
Compare
2f80063 to
1b60889
Compare
1b60889 to
bf49158
Compare
cce3a9b to
47dc1d5
Compare
Why
After WebTailBench is wired, the remaining eval datasets and custom agent tasks need to use the same verifier-backed scoring path so the benchmark matrix can run consistently across suites.
What Changed
runWithVerifier.adHocRubrichelper for custom task rubrics.EVAL_SUCCESS_MODEcasts now that success-mode validation is centralized.Tests
pnpm --filter @browserbasehq/stagehand-evals run typecheckpnpm --dir packages/evals exec vitest run tests/framework/rubricCache.test.ts tests/framework/verifierAdapter.test.ts tests/tui/commandTree.test.ts tests/framework/claudeCodeRunner.test.ts tests/framework/codexRunner.test.ts tests/cli.test.tspnpm -w exec prettier --check packages/core/lib/v3/verifier packages/core/tests/unit/verifier-failure-step-parser.test.ts packages/evals/cli.ts packages/evals/framework packages/evals/tests/framework/rubricCache.test.ts packages/evals/tests/framework/verifierAdapter.test.ts packages/evals/tests/tui/commandTree.test.ts packages/evals/tui packages/evals/tasks/bench/agentgit diff --check