feat(verifier): add verifier evaluator shell and types#2130
Open
miguelg719 wants to merge 14 commits into
Open
feat(verifier): add verifier evaluator shell and types#2130miguelg719 wants to merge 14 commits into
miguelg719 wants to merge 14 commits into
Conversation
🦋 Changeset detectedLatest commit: 60e4321 The changes in this PR will be included in the next version bump. This PR includes changesets to release 4 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Contributor
There was a problem hiding this comment.
1 issue found across 6 files
Confidence score: 4/5
- This PR is likely safe to merge, with only a minor export-consistency risk rather than a broad functional regression.
- In
packages/core/lib/v3/index.ts,loadTrajectoryFromDiskandnextVerdictFilenameare missing from theStagehandDefaultobject, which can causeimport Stagehand from '@browserbasehq/st...'consumers to see missing members on the default import. - Severity is moderate-low (4/10) and the impact appears limited to default-export access patterns, so this looks non-blocking but worth fixing soon.
- Pay close attention to
packages/core/lib/v3/index.ts- ensureStagehandDefaultincludes all intended value exports to avoid default-import API gaps.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/core/lib/v3/index.ts">
<violation number="1" location="packages/core/lib/v3/index.ts:87">
P2: `loadTrajectoryFromDisk` and `nextVerdictFilename` are not added to the `StagehandDefault` default export object, unlike every other value export in this file. Consumers using `import Stagehand from '@browserbasehq/stagehand'` won't have access to these utilities.</violation>
</file>
Architecture diagram
sequenceDiagram
participant Client as Consumer (CLI/Tests)
participant V3Eval as V3Evaluator
participant LegEval as LegacyV3Evaluator
participant TrajLoader as loadTrajectoryFromDisk
participant FileSys as File System
Note over Client,FileSys: NEW: Verifier Public Contract Flow
Client->>V3Eval: verify(trajectory, taskSpec)
alt backend === "legacy"
V3Eval->>V3Eval: assertVerifierInput()<br/>validates id + trajectory
V3Eval->>V3Eval: collectLegacyScreenshots()<br/>extracts probe then agent images
V3Eval->>V3Eval: renderLegacyAgentReasoning()<br/>builds step-by-step text
alt has screenshots or final answer
V3Eval->>LegEval: ask(question, screenshot, answer, agentReasoning)
LegEval-->>V3Eval: EvaluationResult
V3Eval->>V3Eval: legacyEvaluationToVerdict()<br/>maps YES/NO/INVALID to Verdict
V3Eval-->>Client: Verdict
else no evidence
V3Eval->>V3Eval: legacyInsufficientEvidenceVerdict()
V3Eval-->>Client: Verdict<br/>(processScore=0,<br/>evidenceInsufficient=true)
end
else backend === "verifier"
V3Eval->>V3Eval: unavailableVerifierBackend()<br/>throws error
end
Note over Client,FileSys: NEW: generateRubric Flow
Client->>V3Eval: generateRubric(taskSpec)
V3Eval->>V3Eval: validate taskSpec.id present
alt backend === "legacy"
V3Eval->>V3Eval: legacyTaskCompletionCriterion()<br/>single-criterion rubric
V3Eval-->>Client: Rubric with 1 item
else backend === "verifier"
V3Eval->>V3Eval: unavailableVerifierBackend()<br/>throws error
end
Note over Client,FileSys: NEW: Offline Trajectory Loading<br/>(for re-scoring saved runs)
Client->>TrajLoader: loadTrajectoryFromDisk(dir)
TrajLoader->>FileSys: readFile(trajectory.json)
FileSys-->>TrajLoader: raw JSON
TrajLoader->>TrajLoader: parse + iterate steps
loop each step
alt probeEvidence.screenshotPath set<br/>and probeEvidence.screenshot absent
TrajLoader->>FileSys: readFile(screenshotPath)
FileSys-->>TrajLoader: Buffer
TrajLoader->>TrajLoader: assign to probeEvidence.screenshot
else file missing
TrajLoader->>TrajLoader: skip (leaves undefined)
end
alt agentEvidence has image modality<br/>with bytesBase64
TrajLoader->>TrajLoader: Buffer.from(bytesBase64, "base64")
TrajLoader->>TrajLoader: replace with bytes field
end
end
TrajLoader-->>Client: Hydrated Trajectory
Note over Client,FileSys: NEW: Public Type Exports<br/>(Trajectory, Verdict, Verifier, etc.)
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic
d7d2c59 to
2765781
Compare
miguelg719
commented
May 15, 2026
de209d6 to
18265ca
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The verifier pipeline needs a stable public contract before the judgment engine lands. This PR introduces the verifier-facing API and types while keeping runtime behavior reviewable and compatible with the legacy backend.
What Changed
packages/core/lib/v3/verifier/types.tswith public re-exports only fromverifier/index.ts.verifier.tsfile andtrajectory.tstype re-exports.RubricCriterion.earnedPointsnumeric-only; serialized empty-string values are normalized away at the IO boundary.screenshotPathcannot escape the trajectory directory during offline loading.V3Evaluator.verify()andV3Evaluator.generateRubric()facade methods.verify()over saved trajectories.stub-verifierreason.Tests
pnpm --filter @browserbasehq/stagehand run typecheckpnpm --filter @browserbasehq/stagehand run build:esmpnpm --filter @browserbasehq/stagehand run test:core -- packages/core/dist/esm/tests/unit/public-api/v3-core.test.jspnpm --filter @browserbasehq/stagehand run test:core -- packages/core/dist/esm/tests/unit/verifier-trajectory.test.jsgit diff --check