Skip to content

feat(verifier): add verifier evaluator shell and types#2130

Open
miguelg719 wants to merge 14 commits into
miguelgonzalez/verifier-01-evaluator-compatfrom
miguelgonzalez/verifier-02-backend-routing
Open

feat(verifier): add verifier evaluator shell and types#2130
miguelg719 wants to merge 14 commits into
miguelgonzalez/verifier-01-evaluator-compatfrom
miguelgonzalez/verifier-02-backend-routing

Conversation

@miguelg719
Copy link
Copy Markdown
Collaborator

@miguelg719 miguelg719 commented May 15, 2026

Why

The verifier pipeline needs a stable public contract before the judgment engine lands. This PR introduces the verifier-facing API and types while keeping runtime behavior reviewable and compatible with the legacy backend.

What Changed

  • Added public verifier types for trajectories, task specs, rubrics, criteria, verdicts, findings, and evidence.
  • Consolidated the verifier-facing type surface through packages/core/lib/v3/verifier/types.ts with public re-exports only from verifier/index.ts.
  • Removed implementation-level proxy type barrels, including the unused verifier.ts file and trajectory.ts type re-exports.
  • Kept public RubricCriterion.earnedPoints numeric-only; serialized empty-string values are normalized away at the IO boundary.
  • Constrained trajectory asset paths so screenshotPath cannot escape the trajectory directory during offline loading.
  • Added V3Evaluator.verify() and V3Evaluator.generateRubric() facade methods.
  • Added legacy compatibility behavior for verify() over saved trajectories.
  • Exported verifier types and helpers from the public v3 entrypoint.
  • Replaced the temporary stub reason with the stable stub-verifier reason.
  • Added public API and trajectory utility coverage for the new verifier surface.

Tests

  • pnpm --filter @browserbasehq/stagehand run typecheck
  • pnpm --filter @browserbasehq/stagehand run build:esm
  • pnpm --filter @browserbasehq/stagehand run test:core -- packages/core/dist/esm/tests/unit/public-api/v3-core.test.js
  • pnpm --filter @browserbasehq/stagehand run test:core -- packages/core/dist/esm/tests/unit/verifier-trajectory.test.js
  • git diff --check

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 15, 2026

🦋 Changeset detected

Latest commit: 60e4321

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 4 packages
Name Type
@browserbasehq/stagehand Patch
@browserbasehq/stagehand-evals Patch
@browserbasehq/stagehand-server-v3 Patch
@browserbasehq/stagehand-server-v4 Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 6 files

Confidence score: 4/5

  • This PR is likely safe to merge, with only a minor export-consistency risk rather than a broad functional regression.
  • In packages/core/lib/v3/index.ts, loadTrajectoryFromDisk and nextVerdictFilename are missing from the StagehandDefault object, which can cause import Stagehand from '@browserbasehq/st...' consumers to see missing members on the default import.
  • Severity is moderate-low (4/10) and the impact appears limited to default-export access patterns, so this looks non-blocking but worth fixing soon.
  • Pay close attention to packages/core/lib/v3/index.ts - ensure StagehandDefault includes all intended value exports to avoid default-import API gaps.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/core/lib/v3/index.ts">

<violation number="1" location="packages/core/lib/v3/index.ts:87">
P2: `loadTrajectoryFromDisk` and `nextVerdictFilename` are not added to the `StagehandDefault` default export object, unlike every other value export in this file. Consumers using `import Stagehand from '@browserbasehq/stagehand'` won't have access to these utilities.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant Client as Consumer (CLI/Tests)
    participant V3Eval as V3Evaluator
    participant LegEval as LegacyV3Evaluator
    participant TrajLoader as loadTrajectoryFromDisk
    participant FileSys as File System
    
    Note over Client,FileSys: NEW: Verifier Public Contract Flow
    
    Client->>V3Eval: verify(trajectory, taskSpec)
    
    alt backend === "legacy"
        V3Eval->>V3Eval: assertVerifierInput()<br/>validates id + trajectory
        
        V3Eval->>V3Eval: collectLegacyScreenshots()<br/>extracts probe then agent images
        
        V3Eval->>V3Eval: renderLegacyAgentReasoning()<br/>builds step-by-step text
        
        alt has screenshots or final answer
            V3Eval->>LegEval: ask(question, screenshot, answer, agentReasoning)
            LegEval-->>V3Eval: EvaluationResult
            
            V3Eval->>V3Eval: legacyEvaluationToVerdict()<br/>maps YES/NO/INVALID to Verdict
            V3Eval-->>Client: Verdict
        else no evidence
            V3Eval->>V3Eval: legacyInsufficientEvidenceVerdict()
            V3Eval-->>Client: Verdict<br/>(processScore=0,<br/>evidenceInsufficient=true)
        end
    else backend === "verifier"
        V3Eval->>V3Eval: unavailableVerifierBackend()<br/>throws error
    end
    
    Note over Client,FileSys: NEW: generateRubric Flow
    
    Client->>V3Eval: generateRubric(taskSpec)
    V3Eval->>V3Eval: validate taskSpec.id present
    
    alt backend === "legacy"
        V3Eval->>V3Eval: legacyTaskCompletionCriterion()<br/>single-criterion rubric
        V3Eval-->>Client: Rubric with 1 item
    else backend === "verifier"
        V3Eval->>V3Eval: unavailableVerifierBackend()<br/>throws error
    end
    
    Note over Client,FileSys: NEW: Offline Trajectory Loading<br/>(for re-scoring saved runs)
    
    Client->>TrajLoader: loadTrajectoryFromDisk(dir)
    TrajLoader->>FileSys: readFile(trajectory.json)
    FileSys-->>TrajLoader: raw JSON
    TrajLoader->>TrajLoader: parse + iterate steps
    
    loop each step
        alt probeEvidence.screenshotPath set<br/>and probeEvidence.screenshot absent
            TrajLoader->>FileSys: readFile(screenshotPath)
            FileSys-->>TrajLoader: Buffer
            TrajLoader->>TrajLoader: assign to probeEvidence.screenshot
        else file missing
            TrajLoader->>TrajLoader: skip (leaves undefined)
        end
        
        alt agentEvidence has image modality<br/>with bytesBase64
            TrajLoader->>TrajLoader: Buffer.from(bytesBase64, "base64")
            TrajLoader->>TrajLoader: replace with bytes field
        end
    end
    
    TrajLoader-->>Client: Hydrated Trajectory
    
    Note over Client,FileSys: NEW: Public Type Exports<br/>(Trajectory, Verdict, Verifier, etc.)
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic

Comment thread packages/core/lib/v3/index.ts
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-02-backend-routing branch from d7d2c59 to 2765781 Compare May 15, 2026 21:23
Comment thread packages/core/lib/v3/verifier/trajectory.ts Outdated
@miguelg719 miguelg719 changed the title feat(verifier): add verifier evaluator shell feat(verifier): add verifier evaluator shell and types May 16, 2026
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-02-backend-routing branch from de209d6 to 18265ca Compare May 16, 2026 05:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant