Skip to content

feat(verifier): normalize canonical evidence#2132

Open
miguelg719 wants to merge 5 commits into
miguelgonzalez/verifier-03-trajectory-recorderfrom
miguelgonzalez/verifier-04-evidence-normalization
Open

feat(verifier): normalize canonical evidence#2132
miguelg719 wants to merge 5 commits into
miguelgonzalez/verifier-03-trajectory-recorderfrom
miguelgonzalez/verifier-04-evidence-normalization

Conversation

@miguelg719
Copy link
Copy Markdown
Collaborator

@miguelg719 miguelg719 commented May 15, 2026

Why

The verifier needs a mode-neutral evidence layer before scoring. DOM/Hybrid tasks can only be judged correctly if text, JSON, and tool-output evidence are considered alongside screenshots and ARIA evidence, while core installs should not pick up image-processing runtime dependencies.

What Changed

  • Added canonical screenshot loading, deduplication, resizing, and step-index mapping.
  • Added canonical text evidence for ARIA snippets, agent text, agent JSON, and native tool output.
  • Added combined chronological canonical evidence collection.
  • Moved canonical evidence and evidence-load result types into verifier/types.ts; implementation files import those types directly.
  • Kept image reduction behind a dynamic sharp import so core remains dependency-light; sharp is owned by evals where verifier tooling runs.
  • Added a smoke check proving probe-aria, agent-text, agent-json, and tool-output evidence are collected.
  • Removed upstream verifier references from comments.

Tests

  • pnpm --filter @browserbasehq/stagehand run typecheck
  • pnpm -w exec prettier --check packages/core/lib/v3/verifier packages/core/tests/unit/verifier-failure-step-parser.test.ts packages/core/tests/unit/verifier-trajectory.test.ts packages/evals/cli.ts packages/evals/framework packages/evals/tests/framework/rubricCache.test.ts packages/evals/tests/framework/verifierAdapter.test.ts packages/evals/tests/tui/commandTree.test.ts packages/evals/tui packages/evals/tasks/bench/agent
  • Canonical evidence smoke: collectCanonicalEvidence produced probe-aria,agent-text,agent-json,tool-output
  • git diff --check

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 15, 2026

🦋 Changeset detected

Latest commit: 4e9c26e

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 4 packages
Name Type
@browserbasehq/stagehand Patch
@browserbasehq/stagehand-evals Patch
@browserbasehq/stagehand-server-v3 Patch
@browserbasehq/stagehand-server-v4 Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files

Confidence score: 2/5

  • High-confidence, high-severity risk: adding sharp in packages/core/package.json introduces a binary/node-gyp-backed dependency in core, which violates the stated core-lib constraint and is likely to cause build/portability regressions.
  • Because this is a concrete policy and runtime/build compatibility concern (severity 9/10, confidence 10/10), this is not a low-risk merge as-is.
  • Pay close attention to packages/core/package.json - sharp should be removed, replaced, or isolated outside core to avoid binary dependency risk.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/core/package.json">

<violation number="1" location="packages/core/package.json:113">
P0: Custom agent: **Flag any imports of packages that use node-gyp or embed binaries (e.g. sharp)**

`sharp` (a binary/node-gyp-backed package) was added to `packages/core/package.json` dependencies, violating the core-library restriction on native/binary npm packages.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant Verifier as Evidence Verifier
    participant Loader as loadAndReduceScreenshots
    participant Sharp as Sharp (Optional)
    participant TextCollector as collectCanonicalEvidence
    participant Trajectory as Trajectory Data

    Note over Verifier,Trajectory: Canonical Evidence Collection Flow

    Verifier->>Trajectory: Read trajectory steps
    Verifier->>Loader: Call loadAndReduceScreenshots(trajectory)

    Loader->>Loader: Parse env thresholds (SSIM, MSE, resize)

    loop Each trajectory step
        Loader->>Trajectory: Check probeEvidence.screenshot buffer
        alt Screenshot exists
            Loader->>Loader: Add to rawFrames list
        else No screenshot
            Loader->>Loader: Skip step
        end
    end

    alt Sharp available
        Loader->>Sharp: Dynamic import sharp
        Sharp-->>Loader: sharp instance

        loop Dedup and resize frames
            Loader->>Sharp: calculateMSE(prev.bytes, frame.bytes)
            Sharp-->>Loader: MSE value

            alt MSE >= threshold
                Loader->>Sharp: calculateSSIM(prev.bytes, frame.bytes)
                Sharp-->>Loader: SSIM value

                alt SSIM < threshold
                    Loader->>Loader: Keep frame (keptReason: "diverges")
                else SSIM >= threshold
                    Loader->>Loader: Drop duplicate frame
                end
            else MSE < threshold
                Loader->>Loader: Drop duplicate frame (fast path)
            end

            alt First or last frame
                Loader->>Loader: Always keep (keptReason: "first"/"last")
            end
        end

        Loader->>Sharp: Resize kept frames (resizeFactor)
        Sharp-->>Loader: Resized Buffer
    else Sharp unavailable
        Loader->>Loader: Keep all frames native size (keptReason: "no-dedup")
    end

    Loader-->>Verifier: EvidenceLoadResult (canonical screenshots)

    Verifier->>TextCollector: Call collectCanonicalEvidence(trajectory)

    loop Each trajectory step
        TextCollector->>Trajectory: Extract aria snippet
        TextCollector->>Trajectory: Extract agent text
        TextCollector->>Trajectory: Extract agent JSON
        TextCollector->>Trajectory: Extract tool output

        alt Text evidence exists
            TextCollector->>TextCollector: Create CanonicalTextEvidence with source type
        end
    end

    TextCollector-->>Verifier: Array<CanonicalTextEvidence>

    Verifier->>Verifier: Merge screenshots + text into chronological CanonicalEvidence[]
    
    Note over Verifier: StepIndex → CanonicalIndex mapping preserved

    Verifier-->>Verifier: Evidence ready for relevance scoring (Step 2)
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic

Comment thread packages/core/package.json Outdated
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-04-evidence-normalization branch from 3fee3cb to 7d010ed Compare May 15, 2026 21:23
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch 2 times, most recently from da0c152 to d77e596 Compare May 15, 2026 21:45
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-04-evidence-normalization branch 3 times, most recently from a152252 to fc5a9f7 Compare May 15, 2026 23:27
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from 83b4e86 to fd043bc Compare May 16, 2026 04:40
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-04-evidence-normalization branch from fc5a9f7 to be24a26 Compare May 16, 2026 04:40
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from fd043bc to 635b3d2 Compare May 16, 2026 05:50
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-04-evidence-normalization branch from be24a26 to 4e9c26e Compare May 16, 2026 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant