Skip to content

feat(evals): add offline verifier CLI#2134

Open
miguelg719 wants to merge 7 commits into
miguelgonzalez/verifier-05-core-enginefrom
miguelgonzalez/verifier-06-offline-cli
Open

feat(evals): add offline verifier CLI#2134
miguelg719 wants to merge 7 commits into
miguelgonzalez/verifier-05-core-enginefrom
miguelgonzalez/verifier-06-offline-cli

Conversation

@miguelg719
Copy link
Copy Markdown
Collaborator

@miguelg719 miguelg719 commented May 15, 2026

Why

Verifier iteration should not require rerunning browser automation. This PR adds offline saved-trajectory rescoring so prompts, approaches, and scoring behavior can be compared against the same trajectory artifacts without changing the existing CLI architecture.

What Changed

  • Added evals verify <trajectory-dir> for offline trajectory rescoring through the existing command-tree dispatch.
  • Preserved REPL quiet handling, first-run state, welcome behavior, and existing command-tree routing.
  • Added rubric cache utilities for generated rubrics.
  • Hardened rubric cache reads to verify both taskId and instruction hash before returning cached data.
  • Added live/offline verifier scripts.
  • Added TUI command parsing and help support for the verify command.
  • Ensured offline verification explicitly uses the verifier backend.
  • Removed upstream verifier references from comments.

Tests

  • pnpm --filter @browserbasehq/stagehand run typecheck
  • pnpm --filter @browserbasehq/stagehand-evals run typecheck
  • pnpm --dir packages/evals exec vitest run tests/framework/rubricCache.test.ts tests/framework/verifierAdapter.test.ts tests/tui/commandTree.test.ts tests/framework/claudeCodeRunner.test.ts tests/framework/codexRunner.test.ts tests/cli.test.ts
  • pnpm -w exec prettier --check packages/core/lib/v3/verifier packages/core/tests/unit/verifier-failure-step-parser.test.ts packages/evals/cli.ts packages/evals/framework packages/evals/tests/framework/rubricCache.test.ts packages/evals/tests/framework/verifierAdapter.test.ts packages/evals/tests/tui/commandTree.test.ts packages/evals/tui packages/evals/tasks/bench/agent
  • git diff --check

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 15, 2026

⚠️ No Changeset found

Latest commit: 4f141e7

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 9 files

Confidence score: 3/5

  • There is some concrete regression risk here: packages/evals/framework/rubricCache.ts read does not confirm parsed.taskId === taskSpec.id, so sanitized ID collisions can return the wrong cached rubric data when hashes align.
  • Two CLI/runtime behaviors are likely to confuse users but are straightforward to fix: packages/evals/tui/commands/verify.ts silently ignores trailing --model/--label without values, and packages/evals/scripts/verify-live-trajectory.ts passes timeoutMs to page.goto() (ignored), causing fallback to Playwright’s default timeout.
  • Given one medium-severity correctness issue plus two medium input/timeout handling issues, this looks mergeable with caution after targeted fixes rather than a hard block.
  • Pay close attention to packages/evals/framework/rubricCache.ts, packages/evals/tui/commands/verify.ts, and packages/evals/scripts/verify-live-trajectory.ts - cache key/task ID validation, missing flag-value errors, and ignored navigation timeout options need verification.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/tui/commands/verify.ts">

<violation number="1" location="packages/evals/tui/commands/verify.ts:89">
P2: Missing validation when `--model` or `--label` is passed without a following value. If either is the last argument, `args[++i]` is `undefined` and the flag is silently ignored rather than producing an error.</violation>
</file>

<file name="packages/evals/framework/rubricCache.ts">

<violation number="1" location="packages/evals/framework/rubricCache.ts:95">
P2: The `read` method does not verify `parsed.taskId` matches `taskSpec.id`. Since `entryPath` sanitizes characters (`:`, `/`, etc.) to `_`, distinct task IDs can map to the same file. When instruction hashes also happen to match, a stale/wrong rubric is served silently. Add a `taskId` equality check alongside the `instructionHash` check.</violation>
</file>

<file name="packages/evals/scripts/verify-live-trajectory.ts">

<violation number="1" location="packages/evals/scripts/verify-live-trajectory.ts:38">
P2: Playwright's `page.goto()` accepts `timeout`, not `timeoutMs`. This option is silently ignored, so the navigation falls back to the default 30s timeout instead of the intended 60s.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant CLI as CLI / terminal
    participant CmdRouter as Command Router (cli.ts)
    participant VerifyCmd as verify Command
    participant Trajectory as Trajectory Dir (disk)
    participant RubricCache as RubricCache
    participant V3Eval as V3Evaluator (verifier backend)
    participant V3 as V3 instance (headless)
    participant TrajectoryRecorder as TrajectoryRecorder (live)

    Note over CLI,V3: NEW: Offline verify path (red arrow) vs existing live run (blue arrows)

    CLI->>CmdRouter: evals verify <trajectory-dir> [options]
    CmdRouter->>VerifyCmd: handleVerify(args)
    VerifyCmd->>Trajectory: read trajectory.json + task_data.json
    Trajectory-->>VerifyCmd: Trajectory + TaskSpec
    VerifyCmd->>V3Eval: new V3Evaluator(v3, {backend:"verifier"})
    Note over VerifyCmd,V3Eval: No browser launched — V3 constructed without init()
    VerifyCmd->>V3Eval: verify(trajectory, taskSpec)
    V3Eval->>RubricCache: getOrGenerate(taskSpec, evaluator)
    alt Cache hit (same instruction hash)
        RubricCache-->>V3Eval: cached Rubric
    else Cache miss or hash drift
        RubricCache->>RubricCache: hashInstruction(taskSpec.instruction)
        V3Eval->>V3Eval: generateRubric(taskSpec) — Step 0a
        V3Eval->>RubricCache: write(taskSpec, rubric)
        RubricCache-->>V3Eval: freshly generated Rubric
    end
    V3Eval->>V3Eval: score trajectory against rubric — Step 8
    V3Eval-->>VerifyCmd: Verdict (outcomeSuccess, processScore, perCriterion)
    alt --json flag
        VerifyCmd->>CLI: JSON stringified Verdict to stdout
    else default (human summary)
        VerifyCmd->>CLI: colored summary (score, criteria, findings)
        alt --dry-run not set
            VerifyCmd->>Trajectory: write scores/mmrubric_<label>.json
        end
    end

    Note over CLI,Trajectory: Live run path (unchanged, shown for context)
    CLI->>CmdRouter: evals run <target>
    CmdRouter->>V3: agent.execute(instruction)
    V3->>TrajectoryRecorder: start() — subscribe to bus events
    V3->>V3: perform browser automation steps
    V3->>TrajectoryRecorder: capture step events (screenshots, URLs, evidence)
    V3-->>CmdRouter: agent result
    TrajectoryRecorder->>Trajectory: persist() — write trajectory.json, screenshots, task_data.json, times.json
    CmdRouter->>CLI: run summary

    Note over CmdRouter,V3: Success mode plumbing (--success flag)
    CmdRouter->>CmdRouter: resolve successMode from --success / EVAL_SUCCESS_MODE / "outcome"
    CmdRouter->>V3: envOverrides.EVAL_SUCCESS_MODE = successMode
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic

parsed.json = true;
} else if (a === "--dry-run") {
parsed.dryRun = true;
} else if (a === "--model") {
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Missing validation when --model or --label is passed without a following value. If either is the last argument, args[++i] is undefined and the flag is silently ignored rather than producing an error.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/tui/commands/verify.ts, line 89:

<comment>Missing validation when `--model` or `--label` is passed without a following value. If either is the last argument, `args[++i]` is `undefined` and the flag is silently ignored rather than producing an error.</comment>

<file context>
@@ -0,0 +1,238 @@
+      parsed.json = true;
+    } else if (a === "--dry-run") {
+      parsed.dryRun = true;
+    } else if (a === "--model") {
+      parsed.model = args[++i];
+    } else if (a === "--label") {
</file context>
Fix with Cubic

Comment thread packages/evals/framework/rubricCache.ts
Comment thread packages/evals/scripts/verify-live-trajectory.ts Outdated
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from 163db47 to ebe60bf Compare May 15, 2026 21:23
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch 2 times, most recently from 4923ce6 to dcc5bfc Compare May 15, 2026 21:45
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from ebe60bf to 191904b Compare May 15, 2026 21:45
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from dcc5bfc to d736522 Compare May 15, 2026 22:33
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from 191904b to 62cb8db Compare May 15, 2026 22:33
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from d736522 to cd1f8f4 Compare May 15, 2026 23:27
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch 2 times, most recently from a6ee702 to 2e7ff0f Compare May 16, 2026 04:40
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from cd1f8f4 to 95ada04 Compare May 16, 2026 04:40
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from 2e7ff0f to b725247 Compare May 16, 2026 05:50
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from 95ada04 to 4f141e7 Compare May 16, 2026 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant