Skip to content

docs(verifier): add README for the new rubric verifier#2139

Draft
miguelg719 wants to merge 1 commit into
mainfrom
miguelgonzalez/verifier-readme
Draft

docs(verifier): add README for the new rubric verifier#2139
miguelg719 wants to merge 1 commit into
mainfrom
miguelgonzalez/verifier-readme

Conversation

@miguelg719
Copy link
Copy Markdown
Collaborator

@miguelg719 miguelg719 commented May 15, 2026

Summary

Adds a README at packages/core/lib/v3/verifier/README.md that documents the rubric verifier landing across PRs #2129-#2138. Pure documentation — no code changes.

Covers:

  • What it produces — the Verdict shape and the outcome-vs-process distinction (load-bearing rule: render failures count against outcomeSuccess regardless of cause; effort credit goes to processScore)
  • Pipeline diagram (mermaid) — Step 0a rubric → Step 1 canonical evidence (SSIM/MSE dedup + downscale) → Step 2 batched relevance → Top-K → Approach A or Approach B
  • Approach A vs B — fault-isolated per-criterion path vs single fused multimodal call, with accuracy + cost tradeoffs
  • Configuration — every env var (VERIFIER_APPROACH, VERIFIER_OPTIONAL_STEPS, VERIFIER_RELEVANCE_BATCH_SIZE, VERIFIER_MAX_PARALLEL, VERIFIER_TOP_K, dedup thresholds, cache controls, STAGEHAND_EVALUATOR_BACKEND)
  • Public APIV3Evaluator.verify() / .generateRubric() and the hands-off runWithVerifier adapter
  • On-disk trajectory layout — fara-compatible task_data.json + trajectory.json + screenshots/{agent,probe}/N.png + times.json + scores/mmrubric_*.json
  • Offline re-scoringevals verify <dir> usage for ~30s prompt iteration cycles
  • External harness adapters — Codex / Claude Code path with tier-1-only evidence
  • Prompts library + error taxonomy — one-line per file/category, what each does
  • Performance characteristics — 28-50 LLM calls / ~$0.12 per run / 3-10× cheaper than the FARA original
  • Known limitations — Step 0a parse robustness, tier-1 image dedup gap, processScore non-determinism, Browserbase quota interaction

Why standalone

Independent of the 10 implementation PRs so it can land at any stack position without rebase churn. Targets the same directory where the verifier source files will live (packages/core/lib/v3/verifier/) — readers will find it naturally once the rest merges.

Test plan

🤖 Generated with Claude Code


Summary by cubic

Adds a README for the rubric-based verifier at packages/core/lib/v3/verifier/, documenting outputs, pipeline (Approach A vs B), configuration, public API, on-disk layout, offline re-scoring, adapters, prompts, taxonomy, performance, and limitations. Docs only; no code changes.

  • Migration
    • To route V3Evaluator.verify() through the new verifier, set STAGEHAND_EVALUATOR_BACKEND=verifier.
    • To re-score saved trajectories without running an agent, use evals verify <trajectory-dir>.

Written for commit 690572f. Summary will update on new commits. Review in cubic

Documents the verifier system landing across PRs #2129-#2138: pipeline
architecture, Approach A vs B tradeoffs, env knobs, on-disk trajectory
layout, offline `bench verify` usage, external harness adapter integration,
prompts library, error taxonomy, and known limitations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 15, 2026

⚠️ No Changeset found

Latest commit: 690572f

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant