docs(verifier): add README for the new rubric verifier#2139
Draft
miguelg719 wants to merge 1 commit into
Draft
Conversation
Documents the verifier system landing across PRs #2129-#2138: pipeline architecture, Approach A vs B tradeoffs, env knobs, on-disk trajectory layout, offline `bench verify` usage, external harness adapter integration, prompts library, error taxonomy, and known limitations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a README at
packages/core/lib/v3/verifier/README.mdthat documents the rubric verifier landing across PRs #2129-#2138. Pure documentation — no code changes.Covers:
Verdictshape and the outcome-vs-process distinction (load-bearing rule: render failures count againstoutcomeSuccessregardless of cause; effort credit goes toprocessScore)VERIFIER_APPROACH,VERIFIER_OPTIONAL_STEPS,VERIFIER_RELEVANCE_BATCH_SIZE,VERIFIER_MAX_PARALLEL,VERIFIER_TOP_K, dedup thresholds, cache controls,STAGEHAND_EVALUATOR_BACKEND)V3Evaluator.verify()/.generateRubric()and the hands-offrunWithVerifieradaptertask_data.json+trajectory.json+screenshots/{agent,probe}/N.png+times.json+scores/mmrubric_*.jsonevals verify <dir>usage for ~30s prompt iteration cyclesWhy standalone
Independent of the 10 implementation PRs so it can land at any stack position without rebase churn. Targets the same directory where the verifier source files will live (
packages/core/lib/v3/verifier/) — readers will find it naturally once the rest merges.Test plan
gh pr view <num>renders the mermaid block correctlypackages/evals/framework/verifierAdapter.tsetc. point at the right files post-merge🤖 Generated with Claude Code
Summary by cubic
Adds a README for the rubric-based verifier at
packages/core/lib/v3/verifier/, documenting outputs, pipeline (Approach A vs B), configuration, public API, on-disk layout, offline re-scoring, adapters, prompts, taxonomy, performance, and limitations. Docs only; no code changes.V3Evaluator.verify()through the new verifier, setSTAGEHAND_EVALUATOR_BACKEND=verifier.evals verify <trajectory-dir>.Written for commit 690572f. Summary will update on new commits. Review in cubic