-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat(evals): add verifier benchmark instrumentation #2138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: miguelgonzalez/verifier-09-harness-adapters
Are you sure you want to change the base?
Changes from all commits
0eb6685
06d5a4f
457754b
a04286f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| # Verifier Benchmark Matrix | ||
|
|
||
| Use this matrix before changing `STAGEHAND_EVALUATOR_BACKEND` defaults. | ||
| `STAGEHAND_EVALUATOR_BACKEND` selects the public evaluator backend; `VERIFIER_*` | ||
| flags tune the verifier internals once that backend is selected. | ||
|
|
||
| ```bash | ||
| STAGEHAND_EVALUATOR_BACKEND=legacy | ||
| STAGEHAND_EVALUATOR_BACKEND=verifier VERIFIER_APPROACH=outcome-only | ||
| STAGEHAND_EVALUATOR_BACKEND=verifier VERIFIER_APPROACH=a | ||
| STAGEHAND_EVALUATOR_BACKEND=verifier VERIFIER_APPROACH=b | ||
| ``` | ||
|
|
||
| Use `VERIFIER_APPROACH=outcome-only` as the verifier default for benchmarks | ||
| without curated rubrics. Use approaches `a` and `b` when evaluating the rubric | ||
| pipeline itself or datasets with trusted precomputed rubrics. | ||
|
|
||
| For saved trajectories, run verifier approaches against the same agent outputs | ||
| so verifier quality is isolated from agent variance: | ||
|
|
||
| ```bash | ||
| TRAJECTORY_GLOB=".trajectories/<run-prefix>*" scripts/cross-verify-parallel.sh | ||
| ``` | ||
|
|
||
| Optional environment: | ||
|
|
||
| ```bash | ||
| EVALS_ENV_FILE=~/.envs/prod-evals.env | ||
| PARALLEL=8 | ||
| VERIFIER_OPTIONAL_STEPS=folded | ||
| ``` | ||
|
|
||
| Report at least: | ||
|
|
||
| - accuracy against manually reviewed labels | ||
| - false positives and false negatives | ||
| - invalid or ambiguous task handling | ||
| - evidence-insufficient count | ||
| - latency and model cost | ||
|
|
||
| Do not flip the default backend until verifier results beat or match legacy on | ||
| the target datasets and failure analysis is reviewed. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,64 @@ | ||
| #!/usr/bin/env bash | ||
| # Parallel cross-verify: 8 verifier processes in flight at once across | ||
| # outcome-only plus the rubric approaches. | ||
|
|
||
| set -e | ||
| cd "$(dirname "$0")/.." | ||
|
|
||
| if [[ -n "${EVALS_ENV_FILE:-}" && -f "$EVALS_ENV_FILE" ]]; then | ||
| set -a | ||
| source "$EVALS_ENV_FILE" | ||
| set +a | ||
| fi | ||
|
|
||
| PARALLEL=${PARALLEL:-8} | ||
| TRAJECTORY_GLOB=${TRAJECTORY_GLOB:-.trajectories/*} | ||
|
|
||
| DIRS=() | ||
| while IFS= read -r d; do | ||
| DIRS+=("$d") | ||
| done < <(find $TRAJECTORY_GLOB -mindepth 1 -maxdepth 1 -type d | sort) | ||
|
|
||
| echo "[$(date +%H:%M:%S)] Found ${#DIRS[@]} trajectory dirs; parallelism=$PARALLEL" | ||
|
|
||
| run_one() { | ||
| local dir="$1" | ||
| local approach="$2" | ||
| local label="cross-${approach}" | ||
| local out_file="$dir/scores/result_${label}.json" | ||
| local task | ||
| task=$(basename "$dir") | ||
| if [[ -f "$out_file" ]]; then | ||
| echo "[$(date +%H:%M:%S)] [$approach] $task: skip (exists)" | ||
| return 0 | ||
| fi | ||
| local start | ||
| start=$(date +%s) | ||
| if VERIFIER_APPROACH=$approach VERIFIER_OPTIONAL_STEPS=folded \ | ||
| pnpm exec tsx packages/evals/cli.ts verify "$dir" --label "$label" > /tmp/verify-$$-$task-$approach.log 2>&1; then | ||
| echo "[$(date +%H:%M:%S)] [$approach] $task: done in $(( $(date +%s) - start ))s" | ||
| else | ||
| echo "[$(date +%H:%M:%S)] [$approach] $task: FAILED in $(( $(date +%s) - start ))s; see /tmp/verify-$$-$task-$approach.log" | ||
| fi | ||
| } | ||
| export -f run_one | ||
| export PARALLEL | ||
|
|
||
| # Build (dir, approach) job list and feed to xargs -P. | ||
| JOBS=() | ||
| for d in "${DIRS[@]}"; do | ||
| JOBS+=("$d|outcome-only") | ||
| done | ||
| for d in "${DIRS[@]}"; do | ||
| JOBS+=("$d|b") | ||
| done | ||
| for d in "${DIRS[@]}"; do | ||
| JOBS+=("$d|a") | ||
| done | ||
|
|
||
| printf '%s\n' "${JOBS[@]}" | xargs -I {} -n 1 -P "$PARALLEL" bash -c ' | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. P2: Use null-delimited input ( Prompt for AI agents |
||
| IFS="|" read -r dir approach <<< "$1" | ||
| run_one "$dir" "$approach" | ||
| ' _ {} | ||
|
|
||
| echo "[$(date +%H:%M:%S)] All cross-verifications complete." | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| #!/usr/bin/env bash | ||
| # Re-verify each stored trajectory under each verifier approach via `bench verify`. | ||
| # Lets us isolate verifier disagreement from agent variance. | ||
| # | ||
| # Inputs: every trajectory dir matched by TRAJECTORY_GLOB. | ||
| # Outputs: scores/result_cross-{outcome-only,a,b}.json next to each trajectory. | ||
|
|
||
| set -e | ||
| cd "$(dirname "$0")/.." | ||
|
|
||
| if [[ -n "${EVALS_ENV_FILE:-}" && -f "$EVALS_ENV_FILE" ]]; then | ||
| set -a | ||
| source "$EVALS_ENV_FILE" | ||
| set +a | ||
| fi | ||
|
|
||
| # Collect trajectory dirs from persisted verifier runs. | ||
| TRAJECTORY_GLOB=${TRAJECTORY_GLOB:-.trajectories/*} | ||
| DIRS=() | ||
| while IFS= read -r d; do | ||
| DIRS+=("$d") | ||
| done < <(find $TRAJECTORY_GLOB -mindepth 1 -maxdepth 1 -type d | sort) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. P2: Quote Prompt for AI agents |
||
|
|
||
| echo "Found ${#DIRS[@]} trajectory dirs" | ||
| for d in "${DIRS[@]}"; do | ||
| task=$(basename "$d") | ||
| echo "=== $(basename "$(dirname "$d")")/$task ===" | ||
| for approach in outcome-only b a; do | ||
| label="cross-${approach}" | ||
| out_file="$d/scores/result_${label}.json" | ||
| if [[ -f "$out_file" ]]; then | ||
| echo " [$approach] already exists, skipping" | ||
| continue | ||
| fi | ||
| echo " [$approach] verifying..." | ||
| start=$(date +%s) | ||
| VERIFIER_APPROACH=$approach VERIFIER_OPTIONAL_STEPS=folded \ | ||
| pnpm exec tsx packages/evals/cli.ts verify "$d" --label "$label" > /dev/null 2>&1 | ||
| end=$(date +%s) | ||
| echo " [$approach] done in $((end - start))s" | ||
| done | ||
| done | ||
|
|
||
| echo "All cross-verifications complete." | ||
Uh oh!
There was an error while loading. Please reload this page.