eval(auto-pilot): pilot batch — 3 skills, 3/3 success#47
Open
Zhaiyuqing2003 wants to merge 3 commits into
Open
eval(auto-pilot): pilot batch — 3 skills, 3/3 success#47Zhaiyuqing2003 wants to merge 3 commits into
Zhaiyuqing2003 wants to merge 3 commits into
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…coverage 0.54→0.86
Builds eval suite for anthropics/skills/pdf (document-producer): - 4 cases: extract-pdf-facts, split-customer-packet, build-briefing-pdf, no-pdf-skill-needed - 3-model matrix: claude-sonnet-4.6, gpt-5-mini, gemini-2.5-pro - Baseline: 36/36 trials PASS (100% pass rate), no iteration needed - No upstream changes proposed — skill guides all models correctly Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds the first auto-pilot “pilot batch” artifacts under examples/workbench/, including three new/updated workbench eval suites (tool-use, code-reviewer, document-producer) with vendored skill snapshots, seeded cases, graders, and run records. It’s stacked on the auto-improve wrapper work, and primarily serves as reproducible evidence + packaging of the pilot runs and any proposed upstream diffs.
Changes:
- Added a new workbench eval suite for
supabase-postgres-best-practices(fixtures, graders, vendored references, run analysis, and upstream proposal). - Updated the existing
pdfworkbench suite (models + vendored skill snapshot + docs + negative-control grader path). - Added a new
agent-browserworkbench eval suite including a deterministic mock CLI, graders, vendored skill stub + core reference, run analysis, and an upstream proposal.
Reviewed changes
Copilot reviewed 67 out of 68 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/workbench/supabase-postgres-best-practices/workspace/schema.sql | Seed SQL schema fixture for the supabase-postgres-best-practices review case |
| examples/workbench/supabase-postgres-best-practices/workspace/rls_policies.sql | Seed SQL RLS fixture for the supabase-postgres-best-practices review case |
| examples/workbench/supabase-postgres-best-practices/suite.yml | Workbench suite definition (models, cases, graders) for supabase-postgres-best-practices |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/SKILL.md | Vendored skill snapshot used by the eval |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/security-rls-performance.md | Vendored rule reference (RLS performance) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/security-rls-basics.md | Vendored rule reference (RLS basics) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/security-privileges.md | Vendored rule reference (privileges) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-primary-keys.md | Vendored rule reference (primary keys) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-partitioning.md | Vendored rule reference (partitioning) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-lowercase-identifiers.md | Vendored rule reference (identifier casing) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-foreign-key-indexes.md | Vendored rule reference (FK indexes) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-data-types.md | Vendored rule reference (data types) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-constraints.md | Vendored rule reference (constraints / migrations) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-partial-indexes.md | Vendored rule reference (partial indexes) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-missing-indexes.md | Vendored rule reference (missing indexes) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-index-types.md | Vendored rule reference (index types) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-covering-indexes.md | Vendored rule reference (covering indexes) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-composite-indexes.md | Vendored rule reference (composite indexes) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-vacuum-analyze.md | Vendored rule reference (VACUUM/ANALYZE) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-pg-stat-statements.md | Vendored rule reference (pg_stat_statements) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-explain-analyze.md | Vendored rule reference (EXPLAIN ANALYZE) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-skip-locked.md | Vendored rule reference (SKIP LOCKED) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-short-transactions.md | Vendored rule reference (short transactions) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-deadlock-prevention.md | Vendored rule reference (deadlocks) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-advisory.md | Vendored rule reference (advisory locks) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-upsert.md | Vendored rule reference (UPSERT) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-pagination.md | Vendored rule reference (pagination) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-n-plus-one.md | Vendored rule reference (N+1) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-batch-inserts.md | Vendored rule reference (batch inserts) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-prepared-statements.md | Vendored rule reference (prepared statements) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-pooling.md | Vendored rule reference (pooling) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-limits.md | Vendored rule reference (connection limits) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-idle-timeout.md | Vendored rule reference (idle timeouts) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/advanced-jsonb-indexing.md | Vendored rule reference (JSONB indexing) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/advanced-full-text-search.md | Vendored rule reference (FTS) |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/_template.md | Reference authoring template |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/_sections.md | Rule section/category definitions |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/_contributing.md | Reference writing guidelines |
| examples/workbench/supabase-postgres-best-practices/README.md | Suite documentation and how-to-run notes |
| examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/supabase-agent-skills/before-SKILL.md | Captured upstream baseline for proposed change |
| examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/supabase-agent-skills/after-SKILL.md | Proposed upstream SKILL.md change |
| examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/README.md | Rationale + instructions for upstream change |
| examples/workbench/supabase-postgres-best-practices/checks/grade-schema-findings.mjs | Grader for schema review findings |
| examples/workbench/supabase-postgres-best-practices/checks/grade-rls-findings.mjs | Grader for RLS review findings |
| examples/workbench/supabase-postgres-best-practices/checks/_grader-utils.mjs | Shared grading utilities for the suite |
| examples/workbench/supabase-postgres-best-practices/analysis.md | Auto-pilot run record for the supabase skill |
| examples/workbench/pdf/suite.yml | Updated pdf eval suite metadata (name/models) |
| examples/workbench/pdf/references/pdf/SKILL.md | Vendored upstream pdf skill snapshot |
| examples/workbench/pdf/references/pdf-skill/SKILL.md | Removed old demo pdf skill stub |
| examples/workbench/pdf/README.md | Updated pdf suite documentation |
| examples/workbench/pdf/proposed-upstream-changes/README.md | Notes that no upstream changes were needed |
| examples/workbench/pdf/proposed-upstream-changes/anthropics-skills/before-SKILL.md | Captured upstream baseline for pdf |
| examples/workbench/pdf/proposed-upstream-changes/anthropics-skills/after-SKILL.md | Captured upstream comparison for pdf |
| examples/workbench/pdf/checks/no-pdf-skill.mjs | Updated negative-control grader to new SKILL path |
| examples/workbench/pdf/analysis.md | Auto-pilot run record for the pdf skill |
| examples/workbench/agent-browser/suite.yml | New agent-browser eval suite definition |
| examples/workbench/agent-browser/references/agent-browser/SKILL.md | Vendored agent-browser SKILL stub (local core reference path) |
| examples/workbench/agent-browser/references/agent-browser/core.md | Vendored agent-browser core workflow reference |
| examples/workbench/agent-browser/README.md | Suite documentation and how-to-run notes |
| examples/workbench/agent-browser/proposed-upstream-changes/vercel-labs-agent-browser/before-SKILL.md | Captured upstream baseline for agent-browser stub |
| examples/workbench/agent-browser/proposed-upstream-changes/vercel-labs-agent-browser/after-SKILL.md | Proposed upstream stub change (quick task reference) |
| examples/workbench/agent-browser/proposed-upstream-changes/README.md | Rationale + instructions for upstream change |
| examples/workbench/agent-browser/checks/grade-search-screenshot-findings.mjs | Grader for search + screenshot case |
| examples/workbench/agent-browser/checks/grade-extract-stories-findings.mjs | Grader for HN story extraction case |
| examples/workbench/agent-browser/checks/grade-capture-homepage-findings.mjs | Grader for example.com screenshot + title case |
| examples/workbench/agent-browser/checks/_grader-utils.mjs | Shared grading utilities (currently unused in these graders) |
| examples/workbench/agent-browser/bin/agent-browser | Deterministic mock agent-browser CLI for the eval |
| examples/workbench/agent-browser/analysis.md | Auto-pilot run record for the agent-browser skill |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+13
to
+19
| CREATE TABLE orders ( | ||
| id bigint generated always as identity primary key, | ||
| customer_id bigint references customers(id) on delete cascade, | ||
| status text not null default 'pending', | ||
| created_at timestamptz default now(), | ||
| total numeric(10,2) | ||
| ); |
Comment on lines
+11
to
+16
| -- Orders table | ||
| -- VIOLATION (schema-foreign-key-indexes): customer_id FK column has no index | ||
| CREATE TABLE orders ( | ||
| id bigint generated always as identity primary key, | ||
| customer_id bigint references customers(id) on delete cascade, | ||
| status text not null default 'pending', |
Comment on lines
+4
to
+8
| -- === RLS enabled, but FORCE not applied === | ||
| -- VIOLATION (security-rls-basics): table owner can bypass RLS; FORCE not set | ||
| ALTER TABLE orders ENABLE ROW LEVEL SECURITY; | ||
| -- FIX: also run ALTER TABLE orders FORCE ROW LEVEL SECURITY; | ||
|
|
| `references/supabase-postgres-best-practices/references/` and place `SKILL.md` | ||
| at `references/supabase-postgres-best-practices/SKILL.md`. The relative paths | ||
| in `SKILL.md` resolve correctly under the workbench's `/work` layout. | ||
| Diff vs upstream is zero (no path changes needed). |
Comment on lines
+1
to
+11
| // Shared grader logic for web-design-guidelines eval cases. | ||
| // | ||
| // Each finding is assumed to be one line in findings.txt that references | ||
| // "<File>.tsx:<line>" (line numbers come from the agent — they're often | ||
| // off by ±1-2 due to LLM line-counting). A violation is considered "found" | ||
| // when at least one finding line: | ||
| // (a) references a line number within the violation's accepted range, AND | ||
| // (b) contains at least one of the violation's distinguishing keywords. | ||
| // | ||
| // This per-finding-line check prevents spurious cross-matches (e.g. the | ||
| // keyword "label" from a different finding being credited to a paste rule). |
Comment on lines
+1
to
+11
| // Shared grader logic for web-design-guidelines eval cases. | ||
| // | ||
| // Each finding is assumed to be one line in findings.txt that references | ||
| // "<File>.tsx:<line>" (line numbers come from the agent — they're often | ||
| // off by ±1-2 due to LLM line-counting). A violation is considered "found" | ||
| // when at least one finding line: | ||
| // (a) references a line number within the violation's accepted range, AND | ||
| // (b) contains at least one of the violation's distinguishing keywords. | ||
| // | ||
| // This per-finding-line check prevents spurious cross-matches (e.g. the | ||
| // keyword "label" from a different finding being credited to a paste rule). |
Zhaiyuqing2003
pushed a commit
that referenced
this pull request
May 8, 2026
Three changes informed by the 3-skill pilot batch (PR #47): 1. **"Always: write analysis.md AND commit" merged into a single atomic step.** Pilots #1b and #2 wrote analysis.md but ran out of budget before reaching the separate commit step, leaving case files uncommitted. The merged section explicitly tells the agent to skip everything else if budget is low and finish this section first. 2. **Default --max-budget-usd bumped 3.50 → 10.00.** Pilot #1's first real-data attempt died at the cap mid-modification. Pilot #1c at --budget 15 settled at $3.15 with full success. The prompt's Phase-4 self-cap also moved from $3.00 to $7.00 to leave a $2-3 buffer for the analysis.md + commit cleanup below the wrapper hard cap. 3. **New tools/auto-improve-skill-lessons.md** — living doc the prompt reads as Phase-4 prior. Captures recipes A-E (two-pass workflow, verify-tool-installed, per-element checklists, BAD/GOOD examples, rationale + bug-story) and grader-reliability patterns G1-G6 (line tolerance, hyphen regex, per-finding-line matching, keyword variants, set-semantics, verbosity floor) with empirical evidence from the manual web-design-guidelines run + the 3 auto pilots. Phase 4 of the prompt now references the recipes by letter so the auto-pilot doesn't rediscover patterns from scratch each run. Also fixes a slug-parsing regression introduced by the --budget flag (when --budget was absent, the filter wrongly skipped argv[0]). Smoke tests pass: bare invocation prints usage, "nope" gives bad-slug, existing dir gets refused, --budget validates input.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on PR #46 (
feat/auto-improve-skill).First batch run of the auto-improve-skill pilot. Orchestrator drove the wrapper on 3 skills of varied shape (one tool-use, one code-reviewer, one document-producer). All 3 succeeded with distinct outcomes — the pipeline distinguished "needs work" from "needs grader-fix" from "already optimal" without operator input.
Results
vercel-labs/agent-browser/agent-browsersupabase/agent-skills/supabase-postgres-best-practicesanthropics/skills/pdfTotal OpenRouter spend: ~$6.60 across 3 pilots. Wall clock: ~50 min for 3 in parallel via
git worktree.Capabilities validated
v1 issues to address before scaling
claude -pinvocations.--max-budget-usd 3.50too tight for runs with real iteration. v2 default: $7-10._grader-utils.mjswould help.What this PR contains
3 commits (one per pilot) cherry-picked onto
feat/auto-improve-skill:```
f95e948 eval(auto-pilot): pdf — status=success, coverage 1.00→1.00
0c932aa eval(auto-pilot): supabase-postgres-best-practices — status=success, coverage 0.54→0.86
37f4cb5 eval(auto-pilot): agent-browser — status=success, coverage 0.56→1.00
```
Each pilot dir has:
analysis.md(the run-record),suite.yml+ cases + graders,references/<skill-id>/(vendored upstream),proposed-upstream-changes/(only when auto-pilot proposed real changes).🤖 Generated with Claude Code