Skip to content

eval(auto-pilot): pilot batch — 3 skills, 3/3 success#47

Open
Zhaiyuqing2003 wants to merge 3 commits into
feat/auto-improve-skillfrom
eval/auto-pilot/batch-2026-05-08
Open

eval(auto-pilot): pilot batch — 3 skills, 3/3 success#47
Zhaiyuqing2003 wants to merge 3 commits into
feat/auto-improve-skillfrom
eval/auto-pilot/batch-2026-05-08

Conversation

@Zhaiyuqing2003
Copy link
Copy Markdown

Stacked on PR #46 (feat/auto-improve-skill).

First batch run of the auto-improve-skill pilot. Orchestrator drove the wrapper on 3 skills of varied shape (one tool-use, one code-reviewer, one document-producer). All 3 succeeded with distinct outcomes — the pipeline distinguished "needs work" from "needs grader-fix" from "already optimal" without operator input.

Results

Skill Classification Status Baseline → Final Uplift
vercel-labs/agent-browser/agent-browser tool-use success 0.56 → 1.00 +0.44 (mostly grader correction; tiny additive SKILL.md proposal)
supabase/agent-skills/supabase-postgres-best-practices code-reviewer success 0.54 → 0.86 +0.32 (real two-pass workflow added to SKILL.md)
anthropics/skills/pdf document-producer success 1.00 → 1.00 none — auto-pilot triggered "≥0.95 → exit clean" path

Total OpenRouter spend: ~$6.60 across 3 pilots. Wall clock: ~50 min for 3 in parallel via git worktree.

Capabilities validated

  1. Correct skill-shape classification on all 3 (tool-use / code-reviewer / document-producer)
  2. Self-correction of own grader bugs before drawing skill conclusions (happened in 2 of 3 pilots)
  3. Pattern transfer — auto-pilot independently rediscovered the "two-pass workflow for absence-type rules" insight on supabase, the same insight we found manually for web-design-guidelines
  4. Clean exit on already-good skills (pdf had 36/36 trials pass at baseline; auto-pilot did NOT manufacture changes)
  5. Distinguishing skill problem from grader problem (agent-browser caught grader-over-specification)

v1 issues to address before scaling

  • "Always: commit" step unreliable — pilots #1b and fix: bug fixes and init improvements from fast-cli optimizer run #2 didn't reach it (had to commit manually). Fix in v2: hoist commit earlier or split into two claude -p invocations.
  • --max-budget-usd 3.50 too tight for runs with real iteration. v2 default: $7-10.
  • Phase-4 grader-fix iteration eats one of the two iteration slots; pre-baking known grader-tuning patterns into _grader-utils.mjs would help.

What this PR contains

3 commits (one per pilot) cherry-picked onto feat/auto-improve-skill:

```
f95e948 eval(auto-pilot): pdf — status=success, coverage 1.00→1.00
0c932aa eval(auto-pilot): supabase-postgres-best-practices — status=success, coverage 0.54→0.86
37f4cb5 eval(auto-pilot): agent-browser — status=success, coverage 0.56→1.00
```

Each pilot dir has: analysis.md (the run-record), suite.yml + cases + graders, references/<skill-id>/ (vendored upstream), proposed-upstream-changes/ (only when auto-pilot proposed real changes).

🤖 Generated with Claude Code

Yuqing Zhai and others added 3 commits May 8, 2026 13:43
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Builds eval suite for anthropics/skills/pdf (document-producer):
- 4 cases: extract-pdf-facts, split-customer-packet, build-briefing-pdf,
  no-pdf-skill-needed
- 3-model matrix: claude-sonnet-4.6, gpt-5-mini, gemini-2.5-pro
- Baseline: 36/36 trials PASS (100% pass rate), no iteration needed
- No upstream changes proposed — skill guides all models correctly

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 8, 2026 18:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds the first auto-pilot “pilot batch” artifacts under examples/workbench/, including three new/updated workbench eval suites (tool-use, code-reviewer, document-producer) with vendored skill snapshots, seeded cases, graders, and run records. It’s stacked on the auto-improve wrapper work, and primarily serves as reproducible evidence + packaging of the pilot runs and any proposed upstream diffs.

Changes:

  • Added a new workbench eval suite for supabase-postgres-best-practices (fixtures, graders, vendored references, run analysis, and upstream proposal).
  • Updated the existing pdf workbench suite (models + vendored skill snapshot + docs + negative-control grader path).
  • Added a new agent-browser workbench eval suite including a deterministic mock CLI, graders, vendored skill stub + core reference, run analysis, and an upstream proposal.

Reviewed changes

Copilot reviewed 67 out of 68 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
examples/workbench/supabase-postgres-best-practices/workspace/schema.sql Seed SQL schema fixture for the supabase-postgres-best-practices review case
examples/workbench/supabase-postgres-best-practices/workspace/rls_policies.sql Seed SQL RLS fixture for the supabase-postgres-best-practices review case
examples/workbench/supabase-postgres-best-practices/suite.yml Workbench suite definition (models, cases, graders) for supabase-postgres-best-practices
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/SKILL.md Vendored skill snapshot used by the eval
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/security-rls-performance.md Vendored rule reference (RLS performance)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/security-rls-basics.md Vendored rule reference (RLS basics)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/security-privileges.md Vendored rule reference (privileges)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-primary-keys.md Vendored rule reference (primary keys)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-partitioning.md Vendored rule reference (partitioning)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-lowercase-identifiers.md Vendored rule reference (identifier casing)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-foreign-key-indexes.md Vendored rule reference (FK indexes)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-data-types.md Vendored rule reference (data types)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-constraints.md Vendored rule reference (constraints / migrations)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-partial-indexes.md Vendored rule reference (partial indexes)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-missing-indexes.md Vendored rule reference (missing indexes)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-index-types.md Vendored rule reference (index types)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-covering-indexes.md Vendored rule reference (covering indexes)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-composite-indexes.md Vendored rule reference (composite indexes)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-vacuum-analyze.md Vendored rule reference (VACUUM/ANALYZE)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-pg-stat-statements.md Vendored rule reference (pg_stat_statements)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-explain-analyze.md Vendored rule reference (EXPLAIN ANALYZE)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-skip-locked.md Vendored rule reference (SKIP LOCKED)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-short-transactions.md Vendored rule reference (short transactions)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-deadlock-prevention.md Vendored rule reference (deadlocks)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-advisory.md Vendored rule reference (advisory locks)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-upsert.md Vendored rule reference (UPSERT)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-pagination.md Vendored rule reference (pagination)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-n-plus-one.md Vendored rule reference (N+1)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-batch-inserts.md Vendored rule reference (batch inserts)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-prepared-statements.md Vendored rule reference (prepared statements)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-pooling.md Vendored rule reference (pooling)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-limits.md Vendored rule reference (connection limits)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-idle-timeout.md Vendored rule reference (idle timeouts)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/advanced-jsonb-indexing.md Vendored rule reference (JSONB indexing)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/advanced-full-text-search.md Vendored rule reference (FTS)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/_template.md Reference authoring template
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/_sections.md Rule section/category definitions
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/_contributing.md Reference writing guidelines
examples/workbench/supabase-postgres-best-practices/README.md Suite documentation and how-to-run notes
examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/supabase-agent-skills/before-SKILL.md Captured upstream baseline for proposed change
examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/supabase-agent-skills/after-SKILL.md Proposed upstream SKILL.md change
examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/README.md Rationale + instructions for upstream change
examples/workbench/supabase-postgres-best-practices/checks/grade-schema-findings.mjs Grader for schema review findings
examples/workbench/supabase-postgres-best-practices/checks/grade-rls-findings.mjs Grader for RLS review findings
examples/workbench/supabase-postgres-best-practices/checks/_grader-utils.mjs Shared grading utilities for the suite
examples/workbench/supabase-postgres-best-practices/analysis.md Auto-pilot run record for the supabase skill
examples/workbench/pdf/suite.yml Updated pdf eval suite metadata (name/models)
examples/workbench/pdf/references/pdf/SKILL.md Vendored upstream pdf skill snapshot
examples/workbench/pdf/references/pdf-skill/SKILL.md Removed old demo pdf skill stub
examples/workbench/pdf/README.md Updated pdf suite documentation
examples/workbench/pdf/proposed-upstream-changes/README.md Notes that no upstream changes were needed
examples/workbench/pdf/proposed-upstream-changes/anthropics-skills/before-SKILL.md Captured upstream baseline for pdf
examples/workbench/pdf/proposed-upstream-changes/anthropics-skills/after-SKILL.md Captured upstream comparison for pdf
examples/workbench/pdf/checks/no-pdf-skill.mjs Updated negative-control grader to new SKILL path
examples/workbench/pdf/analysis.md Auto-pilot run record for the pdf skill
examples/workbench/agent-browser/suite.yml New agent-browser eval suite definition
examples/workbench/agent-browser/references/agent-browser/SKILL.md Vendored agent-browser SKILL stub (local core reference path)
examples/workbench/agent-browser/references/agent-browser/core.md Vendored agent-browser core workflow reference
examples/workbench/agent-browser/README.md Suite documentation and how-to-run notes
examples/workbench/agent-browser/proposed-upstream-changes/vercel-labs-agent-browser/before-SKILL.md Captured upstream baseline for agent-browser stub
examples/workbench/agent-browser/proposed-upstream-changes/vercel-labs-agent-browser/after-SKILL.md Proposed upstream stub change (quick task reference)
examples/workbench/agent-browser/proposed-upstream-changes/README.md Rationale + instructions for upstream change
examples/workbench/agent-browser/checks/grade-search-screenshot-findings.mjs Grader for search + screenshot case
examples/workbench/agent-browser/checks/grade-extract-stories-findings.mjs Grader for HN story extraction case
examples/workbench/agent-browser/checks/grade-capture-homepage-findings.mjs Grader for example.com screenshot + title case
examples/workbench/agent-browser/checks/_grader-utils.mjs Shared grading utilities (currently unused in these graders)
examples/workbench/agent-browser/bin/agent-browser Deterministic mock agent-browser CLI for the eval
examples/workbench/agent-browser/analysis.md Auto-pilot run record for the agent-browser skill

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +13 to +19
CREATE TABLE orders (
id bigint generated always as identity primary key,
customer_id bigint references customers(id) on delete cascade,
status text not null default 'pending',
created_at timestamptz default now(),
total numeric(10,2)
);
Comment on lines +11 to +16
-- Orders table
-- VIOLATION (schema-foreign-key-indexes): customer_id FK column has no index
CREATE TABLE orders (
id bigint generated always as identity primary key,
customer_id bigint references customers(id) on delete cascade,
status text not null default 'pending',
Comment on lines +4 to +8
-- === RLS enabled, but FORCE not applied ===
-- VIOLATION (security-rls-basics): table owner can bypass RLS; FORCE not set
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
-- FIX: also run ALTER TABLE orders FORCE ROW LEVEL SECURITY;

`references/supabase-postgres-best-practices/references/` and place `SKILL.md`
at `references/supabase-postgres-best-practices/SKILL.md`. The relative paths
in `SKILL.md` resolve correctly under the workbench's `/work` layout.
Diff vs upstream is zero (no path changes needed).
Comment on lines +1 to +11
// Shared grader logic for web-design-guidelines eval cases.
//
// Each finding is assumed to be one line in findings.txt that references
// "<File>.tsx:<line>" (line numbers come from the agent — they're often
// off by ±1-2 due to LLM line-counting). A violation is considered "found"
// when at least one finding line:
// (a) references a line number within the violation's accepted range, AND
// (b) contains at least one of the violation's distinguishing keywords.
//
// This per-finding-line check prevents spurious cross-matches (e.g. the
// keyword "label" from a different finding being credited to a paste rule).
Comment on lines +1 to +11
// Shared grader logic for web-design-guidelines eval cases.
//
// Each finding is assumed to be one line in findings.txt that references
// "<File>.tsx:<line>" (line numbers come from the agent — they're often
// off by ±1-2 due to LLM line-counting). A violation is considered "found"
// when at least one finding line:
// (a) references a line number within the violation's accepted range, AND
// (b) contains at least one of the violation's distinguishing keywords.
//
// This per-finding-line check prevents spurious cross-matches (e.g. the
// keyword "label" from a different finding being credited to a paste rule).
Zhaiyuqing2003 pushed a commit that referenced this pull request May 8, 2026
Three changes informed by the 3-skill pilot batch (PR #47):

1. **"Always: write analysis.md AND commit" merged into a single atomic
   step.** Pilots #1b and #2 wrote analysis.md but ran out of budget
   before reaching the separate commit step, leaving case files
   uncommitted. The merged section explicitly tells the agent to skip
   everything else if budget is low and finish this section first.

2. **Default --max-budget-usd bumped 3.50 → 10.00.** Pilot #1's first
   real-data attempt died at the cap mid-modification. Pilot #1c at
   --budget 15 settled at $3.15 with full success. The prompt's Phase-4
   self-cap also moved from $3.00 to $7.00 to leave a $2-3 buffer for
   the analysis.md + commit cleanup below the wrapper hard cap.

3. **New tools/auto-improve-skill-lessons.md** — living doc the prompt
   reads as Phase-4 prior. Captures recipes A-E (two-pass workflow,
   verify-tool-installed, per-element checklists, BAD/GOOD examples,
   rationale + bug-story) and grader-reliability patterns G1-G6 (line
   tolerance, hyphen regex, per-finding-line matching, keyword variants,
   set-semantics, verbosity floor) with empirical evidence from the
   manual web-design-guidelines run + the 3 auto pilots. Phase 4 of the
   prompt now references the recipes by letter so the auto-pilot doesn't
   rediscover patterns from scratch each run.

Also fixes a slug-parsing regression introduced by the --budget flag
(when --budget was absent, the filter wrongly skipped argv[0]).

Smoke tests pass: bare invocation prints usage, "nope" gives bad-slug,
existing dir gets refused, --budget validates input.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants