eval(auto-pilot): pilot batch — 3 skills, 3/3 success by Zhaiyuqing2003 · Pull Request #47 · fastxyz/skill-optimizer

Zhaiyuqing2003 · 2026-05-08T18:45:56Z

Stacked on PR #46 (feat/auto-improve-skill).

First batch run of the auto-improve-skill pilot. Orchestrator drove the wrapper on 3 skills of varied shape (one tool-use, one code-reviewer, one document-producer). All 3 succeeded with distinct outcomes — the pipeline distinguished "needs work" from "needs grader-fix" from "already optimal" without operator input.

Results

Skill	Classification	Status	Baseline → Final	Uplift
`vercel-labs/agent-browser/agent-browser`	tool-use	success	0.56 → 1.00	+0.44 (mostly grader correction; tiny additive SKILL.md proposal)
`supabase/agent-skills/supabase-postgres-best-practices`	code-reviewer	success	0.54 → 0.86	+0.32 (real two-pass workflow added to SKILL.md)
`anthropics/skills/pdf`	document-producer	success	1.00 → 1.00	none — auto-pilot triggered "≥0.95 → exit clean" path

Total OpenRouter spend: ~$6.60 across 3 pilots. Wall clock: ~50 min for 3 in parallel via git worktree.

Capabilities validated

Correct skill-shape classification on all 3 (tool-use / code-reviewer / document-producer)
Self-correction of own grader bugs before drawing skill conclusions (happened in 2 of 3 pilots)
Pattern transfer — auto-pilot independently rediscovered the "two-pass workflow for absence-type rules" insight on supabase, the same insight we found manually for web-design-guidelines
Clean exit on already-good skills (pdf had 36/36 trials pass at baseline; auto-pilot did NOT manufacture changes)
Distinguishing skill problem from grader problem (agent-browser caught grader-over-specification)

v1 issues to address before scaling

"Always: commit" step unreliable — pilots #1b and fix: bug fixes and init improvements from fast-cli optimizer run #2 didn't reach it (had to commit manually). Fix in v2: hoist commit earlier or split into two claude -p invocations.
--max-budget-usd 3.50 too tight for runs with real iteration. v2 default: $7-10.
Phase-4 grader-fix iteration eats one of the two iteration slots; pre-baking known grader-tuning patterns into _grader-utils.mjs would help.

What this PR contains

3 commits (one per pilot) cherry-picked onto feat/auto-improve-skill:

```
f95e948 eval(auto-pilot): pdf — status=success, coverage 1.00→1.00
0c932aa eval(auto-pilot): supabase-postgres-best-practices — status=success, coverage 0.54→0.86
37f4cb5 eval(auto-pilot): agent-browser — status=success, coverage 0.56→1.00
```

Each pilot dir has: analysis.md (the run-record), suite.yml + cases + graders, references/<skill-id>/ (vendored upstream), proposed-upstream-changes/ (only when auto-pilot proposed real changes).

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…coverage 0.54→0.86

Builds eval suite for anthropics/skills/pdf (document-producer): - 4 cases: extract-pdf-facts, split-customer-packet, build-briefing-pdf, no-pdf-skill-needed - 3-model matrix: claude-sonnet-4.6, gpt-5-mini, gemini-2.5-pro - Baseline: 36/36 trials PASS (100% pass rate), no iteration needed - No upstream changes proposed — skill guides all models correctly Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR adds the first auto-pilot “pilot batch” artifacts under examples/workbench/, including three new/updated workbench eval suites (tool-use, code-reviewer, document-producer) with vendored skill snapshots, seeded cases, graders, and run records. It’s stacked on the auto-improve wrapper work, and primarily serves as reproducible evidence + packaging of the pilot runs and any proposed upstream diffs.

Changes:

Added a new workbench eval suite for supabase-postgres-best-practices (fixtures, graders, vendored references, run analysis, and upstream proposal).
Updated the existing pdf workbench suite (models + vendored skill snapshot + docs + negative-control grader path).
Added a new agent-browser workbench eval suite including a deterministic mock CLI, graders, vendored skill stub + core reference, run analysis, and an upstream proposal.

Reviewed changes

Copilot reviewed 67 out of 68 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
examples/workbench/supabase-postgres-best-practices/workspace/schema.sql	Seed SQL schema fixture for the supabase-postgres-best-practices review case
examples/workbench/supabase-postgres-best-practices/workspace/rls_policies.sql	Seed SQL RLS fixture for the supabase-postgres-best-practices review case
examples/workbench/supabase-postgres-best-practices/suite.yml	Workbench suite definition (models, cases, graders) for supabase-postgres-best-practices
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/SKILL.md	Vendored skill snapshot used by the eval
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/security-rls-performance.md	Vendored rule reference (RLS performance)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/security-rls-basics.md	Vendored rule reference (RLS basics)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/security-privileges.md	Vendored rule reference (privileges)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-primary-keys.md	Vendored rule reference (primary keys)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-partitioning.md	Vendored rule reference (partitioning)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-lowercase-identifiers.md	Vendored rule reference (identifier casing)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-foreign-key-indexes.md	Vendored rule reference (FK indexes)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-data-types.md	Vendored rule reference (data types)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-constraints.md	Vendored rule reference (constraints / migrations)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-partial-indexes.md	Vendored rule reference (partial indexes)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-missing-indexes.md	Vendored rule reference (missing indexes)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-index-types.md	Vendored rule reference (index types)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-covering-indexes.md	Vendored rule reference (covering indexes)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-composite-indexes.md	Vendored rule reference (composite indexes)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-vacuum-analyze.md	Vendored rule reference (VACUUM/ANALYZE)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-pg-stat-statements.md	Vendored rule reference (pg_stat_statements)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-explain-analyze.md	Vendored rule reference (EXPLAIN ANALYZE)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-skip-locked.md	Vendored rule reference (SKIP LOCKED)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-short-transactions.md	Vendored rule reference (short transactions)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-deadlock-prevention.md	Vendored rule reference (deadlocks)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-advisory.md	Vendored rule reference (advisory locks)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-upsert.md	Vendored rule reference (UPSERT)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-pagination.md	Vendored rule reference (pagination)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-n-plus-one.md	Vendored rule reference (N+1)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-batch-inserts.md	Vendored rule reference (batch inserts)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-prepared-statements.md	Vendored rule reference (prepared statements)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-pooling.md	Vendored rule reference (pooling)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-limits.md	Vendored rule reference (connection limits)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-idle-timeout.md	Vendored rule reference (idle timeouts)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/advanced-jsonb-indexing.md	Vendored rule reference (JSONB indexing)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/advanced-full-text-search.md	Vendored rule reference (FTS)
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/_template.md	Reference authoring template
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/_sections.md	Rule section/category definitions
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/_contributing.md	Reference writing guidelines
examples/workbench/supabase-postgres-best-practices/README.md	Suite documentation and how-to-run notes
examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/supabase-agent-skills/before-SKILL.md	Captured upstream baseline for proposed change
examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/supabase-agent-skills/after-SKILL.md	Proposed upstream SKILL.md change
examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/README.md	Rationale + instructions for upstream change
examples/workbench/supabase-postgres-best-practices/checks/grade-schema-findings.mjs	Grader for schema review findings
examples/workbench/supabase-postgres-best-practices/checks/grade-rls-findings.mjs	Grader for RLS review findings
examples/workbench/supabase-postgres-best-practices/checks/_grader-utils.mjs	Shared grading utilities for the suite
examples/workbench/supabase-postgres-best-practices/analysis.md	Auto-pilot run record for the supabase skill
examples/workbench/pdf/suite.yml	Updated pdf eval suite metadata (name/models)
examples/workbench/pdf/references/pdf/SKILL.md	Vendored upstream pdf skill snapshot
examples/workbench/pdf/references/pdf-skill/SKILL.md	Removed old demo pdf skill stub
examples/workbench/pdf/README.md	Updated pdf suite documentation
examples/workbench/pdf/proposed-upstream-changes/README.md	Notes that no upstream changes were needed
examples/workbench/pdf/proposed-upstream-changes/anthropics-skills/before-SKILL.md	Captured upstream baseline for pdf
examples/workbench/pdf/proposed-upstream-changes/anthropics-skills/after-SKILL.md	Captured upstream comparison for pdf
examples/workbench/pdf/checks/no-pdf-skill.mjs	Updated negative-control grader to new SKILL path
examples/workbench/pdf/analysis.md	Auto-pilot run record for the pdf skill
examples/workbench/agent-browser/suite.yml	New agent-browser eval suite definition
examples/workbench/agent-browser/references/agent-browser/SKILL.md	Vendored agent-browser SKILL stub (local core reference path)
examples/workbench/agent-browser/references/agent-browser/core.md	Vendored agent-browser core workflow reference
examples/workbench/agent-browser/README.md	Suite documentation and how-to-run notes
examples/workbench/agent-browser/proposed-upstream-changes/vercel-labs-agent-browser/before-SKILL.md	Captured upstream baseline for agent-browser stub
examples/workbench/agent-browser/proposed-upstream-changes/vercel-labs-agent-browser/after-SKILL.md	Proposed upstream stub change (quick task reference)
examples/workbench/agent-browser/proposed-upstream-changes/README.md	Rationale + instructions for upstream change
examples/workbench/agent-browser/checks/grade-search-screenshot-findings.mjs	Grader for search + screenshot case
examples/workbench/agent-browser/checks/grade-extract-stories-findings.mjs	Grader for HN story extraction case
examples/workbench/agent-browser/checks/grade-capture-homepage-findings.mjs	Grader for example.com screenshot + title case
examples/workbench/agent-browser/checks/_grader-utils.mjs	Shared grading utilities (currently unused in these graders)
examples/workbench/agent-browser/bin/agent-browser	Deterministic mock agent-browser CLI for the eval
examples/workbench/agent-browser/analysis.md	Auto-pilot run record for the agent-browser skill

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+CREATE TABLE orders (
+  id bigint generated always as identity primary key,
+  customer_id bigint references customers(id) on delete cascade,
+  status text not null default 'pending',
+  created_at timestamptz default now(),
+  total numeric(10,2)
+);


+-- Orders table
+-- VIOLATION (schema-foreign-key-indexes): customer_id FK column has no index
+CREATE TABLE orders (
+  id bigint generated always as identity primary key,
+  customer_id bigint references customers(id) on delete cascade,
+  status text not null default 'pending',


+-- === RLS enabled, but FORCE not applied ===
+-- VIOLATION (security-rls-basics): table owner can bypass RLS; FORCE not set
+ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
+-- FIX: also run ALTER TABLE orders FORCE ROW LEVEL SECURITY;
+


+`references/supabase-postgres-best-practices/references/` and place `SKILL.md`
+at `references/supabase-postgres-best-practices/SKILL.md`. The relative paths
+in `SKILL.md` resolve correctly under the workbench's `/work` layout.
+Diff vs upstream is zero (no path changes needed).


+// Shared grader logic for web-design-guidelines eval cases.
+//
+// Each finding is assumed to be one line in findings.txt that references
+// "<File>.tsx:<line>" (line numbers come from the agent — they're often
+// off by ±1-2 due to LLM line-counting). A violation is considered "found"
+// when at least one finding line:
+//   (a) references a line number within the violation's accepted range, AND
+//   (b) contains at least one of the violation's distinguishing keywords.
+//
+// This per-finding-line check prevents spurious cross-matches (e.g. the
+// keyword "label" from a different finding being credited to a paste rule).


+// Shared grader logic for web-design-guidelines eval cases.
+//
+// Each finding is assumed to be one line in findings.txt that references
+// "<File>.tsx:<line>" (line numbers come from the agent — they're often
+// off by ±1-2 due to LLM line-counting). A violation is considered "found"
+// when at least one finding line:
+//   (a) references a line number within the violation's accepted range, AND
+//   (b) contains at least one of the violation's distinguishing keywords.
+//
+// This per-finding-line check prevents spurious cross-matches (e.g. the
+// keyword "label" from a different finding being credited to a paste rule).


Three changes informed by the 3-skill pilot batch (PR #47): 1. **"Always: write analysis.md AND commit" merged into a single atomic step.** Pilots #1b and #2 wrote analysis.md but ran out of budget before reaching the separate commit step, leaving case files uncommitted. The merged section explicitly tells the agent to skip everything else if budget is low and finish this section first. 2. **Default --max-budget-usd bumped 3.50 → 10.00.** Pilot #1's first real-data attempt died at the cap mid-modification. Pilot #1c at --budget 15 settled at $3.15 with full success. The prompt's Phase-4 self-cap also moved from $3.00 to $7.00 to leave a $2-3 buffer for the analysis.md + commit cleanup below the wrapper hard cap. 3. **New tools/auto-improve-skill-lessons.md** — living doc the prompt reads as Phase-4 prior. Captures recipes A-E (two-pass workflow, verify-tool-installed, per-element checklists, BAD/GOOD examples, rationale + bug-story) and grader-reliability patterns G1-G6 (line tolerance, hyphen regex, per-finding-line matching, keyword variants, set-semantics, verbosity floor) with empirical evidence from the manual web-design-guidelines run + the 3 auto pilots. Phase 4 of the prompt now references the recipes by letter so the auto-pilot doesn't rediscover patterns from scratch each run. Also fixes a slug-parsing regression introduced by the --budget flag (when --budget was absent, the filter wrongly skipped argv[0]). Smoke tests pass: bare invocation prints usage, "nope" gives bad-slug, existing dir gets refused, --budget validates input.

Yuqing Zhai and others added 3 commits May 8, 2026 13:43

eval(auto-pilot): agent-browser — status=success, coverage 0.56→1.00

37f4cb5

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

eval(auto-pilot): supabase-postgres-best-practices — status=success, …

0c932aa

…coverage 0.54→0.86

Copilot AI review requested due to automatic review settings May 8, 2026 18:45

Copilot started reviewing on behalf of Zhaiyuqing2003 May 8, 2026 18:46 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Zhaiyuqing2003 mentioned this pull request May 11, 2026

docs(pilot-runs): publish batch-1 + batch-2 pilot summaries #49

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval(auto-pilot): pilot batch — 3 skills, 3/3 success#47

eval(auto-pilot): pilot batch — 3 skills, 3/3 success#47
Zhaiyuqing2003 wants to merge 3 commits into
feat/auto-improve-skillfrom
eval/auto-pilot/batch-2026-05-08

Zhaiyuqing2003 commented May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Zhaiyuqing2003 commented May 8, 2026

Results

Capabilities validated

v1 issues to address before scaling

What this PR contains

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants