Skip to content

eval(auto-pilot): batch 2 — 10 skills, 8/10 success, 0 failures#48

Open
Zhaiyuqing2003 wants to merge 10 commits into
feat/auto-improve-skillfrom
eval/auto-pilot/batch-2-2026-05-09
Open

eval(auto-pilot): batch 2 — 10 skills, 8/10 success, 0 failures#48
Zhaiyuqing2003 wants to merge 10 commits into
feat/auto-improve-skillfrom
eval/auto-pilot/batch-2-2026-05-09

Conversation

@Zhaiyuqing2003
Copy link
Copy Markdown

Stacked on PR #46 (feat/auto-improve-skill v1.1 + #3).

Second batch run of the auto-improve-skill pilot. 10 skills (ranks 5–14 from the prioritized top-N list), fired in parallel via git worktree. 8/10 success, 2/10 uplift-too-small, 0 blocked or budget-exceeded.

Results

Skill Class Status Coverage Mods
pptx document-producer 0.85 → 0.85 0
next-best-practices code-reviewer 0.80 → 0.975 0 (grader cal only)
firebase-auth-basics code-reviewer 1.00 → 1.00 0
firebase-hosting-basics code-patterns 0.89 → 1.00 1
building-native-ui code-patterns 0.99 → 0.99 0
shadcn-ui code-patterns 0.82 → 0.89 1
native-data-fetching code-reviewer 1.00 → 1.00 0
firecrawl-build-scrape code-patterns ⚠️ 0.84 → 0.89 2
next-upgrade code-reviewer ⚠️ 0.83 → 0.76 (regression) 2
prd document-producer 1.00 → 1.00 0

v1.1 + #3 validation

  • All 10 inner agents committed cleanly. Atomic write-and-commit fix worked — no manual recovery needed (vs batch 1 where 2 of 3 needed it).
  • Pilots cited Recipe A / D / E by letter from tools/auto-improve-skill-lessons.md. The lessons doc is doing real work; agents aren't rediscovering patterns.
  • Grader-vs-skill check first ran on 6 of 10 pilots (iteration 0 grader calibration before counting against budget). Saved meaningful budget on pilots 2, 4, 6.
  • looseRange / tolerantKeyword pre-baked helpers were used in the agent-written graders. Several pilots widened specifically for gpt-4o-mini drift — new signal for v1.2.

Cost

  • OpenRouter: $21.30 across 10 pilots ($2.13/pilot)
  • Wall clock: ~50 min (vs ~150 min sequential)

What to PR upstream

Three pilots produced real additive proposals:

  • firebase-hosting-basics (+0.11, → firebase/agent-skills)
  • shadcn-ui (+0.07, → google-labs-code/stitch-skills)
  • firecrawl-build-scrape (+0.05, → firecrawl/skills) — bubble case at exactly the threshold

next-upgrade regressed (-0.07) and is not safe to PR — see the run's analysis.md for the new failure modes (CLI fabrication, bash-commands-for-small-models anti-pattern).

What this surfaces for v1.2 of the prompt

  1. New anti-pattern: don't add bash commands to skills aimed at small models — gpt-4o-mini tries to execute them
  2. New failure mode: CLI fabrication on "upgrade-style" skills (different from curl fallback)
  3. New grader pattern: per-model line tolerance — gpt-4o-mini drifts 6-15 lines, sonnet/gemini drift 0-3
  4. Repo path variant: expo/skills uses plugins/expo/skills/<id>/SKILL.md (not canonical layout)

Detailed write-up at docs/superpowers/pilot-runs/2026-05-09-auto-improve-batch-2-summary.md (gitignored, local-only).

🤖 Generated with Claude Code

Yuqing Zhai and others added 10 commits May 8, 2026 15:50
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…→0.975

Baseline raw coverage 0.80; after grader calibration (±12 tolerance for
late-file violations, explicit write-tool instruction in task prompts)
coverage reached 0.975 (79/81 trials). No skill modifications needed —
skill already covers the seeded violations well.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…0→1.00

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…0.89→1.00

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…0.99

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…0→1.00

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…overage 0.84→0.89

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…83→0.76

Eval suite for vercel-labs/next-skills/next-upgrade skill.

- Classified as code-reviewer (findings.txt output shape)
- 1 case (review-starter-app), 6 violations across 4 files:
  package.json (v14 version), app/page.tsx (viewport + sync searchParams),
  app/[id]/page.tsx (sync params), app/api/route.ts (sync cookies + headers)
- 3-model matrix: claude-sonnet-4-5, gpt-4o-mini, gemini-2.5-pro, 3 trials each
- Baseline coverage 0.83 after grader calibration (pkg range 1-25, not looseRange)
- Iteration 1 (grep checklist): 0.685 — bash cmds caused GPT to try executing them
- Iteration 2 (BAD/GOOD examples only): 0.759 — GPT still CLI-fixated on some trials
- status: uplift-too-small (2 iterations done, neither exceeded baseline+0.05)
- Lessons: seed comments must not contain fix patterns; pkg.json line=file-level

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 8, 2026 20:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds the second “auto-improve-skill” pilot batch outputs into the repo: new workbench eval suites (cases, graders, fixtures, analyses) for multiple skills, plus an update to the lessons-learned doc capturing new prompt/grader patterns discovered during the runs.

Changes:

  • Added several new examples/workbench/<skill>/ eval suites (suite configs, seeded workspaces, graders, analyses, and proposed-upstream-change notes).
  • Extended tools/auto-improve-skill-lessons.md with new pilot learnings (notably around bash-command anti-patterns and grader calibration).
  • Vendored additional skill/reference snapshots used by the workbench suites.

Reviewed changes

Copilot reviewed 134 out of 134 changed files in this pull request and generated 23 comments.

Show a summary per file
File Description
tools/auto-improve-skill-lessons.md Adds new batch-2 pilot lessons (anti-patterns + grader calibration guidance).
examples/workbench/shadcn-ui/workspace/UserCard.tsx Seeded fixture for shadcn/ui review eval.
examples/workbench/shadcn-ui/workspace/StatusBadge.tsx Seeded fixture for shadcn/ui review eval.
examples/workbench/shadcn-ui/suite.yml Workbench suite config for shadcn-ui eval.
examples/workbench/shadcn-ui/README.md Documents shadcn-ui eval cases and how to run the suite.
examples/workbench/shadcn-ui/proposed-upstream-changes/README.md Summarizes suggested upstream SKILL.md additions for shadcn-ui.
examples/workbench/shadcn-ui/checks/grade-usercard-findings.mjs Grader for shadcn-ui UserCard findings output.
examples/workbench/shadcn-ui/checks/grade-statusbadge-findings.mjs Grader for shadcn-ui StatusBadge findings output.
examples/workbench/shadcn-ui/checks/_grader-utils.mjs Shared grader helpers for shadcn-ui eval.
examples/workbench/shadcn-ui/analysis.md Records the shadcn-ui auto-pilot run results and calibration notes.
examples/workbench/prd/workspace/brief-api-gateway.md Seed brief input for PRD-writing eval case.
examples/workbench/prd/workspace/brief-ai-search.md Seed brief input for PRD-writing eval case.
examples/workbench/prd/suite.yml Workbench suite config for prd eval.
examples/workbench/prd/references/prd/SKILL.md Vendored PRD skill snapshot used by the suite.
examples/workbench/prd/README.md Documents PRD eval cases and run instructions.
examples/workbench/prd/proposed-upstream-changes/README.md Notes that no upstream PRD skill changes are proposed.
examples/workbench/prd/proposed-upstream-changes/github-awesome-copilot/before-SKILL.md Baseline upstream snapshot for PRD skill.
examples/workbench/prd/proposed-upstream-changes/github-awesome-copilot/after-SKILL.md Post-run snapshot for PRD skill (unchanged).
examples/workbench/prd/checks/grade-api-gateway-prd.mjs Grader for API-gateway PRD structural checks.
examples/workbench/prd/checks/grade-ai-search-prd.mjs Grader for AI-search PRD structural checks.
examples/workbench/prd/checks/_grader-utils.mjs Shared keyword helpers (and findings grader utility) for PRD graders.
examples/workbench/prd/analysis.md Records PRD pilot outcomes and notes about excluded trials.
examples/workbench/pptx/suite.yml Workbench suite config for pptx eval.
examples/workbench/pptx/README.md Documents pptx eval cases and how to run them.
examples/workbench/pptx/checks/no-pptx-skill.mjs Control-case grader ensuring pptx skill isn’t unnecessarily read.
examples/workbench/pptx/checks/extract-pptx-facts.mjs Grader for PPTX extraction case output JSON.
examples/workbench/pptx/checks/create-product-deck.mjs Grader validating generated PPTX contains required strings.
examples/workbench/pptx/checks/create-inputs.py Generates a deterministic input PPTX for extraction tests.
examples/workbench/pptx/checks/_trace.mjs Trace reader used to detect forbidden reads in control cases.
examples/workbench/pptx/checks/_grader-utils.mjs Shared grader helpers for pptx eval.
examples/workbench/pptx/analysis.md Records pptx pilot outcomes and grader calibration note.
examples/workbench/next-upgrade/workspace/starter-app/package.json Seed Next.js 14 app fixture for upgrade-review eval.
examples/workbench/next-upgrade/workspace/starter-app/app/page.tsx Seed file for Next.js async API + metadata viewport checks.
examples/workbench/next-upgrade/workspace/starter-app/app/api/route.ts Seed route handler fixture for async cookies/headers checks.
examples/workbench/next-upgrade/workspace/starter-app/app/[id]/page.tsx Seed dynamic route fixture for async params checks.
examples/workbench/next-upgrade/suite.yml Workbench suite config for next-upgrade eval.
examples/workbench/next-upgrade/references/next-upgrade/SKILL.md Vendored next-upgrade skill snapshot used by the suite.
examples/workbench/next-upgrade/README.md Documents next-upgrade eval cases, graders, and run instructions.
examples/workbench/next-upgrade/checks/grade-starter-route.mjs Grader for workspace-mutation route handler fixes.
examples/workbench/next-upgrade/checks/grade-starter-pages.mjs Grader for workspace-mutation page fixes.
examples/workbench/next-upgrade/checks/grade-starter-package.mjs Grader for workspace-mutation dependency version fix.
examples/workbench/next-upgrade/checks/grade-route-findings.mjs Findings.txt grader for route handler issues.
examples/workbench/next-upgrade/checks/grade-page-findings.mjs Findings.txt grader for page.tsx issues.
examples/workbench/next-upgrade/checks/grade-package-findings.mjs Findings.txt grader for package.json upgrade issues.
examples/workbench/next-upgrade/checks/grade-id-page-findings.mjs Findings.txt grader for dynamic route params issue.
examples/workbench/next-upgrade/checks/_grader-utils.mjs Shared grader helpers for next-upgrade findings graders.
examples/workbench/next-upgrade/analysis.md Records next-upgrade pilot outcomes and failure-mode analysis.
examples/workbench/next-best-practices/workspace/components/HeroSection.tsx Seed fixture for next-best-practices code-review eval.
examples/workbench/next-best-practices/workspace/app/dashboard/page.tsx Seed fixture for next-best-practices code-review eval.
examples/workbench/next-best-practices/suite.yml Workbench suite config for next-best-practices eval.
examples/workbench/next-best-practices/references/next-best-practices/suspense-boundaries.md Vendored reference doc used by the skill.
examples/workbench/next-best-practices/references/next-best-practices/SKILL.md Vendored next-best-practices skill snapshot.
examples/workbench/next-best-practices/references/next-best-practices/scripts.md Vendored reference doc used by the skill.
examples/workbench/next-best-practices/references/next-best-practices/runtime-selection.md Vendored reference doc used by the skill.
examples/workbench/next-best-practices/references/next-best-practices/route-handlers.md Vendored reference doc used by the skill.
examples/workbench/next-best-practices/references/next-best-practices/hydration-error.md Vendored reference doc used by the skill.
examples/workbench/next-best-practices/references/next-best-practices/functions.md Vendored reference doc used by the skill.
examples/workbench/next-best-practices/references/next-best-practices/file-conventions.md Vendored reference doc used by the skill.
examples/workbench/next-best-practices/references/next-best-practices/directives.md Vendored reference doc used by the skill.
examples/workbench/next-best-practices/references/next-best-practices/debug-tricks.md Vendored reference doc used by the skill.
examples/workbench/next-best-practices/references/next-best-practices/async-patterns.md Vendored reference doc used by the skill.
examples/workbench/next-best-practices/README.md Documents next-best-practices eval cases and running.
examples/workbench/next-best-practices/checks/grade-herosection-findings.mjs Findings grader for HeroSection violations.
examples/workbench/next-best-practices/checks/grade-dashboard-findings.mjs Findings grader for dashboard page violations.
examples/workbench/next-best-practices/checks/_grader-utils.mjs Shared grader helpers for next-best-practices eval.
examples/workbench/next-best-practices/analysis.md Records next-best-practices pilot outcomes and calibration note.
examples/workbench/native-data-fetching/workspace/screens/DashboardScreen.tsx Seed fixture for Expo networking review eval.
examples/workbench/native-data-fetching/workspace/api/client.ts Seed fixture for Expo networking review eval.
examples/workbench/native-data-fetching/suite.yml Workbench suite config for native-data-fetching eval.
examples/workbench/native-data-fetching/README.md Documents native-data-fetching eval cases and running.
examples/workbench/native-data-fetching/checks/grade-dashboard-findings.mjs Findings grader for DashboardScreen violations.
examples/workbench/native-data-fetching/checks/grade-client-findings.mjs Findings grader for api/client.ts violations.
examples/workbench/native-data-fetching/checks/_grader-utils.mjs Shared grader helpers for native-data-fetching eval.
examples/workbench/native-data-fetching/analysis.md Records native-data-fetching pilot outcomes.
examples/workbench/firecrawl-build-scrape/workspace/ScrapeService.ts Seed fixture for Firecrawl scrape integration review eval.
examples/workbench/firecrawl-build-scrape/suite.yml Workbench suite config for firecrawl-build-scrape eval.
examples/workbench/firecrawl-build-scrape/references/firecrawl-build-scrape/SKILL.md Vendored Firecrawl skill snapshot.
examples/workbench/firecrawl-build-scrape/references/firecrawl-build-scrape/node-docs.md Vendored Node.js “source of truth” reference for Firecrawl.
examples/workbench/firecrawl-build-scrape/README.md Documents firecrawl-build-scrape eval cases and running.
examples/workbench/firecrawl-build-scrape/checks/grade-scrape-service-findings.mjs Findings grader for Firecrawl integration violations.
examples/workbench/firecrawl-build-scrape/checks/_grader-utils.mjs Shared grader helpers for firecrawl-build-scrape eval.
examples/workbench/firecrawl-build-scrape/analysis.md Records firecrawl-build-scrape pilot outcomes.
examples/workbench/firebase-hosting-basics/workspace/firebase-app/firebase.json Seed misconfigured firebase.json fixture.
examples/workbench/firebase-hosting-basics/suite.yml Workbench suite config for firebase-hosting-basics eval.
examples/workbench/firebase-hosting-basics/references/firebase-hosting-basics/SKILL.md Vendored firebase-hosting-basics skill snapshot.
examples/workbench/firebase-hosting-basics/references/firebase-hosting-basics/deploying.md Vendored deployment reference doc.
examples/workbench/firebase-hosting-basics/references/firebase-hosting-basics/configuration.md Vendored configuration reference doc.
examples/workbench/firebase-hosting-basics/README.md Documents firebase-hosting-basics eval cases and running.
examples/workbench/firebase-hosting-basics/proposed-upstream-changes/README.md Summarizes proposed upstream SKILL.md additions.
examples/workbench/firebase-hosting-basics/proposed-upstream-changes/firebase-agent-skills/before-SKILL.md Baseline upstream snapshot for firebase-hosting-basics skill.
examples/workbench/firebase-hosting-basics/proposed-upstream-changes/firebase-agent-skills/after-SKILL.md Post-run snapshot including proposed additive section.
examples/workbench/firebase-hosting-basics/checks/grade-firebase-config-findings.mjs Findings grader for firebase.json violations.
examples/workbench/firebase-hosting-basics/checks/_grader-utils.mjs Shared grader helpers for firebase-hosting-basics eval.
examples/workbench/firebase-hosting-basics/analysis.md Records firebase-hosting-basics pilot outcomes.
examples/workbench/firebase-auth-basics/workspace/src/auth.js Seed fixture for Firebase Auth JS review case.
examples/workbench/firebase-auth-basics/workspace/firestore.rules Seed fixture for Firestore rules review case.
examples/workbench/firebase-auth-basics/suite.yml Workbench suite config for firebase-auth-basics eval.
examples/workbench/firebase-auth-basics/references/firebase-auth-basics/SKILL.md Vendored firebase-auth-basics skill snapshot.
examples/workbench/firebase-auth-basics/references/firebase-auth-basics/references/security_rules.md Vendored security rules reference doc.
examples/workbench/firebase-auth-basics/references/firebase-auth-basics/references/flutter_setup.md Vendored Flutter auth reference doc.
examples/workbench/firebase-auth-basics/references/firebase-auth-basics/references/client_sdk_android.md Vendored Android auth reference doc.
examples/workbench/firebase-auth-basics/README.md Documents firebase-auth-basics eval cases and running.
examples/workbench/firebase-auth-basics/checks/grade-rules-findings.mjs Findings grader for Firestore rules violations.
examples/workbench/firebase-auth-basics/checks/grade-auth-findings.mjs Findings grader for auth.js violations.
examples/workbench/firebase-auth-basics/checks/_grader-utils.mjs Shared grader helpers for firebase-auth-basics eval.
examples/workbench/firebase-auth-basics/analysis.md Records firebase-auth-basics pilot outcomes.
examples/workbench/building-native-ui/workspace/SettingsScreen.tsx Seed fixture for Expo UI review case.
examples/workbench/building-native-ui/workspace/MediaPlayerScreen.tsx Seed fixture for Expo UI review case.
examples/workbench/building-native-ui/suite.yml Workbench suite config for building-native-ui eval.
examples/workbench/building-native-ui/README.md Documents building-native-ui eval cases and running.
examples/workbench/building-native-ui/checks/grade-settings-screen-findings.mjs Findings grader for SettingsScreen violations.
examples/workbench/building-native-ui/checks/grade-media-player-findings.mjs Findings grader for MediaPlayerScreen violations.
examples/workbench/building-native-ui/checks/_grader-utils.mjs Shared grader helpers for building-native-ui eval.
examples/workbench/building-native-ui/analysis.md Records building-native-ui pilot outcomes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +3 to +6
models:
- openrouter/anthropic/claude-sonnet-4-6
- openrouter/openai/gpt-4o-mini
- openrouter/google/gemini-2.5-pro
Comment on lines +13 to +16
task: |
Review UserCard.tsx for shadcn/ui best-practice violations.
Use the skill guidance in your references/shadcn-ui/SKILL.md.

name: building-native-ui-eval
references: ./references
models:
- openrouter/anthropic/claude-sonnet-4-6
Comment on lines +16 to +19
You are reviewing a React Native / Expo screen component for guideline violations.
The Expo UI guidelines are in `references/building-native-ui/SKILL.md`.

Review the file `MediaPlayerScreen.tsx` against every section of the guidelines:
Comment on lines +4 to +7
models:
- openrouter/anthropic/claude-sonnet-4-6
- openrouter/openai/gpt-4o-mini
- openrouter/google/gemini-2.5-pro
Comment on lines +21 to +29
function findPrd(dir) {
let files;
try { files = readdirSync(dir); } catch { return null; }
const md = files.filter((f) => f.endsWith('.md'));
const preferred = md.find((f) => /prd|product.req|requirement/i.test(f));
return preferred
? join(dir, preferred)
: md.length > 0 ? join(dir, md[0]) : null;
}
Comment on lines +15 to +19
task: |
A teammate submitted a pull request adding `api/client.ts`. Review this file for
networking and data-fetching issues using the Expo skill at
`references/native-data-fetching/SKILL.md`.

Comment on lines +27 to +30
task: |
Review StatusBadge.tsx for shadcn/ui best-practice violations.
Use the skill guidance in your references/shadcn-ui/SKILL.md.

Comment on lines +30 to +34
- name: review-settings-screen
task: |
You are reviewing a React Native / Expo screen component for guideline violations.
The Expo UI guidelines are in `references/building-native-ui/SKILL.md`.

Comment on lines +30 to +34
task: |
A teammate submitted a pull request adding `screens/DashboardScreen.tsx`. Review this
file for networking and data-fetching issues using the Expo skill at
`references/native-data-fetching/SKILL.md`.

Zhaiyuqing2003 pushed a commit that referenced this pull request May 11, 2026
Model matrix change driven by batch-2 pilot results (PR #48):

- gpt-5-mini consistently dragged scores across 10 pilots via:
  - 3–4 line verbosity floor (rules below the floor were under-reported)
  - 6–15 line drift in findings.txt (vs sonnet/gemini's 0–3 line drift)
  - CLI fabrication on "upgrade-style" skills (hallucinated `npx
    next-upgrade`, ran it, wrote the not-found error as findings)
- Replaced with `openai/gpt-5` — same tier as sonnet-4.6 / gemini-2.5-pro

Lessons.md v1.2 additions:

- New anti-pattern: "Don't add bash commands to skills aimed at small
  models" — they will execute them rather than read them as docs.
  Source: next-upgrade pilot regression (0.83 → 0.76).
- New failure mode: "CLI fabrication on upgrade-style skills" —
  distinct from Recipe B's "reaches-for-fallback curl" pattern.
- New section: "Some upstream repos use non-canonical SKILL.md paths"
  (e.g., `plugins/<owner>/skills/<id>/SKILL.md` in expo's repo).
- G1 updated to reflect new matrix: default ±8 is calibrated for
  sonnet-4.6/gpt-5/gemini-2.5-pro. Smaller models need ±12+.
- Run-record protocol: appended batch-2 entries (10 pilots) +
  added a "Model-matrix history" subsection tracking matrix changes.

Wrapper script unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants