eval(auto-pilot): batch 2 — 10 skills, 8/10 success, 0 failures#48
Open
Zhaiyuqing2003 wants to merge 10 commits into
Open
eval(auto-pilot): batch 2 — 10 skills, 8/10 success, 0 failures#48Zhaiyuqing2003 wants to merge 10 commits into
Zhaiyuqing2003 wants to merge 10 commits into
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…→0.975 Baseline raw coverage 0.80; after grader calibration (±12 tolerance for late-file violations, explicit write-tool instruction in task prompts) coverage reached 0.975 (79/81 trials). No skill modifications needed — skill already covers the seeded violations well. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…0→1.00 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…0.89→1.00 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…0.99 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…0→1.00 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…overage 0.84→0.89 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…83→0.76 Eval suite for vercel-labs/next-skills/next-upgrade skill. - Classified as code-reviewer (findings.txt output shape) - 1 case (review-starter-app), 6 violations across 4 files: package.json (v14 version), app/page.tsx (viewport + sync searchParams), app/[id]/page.tsx (sync params), app/api/route.ts (sync cookies + headers) - 3-model matrix: claude-sonnet-4-5, gpt-4o-mini, gemini-2.5-pro, 3 trials each - Baseline coverage 0.83 after grader calibration (pkg range 1-25, not looseRange) - Iteration 1 (grep checklist): 0.685 — bash cmds caused GPT to try executing them - Iteration 2 (BAD/GOOD examples only): 0.759 — GPT still CLI-fixated on some trials - status: uplift-too-small (2 iterations done, neither exceeded baseline+0.05) - Lessons: seed comments must not contain fix patterns; pkg.json line=file-level Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds the second “auto-improve-skill” pilot batch outputs into the repo: new workbench eval suites (cases, graders, fixtures, analyses) for multiple skills, plus an update to the lessons-learned doc capturing new prompt/grader patterns discovered during the runs.
Changes:
- Added several new
examples/workbench/<skill>/eval suites (suite configs, seeded workspaces, graders, analyses, and proposed-upstream-change notes). - Extended
tools/auto-improve-skill-lessons.mdwith new pilot learnings (notably around bash-command anti-patterns and grader calibration). - Vendored additional skill/reference snapshots used by the workbench suites.
Reviewed changes
Copilot reviewed 134 out of 134 changed files in this pull request and generated 23 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/auto-improve-skill-lessons.md | Adds new batch-2 pilot lessons (anti-patterns + grader calibration guidance). |
| examples/workbench/shadcn-ui/workspace/UserCard.tsx | Seeded fixture for shadcn/ui review eval. |
| examples/workbench/shadcn-ui/workspace/StatusBadge.tsx | Seeded fixture for shadcn/ui review eval. |
| examples/workbench/shadcn-ui/suite.yml | Workbench suite config for shadcn-ui eval. |
| examples/workbench/shadcn-ui/README.md | Documents shadcn-ui eval cases and how to run the suite. |
| examples/workbench/shadcn-ui/proposed-upstream-changes/README.md | Summarizes suggested upstream SKILL.md additions for shadcn-ui. |
| examples/workbench/shadcn-ui/checks/grade-usercard-findings.mjs | Grader for shadcn-ui UserCard findings output. |
| examples/workbench/shadcn-ui/checks/grade-statusbadge-findings.mjs | Grader for shadcn-ui StatusBadge findings output. |
| examples/workbench/shadcn-ui/checks/_grader-utils.mjs | Shared grader helpers for shadcn-ui eval. |
| examples/workbench/shadcn-ui/analysis.md | Records the shadcn-ui auto-pilot run results and calibration notes. |
| examples/workbench/prd/workspace/brief-api-gateway.md | Seed brief input for PRD-writing eval case. |
| examples/workbench/prd/workspace/brief-ai-search.md | Seed brief input for PRD-writing eval case. |
| examples/workbench/prd/suite.yml | Workbench suite config for prd eval. |
| examples/workbench/prd/references/prd/SKILL.md | Vendored PRD skill snapshot used by the suite. |
| examples/workbench/prd/README.md | Documents PRD eval cases and run instructions. |
| examples/workbench/prd/proposed-upstream-changes/README.md | Notes that no upstream PRD skill changes are proposed. |
| examples/workbench/prd/proposed-upstream-changes/github-awesome-copilot/before-SKILL.md | Baseline upstream snapshot for PRD skill. |
| examples/workbench/prd/proposed-upstream-changes/github-awesome-copilot/after-SKILL.md | Post-run snapshot for PRD skill (unchanged). |
| examples/workbench/prd/checks/grade-api-gateway-prd.mjs | Grader for API-gateway PRD structural checks. |
| examples/workbench/prd/checks/grade-ai-search-prd.mjs | Grader for AI-search PRD structural checks. |
| examples/workbench/prd/checks/_grader-utils.mjs | Shared keyword helpers (and findings grader utility) for PRD graders. |
| examples/workbench/prd/analysis.md | Records PRD pilot outcomes and notes about excluded trials. |
| examples/workbench/pptx/suite.yml | Workbench suite config for pptx eval. |
| examples/workbench/pptx/README.md | Documents pptx eval cases and how to run them. |
| examples/workbench/pptx/checks/no-pptx-skill.mjs | Control-case grader ensuring pptx skill isn’t unnecessarily read. |
| examples/workbench/pptx/checks/extract-pptx-facts.mjs | Grader for PPTX extraction case output JSON. |
| examples/workbench/pptx/checks/create-product-deck.mjs | Grader validating generated PPTX contains required strings. |
| examples/workbench/pptx/checks/create-inputs.py | Generates a deterministic input PPTX for extraction tests. |
| examples/workbench/pptx/checks/_trace.mjs | Trace reader used to detect forbidden reads in control cases. |
| examples/workbench/pptx/checks/_grader-utils.mjs | Shared grader helpers for pptx eval. |
| examples/workbench/pptx/analysis.md | Records pptx pilot outcomes and grader calibration note. |
| examples/workbench/next-upgrade/workspace/starter-app/package.json | Seed Next.js 14 app fixture for upgrade-review eval. |
| examples/workbench/next-upgrade/workspace/starter-app/app/page.tsx | Seed file for Next.js async API + metadata viewport checks. |
| examples/workbench/next-upgrade/workspace/starter-app/app/api/route.ts | Seed route handler fixture for async cookies/headers checks. |
| examples/workbench/next-upgrade/workspace/starter-app/app/[id]/page.tsx | Seed dynamic route fixture for async params checks. |
| examples/workbench/next-upgrade/suite.yml | Workbench suite config for next-upgrade eval. |
| examples/workbench/next-upgrade/references/next-upgrade/SKILL.md | Vendored next-upgrade skill snapshot used by the suite. |
| examples/workbench/next-upgrade/README.md | Documents next-upgrade eval cases, graders, and run instructions. |
| examples/workbench/next-upgrade/checks/grade-starter-route.mjs | Grader for workspace-mutation route handler fixes. |
| examples/workbench/next-upgrade/checks/grade-starter-pages.mjs | Grader for workspace-mutation page fixes. |
| examples/workbench/next-upgrade/checks/grade-starter-package.mjs | Grader for workspace-mutation dependency version fix. |
| examples/workbench/next-upgrade/checks/grade-route-findings.mjs | Findings.txt grader for route handler issues. |
| examples/workbench/next-upgrade/checks/grade-page-findings.mjs | Findings.txt grader for page.tsx issues. |
| examples/workbench/next-upgrade/checks/grade-package-findings.mjs | Findings.txt grader for package.json upgrade issues. |
| examples/workbench/next-upgrade/checks/grade-id-page-findings.mjs | Findings.txt grader for dynamic route params issue. |
| examples/workbench/next-upgrade/checks/_grader-utils.mjs | Shared grader helpers for next-upgrade findings graders. |
| examples/workbench/next-upgrade/analysis.md | Records next-upgrade pilot outcomes and failure-mode analysis. |
| examples/workbench/next-best-practices/workspace/components/HeroSection.tsx | Seed fixture for next-best-practices code-review eval. |
| examples/workbench/next-best-practices/workspace/app/dashboard/page.tsx | Seed fixture for next-best-practices code-review eval. |
| examples/workbench/next-best-practices/suite.yml | Workbench suite config for next-best-practices eval. |
| examples/workbench/next-best-practices/references/next-best-practices/suspense-boundaries.md | Vendored reference doc used by the skill. |
| examples/workbench/next-best-practices/references/next-best-practices/SKILL.md | Vendored next-best-practices skill snapshot. |
| examples/workbench/next-best-practices/references/next-best-practices/scripts.md | Vendored reference doc used by the skill. |
| examples/workbench/next-best-practices/references/next-best-practices/runtime-selection.md | Vendored reference doc used by the skill. |
| examples/workbench/next-best-practices/references/next-best-practices/route-handlers.md | Vendored reference doc used by the skill. |
| examples/workbench/next-best-practices/references/next-best-practices/hydration-error.md | Vendored reference doc used by the skill. |
| examples/workbench/next-best-practices/references/next-best-practices/functions.md | Vendored reference doc used by the skill. |
| examples/workbench/next-best-practices/references/next-best-practices/file-conventions.md | Vendored reference doc used by the skill. |
| examples/workbench/next-best-practices/references/next-best-practices/directives.md | Vendored reference doc used by the skill. |
| examples/workbench/next-best-practices/references/next-best-practices/debug-tricks.md | Vendored reference doc used by the skill. |
| examples/workbench/next-best-practices/references/next-best-practices/async-patterns.md | Vendored reference doc used by the skill. |
| examples/workbench/next-best-practices/README.md | Documents next-best-practices eval cases and running. |
| examples/workbench/next-best-practices/checks/grade-herosection-findings.mjs | Findings grader for HeroSection violations. |
| examples/workbench/next-best-practices/checks/grade-dashboard-findings.mjs | Findings grader for dashboard page violations. |
| examples/workbench/next-best-practices/checks/_grader-utils.mjs | Shared grader helpers for next-best-practices eval. |
| examples/workbench/next-best-practices/analysis.md | Records next-best-practices pilot outcomes and calibration note. |
| examples/workbench/native-data-fetching/workspace/screens/DashboardScreen.tsx | Seed fixture for Expo networking review eval. |
| examples/workbench/native-data-fetching/workspace/api/client.ts | Seed fixture for Expo networking review eval. |
| examples/workbench/native-data-fetching/suite.yml | Workbench suite config for native-data-fetching eval. |
| examples/workbench/native-data-fetching/README.md | Documents native-data-fetching eval cases and running. |
| examples/workbench/native-data-fetching/checks/grade-dashboard-findings.mjs | Findings grader for DashboardScreen violations. |
| examples/workbench/native-data-fetching/checks/grade-client-findings.mjs | Findings grader for api/client.ts violations. |
| examples/workbench/native-data-fetching/checks/_grader-utils.mjs | Shared grader helpers for native-data-fetching eval. |
| examples/workbench/native-data-fetching/analysis.md | Records native-data-fetching pilot outcomes. |
| examples/workbench/firecrawl-build-scrape/workspace/ScrapeService.ts | Seed fixture for Firecrawl scrape integration review eval. |
| examples/workbench/firecrawl-build-scrape/suite.yml | Workbench suite config for firecrawl-build-scrape eval. |
| examples/workbench/firecrawl-build-scrape/references/firecrawl-build-scrape/SKILL.md | Vendored Firecrawl skill snapshot. |
| examples/workbench/firecrawl-build-scrape/references/firecrawl-build-scrape/node-docs.md | Vendored Node.js “source of truth” reference for Firecrawl. |
| examples/workbench/firecrawl-build-scrape/README.md | Documents firecrawl-build-scrape eval cases and running. |
| examples/workbench/firecrawl-build-scrape/checks/grade-scrape-service-findings.mjs | Findings grader for Firecrawl integration violations. |
| examples/workbench/firecrawl-build-scrape/checks/_grader-utils.mjs | Shared grader helpers for firecrawl-build-scrape eval. |
| examples/workbench/firecrawl-build-scrape/analysis.md | Records firecrawl-build-scrape pilot outcomes. |
| examples/workbench/firebase-hosting-basics/workspace/firebase-app/firebase.json | Seed misconfigured firebase.json fixture. |
| examples/workbench/firebase-hosting-basics/suite.yml | Workbench suite config for firebase-hosting-basics eval. |
| examples/workbench/firebase-hosting-basics/references/firebase-hosting-basics/SKILL.md | Vendored firebase-hosting-basics skill snapshot. |
| examples/workbench/firebase-hosting-basics/references/firebase-hosting-basics/deploying.md | Vendored deployment reference doc. |
| examples/workbench/firebase-hosting-basics/references/firebase-hosting-basics/configuration.md | Vendored configuration reference doc. |
| examples/workbench/firebase-hosting-basics/README.md | Documents firebase-hosting-basics eval cases and running. |
| examples/workbench/firebase-hosting-basics/proposed-upstream-changes/README.md | Summarizes proposed upstream SKILL.md additions. |
| examples/workbench/firebase-hosting-basics/proposed-upstream-changes/firebase-agent-skills/before-SKILL.md | Baseline upstream snapshot for firebase-hosting-basics skill. |
| examples/workbench/firebase-hosting-basics/proposed-upstream-changes/firebase-agent-skills/after-SKILL.md | Post-run snapshot including proposed additive section. |
| examples/workbench/firebase-hosting-basics/checks/grade-firebase-config-findings.mjs | Findings grader for firebase.json violations. |
| examples/workbench/firebase-hosting-basics/checks/_grader-utils.mjs | Shared grader helpers for firebase-hosting-basics eval. |
| examples/workbench/firebase-hosting-basics/analysis.md | Records firebase-hosting-basics pilot outcomes. |
| examples/workbench/firebase-auth-basics/workspace/src/auth.js | Seed fixture for Firebase Auth JS review case. |
| examples/workbench/firebase-auth-basics/workspace/firestore.rules | Seed fixture for Firestore rules review case. |
| examples/workbench/firebase-auth-basics/suite.yml | Workbench suite config for firebase-auth-basics eval. |
| examples/workbench/firebase-auth-basics/references/firebase-auth-basics/SKILL.md | Vendored firebase-auth-basics skill snapshot. |
| examples/workbench/firebase-auth-basics/references/firebase-auth-basics/references/security_rules.md | Vendored security rules reference doc. |
| examples/workbench/firebase-auth-basics/references/firebase-auth-basics/references/flutter_setup.md | Vendored Flutter auth reference doc. |
| examples/workbench/firebase-auth-basics/references/firebase-auth-basics/references/client_sdk_android.md | Vendored Android auth reference doc. |
| examples/workbench/firebase-auth-basics/README.md | Documents firebase-auth-basics eval cases and running. |
| examples/workbench/firebase-auth-basics/checks/grade-rules-findings.mjs | Findings grader for Firestore rules violations. |
| examples/workbench/firebase-auth-basics/checks/grade-auth-findings.mjs | Findings grader for auth.js violations. |
| examples/workbench/firebase-auth-basics/checks/_grader-utils.mjs | Shared grader helpers for firebase-auth-basics eval. |
| examples/workbench/firebase-auth-basics/analysis.md | Records firebase-auth-basics pilot outcomes. |
| examples/workbench/building-native-ui/workspace/SettingsScreen.tsx | Seed fixture for Expo UI review case. |
| examples/workbench/building-native-ui/workspace/MediaPlayerScreen.tsx | Seed fixture for Expo UI review case. |
| examples/workbench/building-native-ui/suite.yml | Workbench suite config for building-native-ui eval. |
| examples/workbench/building-native-ui/README.md | Documents building-native-ui eval cases and running. |
| examples/workbench/building-native-ui/checks/grade-settings-screen-findings.mjs | Findings grader for SettingsScreen violations. |
| examples/workbench/building-native-ui/checks/grade-media-player-findings.mjs | Findings grader for MediaPlayerScreen violations. |
| examples/workbench/building-native-ui/checks/_grader-utils.mjs | Shared grader helpers for building-native-ui eval. |
| examples/workbench/building-native-ui/analysis.md | Records building-native-ui pilot outcomes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+3
to
+6
| models: | ||
| - openrouter/anthropic/claude-sonnet-4-6 | ||
| - openrouter/openai/gpt-4o-mini | ||
| - openrouter/google/gemini-2.5-pro |
Comment on lines
+13
to
+16
| task: | | ||
| Review UserCard.tsx for shadcn/ui best-practice violations. | ||
| Use the skill guidance in your references/shadcn-ui/SKILL.md. | ||
|
|
| name: building-native-ui-eval | ||
| references: ./references | ||
| models: | ||
| - openrouter/anthropic/claude-sonnet-4-6 |
Comment on lines
+16
to
+19
| You are reviewing a React Native / Expo screen component for guideline violations. | ||
| The Expo UI guidelines are in `references/building-native-ui/SKILL.md`. | ||
|
|
||
| Review the file `MediaPlayerScreen.tsx` against every section of the guidelines: |
Comment on lines
+4
to
+7
| models: | ||
| - openrouter/anthropic/claude-sonnet-4-6 | ||
| - openrouter/openai/gpt-4o-mini | ||
| - openrouter/google/gemini-2.5-pro |
Comment on lines
+21
to
+29
| function findPrd(dir) { | ||
| let files; | ||
| try { files = readdirSync(dir); } catch { return null; } | ||
| const md = files.filter((f) => f.endsWith('.md')); | ||
| const preferred = md.find((f) => /prd|product.req|requirement/i.test(f)); | ||
| return preferred | ||
| ? join(dir, preferred) | ||
| : md.length > 0 ? join(dir, md[0]) : null; | ||
| } |
Comment on lines
+15
to
+19
| task: | | ||
| A teammate submitted a pull request adding `api/client.ts`. Review this file for | ||
| networking and data-fetching issues using the Expo skill at | ||
| `references/native-data-fetching/SKILL.md`. | ||
|
|
Comment on lines
+27
to
+30
| task: | | ||
| Review StatusBadge.tsx for shadcn/ui best-practice violations. | ||
| Use the skill guidance in your references/shadcn-ui/SKILL.md. | ||
|
|
Comment on lines
+30
to
+34
| - name: review-settings-screen | ||
| task: | | ||
| You are reviewing a React Native / Expo screen component for guideline violations. | ||
| The Expo UI guidelines are in `references/building-native-ui/SKILL.md`. | ||
|
|
Comment on lines
+30
to
+34
| task: | | ||
| A teammate submitted a pull request adding `screens/DashboardScreen.tsx`. Review this | ||
| file for networking and data-fetching issues using the Expo skill at | ||
| `references/native-data-fetching/SKILL.md`. | ||
|
|
Zhaiyuqing2003
pushed a commit
that referenced
this pull request
May 11, 2026
Model matrix change driven by batch-2 pilot results (PR #48): - gpt-5-mini consistently dragged scores across 10 pilots via: - 3–4 line verbosity floor (rules below the floor were under-reported) - 6–15 line drift in findings.txt (vs sonnet/gemini's 0–3 line drift) - CLI fabrication on "upgrade-style" skills (hallucinated `npx next-upgrade`, ran it, wrote the not-found error as findings) - Replaced with `openai/gpt-5` — same tier as sonnet-4.6 / gemini-2.5-pro Lessons.md v1.2 additions: - New anti-pattern: "Don't add bash commands to skills aimed at small models" — they will execute them rather than read them as docs. Source: next-upgrade pilot regression (0.83 → 0.76). - New failure mode: "CLI fabrication on upgrade-style skills" — distinct from Recipe B's "reaches-for-fallback curl" pattern. - New section: "Some upstream repos use non-canonical SKILL.md paths" (e.g., `plugins/<owner>/skills/<id>/SKILL.md` in expo's repo). - G1 updated to reflect new matrix: default ±8 is calibrated for sonnet-4.6/gpt-5/gemini-2.5-pro. Smaller models need ±12+. - Run-record protocol: appended batch-2 entries (10 pilots) + added a "Model-matrix history" subsection tracking matrix changes. Wrapper script unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on PR #46 (
feat/auto-improve-skillv1.1 + #3).Second batch run of the auto-improve-skill pilot. 10 skills (ranks 5–14 from the prioritized top-N list), fired in parallel via
git worktree. 8/10 success, 2/10 uplift-too-small, 0 blocked or budget-exceeded.Results
pptxnext-best-practicesfirebase-auth-basicsfirebase-hosting-basicsbuilding-native-uishadcn-uinative-data-fetchingfirecrawl-build-scrapenext-upgradeprdv1.1 + #3 validation
tools/auto-improve-skill-lessons.md. The lessons doc is doing real work; agents aren't rediscovering patterns.looseRange/tolerantKeywordpre-baked helpers were used in the agent-written graders. Several pilots widened specifically for gpt-4o-mini drift — new signal for v1.2.Cost
$21.30 across 10 pilots ($2.13/pilot)What to PR upstream
Three pilots produced real additive proposals:
firebase-hosting-basics(+0.11, → firebase/agent-skills)shadcn-ui(+0.07, → google-labs-code/stitch-skills)firecrawl-build-scrape(+0.05, → firecrawl/skills) — bubble case at exactly the thresholdnext-upgraderegressed (-0.07) and is not safe to PR — see the run's analysis.md for the new failure modes (CLI fabrication, bash-commands-for-small-models anti-pattern).What this surfaces for v1.2 of the prompt
curlfallback)expo/skillsusesplugins/expo/skills/<id>/SKILL.md(not canonical layout)Detailed write-up at
docs/superpowers/pilot-runs/2026-05-09-auto-improve-batch-2-summary.md(gitignored, local-only).🤖 Generated with Claude Code