docs(pilot-runs): publish batch-1 + batch-2 pilot summaries by Zhaiyuqing2003 · Pull Request #49 · fastxyz/skill-optimizer

Zhaiyuqing2003 · 2026-05-11T11:45:27Z

Stacked on PR #46 (feat/auto-improve-skill).

Publishes the auto-improve-skill pilot summaries that were previously
local-only (in the gitignored docs/superpowers/pilot-runs/). Moves
them to tracked docs/pilot-runs/ so the team can review the actual
run records, not just the per-PR commit messages.

What's in here

File	Skills	Results
`2026-05-08-auto-improve-pilot-summary.md`	3 (agent-browser, supabase, pdf)	3/3 success
`2026-05-09-auto-improve-batch-2-summary.md`	10 (pptx, next-best-practices, firebase-auth-basics, firebase-hosting-basics, building-native-ui, shadcn-ui, native-data-fetching, firecrawl-build-scrape, next-upgrade, prd)	8/10 success, 0 failures
`README.md`	n/a	directory index + reproduction recipe

Why these are worth seeing

Pattern transfer validated: auto-pilot rediscovered the "two-pass workflow for absence-type rules" insight on supabase (batch 1), then cited Recipe A/D/E from lessons.md by letter in batch-2 pilots 4, 6, 8.
Already-good detection works: 5 of 13 total pilots correctly exited clean without proposing changes.
One honest regression captured: next-upgrade went 0.83 → 0.76. Auto-pilot reported it without dressing it up. Surfaces new failure modes (CLI fabrication, "don't add bash for small models").
Reproducibility: worktree-per-pilot batching + hardlinked node_modules works at N=10 parallel. ~$21 OpenRouter for 10 pilots, ~50 min wall clock.

Per-skill eval artifacts live on eval/auto-pilot/<skill-id> branches (PRs #47, #48); this PR is just the human-readable digest.

🤖 Generated with Claude Code

Moves auto-improve-skill pilot summaries from gitignored docs/superpowers/pilot-runs/ to tracked docs/pilot-runs/ so the team can review them in-tree. Includes: - docs/pilot-runs/README.md — directory index + reproduction recipe - 2026-05-08-auto-improve-pilot-summary.md — batch 1 (3 skills, 3/3 success: agent-browser, supabase, pdf) - 2026-05-09-auto-improve-batch-2-summary.md — batch 2 (10 skills, 8/10 success, 0 failures: pptx, next-best-practices, firebase-auth-basics, firebase-hosting-basics, building-native-ui, shadcn-ui, native-data-fetching, firecrawl-build-scrape, next-upgrade, prd) Per-skill eval artifacts and proposed-upstream-changes live on eval/auto-pilot/<skill-id> branches and the consolidated batch branches (eval/auto-pilot/batch-2026-05-08, eval/auto-pilot/batch-2-2026-05-09).

Copilot

Pull request overview

Publishes previously local-only auto-improve-skill pilot run summaries into tracked docs/pilot-runs/ so the team can review actual batch outcomes, patterns, and reproduction steps in-repo.

Changes:

Adds a docs/pilot-runs/ README with an index and a suggested batching workflow.
Adds batch-1 (2026-05-08) and batch-2 (2026-05-09) human-readable pilot summaries, including results, patterns, and decision points.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
docs/pilot-runs/README.md	Adds directory index + a short recipe for running pilots in parallel via worktrees.
docs/pilot-runs/2026-05-08-auto-improve-pilot-summary.md	Documents batch-1 outcomes, costs, and follow-up improvements to the auto-pilot.
docs/pilot-runs/2026-05-09-auto-improve-batch-2-summary.md	Documents batch-2 outcomes, new patterns/failure modes, and reproduction notes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+
+```bash
+# This batch can be reproduced from a fresh checkout of feat/auto-improve-skill:
+cd /home/yuqing/Documents/Code/skill-optimizer


+- Per-pilot avg: **$2.13** (well under the $3.50 budgeted)
+- Plan-token spend (inner `claude -p`): each pilot reported between $0.00 and $1.00 — no pilot hit the $10 wrapper cap


+- **Wrapper version:** v1.1 + #3 (atomic write-and-commit, $10 default budget, lessons.md, pre-baked grader helpers)
+- **Skills:** ranks 5–14 from the prioritized top-N list (skips the 4 already covered in batch 1: web-design-guidelines, agent-browser, supabase, pdf)
+- **Parallelism:** 10 git worktrees, hardlinked `node_modules`, fired simultaneously
+- **Wall clock:** ~50 min (slowest pilot to longest), down from estimated ~150 min sequential


+## Reproducing the pilots
+
+```bash
+cd /home/yuqing/Documents/Code/skill-optimizer


Operational guide for submitting skill-improvement PRs to the four repos we're currently working with (vercel-labs/agent-skills, vercel-labs/web-interface-guidelines, vercel-labs/agent-browser, supabase/agent-skills). Per repo: title format, body convention, CI gates, CLA status, merge style, scope guidance, and any gotchas discovered by reading AGENTS.md/CONTRIBUTING.md/workflow files plus the last 5–10 merged PRs. Future batches: append new repos as their conventions become known.

Polished PR drafts ready for operator review + submission to upstream. Each draft contains: - Target repo + base branch - Title in the repo's preferred convention (see upstream-pr-conventions.md) - PR body matching the repo's style (formal/casual/terse) - File diff or path to the full proposed file in our repo - Caveats and gotchas specific to the repo - Operator copy-paste shell snippet for fork → branch → commit → push → gh pr create The 4 PRs cover 3 skills (web-design-guidelines spans 2 repos): 1. vercel-labs/agent-skills — web-design-guidelines SKILL.md two-pass workflow 2. vercel-labs/web-interface-guidelines — per-element checklist + 5 BAD/GOOD examples 3. vercel-labs/agent-browser — Pre-flight section (retargeted to skill-data/core/SKILL.md per AGENTS.md) 4. supabase/agent-skills — two-pass review reference (reformulated as a new references/ file per CONTRIBUTING.md, not a SKILL.md edit) Sources: - PR 1 + 2: manual web-design-guidelines run (eval/web-design-guidelines) - PR 3: agent-browser v1.2 re-run (the small additive Pre-flight) - PR 4: supabase batch-1 result (0.54 → 0.86, content reformulated to fit repo convention)

Adds a `--context <path>` flag to the auto-pilot wrapper that reads a markdown file and injects it into the prompt as a "Constraints" section Phase 4 must respect. Enables steering pilots toward upstream-specific targets (e.g. fetched rules docs instead of skill SKILL.md) and encoding architecture intent (additive-only, no restructure, etc.) as hard constraints. Phase 4 + Phase 5 updated to honor target-file overrides from the constraints (e.g. edit `command.md` instead of `SKILL.md` when the context says so; package files as `before-/after-command.md` under the correct upstream-repo directory). Includes the first context file: `tools/auto-improve-contexts/vercel-web-interface-guidelines.md`, encoding the vercel research findings — `command.md` is the canonical source distributed to 7 tools + 10 downstream consumers, restructure risk is HIGH, additive-only PRs are the merged norm (PR #23 precedent), and the AGENTS.md / README.md mirrors happen at PR-draft time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Encodes upstream conventions discovered via gh-CLI research: - All 28 existing references in this skill are single-rule SQL anti-pattern fixes with **Incorrect/**Correct SQL blocks; meta-workflow guidance is shape-novel (MEDIUM-HIGH risk of "fit the convention" pushback from gregnr/Rodriguespn). - Prefixes locked to the 8 in `_sections.md` (`query-`, `conn-`, `security-`, `schema-`, `lock-`, `data-`, `monitor-`, `advanced-`); a `review-` prefix would require modifying `_sections.md` which is not additive-only. - Required reshape: pick a single concrete SQL anti-pattern that two-pass review catches and frame around it (Incorrect = single-pass miss, Correct = two-pass catch). If reshape feels contrived, surface needs-discussion signal instead of shipping borderline PR. - Frontmatter spec corrected: 4 fields (`title`, `impact`, `impactDescription`, `tags`); previous research missed `impactDescription`. `tags` is comma-separated string, not YAML list. - pnpm test:sanity does NOT validate frontmatter (corrected prior note); convention is enforced by maintainer review only. - Release Please owns metadata.version; do not bump manually (causes merge conflicts with bot's release PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-browser Carry over existing Tier-0 eval (navigate-and-report, screenshot-capture) as the starting point for deeper Tier-1 work.

@en

- Add 4 cases (ref-based-search, ref-disambiguation, output-correctness, multi-step-state) that grade snapshot-driven @en ref discipline, ambiguous-element resolution, content correctness, and full state-machine traversal — none of which the v1 baseline covered. - Upgrade bin/agent-browser to a stateful playback CLI: URL match -> page, per-page transitions.txt drives state changes, snapshot emits the recorded accessibility-tree fixture for current (page, state). Falls back to the legacy generic snapshot for Tier-0 continuity. Adds AB_WORK override so the CLI can be smoke-tested outside Docker. - Add hand-fabricated recordings for 4 pages (wikipedia, signin-signup, blog-article, multistep-form) under references/agent-browser/recordings/. - Add checks/smoke-graders.mjs running 14 GOOD/BAD assertions against hand-crafted ab-calls.log + output-file fixtures; all pass without Docker or models.

…er-1 pilot Encodes constraints for the auto-pilot to run against the hand-built Tier-1 deeper eval (4 new cases: ref-based-search, ref-disambiguation, output-correctness, multi-step-state) without rebuilding the workbench. Key directives: - Workbench is already built — skip Phase 2 entirely - Optimization target = references/agent-browser/agent-browser-core.md (the workflow content), NOT references/agent-browser/SKILL.md (the discovery stub) - Upstream packaging target = skill-data/core/SKILL.md per AGENTS.md - Apache-2.0 + conventional commits + ctate same-day merges for clean docs-only PRs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The wrapper-skill PR target (`vercel-labs/agent-skills/.../web-design-guidelines/SKILL.md`) is dropped — it's a thin Claude-Code-specific adapter that WebFetches the rules doc, and editing it is low-leverage. All value lives in `vercel-labs/web-interface-guidelines/command.md` and its two stylistic siblings (`AGENTS.md`, `README.md`). The consolidated draft at #1 carries: - The auto-pilot's measured 22-line `command.md` insert (eval 0.92→1.00, 18 trials × 3 frontier models, 6 absence-type misses → 0) - A MUST/SHOULD/NEVER mirror for `AGENTS.md` (style-faithful, not independently measured) - A prose mirror for `README.md` (style-faithful, not independently measured) - A qualitative pitch as the headline + eval data as supporting evidence (matches PR #23 precedent in this repo, which has zero quantitative evidence in any merged PR) Old drafts moved to `superseded/` with a README explaining why each was retired. Repo PR-drafts README updated to reflect the new canonical numbering. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captures the two structural lessons from the v1.2.1 pilot session: 1. Research-first context is mandatory (Phase 0): the auto-pilot is good at finding what to change, bad at fitting upstream conventions. Without a researched context file, output requires manual reformulation. 2. Two-loop iteration on eval AND skill (Phase 3.5): the current pipeline can't escape ceiling (>= 0.95) or floor (< 0.50) eval baselines because it only iterates the skill, treating the eval as fixed. Backwards compatible — v1.2.1's --context flag continues to work; v1.3 phases are opt-in via --research and --auto-eval flags until validated. Note: this commit lands on the supabase--v1-shallow branch because the agent-browser pilot is concurrently active on the main worktree; branch hygiene (move to docs/auto-pilot-runs) deferred until pilots finish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The agent-browser deeper-eval pilot timed out at the 90-min wrapper cap mid-baseline (50/54 trials complete; no Phase 5 commit). However, the supabase v2 pilot's Phase 4 instruction to append a run-record entry to lessons.md DID complete and wrote a useful observation about the 'calibrated graders cause baseline ceiling' pattern. Salvaging that entry here even though the parent agent-browser pilot didn't finalize. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#3 (agent-browser): updated to acknowledge that the v1.2.1 deeper-eval pilot was attempted but timed out at the wrapper's 90-min hard cap mid-baseline (50/54 trials complete, no Phase 5 commit). Ships the original v1.0 Pre-flight diff (baseline 0.97; 1/9 Gemini trial used curl). Partial baseline data preserved at .results/20260512-101220/ for future analysis. #4 (supabase): replaced the batch-1 draft with the v1.2.1 v2 result. The auto-pilot reshaped the proposal exactly per the upstream context file (filename monitor-two-pass-review.md, monitor- prefix, 4-field frontmatter, **Incorrect**/**Correct** SQL blocks, ~50 lines) so the file is convention-perfect. Honest framing: per-case breakdown shows update-without-where at 77.8% (the targeted failure pattern) but overall 0.97 baseline meant no iteration; auto-pilot's exit logic uses overall average rather than per-case minimum (v1.3 will fix). README index updated with evidence-strength column. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The shadcn-ui v1.3 dispatch with the gpt-5 frontier matrix produced clean measured uplift (+0.222 per-case-min) from a single Recipe D iteration that strengthened the file-location rule with an explicit StatusBadge BAD/GOOD example + added a Code Review Checklist section. Draft includes: - Honest per-case-min framing (0.667 → 0.889 on frontier matrix) - Diff against actual upstream (verified via gh API) - Caveats: Google CLA required, cosmetic whitespace fixes from markdownlint that should be manually reverted before submission for strict additive-only PR target: google-labs-code/stitch-skills, file skills/shadcn-ui/SKILL.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds #6 firebase-hosting-basics (the first full end-to-end v1.3 orchestrator demo with Phase 3.5 eval-iteration). Measured uplift 0.89 → 1.00 (+0.11) on frontier matrix; orchestrator added 2 harder cases via add-harder direction before applying Recipe C. Removes #3 agent-browser and #4 supabase from the canonical set — both ended with null/soft evidence (#3 timed out after frontier-matrix re-fire showed uplift-too-small; #4's per-case finding was identified but no measured uplift). Keeping them for internal reference only, not for team submission. They remain in PR #49 (the older bloated PR) for traceability. Updates README to reflect 3-strong canonical set with Google CLA note for #5 and #6 (both Google-org repos: stitch-skills and firebase). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 11, 2026 11:45

Copilot started reviewing on behalf of Zhaiyuqing2003 May 11, 2026 11:46 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Yuqing Zhai and others added 12 commits May 11, 2026 20:15

chore(agent-browser-eval): import baseline from eval/auto-pilot/agent…

bdb4ed0

…-browser Carry over existing Tier-0 eval (navigate-and-report, screenshot-capture) as the starting point for deeper Tier-1 work.

Zhaiyuqing2003 mentioned this pull request May 12, 2026

docs(pilot-runs): 3 strong upstream PR drafts ready for team greenlight #51

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(pilot-runs): publish batch-1 + batch-2 pilot summaries#49

docs(pilot-runs): publish batch-1 + batch-2 pilot summaries#49
Zhaiyuqing2003 wants to merge 13 commits into
feat/auto-improve-skillfrom
docs/auto-pilot-runs

Zhaiyuqing2003 commented May 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		- Per-pilot avg: $2.13 (well under the $3.50 budgeted)
		- Plan-token spend (inner `claude -p`): each pilot reported between $0.00 and $1.00 — no pilot hit the $10 wrapper cap

Conversation

Zhaiyuqing2003 commented May 11, 2026

What's in here

Why these are worth seeing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants