Skip to content

docs(pilot-runs): publish batch-1 + batch-2 pilot summaries#49

Open
Zhaiyuqing2003 wants to merge 13 commits into
feat/auto-improve-skillfrom
docs/auto-pilot-runs
Open

docs(pilot-runs): publish batch-1 + batch-2 pilot summaries#49
Zhaiyuqing2003 wants to merge 13 commits into
feat/auto-improve-skillfrom
docs/auto-pilot-runs

Conversation

@Zhaiyuqing2003
Copy link
Copy Markdown

Stacked on PR #46 (feat/auto-improve-skill).

Publishes the auto-improve-skill pilot summaries that were previously
local-only (in the gitignored docs/superpowers/pilot-runs/). Moves
them to tracked docs/pilot-runs/ so the team can review the actual
run records, not just the per-PR commit messages.

What's in here

File Skills Results
2026-05-08-auto-improve-pilot-summary.md 3 (agent-browser, supabase, pdf) 3/3 success
2026-05-09-auto-improve-batch-2-summary.md 10 (pptx, next-best-practices, firebase-auth-basics, firebase-hosting-basics, building-native-ui, shadcn-ui, native-data-fetching, firecrawl-build-scrape, next-upgrade, prd) 8/10 success, 0 failures
README.md n/a directory index + reproduction recipe

Why these are worth seeing

  • Pattern transfer validated: auto-pilot rediscovered the "two-pass workflow for absence-type rules" insight on supabase (batch 1), then cited Recipe A/D/E from lessons.md by letter in batch-2 pilots 4, 6, 8.
  • Already-good detection works: 5 of 13 total pilots correctly exited clean without proposing changes.
  • One honest regression captured: next-upgrade went 0.83 → 0.76. Auto-pilot reported it without dressing it up. Surfaces new failure modes (CLI fabrication, "don't add bash for small models").
  • Reproducibility: worktree-per-pilot batching + hardlinked node_modules works at N=10 parallel. ~$21 OpenRouter for 10 pilots, ~50 min wall clock.

Per-skill eval artifacts live on eval/auto-pilot/<skill-id> branches (PRs #47, #48); this PR is just the human-readable digest.

🤖 Generated with Claude Code

Moves auto-improve-skill pilot summaries from gitignored
docs/superpowers/pilot-runs/ to tracked docs/pilot-runs/ so the team
can review them in-tree.

Includes:

- docs/pilot-runs/README.md — directory index + reproduction recipe
- 2026-05-08-auto-improve-pilot-summary.md — batch 1 (3 skills, 3/3
  success: agent-browser, supabase, pdf)
- 2026-05-09-auto-improve-batch-2-summary.md — batch 2 (10 skills,
  8/10 success, 0 failures: pptx, next-best-practices, firebase-auth-basics,
  firebase-hosting-basics, building-native-ui, shadcn-ui, native-data-fetching,
  firecrawl-build-scrape, next-upgrade, prd)

Per-skill eval artifacts and proposed-upstream-changes live on
eval/auto-pilot/<skill-id> branches and the consolidated batch branches
(eval/auto-pilot/batch-2026-05-08, eval/auto-pilot/batch-2-2026-05-09).
Copilot AI review requested due to automatic review settings May 11, 2026 11:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Publishes previously local-only auto-improve-skill pilot run summaries into tracked docs/pilot-runs/ so the team can review actual batch outcomes, patterns, and reproduction steps in-repo.

Changes:

  • Adds a docs/pilot-runs/ README with an index and a suggested batching workflow.
  • Adds batch-1 (2026-05-08) and batch-2 (2026-05-09) human-readable pilot summaries, including results, patterns, and decision points.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
docs/pilot-runs/README.md Adds directory index + a short recipe for running pilots in parallel via worktrees.
docs/pilot-runs/2026-05-08-auto-improve-pilot-summary.md Documents batch-1 outcomes, costs, and follow-up improvements to the auto-pilot.
docs/pilot-runs/2026-05-09-auto-improve-batch-2-summary.md Documents batch-2 outcomes, new patterns/failure modes, and reproduction notes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


```bash
# This batch can be reproduced from a fresh checkout of feat/auto-improve-skill:
cd /home/yuqing/Documents/Code/skill-optimizer
Comment on lines +30 to +31
- Per-pilot avg: **$2.13** (well under the $3.50 budgeted)
- Plan-token spend (inner `claude -p`): each pilot reported between $0.00 and $1.00 — no pilot hit the $10 wrapper cap
- **Wrapper version:** v1.1 + #3 (atomic write-and-commit, $10 default budget, lessons.md, pre-baked grader helpers)
- **Skills:** ranks 5–14 from the prioritized top-N list (skips the 4 already covered in batch 1: web-design-guidelines, agent-browser, supabase, pdf)
- **Parallelism:** 10 git worktrees, hardlinked `node_modules`, fired simultaneously
- **Wall clock:** ~50 min (slowest pilot to longest), down from estimated ~150 min sequential
## Reproducing the pilots

```bash
cd /home/yuqing/Documents/Code/skill-optimizer
Yuqing Zhai and others added 12 commits May 11, 2026 20:15
Operational guide for submitting skill-improvement PRs to the four
repos we're currently working with (vercel-labs/agent-skills,
vercel-labs/web-interface-guidelines, vercel-labs/agent-browser,
supabase/agent-skills).

Per repo: title format, body convention, CI gates, CLA status, merge
style, scope guidance, and any gotchas discovered by reading
AGENTS.md/CONTRIBUTING.md/workflow files plus the last 5–10 merged
PRs.

Future batches: append new repos as their conventions become known.
Polished PR drafts ready for operator review + submission to upstream.
Each draft contains:

- Target repo + base branch
- Title in the repo's preferred convention (see upstream-pr-conventions.md)
- PR body matching the repo's style (formal/casual/terse)
- File diff or path to the full proposed file in our repo
- Caveats and gotchas specific to the repo
- Operator copy-paste shell snippet for fork → branch → commit → push → gh pr create

The 4 PRs cover 3 skills (web-design-guidelines spans 2 repos):

1. vercel-labs/agent-skills — web-design-guidelines SKILL.md two-pass workflow
2. vercel-labs/web-interface-guidelines — per-element checklist + 5 BAD/GOOD examples
3. vercel-labs/agent-browser — Pre-flight section (retargeted to
   skill-data/core/SKILL.md per AGENTS.md)
4. supabase/agent-skills — two-pass review reference (reformulated as a
   new references/ file per CONTRIBUTING.md, not a SKILL.md edit)

Sources:
- PR 1 + 2: manual web-design-guidelines run (eval/web-design-guidelines)
- PR 3: agent-browser v1.2 re-run (the small additive Pre-flight)
- PR 4: supabase batch-1 result (0.54 → 0.86, content reformulated to
  fit repo convention)
Adds a `--context <path>` flag to the auto-pilot wrapper that reads a
markdown file and injects it into the prompt as a "Constraints" section
Phase 4 must respect. Enables steering pilots toward upstream-specific
targets (e.g. fetched rules docs instead of skill SKILL.md) and
encoding architecture intent (additive-only, no restructure, etc.) as
hard constraints.

Phase 4 + Phase 5 updated to honor target-file overrides from the
constraints (e.g. edit `command.md` instead of `SKILL.md` when the
context says so; package files as `before-/after-command.md` under the
correct upstream-repo directory).

Includes the first context file:
`tools/auto-improve-contexts/vercel-web-interface-guidelines.md`,
encoding the vercel research findings — `command.md` is the canonical
source distributed to 7 tools + 10 downstream consumers, restructure
risk is HIGH, additive-only PRs are the merged norm (PR #23 precedent),
and the AGENTS.md / README.md mirrors happen at PR-draft time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Encodes upstream conventions discovered via gh-CLI research:
- All 28 existing references in this skill are single-rule SQL
  anti-pattern fixes with **Incorrect/**Correct SQL blocks; meta-workflow
  guidance is shape-novel (MEDIUM-HIGH risk of "fit the convention"
  pushback from gregnr/Rodriguespn).
- Prefixes locked to the 8 in `_sections.md` (`query-`, `conn-`,
  `security-`, `schema-`, `lock-`, `data-`, `monitor-`, `advanced-`); a
  `review-` prefix would require modifying `_sections.md` which is not
  additive-only.
- Required reshape: pick a single concrete SQL anti-pattern that
  two-pass review catches and frame around it (Incorrect = single-pass
  miss, Correct = two-pass catch). If reshape feels contrived, surface
  needs-discussion signal instead of shipping borderline PR.
- Frontmatter spec corrected: 4 fields (`title`, `impact`,
  `impactDescription`, `tags`); previous research missed
  `impactDescription`. `tags` is comma-separated string, not YAML list.
- pnpm test:sanity does NOT validate frontmatter (corrected prior note);
  convention is enforced by maintainer review only.
- Release Please owns metadata.version; do not bump manually (causes
  merge conflicts with bot's release PR).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-browser

Carry over existing Tier-0 eval (navigate-and-report, screenshot-capture)
as the starting point for deeper Tier-1 work.
- Add 4 cases (ref-based-search, ref-disambiguation, output-correctness,
  multi-step-state) that grade snapshot-driven @en ref discipline,
  ambiguous-element resolution, content correctness, and full
  state-machine traversal — none of which the v1 baseline covered.
- Upgrade bin/agent-browser to a stateful playback CLI: URL match -> page,
  per-page transitions.txt drives state changes, snapshot emits the
  recorded accessibility-tree fixture for current (page, state). Falls
  back to the legacy generic snapshot for Tier-0 continuity. Adds AB_WORK
  override so the CLI can be smoke-tested outside Docker.
- Add hand-fabricated recordings for 4 pages (wikipedia, signin-signup,
  blog-article, multistep-form) under references/agent-browser/recordings/.
- Add checks/smoke-graders.mjs running 14 GOOD/BAD assertions against
  hand-crafted ab-calls.log + output-file fixtures; all pass without
  Docker or models.
…er-1 pilot

Encodes constraints for the auto-pilot to run against the hand-built
Tier-1 deeper eval (4 new cases: ref-based-search, ref-disambiguation,
output-correctness, multi-step-state) without rebuilding the workbench.

Key directives:
- Workbench is already built — skip Phase 2 entirely
- Optimization target = references/agent-browser/agent-browser-core.md
  (the workflow content), NOT references/agent-browser/SKILL.md (the
  discovery stub)
- Upstream packaging target = skill-data/core/SKILL.md per AGENTS.md
- Apache-2.0 + conventional commits + ctate same-day merges for clean
  docs-only PRs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wrapper-skill PR target (`vercel-labs/agent-skills/.../web-design-guidelines/SKILL.md`)
is dropped — it's a thin Claude-Code-specific adapter that
WebFetches the rules doc, and editing it is low-leverage. All value
lives in `vercel-labs/web-interface-guidelines/command.md` and its
two stylistic siblings (`AGENTS.md`, `README.md`).

The consolidated draft at #1 carries:
- The auto-pilot's measured 22-line `command.md` insert (eval 0.92→1.00,
  18 trials × 3 frontier models, 6 absence-type misses → 0)
- A MUST/SHOULD/NEVER mirror for `AGENTS.md` (style-faithful, not
  independently measured)
- A prose mirror for `README.md` (style-faithful, not independently
  measured)
- A qualitative pitch as the headline + eval data as supporting
  evidence (matches PR #23 precedent in this repo, which has zero
  quantitative evidence in any merged PR)

Old drafts moved to `superseded/` with a README explaining why each
was retired. Repo PR-drafts README updated to reflect the new
canonical numbering.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the two structural lessons from the v1.2.1 pilot session:
1. Research-first context is mandatory (Phase 0): the auto-pilot is
   good at finding what to change, bad at fitting upstream conventions.
   Without a researched context file, output requires manual reformulation.
2. Two-loop iteration on eval AND skill (Phase 3.5): the current
   pipeline can't escape ceiling (>= 0.95) or floor (< 0.50) eval
   baselines because it only iterates the skill, treating the eval as
   fixed.

Backwards compatible — v1.2.1's --context flag continues to work; v1.3
phases are opt-in via --research and --auto-eval flags until validated.

Note: this commit lands on the supabase--v1-shallow branch because the
agent-browser pilot is concurrently active on the main worktree;
branch hygiene (move to docs/auto-pilot-runs) deferred until pilots
finish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The agent-browser deeper-eval pilot timed out at the 90-min wrapper cap
mid-baseline (50/54 trials complete; no Phase 5 commit). However, the
supabase v2 pilot's Phase 4 instruction to append a run-record entry to
lessons.md DID complete and wrote a useful observation about the
'calibrated graders cause baseline ceiling' pattern. Salvaging that
entry here even though the parent agent-browser pilot didn't finalize.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#3 (agent-browser): updated to acknowledge that the v1.2.1 deeper-eval
pilot was attempted but timed out at the wrapper's 90-min hard cap
mid-baseline (50/54 trials complete, no Phase 5 commit). Ships the
original v1.0 Pre-flight diff (baseline 0.97; 1/9 Gemini trial used
curl). Partial baseline data preserved at .results/20260512-101220/
for future analysis.

#4 (supabase): replaced the batch-1 draft with the v1.2.1 v2 result.
The auto-pilot reshaped the proposal exactly per the upstream context
file (filename monitor-two-pass-review.md, monitor- prefix, 4-field
frontmatter, **Incorrect**/**Correct** SQL blocks, ~50 lines) so the
file is convention-perfect. Honest framing: per-case breakdown shows
update-without-where at 77.8% (the targeted failure pattern) but
overall 0.97 baseline meant no iteration; auto-pilot's exit logic
uses overall average rather than per-case minimum (v1.3 will fix).

README index updated with evidence-strength column.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The shadcn-ui v1.3 dispatch with the gpt-5 frontier matrix produced
clean measured uplift (+0.222 per-case-min) from a single Recipe D
iteration that strengthened the file-location rule with an explicit
StatusBadge BAD/GOOD example + added a Code Review Checklist section.

Draft includes:
- Honest per-case-min framing (0.667 → 0.889 on frontier matrix)
- Diff against actual upstream (verified via gh API)
- Caveats: Google CLA required, cosmetic whitespace fixes from
  markdownlint that should be manually reverted before submission
  for strict additive-only

PR target: google-labs-code/stitch-skills, file skills/shadcn-ui/SKILL.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Zhaiyuqing2003 pushed a commit that referenced this pull request May 12, 2026
Adds #6 firebase-hosting-basics (the first full end-to-end v1.3
orchestrator demo with Phase 3.5 eval-iteration). Measured uplift
0.89 → 1.00 (+0.11) on frontier matrix; orchestrator added 2 harder
cases via add-harder direction before applying Recipe C.

Removes #3 agent-browser and #4 supabase from the canonical set —
both ended with null/soft evidence (#3 timed out after frontier-matrix
re-fire showed uplift-too-small; #4's per-case finding was identified
but no measured uplift). Keeping them for internal reference only,
not for team submission. They remain in PR #49 (the older bloated PR)
for traceability.

Updates README to reflect 3-strong canonical set with Google CLA note
for #5 and #6 (both Google-org repos: stitch-skills and firebase).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants