docs(pilot-runs): publish batch-1 + batch-2 pilot summaries#49
Open
Zhaiyuqing2003 wants to merge 13 commits into
Open
docs(pilot-runs): publish batch-1 + batch-2 pilot summaries#49Zhaiyuqing2003 wants to merge 13 commits into
Zhaiyuqing2003 wants to merge 13 commits into
Conversation
Moves auto-improve-skill pilot summaries from gitignored docs/superpowers/pilot-runs/ to tracked docs/pilot-runs/ so the team can review them in-tree. Includes: - docs/pilot-runs/README.md — directory index + reproduction recipe - 2026-05-08-auto-improve-pilot-summary.md — batch 1 (3 skills, 3/3 success: agent-browser, supabase, pdf) - 2026-05-09-auto-improve-batch-2-summary.md — batch 2 (10 skills, 8/10 success, 0 failures: pptx, next-best-practices, firebase-auth-basics, firebase-hosting-basics, building-native-ui, shadcn-ui, native-data-fetching, firecrawl-build-scrape, next-upgrade, prd) Per-skill eval artifacts and proposed-upstream-changes live on eval/auto-pilot/<skill-id> branches and the consolidated batch branches (eval/auto-pilot/batch-2026-05-08, eval/auto-pilot/batch-2-2026-05-09).
Contributor
There was a problem hiding this comment.
Pull request overview
Publishes previously local-only auto-improve-skill pilot run summaries into tracked docs/pilot-runs/ so the team can review actual batch outcomes, patterns, and reproduction steps in-repo.
Changes:
- Adds a
docs/pilot-runs/README with an index and a suggested batching workflow. - Adds batch-1 (2026-05-08) and batch-2 (2026-05-09) human-readable pilot summaries, including results, patterns, and decision points.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| docs/pilot-runs/README.md | Adds directory index + a short recipe for running pilots in parallel via worktrees. |
| docs/pilot-runs/2026-05-08-auto-improve-pilot-summary.md | Documents batch-1 outcomes, costs, and follow-up improvements to the auto-pilot. |
| docs/pilot-runs/2026-05-09-auto-improve-batch-2-summary.md | Documents batch-2 outcomes, new patterns/failure modes, and reproduction notes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| ```bash | ||
| # This batch can be reproduced from a fresh checkout of feat/auto-improve-skill: | ||
| cd /home/yuqing/Documents/Code/skill-optimizer |
Comment on lines
+30
to
+31
| - Per-pilot avg: **$2.13** (well under the $3.50 budgeted) | ||
| - Plan-token spend (inner `claude -p`): each pilot reported between $0.00 and $1.00 — no pilot hit the $10 wrapper cap |
| - **Wrapper version:** v1.1 + #3 (atomic write-and-commit, $10 default budget, lessons.md, pre-baked grader helpers) | ||
| - **Skills:** ranks 5–14 from the prioritized top-N list (skips the 4 already covered in batch 1: web-design-guidelines, agent-browser, supabase, pdf) | ||
| - **Parallelism:** 10 git worktrees, hardlinked `node_modules`, fired simultaneously | ||
| - **Wall clock:** ~50 min (slowest pilot to longest), down from estimated ~150 min sequential |
| ## Reproducing the pilots | ||
|
|
||
| ```bash | ||
| cd /home/yuqing/Documents/Code/skill-optimizer |
Operational guide for submitting skill-improvement PRs to the four repos we're currently working with (vercel-labs/agent-skills, vercel-labs/web-interface-guidelines, vercel-labs/agent-browser, supabase/agent-skills). Per repo: title format, body convention, CI gates, CLA status, merge style, scope guidance, and any gotchas discovered by reading AGENTS.md/CONTRIBUTING.md/workflow files plus the last 5–10 merged PRs. Future batches: append new repos as their conventions become known.
Polished PR drafts ready for operator review + submission to upstream. Each draft contains: - Target repo + base branch - Title in the repo's preferred convention (see upstream-pr-conventions.md) - PR body matching the repo's style (formal/casual/terse) - File diff or path to the full proposed file in our repo - Caveats and gotchas specific to the repo - Operator copy-paste shell snippet for fork → branch → commit → push → gh pr create The 4 PRs cover 3 skills (web-design-guidelines spans 2 repos): 1. vercel-labs/agent-skills — web-design-guidelines SKILL.md two-pass workflow 2. vercel-labs/web-interface-guidelines — per-element checklist + 5 BAD/GOOD examples 3. vercel-labs/agent-browser — Pre-flight section (retargeted to skill-data/core/SKILL.md per AGENTS.md) 4. supabase/agent-skills — two-pass review reference (reformulated as a new references/ file per CONTRIBUTING.md, not a SKILL.md edit) Sources: - PR 1 + 2: manual web-design-guidelines run (eval/web-design-guidelines) - PR 3: agent-browser v1.2 re-run (the small additive Pre-flight) - PR 4: supabase batch-1 result (0.54 → 0.86, content reformulated to fit repo convention)
Adds a `--context <path>` flag to the auto-pilot wrapper that reads a markdown file and injects it into the prompt as a "Constraints" section Phase 4 must respect. Enables steering pilots toward upstream-specific targets (e.g. fetched rules docs instead of skill SKILL.md) and encoding architecture intent (additive-only, no restructure, etc.) as hard constraints. Phase 4 + Phase 5 updated to honor target-file overrides from the constraints (e.g. edit `command.md` instead of `SKILL.md` when the context says so; package files as `before-/after-command.md` under the correct upstream-repo directory). Includes the first context file: `tools/auto-improve-contexts/vercel-web-interface-guidelines.md`, encoding the vercel research findings — `command.md` is the canonical source distributed to 7 tools + 10 downstream consumers, restructure risk is HIGH, additive-only PRs are the merged norm (PR #23 precedent), and the AGENTS.md / README.md mirrors happen at PR-draft time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Encodes upstream conventions discovered via gh-CLI research: - All 28 existing references in this skill are single-rule SQL anti-pattern fixes with **Incorrect/**Correct SQL blocks; meta-workflow guidance is shape-novel (MEDIUM-HIGH risk of "fit the convention" pushback from gregnr/Rodriguespn). - Prefixes locked to the 8 in `_sections.md` (`query-`, `conn-`, `security-`, `schema-`, `lock-`, `data-`, `monitor-`, `advanced-`); a `review-` prefix would require modifying `_sections.md` which is not additive-only. - Required reshape: pick a single concrete SQL anti-pattern that two-pass review catches and frame around it (Incorrect = single-pass miss, Correct = two-pass catch). If reshape feels contrived, surface needs-discussion signal instead of shipping borderline PR. - Frontmatter spec corrected: 4 fields (`title`, `impact`, `impactDescription`, `tags`); previous research missed `impactDescription`. `tags` is comma-separated string, not YAML list. - pnpm test:sanity does NOT validate frontmatter (corrected prior note); convention is enforced by maintainer review only. - Release Please owns metadata.version; do not bump manually (causes merge conflicts with bot's release PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-browser Carry over existing Tier-0 eval (navigate-and-report, screenshot-capture) as the starting point for deeper Tier-1 work.
- Add 4 cases (ref-based-search, ref-disambiguation, output-correctness, multi-step-state) that grade snapshot-driven @en ref discipline, ambiguous-element resolution, content correctness, and full state-machine traversal — none of which the v1 baseline covered. - Upgrade bin/agent-browser to a stateful playback CLI: URL match -> page, per-page transitions.txt drives state changes, snapshot emits the recorded accessibility-tree fixture for current (page, state). Falls back to the legacy generic snapshot for Tier-0 continuity. Adds AB_WORK override so the CLI can be smoke-tested outside Docker. - Add hand-fabricated recordings for 4 pages (wikipedia, signin-signup, blog-article, multistep-form) under references/agent-browser/recordings/. - Add checks/smoke-graders.mjs running 14 GOOD/BAD assertions against hand-crafted ab-calls.log + output-file fixtures; all pass without Docker or models.
…er-1 pilot Encodes constraints for the auto-pilot to run against the hand-built Tier-1 deeper eval (4 new cases: ref-based-search, ref-disambiguation, output-correctness, multi-step-state) without rebuilding the workbench. Key directives: - Workbench is already built — skip Phase 2 entirely - Optimization target = references/agent-browser/agent-browser-core.md (the workflow content), NOT references/agent-browser/SKILL.md (the discovery stub) - Upstream packaging target = skill-data/core/SKILL.md per AGENTS.md - Apache-2.0 + conventional commits + ctate same-day merges for clean docs-only PRs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wrapper-skill PR target (`vercel-labs/agent-skills/.../web-design-guidelines/SKILL.md`) is dropped — it's a thin Claude-Code-specific adapter that WebFetches the rules doc, and editing it is low-leverage. All value lives in `vercel-labs/web-interface-guidelines/command.md` and its two stylistic siblings (`AGENTS.md`, `README.md`). The consolidated draft at #1 carries: - The auto-pilot's measured 22-line `command.md` insert (eval 0.92→1.00, 18 trials × 3 frontier models, 6 absence-type misses → 0) - A MUST/SHOULD/NEVER mirror for `AGENTS.md` (style-faithful, not independently measured) - A prose mirror for `README.md` (style-faithful, not independently measured) - A qualitative pitch as the headline + eval data as supporting evidence (matches PR #23 precedent in this repo, which has zero quantitative evidence in any merged PR) Old drafts moved to `superseded/` with a README explaining why each was retired. Repo PR-drafts README updated to reflect the new canonical numbering. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the two structural lessons from the v1.2.1 pilot session: 1. Research-first context is mandatory (Phase 0): the auto-pilot is good at finding what to change, bad at fitting upstream conventions. Without a researched context file, output requires manual reformulation. 2. Two-loop iteration on eval AND skill (Phase 3.5): the current pipeline can't escape ceiling (>= 0.95) or floor (< 0.50) eval baselines because it only iterates the skill, treating the eval as fixed. Backwards compatible — v1.2.1's --context flag continues to work; v1.3 phases are opt-in via --research and --auto-eval flags until validated. Note: this commit lands on the supabase--v1-shallow branch because the agent-browser pilot is concurrently active on the main worktree; branch hygiene (move to docs/auto-pilot-runs) deferred until pilots finish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The agent-browser deeper-eval pilot timed out at the 90-min wrapper cap mid-baseline (50/54 trials complete; no Phase 5 commit). However, the supabase v2 pilot's Phase 4 instruction to append a run-record entry to lessons.md DID complete and wrote a useful observation about the 'calibrated graders cause baseline ceiling' pattern. Salvaging that entry here even though the parent agent-browser pilot didn't finalize. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#3 (agent-browser): updated to acknowledge that the v1.2.1 deeper-eval pilot was attempted but timed out at the wrapper's 90-min hard cap mid-baseline (50/54 trials complete, no Phase 5 commit). Ships the original v1.0 Pre-flight diff (baseline 0.97; 1/9 Gemini trial used curl). Partial baseline data preserved at .results/20260512-101220/ for future analysis. #4 (supabase): replaced the batch-1 draft with the v1.2.1 v2 result. The auto-pilot reshaped the proposal exactly per the upstream context file (filename monitor-two-pass-review.md, monitor- prefix, 4-field frontmatter, **Incorrect**/**Correct** SQL blocks, ~50 lines) so the file is convention-perfect. Honest framing: per-case breakdown shows update-without-where at 77.8% (the targeted failure pattern) but overall 0.97 baseline meant no iteration; auto-pilot's exit logic uses overall average rather than per-case minimum (v1.3 will fix). README index updated with evidence-strength column. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The shadcn-ui v1.3 dispatch with the gpt-5 frontier matrix produced clean measured uplift (+0.222 per-case-min) from a single Recipe D iteration that strengthened the file-location rule with an explicit StatusBadge BAD/GOOD example + added a Code Review Checklist section. Draft includes: - Honest per-case-min framing (0.667 → 0.889 on frontier matrix) - Diff against actual upstream (verified via gh API) - Caveats: Google CLA required, cosmetic whitespace fixes from markdownlint that should be manually reverted before submission for strict additive-only PR target: google-labs-code/stitch-skills, file skills/shadcn-ui/SKILL.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Zhaiyuqing2003
pushed a commit
that referenced
this pull request
May 12, 2026
Adds #6 firebase-hosting-basics (the first full end-to-end v1.3 orchestrator demo with Phase 3.5 eval-iteration). Measured uplift 0.89 → 1.00 (+0.11) on frontier matrix; orchestrator added 2 harder cases via add-harder direction before applying Recipe C. Removes #3 agent-browser and #4 supabase from the canonical set — both ended with null/soft evidence (#3 timed out after frontier-matrix re-fire showed uplift-too-small; #4's per-case finding was identified but no measured uplift). Keeping them for internal reference only, not for team submission. They remain in PR #49 (the older bloated PR) for traceability. Updates README to reflect 3-strong canonical set with Google CLA note for #5 and #6 (both Google-org repos: stitch-skills and firebase). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on PR #46 (
feat/auto-improve-skill).Publishes the auto-improve-skill pilot summaries that were previously
local-only (in the gitignored
docs/superpowers/pilot-runs/). Movesthem to tracked
docs/pilot-runs/so the team can review the actualrun records, not just the per-PR commit messages.
What's in here
2026-05-08-auto-improve-pilot-summary.md2026-05-09-auto-improve-batch-2-summary.mdREADME.mdWhy these are worth seeing
lessons.mdby letter in batch-2 pilots 4, 6, 8.next-upgradewent 0.83 → 0.76. Auto-pilot reported it without dressing it up. Surfaces new failure modes (CLI fabrication, "don't add bash for small models").Per-skill eval artifacts live on
eval/auto-pilot/<skill-id>branches (PRs #47, #48); this PR is just the human-readable digest.🤖 Generated with Claude Code