feat(auto-pilot): /auto-improve-skill wrapper + prompt template#46
Open
Zhaiyuqing2003 wants to merge 6 commits into
Open
feat(auto-pilot): /auto-improve-skill wrapper + prompt template#46Zhaiyuqing2003 wants to merge 6 commits into
Zhaiyuqing2003 wants to merge 6 commits into
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reviewer flagged that the prompt referenced case-source files that don't exist on this branch (web-design-guidelines/checks/, find-skills/). Make the prompt self-sufficient: - Inline _grader-utils.mjs content under Phase 2 step 4 - Soften 'mirror <path>' references to advisory - Add minimal Cases-table README skeleton in Phase 2 step 6 - Explicit file list in commit step so .run.log can never sneak in Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Default 3.50 (unchanged). Pilot #1 (agent-browser) hit the original 3.50 cap mid-iteration before reaching the "Always: commit" step, losing the run record. With --budget 15 the same pilot completed cleanly: 0.56 → 1.00, +0.44 uplift, $3.15 actual spend. Operator usage: node tools/auto-improve-skill.mjs <slug> --budget 15
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an “auto-pilot” workflow for generating and iterating on workbench eval cases for a given public skill slug by running a claude -p inner agent against a bundled prompt template, and capturing run logs.
Changes:
- Introduces
tools/auto-improve-skill.mjsto spawnclaude -p, tee output to console + per-case.run.log, and enforce a 90-minute wall-clock timeout. - Adds
tools/auto-improve-skill-prompt.md, a multi-phase prompt template that directs the inner agent to discover a skill, build an eval suite, baseline it, iterate up to 2 times, and package proposed upstream changes. - Ignores per-run
.run.logfiles underexamples/workbench/*/.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| tools/auto-improve-skill.mjs | New Node wrapper that runs claude -p with a templated prompt, logs output, and enforces a timeout. |
| tools/auto-improve-skill-prompt.md | Prompt template defining the autonomous 5-phase skill improvement loop and expected artifacts. |
| .gitignore | Ignores auto-pilot wrapper log files under examples/workbench/*/.run.log. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+54
to
+64
| if (existsSync(caseDir) && !FORCE) { | ||
| console.error(`refusing: ${caseDir} already exists. Pass --force to overwrite.`); | ||
| process.exit(2); | ||
| } | ||
| mkdirSync(caseDir, { recursive: true }); | ||
|
|
||
| const promptTemplate = readFileSync(PROMPT_PATH, 'utf-8'); | ||
| const prompt = promptTemplate.replace(/\$\{SLUG\}/g, slug).replace(/\$\{SKILL_ID\}/g, skillId); | ||
|
|
||
| const logPath = join(caseDir, '.run.log'); | ||
| const logStream = createWriteStream(logPath, { flags: 'a' }); |
Comment on lines
+78
to
+82
| const child = spawn('claude', claudeArgs, { | ||
| cwd: REPO_ROOT, | ||
| env: childEnv, | ||
| stdio: ['ignore', 'pipe', 'pipe'], | ||
| }); |
Comment on lines
+95
to
+106
| logStream.end(); | ||
| if (timedOut) { | ||
| console.error(`\n[wrapper] claude -p exceeded ${PER_CALL_TIMEOUT_MS / 60000}-min timeout`); | ||
| process.exit(124); | ||
| } | ||
| const analysisPath = join(caseDir, 'analysis.md'); | ||
| if (existsSync(analysisPath)) { | ||
| console.log(`\n[wrapper] analysis.md: ${analysisPath}`); | ||
| } else { | ||
| console.error(`\n[wrapper] no analysis.md was written; check ${logPath}`); | ||
| } | ||
| process.exit(code ?? 1); |
| 1. From the case directory, run: | ||
|
|
||
| ```bash | ||
| set -a; . ./.env; set +a |
Comment on lines
+80
to
+95
| sample, sharing `checks/_grader-utils.mjs`. Write the following | ||
| file content to `examples/workbench/${SKILL_ID}/checks/_grader-utils.mjs` | ||
| (verbatim): | ||
|
|
||
| ```js | ||
| // Shared grader logic for web-design-guidelines eval cases. | ||
| // | ||
| // Each finding is assumed to be one line in findings.txt that references | ||
| // "<File>.tsx:<line>" (line numbers come from the agent — they're often | ||
| // off by ±1-2 due to LLM line-counting). A violation is considered "found" | ||
| // when at least one finding line: | ||
| // (a) references a line number within the violation's accepted range, AND | ||
| // (b) contains at least one of the violation's distinguishing keywords. | ||
| // | ||
| // This per-finding-line check prevents spurious cross-matches (e.g. the | ||
| // keyword "label" from a different finding being credited to a paste rule). |
| - Else loop. | ||
|
|
||
| **Cost guard:** sum `metrics.cost.total` from each run's `result.json`. | ||
| If cumulative cost > $3.00, exit `status: budget-exceeded` immediately. |
Comment on lines
+257
to
+265
| 2. **Modify** — write a *minimal additive* edit: | ||
| - Add a per-element checklist entry to the rules doc. | ||
| - Add a BAD/GOOD code example for a missed rule. | ||
| - Add a two-pass-workflow nudge to the SKILL.md. | ||
| - Tighten ambiguous rule wording. | ||
|
|
||
| Edits must be additive: no rule deletions, no wording changes to | ||
| existing rules. | ||
|
|
Comment on lines
+157
to
+164
| 5. Write `suite.yml` with the standard 3-model matrix: | ||
|
|
||
| ```yaml | ||
| models: | ||
| - openrouter/anthropic/claude-sonnet-4.6 | ||
| - openrouter/openai/gpt-5-mini | ||
| - openrouter/google/gemini-2.5-pro | ||
| env: |
added 2 commits
May 8, 2026 14:31
Three changes informed by the 3-skill pilot batch (PR #47): 1. **"Always: write analysis.md AND commit" merged into a single atomic step.** Pilots #1b and #2 wrote analysis.md but ran out of budget before reaching the separate commit step, leaving case files uncommitted. The merged section explicitly tells the agent to skip everything else if budget is low and finish this section first. 2. **Default --max-budget-usd bumped 3.50 → 10.00.** Pilot #1's first real-data attempt died at the cap mid-modification. Pilot #1c at --budget 15 settled at $3.15 with full success. The prompt's Phase-4 self-cap also moved from $3.00 to $7.00 to leave a $2-3 buffer for the analysis.md + commit cleanup below the wrapper hard cap. 3. **New tools/auto-improve-skill-lessons.md** — living doc the prompt reads as Phase-4 prior. Captures recipes A-E (two-pass workflow, verify-tool-installed, per-element checklists, BAD/GOOD examples, rationale + bug-story) and grader-reliability patterns G1-G6 (line tolerance, hyphen regex, per-finding-line matching, keyword variants, set-semantics, verbosity floor) with empirical evidence from the manual web-design-guidelines run + the 3 auto pilots. Phase 4 of the prompt now references the recipes by letter so the auto-pilot doesn't rediscover patterns from scratch each run. Also fixes a slug-parsing regression introduced by the --budget flag (when --budget was absent, the filter wrongly skipped argv[0]). Smoke tests pass: bare invocation prints usage, "nope" gives bad-slug, existing dir gets refused, --budget validates input.
Adds three grader-helper utilities to the inlined `_grader-utils.mjs`
content the auto-pilot writes to each new case in Phase 2:
- looseRange(N, tolerance=8) — centered range with default ±8 line
tolerance. Replaces hand-rolling `range(N-3, N+3)`. Default absorbs
the LLM line-counting drift seen across all 4 prior pilots.
- fuzzyKeyword(phrase) — hyphen-and-space-tolerant regex builder.
fuzzyKeyword('empty state') matches "empty state", "empty-state",
"emptystate". Replaces hand-rolling `/empty[-\s]+state/`.
- tolerantKeyword(stem) — word-stem prefix matcher. tolerantKeyword('cover')
matches "covering", "covered", "does not cover" but NOT "discovery"
(word boundary). Replaces alternation regexes for common phrasing
variants.
Also updates lessons.md G1 / G2 / G4 to reference the helpers in their
recipes, so the auto-pilot's Phase-4 reading naturally guides it to use
them rather than rediscovering by hand.
Verified end-to-end: extracted the inlined block from the prompt, ran
each helper, confirmed expected behavior on the canonical patterns from
prior pilots.
This was referenced May 8, 2026
Model matrix change driven by batch-2 pilot results (PR #48): - gpt-5-mini consistently dragged scores across 10 pilots via: - 3–4 line verbosity floor (rules below the floor were under-reported) - 6–15 line drift in findings.txt (vs sonnet/gemini's 0–3 line drift) - CLI fabrication on "upgrade-style" skills (hallucinated `npx next-upgrade`, ran it, wrote the not-found error as findings) - Replaced with `openai/gpt-5` — same tier as sonnet-4.6 / gemini-2.5-pro Lessons.md v1.2 additions: - New anti-pattern: "Don't add bash commands to skills aimed at small models" — they will execute them rather than read them as docs. Source: next-upgrade pilot regression (0.83 → 0.76). - New failure mode: "CLI fabrication on upgrade-style skills" — distinct from Recipe B's "reaches-for-fallback curl" pattern. - New section: "Some upstream repos use non-canonical SKILL.md paths" (e.g., `plugins/<owner>/skills/<id>/SKILL.md` in expo's repo). - G1 updated to reflect new matrix: default ±8 is calibrated for sonnet-4.6/gpt-5/gemini-2.5-pro. Smaller models need ±12+. - Run-record protocol: appended batch-2 entries (10 pilots) + added a "Model-matrix history" subsection tracking matrix changes. Wrapper script unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on PR #44 (
fix/workbench-linux-docker-permissions).Adds a wrapper script + prompt template for autonomous skill improvement: operator says "optimize
<slug>" → orchestrator runs the wrapper → innerclaude -pagent does the entire find → eval → diagnose → improve → package loop → writesexamples/workbench/<skill-id>/analysis.md.What ships
tools/auto-improve-skill.mjs(~110 lines) — wrapper that spawnsclaude -pwith the templated prompt, tees output, enforces a 90-min wall-clock timeout. Mirrors the existingtools/skill-explorer/_setup-cost.mjspattern.tools/auto-improve-skill-prompt.md(~280 lines) — 5-phase prompt body (Discover / Build suite / Baseline / Iterate / Package). Self-sufficient: inlines_grader-utils.mjscontent so it doesn't depend on case-source files from other branches..gitignore— addsexamples/workbench/*/.run.log.CLI
```bash
node tools/auto-improve-skill.mjs // [--force] [--budget ]
```
--budgetdefaults to 3.50; bumped to 15 for runs that need real Phase-4 iteration.--forceoverwrites an existing case dir.Validation
3-skill pilot in PR #(below) — see
eval/auto-pilot/batch-2026-05-08for the actual runs and results.🤖 Generated with Claude Code