feat(auto-pilot): /auto-improve-skill wrapper + prompt template by Zhaiyuqing2003 · Pull Request #46 · fastxyz/skill-optimizer

Zhaiyuqing2003 · 2026-05-08T18:45:26Z

Stacked on PR #44 (fix/workbench-linux-docker-permissions).

Adds a wrapper script + prompt template for autonomous skill improvement: operator says "optimize <slug>" → orchestrator runs the wrapper → inner claude -p agent does the entire find → eval → diagnose → improve → package loop → writes examples/workbench/<skill-id>/analysis.md.

What ships

tools/auto-improve-skill.mjs (~110 lines) — wrapper that spawns claude -p with the templated prompt, tees output, enforces a 90-min wall-clock timeout. Mirrors the existing tools/skill-explorer/_setup-cost.mjs pattern.
tools/auto-improve-skill-prompt.md (~280 lines) — 5-phase prompt body (Discover / Build suite / Baseline / Iterate / Package). Self-sufficient: inlines _grader-utils.mjs content so it doesn't depend on case-source files from other branches.
.gitignore — adds examples/workbench/*/.run.log.

CLI

```bash
node tools/auto-improve-skill.mjs // [--force] [--budget ]
```

--budget defaults to 3.50; bumped to 15 for runs that need real Phase-4 iteration. --force overwrites an existing case dir.

Validation

3-skill pilot in PR #(below) — see eval/auto-pilot/batch-2026-05-08 for the actual runs and results.

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Reviewer flagged that the prompt referenced case-source files that don't exist on this branch (web-design-guidelines/checks/, find-skills/). Make the prompt self-sufficient: - Inline _grader-utils.mjs content under Phase 2 step 4 - Soften 'mirror <path>' references to advisory - Add minimal Cases-table README skeleton in Phase 2 step 6 - Explicit file list in commit step so .run.log can never sneak in Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Default 3.50 (unchanged). Pilot #1 (agent-browser) hit the original 3.50 cap mid-iteration before reaching the "Always: commit" step, losing the run record. With --budget 15 the same pilot completed cleanly: 0.56 → 1.00, +0.44 uplift, $3.15 actual spend. Operator usage: node tools/auto-improve-skill.mjs <slug> --budget 15

Copilot

Pull request overview

Adds an “auto-pilot” workflow for generating and iterating on workbench eval cases for a given public skill slug by running a claude -p inner agent against a bundled prompt template, and capturing run logs.

Changes:

Introduces tools/auto-improve-skill.mjs to spawn claude -p, tee output to console + per-case .run.log, and enforce a 90-minute wall-clock timeout.
Adds tools/auto-improve-skill-prompt.md, a multi-phase prompt template that directs the inner agent to discover a skill, build an eval suite, baseline it, iterate up to 2 times, and package proposed upstream changes.
Ignores per-run .run.log files under examples/workbench/*/.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 8 comments.

File	Description
tools/auto-improve-skill.mjs	New Node wrapper that runs `claude -p` with a templated prompt, logs output, and enforces a timeout.
tools/auto-improve-skill-prompt.md	Prompt template defining the autonomous 5-phase skill improvement loop and expected artifacts.
.gitignore	Ignores auto-pilot wrapper log files under `examples/workbench/*/.run.log`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+if (existsSync(caseDir) && !FORCE) {
+  console.error(`refusing: ${caseDir} already exists. Pass --force to overwrite.`);
+  process.exit(2);
+}
+mkdirSync(caseDir, { recursive: true });
+
+const promptTemplate = readFileSync(PROMPT_PATH, 'utf-8');
+const prompt = promptTemplate.replace(/\$\{SLUG\}/g, slug).replace(/\$\{SKILL_ID\}/g, skillId);
+
+const logPath = join(caseDir, '.run.log');
+const logStream = createWriteStream(logPath, { flags: 'a' });


+const child = spawn('claude', claudeArgs, {
+  cwd: REPO_ROOT,
+  env: childEnv,
+  stdio: ['ignore', 'pipe', 'pipe'],
+});


+  logStream.end();
+  if (timedOut) {
+    console.error(`\n[wrapper] claude -p exceeded ${PER_CALL_TIMEOUT_MS / 60000}-min timeout`);
+    process.exit(124);
+  }
+  const analysisPath = join(caseDir, 'analysis.md');
+  if (existsSync(analysisPath)) {
+    console.log(`\n[wrapper] analysis.md: ${analysisPath}`);
+  } else {
+    console.error(`\n[wrapper] no analysis.md was written; check ${logPath}`);
+  }
+  process.exit(code ?? 1);


+1. From the case directory, run:
+
+   ```bash
+   set -a; . ./.env; set +a


+   sample, sharing `checks/_grader-utils.mjs`. Write the following
+   file content to `examples/workbench/${SKILL_ID}/checks/_grader-utils.mjs`
+   (verbatim):
+
+   ```js
+   // Shared grader logic for web-design-guidelines eval cases.
+   //
+   // Each finding is assumed to be one line in findings.txt that references
+   // "<File>.tsx:<line>" (line numbers come from the agent — they're often
+   // off by ±1-2 due to LLM line-counting). A violation is considered "found"
+   // when at least one finding line:
+   //   (a) references a line number within the violation's accepted range, AND
+   //   (b) contains at least one of the violation's distinguishing keywords.
+   //
+   // This per-finding-line check prevents spurious cross-matches (e.g. the
+   // keyword "label" from a different finding being credited to a paste rule).


+    - Else loop.
+
+**Cost guard:** sum `metrics.cost.total` from each run's `result.json`.
+If cumulative cost > $3.00, exit `status: budget-exceeded` immediately.


+2. **Modify** — write a *minimal additive* edit:
+    - Add a per-element checklist entry to the rules doc.
+    - Add a BAD/GOOD code example for a missed rule.
+    - Add a two-pass-workflow nudge to the SKILL.md.
+    - Tighten ambiguous rule wording.
+
+   Edits must be additive: no rule deletions, no wording changes to
+   existing rules.
+


+5. Write `suite.yml` with the standard 3-model matrix:
+
+   ```yaml
+   models:
+     - openrouter/anthropic/claude-sonnet-4.6
+     - openrouter/openai/gpt-5-mini
+     - openrouter/google/gemini-2.5-pro
+   env:


Three changes informed by the 3-skill pilot batch (PR #47): 1. **"Always: write analysis.md AND commit" merged into a single atomic step.** Pilots #1b and #2 wrote analysis.md but ran out of budget before reaching the separate commit step, leaving case files uncommitted. The merged section explicitly tells the agent to skip everything else if budget is low and finish this section first. 2. **Default --max-budget-usd bumped 3.50 → 10.00.** Pilot #1's first real-data attempt died at the cap mid-modification. Pilot #1c at --budget 15 settled at $3.15 with full success. The prompt's Phase-4 self-cap also moved from $3.00 to $7.00 to leave a $2-3 buffer for the analysis.md + commit cleanup below the wrapper hard cap. 3. **New tools/auto-improve-skill-lessons.md** — living doc the prompt reads as Phase-4 prior. Captures recipes A-E (two-pass workflow, verify-tool-installed, per-element checklists, BAD/GOOD examples, rationale + bug-story) and grader-reliability patterns G1-G6 (line tolerance, hyphen regex, per-finding-line matching, keyword variants, set-semantics, verbosity floor) with empirical evidence from the manual web-design-guidelines run + the 3 auto pilots. Phase 4 of the prompt now references the recipes by letter so the auto-pilot doesn't rediscover patterns from scratch each run. Also fixes a slug-parsing regression introduced by the --budget flag (when --budget was absent, the filter wrongly skipped argv[0]). Smoke tests pass: bare invocation prints usage, "nope" gives bad-slug, existing dir gets refused, --budget validates input.

Adds three grader-helper utilities to the inlined `_grader-utils.mjs` content the auto-pilot writes to each new case in Phase 2: - looseRange(N, tolerance=8) — centered range with default ±8 line tolerance. Replaces hand-rolling `range(N-3, N+3)`. Default absorbs the LLM line-counting drift seen across all 4 prior pilots. - fuzzyKeyword(phrase) — hyphen-and-space-tolerant regex builder. fuzzyKeyword('empty state') matches "empty state", "empty-state", "emptystate". Replaces hand-rolling `/empty[-\s]+state/`. - tolerantKeyword(stem) — word-stem prefix matcher. tolerantKeyword('cover') matches "covering", "covered", "does not cover" but NOT "discovery" (word boundary). Replaces alternation regexes for common phrasing variants. Also updates lessons.md G1 / G2 / G4 to reference the helpers in their recipes, so the auto-pilot's Phase-4 reading naturally guides it to use them rather than rediscovering by hand. Verified end-to-end: extracted the inlined block from the prompt, ran each helper, confirmed expected behavior on the canonical patterns from prior pilots.

Model matrix change driven by batch-2 pilot results (PR #48): - gpt-5-mini consistently dragged scores across 10 pilots via: - 3–4 line verbosity floor (rules below the floor were under-reported) - 6–15 line drift in findings.txt (vs sonnet/gemini's 0–3 line drift) - CLI fabrication on "upgrade-style" skills (hallucinated `npx next-upgrade`, ran it, wrote the not-found error as findings) - Replaced with `openai/gpt-5` — same tier as sonnet-4.6 / gemini-2.5-pro Lessons.md v1.2 additions: - New anti-pattern: "Don't add bash commands to skills aimed at small models" — they will execute them rather than read them as docs. Source: next-upgrade pilot regression (0.83 → 0.76). - New failure mode: "CLI fabrication on upgrade-style skills" — distinct from Recipe B's "reaches-for-fallback curl" pattern. - New section: "Some upstream repos use non-canonical SKILL.md paths" (e.g., `plugins/<owner>/skills/<id>/SKILL.md` in expo's repo). - G1 updated to reflect new matrix: default ±8 is calibrated for sonnet-4.6/gpt-5/gemini-2.5-pro. Smaller models need ±12+. - Run-record protocol: appended batch-2 entries (10 pilots) + added a "Model-matrix history" subsection tracking matrix changes. Wrapper script unchanged.

Yuqing Zhai and others added 3 commits May 8, 2026 13:43

feat(auto-pilot): tools/auto-improve-skill.mjs + prompt template

6054a09

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 8, 2026 18:45

Zhaiyuqing2003 mentioned this pull request May 8, 2026

eval(auto-pilot): pilot batch — 3 skills, 3/3 success #47

Open

Copilot started reviewing on behalf of Zhaiyuqing2003 May 8, 2026 18:45 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Yuqing Zhai added 2 commits May 8, 2026 14:31

This was referenced May 8, 2026

eval(auto-pilot): batch 2 — 10 skills, 8/10 success, 0 failures #48

Open

docs(pilot-runs): publish batch-1 + batch-2 pilot summaries #49

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(auto-pilot): /auto-improve-skill wrapper + prompt template#46

feat(auto-pilot): /auto-improve-skill wrapper + prompt template#46
Zhaiyuqing2003 wants to merge 6 commits into
fix/workbench-linux-docker-permissionsfrom
feat/auto-improve-skill

Zhaiyuqing2003 commented May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Zhaiyuqing2003 commented May 8, 2026

What ships

CLI

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants