Skip to content

feat(auto-pilot): /auto-improve-skill wrapper + prompt template#46

Open
Zhaiyuqing2003 wants to merge 6 commits into
fix/workbench-linux-docker-permissionsfrom
feat/auto-improve-skill
Open

feat(auto-pilot): /auto-improve-skill wrapper + prompt template#46
Zhaiyuqing2003 wants to merge 6 commits into
fix/workbench-linux-docker-permissionsfrom
feat/auto-improve-skill

Conversation

@Zhaiyuqing2003
Copy link
Copy Markdown

Stacked on PR #44 (fix/workbench-linux-docker-permissions).

Adds a wrapper script + prompt template for autonomous skill improvement: operator says "optimize <slug>" → orchestrator runs the wrapper → inner claude -p agent does the entire find → eval → diagnose → improve → package loop → writes examples/workbench/<skill-id>/analysis.md.

What ships

  • tools/auto-improve-skill.mjs (~110 lines) — wrapper that spawns claude -p with the templated prompt, tees output, enforces a 90-min wall-clock timeout. Mirrors the existing tools/skill-explorer/_setup-cost.mjs pattern.
  • tools/auto-improve-skill-prompt.md (~280 lines) — 5-phase prompt body (Discover / Build suite / Baseline / Iterate / Package). Self-sufficient: inlines _grader-utils.mjs content so it doesn't depend on case-source files from other branches.
  • .gitignore — adds examples/workbench/*/.run.log.

CLI

```bash
node tools/auto-improve-skill.mjs // [--force] [--budget ]
```

--budget defaults to 3.50; bumped to 15 for runs that need real Phase-4 iteration. --force overwrites an existing case dir.

Validation

3-skill pilot in PR #(below) — see eval/auto-pilot/batch-2026-05-08 for the actual runs and results.

🤖 Generated with Claude Code

Yuqing Zhai and others added 3 commits May 8, 2026 13:43
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reviewer flagged that the prompt referenced case-source files that
don't exist on this branch (web-design-guidelines/checks/, find-skills/).
Make the prompt self-sufficient:

- Inline _grader-utils.mjs content under Phase 2 step 4
- Soften 'mirror <path>' references to advisory
- Add minimal Cases-table README skeleton in Phase 2 step 6
- Explicit file list in commit step so .run.log can never sneak in

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Default 3.50 (unchanged). Pilot #1 (agent-browser) hit the original
3.50 cap mid-iteration before reaching the "Always: commit" step,
losing the run record. With --budget 15 the same pilot completed
cleanly: 0.56 → 1.00, +0.44 uplift, $3.15 actual spend.

Operator usage:
  node tools/auto-improve-skill.mjs <slug> --budget 15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an “auto-pilot” workflow for generating and iterating on workbench eval cases for a given public skill slug by running a claude -p inner agent against a bundled prompt template, and capturing run logs.

Changes:

  • Introduces tools/auto-improve-skill.mjs to spawn claude -p, tee output to console + per-case .run.log, and enforce a 90-minute wall-clock timeout.
  • Adds tools/auto-improve-skill-prompt.md, a multi-phase prompt template that directs the inner agent to discover a skill, build an eval suite, baseline it, iterate up to 2 times, and package proposed upstream changes.
  • Ignores per-run .run.log files under examples/workbench/*/.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 8 comments.

File Description
tools/auto-improve-skill.mjs New Node wrapper that runs claude -p with a templated prompt, logs output, and enforces a timeout.
tools/auto-improve-skill-prompt.md Prompt template defining the autonomous 5-phase skill improvement loop and expected artifacts.
.gitignore Ignores auto-pilot wrapper log files under examples/workbench/*/.run.log.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +54 to +64
if (existsSync(caseDir) && !FORCE) {
console.error(`refusing: ${caseDir} already exists. Pass --force to overwrite.`);
process.exit(2);
}
mkdirSync(caseDir, { recursive: true });

const promptTemplate = readFileSync(PROMPT_PATH, 'utf-8');
const prompt = promptTemplate.replace(/\$\{SLUG\}/g, slug).replace(/\$\{SKILL_ID\}/g, skillId);

const logPath = join(caseDir, '.run.log');
const logStream = createWriteStream(logPath, { flags: 'a' });
Comment on lines +78 to +82
const child = spawn('claude', claudeArgs, {
cwd: REPO_ROOT,
env: childEnv,
stdio: ['ignore', 'pipe', 'pipe'],
});
Comment on lines +95 to +106
logStream.end();
if (timedOut) {
console.error(`\n[wrapper] claude -p exceeded ${PER_CALL_TIMEOUT_MS / 60000}-min timeout`);
process.exit(124);
}
const analysisPath = join(caseDir, 'analysis.md');
if (existsSync(analysisPath)) {
console.log(`\n[wrapper] analysis.md: ${analysisPath}`);
} else {
console.error(`\n[wrapper] no analysis.md was written; check ${logPath}`);
}
process.exit(code ?? 1);
1. From the case directory, run:

```bash
set -a; . ./.env; set +a
Comment on lines +80 to +95
sample, sharing `checks/_grader-utils.mjs`. Write the following
file content to `examples/workbench/${SKILL_ID}/checks/_grader-utils.mjs`
(verbatim):

```js
// Shared grader logic for web-design-guidelines eval cases.
//
// Each finding is assumed to be one line in findings.txt that references
// "<File>.tsx:<line>" (line numbers come from the agent — they're often
// off by ±1-2 due to LLM line-counting). A violation is considered "found"
// when at least one finding line:
// (a) references a line number within the violation's accepted range, AND
// (b) contains at least one of the violation's distinguishing keywords.
//
// This per-finding-line check prevents spurious cross-matches (e.g. the
// keyword "label" from a different finding being credited to a paste rule).
Comment thread tools/auto-improve-skill-prompt.md Outdated
- Else loop.

**Cost guard:** sum `metrics.cost.total` from each run's `result.json`.
If cumulative cost > $3.00, exit `status: budget-exceeded` immediately.
Comment thread tools/auto-improve-skill-prompt.md Outdated
Comment on lines +257 to +265
2. **Modify** — write a *minimal additive* edit:
- Add a per-element checklist entry to the rules doc.
- Add a BAD/GOOD code example for a missed rule.
- Add a two-pass-workflow nudge to the SKILL.md.
- Tighten ambiguous rule wording.

Edits must be additive: no rule deletions, no wording changes to
existing rules.

Comment on lines +157 to +164
5. Write `suite.yml` with the standard 3-model matrix:

```yaml
models:
- openrouter/anthropic/claude-sonnet-4.6
- openrouter/openai/gpt-5-mini
- openrouter/google/gemini-2.5-pro
env:
Yuqing Zhai added 2 commits May 8, 2026 14:31
Three changes informed by the 3-skill pilot batch (PR #47):

1. **"Always: write analysis.md AND commit" merged into a single atomic
   step.** Pilots #1b and #2 wrote analysis.md but ran out of budget
   before reaching the separate commit step, leaving case files
   uncommitted. The merged section explicitly tells the agent to skip
   everything else if budget is low and finish this section first.

2. **Default --max-budget-usd bumped 3.50 → 10.00.** Pilot #1's first
   real-data attempt died at the cap mid-modification. Pilot #1c at
   --budget 15 settled at $3.15 with full success. The prompt's Phase-4
   self-cap also moved from $3.00 to $7.00 to leave a $2-3 buffer for
   the analysis.md + commit cleanup below the wrapper hard cap.

3. **New tools/auto-improve-skill-lessons.md** — living doc the prompt
   reads as Phase-4 prior. Captures recipes A-E (two-pass workflow,
   verify-tool-installed, per-element checklists, BAD/GOOD examples,
   rationale + bug-story) and grader-reliability patterns G1-G6 (line
   tolerance, hyphen regex, per-finding-line matching, keyword variants,
   set-semantics, verbosity floor) with empirical evidence from the
   manual web-design-guidelines run + the 3 auto pilots. Phase 4 of the
   prompt now references the recipes by letter so the auto-pilot doesn't
   rediscover patterns from scratch each run.

Also fixes a slug-parsing regression introduced by the --budget flag
(when --budget was absent, the filter wrongly skipped argv[0]).

Smoke tests pass: bare invocation prints usage, "nope" gives bad-slug,
existing dir gets refused, --budget validates input.
Adds three grader-helper utilities to the inlined `_grader-utils.mjs`
content the auto-pilot writes to each new case in Phase 2:

- looseRange(N, tolerance=8) — centered range with default ±8 line
  tolerance. Replaces hand-rolling `range(N-3, N+3)`. Default absorbs
  the LLM line-counting drift seen across all 4 prior pilots.

- fuzzyKeyword(phrase) — hyphen-and-space-tolerant regex builder.
  fuzzyKeyword('empty state') matches "empty state", "empty-state",
  "emptystate". Replaces hand-rolling `/empty[-\s]+state/`.

- tolerantKeyword(stem) — word-stem prefix matcher. tolerantKeyword('cover')
  matches "covering", "covered", "does not cover" but NOT "discovery"
  (word boundary). Replaces alternation regexes for common phrasing
  variants.

Also updates lessons.md G1 / G2 / G4 to reference the helpers in their
recipes, so the auto-pilot's Phase-4 reading naturally guides it to use
them rather than rediscovering by hand.

Verified end-to-end: extracted the inlined block from the prompt, ran
each helper, confirmed expected behavior on the canonical patterns from
prior pilots.
Model matrix change driven by batch-2 pilot results (PR #48):

- gpt-5-mini consistently dragged scores across 10 pilots via:
  - 3–4 line verbosity floor (rules below the floor were under-reported)
  - 6–15 line drift in findings.txt (vs sonnet/gemini's 0–3 line drift)
  - CLI fabrication on "upgrade-style" skills (hallucinated `npx
    next-upgrade`, ran it, wrote the not-found error as findings)
- Replaced with `openai/gpt-5` — same tier as sonnet-4.6 / gemini-2.5-pro

Lessons.md v1.2 additions:

- New anti-pattern: "Don't add bash commands to skills aimed at small
  models" — they will execute them rather than read them as docs.
  Source: next-upgrade pilot regression (0.83 → 0.76).
- New failure mode: "CLI fabrication on upgrade-style skills" —
  distinct from Recipe B's "reaches-for-fallback curl" pattern.
- New section: "Some upstream repos use non-canonical SKILL.md paths"
  (e.g., `plugins/<owner>/skills/<id>/SKILL.md` in expo's repo).
- G1 updated to reflect new matrix: default ±8 is calibrated for
  sonnet-4.6/gpt-5/gemini-2.5-pro. Smaller models need ±12+.
- Run-record protocol: appended batch-2 entries (10 pilots) +
  added a "Model-matrix history" subsection tracking matrix changes.

Wrapper script unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants