feat(skill-optimizer): v1.4 chain implementation (draft, per RFC #52)#53
Draft
Zhaiyuqing2003 wants to merge 50 commits into
Draft
feat(skill-optimizer): v1.4 chain implementation (draft, per RFC #52)#53Zhaiyuqing2003 wants to merge 50 commits into
Zhaiyuqing2003 wants to merge 50 commits into
Conversation
Approved design (brainstormed 2026-05-12) converting the v1.3 monolithic auto-improve-orchestrator into 7 independent Claude Code skills that chain via the superpowers plugin pattern. Motivation: team review of v1.3 PR drafts surfaced 4 critiques — v1.3 optimizes for incremental numerical uplift without validating test case quality, grader correctness, or improvement principledness. The firecrawl iteration regression (1.0 → 0.44 from piling on Recipe A+D simultaneously to chase a small uplift) is the canonical "ducktape-by-monolithic-orchestrator" case study. Key architectural shifts: - Skills (not slash commands) per superpowers convention - Convention-pathed reports at docs/skill-optimizer/<slug>/... (visible + committable, like docs/superpowers/specs/) - Strict limited-context subagents for all generative work (writer, analyzer, optimizer, validator) — prevents tunnel-vision into ducktape patches - Validator subagent after every improvement (internal + optional external consistency) - Auto-pilot = natural chained invocation, not a separate orchestrator Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User clarification: the agent asks 'optimize for PR submission?' at step 1 (when the user first provides an upstream skill), not at the end of step 7. This decision determines whether step 3 (investigate-submissions) runs at all. Changes: - 'Two contexts, one workflow' rewritten: PR decision at step 1, recorded in 01-functionality.md frontmatter as pr_submission_intent - Step 1 behavior: explicit PR question for upstream sources - Step 2 handoff: reads pr_submission_intent to decide whether to invoke step 3 - Step 3 'skipped when': now keyed on pr_submission_intent: false - Step 7 handoff: three branches (local / upstream-no-PR / upstream-yes-PR), no late prompts Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
15 tasks across 5 phases: - Phase A (4 tasks, mechanical): skill dir shells, subagents/refs dirs, v1.3 deprecation, recipes.md seed - Phase B (7 tasks, INTERACTIVE via writing-skills): one per SKILL.md, NOT subagent-driven per user request - Phase C (6 tasks, mechanical): subagent prompt templates with limited-context constraints - Phase D (1 task, mechanical): references/workflow.md chain diagram - Phase E (2 tasks, E2E validation): local skill + firecrawl re-run (the v1.3 regression case) Plan explicitly marks Phase B as not-for-subagent-driven-development; the user explicitly stated 'writing good skills is HARD' and wants interactive creation per skill. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+ writing-skills Phases B–D produce v1.4's load-bearing artifacts (description-routed SKILL.md files, subagent prompt templates that hold the anti-ducktape constraints, and the workflow reference doc). Per operator direction, none of these should be subagent-driven — "writing good skills is HARD" and "everything to be precise". Tool assignment: - Phase B (7 SKILL.md): skill-creator as outer loop (description-routing + eval iteration), superpowers:writing-skills as inner loop when behavioral compliance issues surface during eval - Phase C (6 subagent prompts): superpowers:writing-skills only — these are compliance documents and the limited-context constraint must hold under adversarial pressure, pure TDD-pressure-scenario territory - Phase D (workflow.md): no skill tool; direct interactive authoring with section-by-section operator review (matches handoffs from B + constraints from C, which may have shifted during interactive iteration) Only Phase A (file scaffolding) and Phase E (real eval runs) sit outside the interactive flow.
Body content for each SKILL.md will be written interactively in Phase B via superpowers:writing-skills, with the interface contract from docs/skill-optimizer-v1.4-spec.md as the brief. This commit just lays down the structural skeleton and discoverable frontmatter.
Pulled from feat/auto-improve-skill-v1.3:skills/auto-improve-orchestrator/ references/lessons.md (the v1.3 orchestrator never landed on development, so we source from the experimental branch). Adds v1.4-specific header explaining how the analyzer (step 6) and optimizer (step 7) subagents will use this file. Replaces v1.3 "auto-improve-skill" / "Phase 4" framing with v1.4 step-numbering and subagent-naming. Content body is the v1.3 Recipe A-E + grader patterns G1-G6 + run-record protocol verbatim — that's the cumulative knowledge v1.4 inherits.
…ts dir, move philosophy to docs/ Changes to the v1.4 plan-as-executed, all in one place: - Move skill-writing-philosophy.md from skills/references/ to docs/. The philosophy doc is contributor-facing — we'll distill the load-bearing rules into the optimizer subagent's prompt directly rather than have it load this doc at runtime. - Delete skills/references/recipes.md (and the now-empty references/ dir). The raw seed copied from v1.3 lessons.md is case-study-shaped (accumulated per-pilot observations), which contradicts the "generalize-from-feedback" principle in the philosophy doc. The curated abstract-pattern version needs real end-to-end observations to ground it in — deferred to a follow-up after the chain ships. - Rename skills/subagents/ → skills/skill-optimizer-subagents/ so the dir name is scoped to this plugin and can't collide with subagents that other plugins might ship under a generic name. - Add a "Bootstrapping limit" section to the philosophy doc. The skill-optimizer chain runs an empirical loop on target skills, but that loop can't validate itself — same shape as Thompson's "Reflections on Trusting Trust". The seven meta-skills get authored from philosophy + best judgment; their test is Phase-E real-world runs, not eval data for the meta-skills themselves. - Spec + plan updated: file tree, Task A2 (revised), Task A3 (skipped), Task A4 (deferred), Phase C task file paths (subagents path rename), Phase D location (workflow.md goes to docs/ rather than skills/references/), acceptance criteria #4 (recipes.md deferred), coexistence section (v1.3 orchestrator never landed on development), open questions (recipes.md location TBD on real production). - Add note about post-v1.4 cleanup of the original skills/skill-optimizer/ skill (now redundant with the chain's skill-optimizer-run-bench step). Deferred to its own PR because it touches the public plugin API.
…sification taxonomy + local-with-PR-intent path Two changes to the B1 draft, plus a matching spec sync: 1. Classification: replaced the ad-hoc closed list (code-reviewer | document-producer | tool-use | code-patterns | other) with the v3 taxonomy actually used in the prioritization work — tool-use, code-patterns, document, prose-guidance, meta, interactive — and explicitly grants the subagent sovereignty to write a short descriptive label of its own when none fits, rather than collapsing to "other". A specific label gives downstream steps a real handle to work with. 2. Local-skill PR intent: the prior rule was "local = always pr_submission_intent: false". Revised to default false but treat PR-intent as live when the user explicitly says they want to send the local skill back upstream — in which case we capture the upstream guidelines location (URL / CONTRIBUTING.md path / Slack channel) into the report body as a "PR submission notes" subsection that step 3 (investigate-submissions) reads as its starting point. Also: while reviewing, dropped the v1.3/v1.4 framing from the "Why limited-context dispatch matters" section (version is historical metadata, not skill-functional content) and updated subagent path links to the renamed skill-optimizer-subagents/ directory. Spec §"The 7 skills" #1 step 2 + step 6 (classification field description) updated to match.
…r B2
Spec changes:
- New ## Iteration patterns section between ## Subagent constraints
and ## The skills. Covers: backtrack-trigger table, per-report
versioning mechanism (version + inputs frontmatter, with direct-
upstream-only staleness checks), re-entry contract with the
${OPERATOR_DIRECTIVES} slot, latest-plus-archive convention, and
the transitive-staleness policy (user judgment trusted; auto-pilot
re-runs on direct-upstream version mismatch).
- ## The 7 skills → ## The skills. Added one-line "Iteration
behavior" notes to each of the existing seven subsections.
- Step 2 (investigate-test-case) now dispatches a test-case-designer
subagent rather than running in the operator session. Reasoning:
cross-iteration contamination — on iter 2+ the operator has seen
prior failures and biases test selection toward "what just failed"
instead of comprehensive coverage. Isolated subagent fixes this.
- New ### 8. skill-optimizer-autopilot subsection. Walks 1→7,
consults version mechanism per step (skip-or-dispatch), applies
defaults for the three human-gate points (B1 PR-intent flag, B2
pick-top-N, B7 log-and-exit on validator-rejected), bounds at
max-iterations-per-step. Replaces the old "## Auto-pilot mode"
section (which said "no separate skill, just chained invocation").
- Subagent constraints table updated: test-case-designer added;
every reasoning subagent's "Does NOT see" column now includes
prior drafts of its own output; every Sees column includes
${OPERATOR_DIRECTIVES}.
- Acceptance criteria expanded to 9 items (was 7): new #1 covers
8 SKILL.md files, new #5 covers iteration-mechanism E2E, new #8
covers auto-pilot smoke test.
Plan changes:
- Phase B count 7 → 8 tasks. New Task B8 (skill-optimizer-autopilot
SKILL.md) added with its own A1-equivalent dir-creation step
inline (the dir wasn't part of Phase A scope).
- Phase C count 6 → 7 tasks. New Task C1b (test-case-designer
subagent prompt) inserted between C1 and C2 with a full draft
template; suffixed "1b" rather than renumbering existing C2-C6
to keep cross-references stable.
- Phase C section header adds a preamble explaining the
${OPERATOR_DIRECTIVES} requirement that applies to every
reasoning-subagent prompt template.
…h iteration patterns Five updates to match the spec's new iteration mechanism (§"Iteration patterns" added in commit 0009066): 1. Report frontmatter now includes `version` field. Step 1 has no upstream reports, so `inputs:` is omitted; downstream steps will add it as they're authored. 2. New workflow step 5 — "Handle iteration: archive prior version + collect directives." Checks for existing report at the canonical path; if present, moves it to archive/01-functionality-v<N>.md and bumps version. Also where the operator pre-digests cross-iteration learnings into the OPERATOR_DIRECTIVES bulleted list (atomic new requirements, not a context dump). 3. Step 6 (renumbered from old step 5) dispatches with two new templated inputs: ${VERSION} and ${OPERATOR_DIRECTIVES}. The subagent's "does NOT see" list explicitly includes prior 01-functionality.md drafts under archive/ to prevent the re-derivation from being contaminated by its own past output. 4. New "## Iteration behavior" section after "## Edge cases". Covers: when to re-run, what happens to vendored-skill/ on re-run, the archive convention, and how downstream cascade self-corrects via the version-mismatch check on next invocation. 5. Edge-case for "vendored-skill/ already exists" tightened: default to reuse (same source), re-fetch only on source URL change or explicit user ask. (Previously asked the user every time.) Sets the template for B2-B8.
Authoring the same ~40 lines of archive + version-bump + directives- collection logic into seven chain skills (B1-B7) would mean ~280 duplicated lines. Factor the mechanics into a single shared operational reference that every chain skill loads explicitly. New file: skills/skill-optimizer-shared/iteration-protocol.md (187 lines). Covers: versioning convention, the per-invocation decision tree (no existing report → v1; existing + inputs match → current; existing + mismatch → archive + bump), collecting OPERATOR_DIRECTIVES (atomic new requirements, not context dumps), subagent constraints under iteration (ignore own prior output, read upstream at latest), cascading staleness policy (direct-upstream-only checks), archive table, bootstrapping case, and what the protocol explicitly does NOT cover (bench-results timestamping, auto-pilot summaries, vendored-skill cache). B1 changes: - Step 5 stripped from ~25 lines of inline mechanics to ~10 lines with an explicit "Read this file now" instruction pointing at the protocol doc. The "Read this now" framing is load-bearing — the whole point is that the agent must consult the protocol, not improvise. - "Iteration behavior" section trimmed from ~25 lines of general mechanics to ~10 lines listing only step-1-specific re-run triggers, with a pointer back to the protocol for the general mechanics. - Net: 225 → 208 lines. This sets the pattern for B2-B8: each chain skill's "Handle iteration" step will be a brief pointer-and-step-specific-notes combo, with the heavy mechanics centralized. Spec changes: - Architecture overview file tree: skill-optimizer-shared/ added. - Acceptance criterion 2b added: shared protocol doc exists and is referenced explicitly from each chain skill's iteration step. Plan changes: - File tree updated (skill-optimizer-shared/ + autopilot/ stub). - New Task A5 added: author iteration-protocol.md. Documented as "as-executed (added mid-execution)" because it emerged from B1's revision rather than the initial plan.
…sophy
Three issues that the protocol doc was a load-bearing reference for —
so they would have propagated into every chain skill's invocation —
all caught on read-through:
1. "B1-B7" was project-internal jargon (those are plan task labels,
not part of the skill vocabulary). Replaced with "every skill in
the chain". The shipped doc should be timeless; task labels live
in the plan, not the protocol.
2. The "On every chain-skill invocation" section used an ASCII box-
drawing decision tree. Per Anthropic and writing-skills guidance
("use flowcharts ONLY for non-obvious decision points"), this
logic is straightforward conditional flow — bullets carry it
more cleanly and match standard markdown rendering. Rewritten as
nested bullets.
3. The "Bad examples" of operator directives were marked with ❌.
Per the global instruction "Only use emojis if the user
explicitly requests it. Avoid adding emojis to files unless
asked", these don't belong. Replaced with explicit "Examples
that count" / "Examples that do NOT count" section labels.
Net: 187 → 180 lines.
…mbering + front-load load-bearing context Two related fixes: 1. Internal workflow step numbering (1-7) collided with chain step numbering (1-7 referring to other skills in the chain). Same word, two meanings — an agent reading "step 5" could plausibly think either "this skill's Handle-iteration step" or "the run-bench skill in the chain". Relabeled internal workflow steps to (a) through (g) so they're visually distinct from chain step references. Cross-references inside the doc updated accordingly. A note in the new "Before you start" section declares the convention explicitly so future readers don't have to infer it. 2. The "Read iteration-protocol.md" instruction was buried at internal step (e) — middle of a 7-step workflow. An agent reading sequentially might glance past it once they're in execution momentum. Added a "## Before you start" section right after the intro paragraph, listing the two load-bearing things to know up front: (1) read the iteration protocol now, (2) you will dispatch a subagent — you don't do the research yourself. Step (e) still requires reading the protocol — the "Before you start" section primes the agent so step (e) becomes reinforcement rather than first contact. 229 lines total (up from 208; "Before you start" earns its place as the discipline frame). Sets the pattern for B2-B8 — each will similarly relabel its workflow steps with (a)-(g) and front-load the two load-bearing context items.
…visibility moved from step 4-blocked to step 4-allowed Two coupled changes — B2's SKILL.md draft and a spec update that makes the source-visibility split consistent across the chain. Spec changes (Subagent constraints table): - Step 2 (test-case designer) — explicitly added "skill source content" to "does NOT see". Coverage design happens at the responsibility level here; concrete fixture details enter at step 4. Without this constraint, designers could gerrymander test cases around the source's literal phrasing instead of reasoning from stated responsibilities. - Step 4 (test writer) — removed "the skill's content" from "does NOT see"; added "skill source content" to "sees". Updated the rationale: fixture writing needs concrete patterns and violation examples, which come from source. The per-case test spec from step 2 constrains what the fixture should test, so the gerrymandering risk is bounded. Still blocks other test cases (prevents copying across the suite) and grader internals (prevents grader-leak hacking). Architecture: source enters the chain at step 1 (research), exits at step 2 (responsibility-level design needs no source), enters at step 4 (fixture writing needs source detail), stays present for step 6 (analyze failures) and step 7 (optimize/validate). The spec's step-4 body description updated to match. B2 SKILL.md: - New file, follows B1's template: front-loaded "Before you start" section, lettered workflow steps (a)-(f), bold discipline markers at action sites, why-this-matters rationale, edge cases, iteration behavior section. - 6 internal steps: (a) confirm prerequisites, (b) handle iteration, (c) dispatch designer subagent, (d) confirm subagent output, (e) user gate (present + collect picks), (f) conditional handoff based on pr_submission_intent. - "picked" frontmatter field is the operator's responsibility — subagent writes proposal with picked: []; operator fills picked after user gate. - Three-response handling in user gate: pick subset/all, ask for revision, pick zero. - 215 lines, description 479 chars.
Step 3 of the chain — OPTIONAL, runs only when 01-functionality.md has pr_submission_intent: true. Researches the upstream repo's PR conventions and writes 03-submissions.md as the verbatim-pastable context block the validator (step 7) uses for external consistency. Follows the B1/B2 template: - Front-loaded "Before you start" with iteration-protocol pointer + dispatch discipline + step-numbering convention - Frontmatter with version, inputs.step_1_functionality, plus step-specific fields: upstream_repo, upstream_branch_target, license, requires_cla - 5 lettered workflow steps: (a) confirm prerequisites including pr_submission_intent gate, (b) handle iteration, (c) dispatch subagent, (d) confirm subagent output with blocker flagging, (e) hand off - Why limited-context dispatch matters: rationale specific to step 3 — the validator's external consistency check depends on the submissions report being neutral upstream facts, not advocacy for the proposed change - Edge cases: private repos, non-GitHub hosts, empty PR history, copyleft license - Iteration behavior: explicitly note this rarely re-runs; upstream conventions change slowly 235 lines, description 578 chars.
…otocol + apply in B2 A chain skill never invokes another chain skill on its own. When a step finds its upstream input unsatisfactory (thin functionality report, too-easy bench, missing test case, wrong-target PR conventions), it surfaces the finding and stops — the user (or auto-pilot driver) decides whether to re-run upstream, accept the situation, or abandon. This was implicit in the architecture but not stated as a rule. Adding it explicitly to iteration-protocol.md as a new section between "Cascading staleness" and "Archive convention". Same section covers the forward-handoff exception (those are normal chain flow, not backward triggers). Updated B2's edge case for "Subagent's proposal has fewer cases than expected" — was "Either re-run step 1 with directives or accept the small proposal", now "Surface this to the user with two options ... Do not re-invoke step 1 yourself — re-runs require an active signal from the user", with reference back to the protocol doc's new section. Audit: only B2 had a wording suggesting backward auto-trigger; B1 and B3 don't suggest re-running upstream steps. Same rule applies to all future chain skills (B4-B8) — the protocol doc is the chain-wide source of truth.
…eview 1. upstream_branch_target placeholder syntax. Was "<main | next | other>" which reads as a closed enum. The actual value is the literal branch name (could be `develop`, `release-2024`, etc.), so the placeholder should describe the field meaning, not enumerate three literal options. Now: "<branch name — usually `main`, sometimes `next` for new-skill repos, or a specific release branch>". 2. Removed the "frontmatter spec includes fields not currently in the vendored skill's frontmatter" blocker. It described a real fact in the report (upstream uses fields the vendored skill lacks) but isn't a blocker — the optimizer (step 7) reads 03-submissions.md and would add missing fields naturally. Doesn't need a separate operator alert. 3. Removed the "License is GPL or other copyleft" blocker. Copyleft licenses don't mechanically block PR submission — the upstream is whatever-licensed; a PR becomes part of that codebase under that license. Organization-policy concerns about contributing to copyleft projects exist but are contributor-side decisions, not chain-level blockers. Step (d) reframed from "flag these blockers" to "verify file + only the CLA fact needs explicit mention since it requires operator-side work outside the chain". Other frontmatter fields (license, repo, branch target) are facts downstream steps consume directly. Edge case for "license is copyleft" replaced with one for unusual frontmatter conventions — that's the actual blocker shape (subagent can't extract a consistent spec, so the optimizer has to make a judgment call).
Per operator review: B2's re-run was modeled as fresh derivation,
which loses continuity. The user's `picked` choices and existing
case names should survive a re-run; otherwise the user has to re-pick
everything every time they add coverage. The "archive as inert audit
trail" rule means continuity has to come from the current canonical
file, not from peeking at archived prior versions.
New concept: chain steps come in two kinds.
- Fresh-derivation steps (1, 3, 6, 7): subagent never sees its own
step's prior or current file. The "anti-ducktape" rule applies in
full — optimizer must not see prior attempts, analyzer must not see
prior analyses.
- Maintenance steps (2, 4): subagent reads the current canonical file
as load-bearing input and produces an extended version of it.
Existing entries the user has invested in (picks, manually-added
cases) are preserved unless directives explicitly say to revise.
Archive still happens for audit; archive is still inert.
iteration-protocol changes:
- New "Step kinds: fresh-derivation vs maintenance" section
classifying each step.
- "On every chain-skill invocation" decision tree updated to branch
on step kind: fresh-derivation copies-to-archive then derives from
scratch; maintenance copies-to-archive then dispatches with current
file as input.
- "Subagent constraints under iteration" restructured to make the
third rule kind-dependent (do/don't read your own canonical file).
B2 changes:
- Step (b) Handle iteration: explicitly invokes the maintenance
pattern. Notes the special case where step 1's version bumped
(responsibility set changed) — user decides whether to extend or
start fresh by deleting the canonical file before dispatch.
- Step (c) Dispatch: new templated input ${EXISTING_CASES_PATH}
(empty on iteration 1, current 02-test-case.md on re-runs).
Subagent's "sees" list adds the current file with explicit
preserve-existing semantics. "Does NOT see" list now correctly
excludes archive only (was incorrectly excluding all prior drafts).
- Step (e) User gate: response #2 "user wants additions or revisions"
now reflects maintenance — user does not need to re-pick everything
since existing picks are preserved.
Spec changes:
- Subagent constraints table: test-case-designer row updated to
reflect maintenance pattern. Sees-list adds "current 02-test-case.md
(when extending in maintenance mode)"; does-NOT-see-list correctly
scoped to "archived prior drafts" only; why-rationale updated.
Note: B4 (write-tests) when drafted will also follow the maintenance
pattern (workbench/ accumulates per-case files). B6 and B7 stay as
fresh-derivation — that's the load-bearing anti-ducktape constraint.
Step 4 of the chain — takes the picked cases from 02-test-case.md and dispatches test-writer subagents in parallel (one per case) to build concrete workspace files + graders. Runs a smoke check against hand-crafted GOOD/BAD/EMPTY fixtures before declaring done. Follows the established template (B1/B2/B3 shape) with several B4-specific additions: - **Maintenance step** (per the recent iteration-protocol update). workbench/ accumulates as new picks get built. The (b) iteration step diffs picked vs prior built_cases and only dispatches test-writers for NEW picks or for cases that directives flag for revision. Existing builds are preserved verbatim. - **Parallel per-case dispatch** in step (d). All test-writers emitted in a single message so they run concurrently. Each sees only its case spec + 01-functionality.md + skill source; does NOT see other cases, other graders, prior failures, or anything under archive/. The per-case isolation prevents both cross-fixture homogenization and grader-leak hacking. - **Source content access** (per the recent spec update). Test writers are the one step where the skill source is in-scope for a generative subagent. Bounded by the per-case spec from step 2. - **Smoke check** (step (e)). New responsibility not in other steps. Verifies each grader against GOOD/BAD/EMPTY fixtures the test-writer also produces. last_smoke_check_passed in frontmatter is only set when all graders pass. - **Two outputs**: 04-tests-plan.md (the meta-report) AND the workbench/ directory (the artifacts the run-bench step executes against). Maintenance protocol covers both via archive/04-tests-plan-vN.md and archive/workbench-vN/. - **User-gate at step (c)** for the planned workbench structure before dispatching. Three response cases (approve / add revisions via directives / reject and abandon-or-replan). - **6 internal steps**: (a) confirm prerequisites, (b) handle iteration with diff logic, (c) plan + user gate, (d) parallel dispatch, (e) smoke check, (f) assemble + commit + hand off. 350 lines, description 492 chars. Larger than B1/B2/B3 because the extra responsibilities (parallel dispatch, smoke check, maintenance diff) each earn their lines.
…o centralized script exists The prior wording referenced a path that didn't exist (`skills/skill-optimizer/references/scripts/smoke-check.mjs`). The actual codebase doesn't have a centralized smoke-check runner — each existing workbench has its own `checks/smoke-graders.mjs` specific to its graders (see e.g. examples/workbench/agent-browser/checks/smoke-graders.mjs). Rewrote step (e) to be honest about this: - Smoke check is per-workbench, not centralized. - The test-writer subagent produces the smoke-check artifact alongside its grader. Shape follows the workbench schema docs. - Two reasonable shapes described (per-case runner vs workbench-level aggregator); the test-writer prompt template (Phase C) will pick one and apply consistently. - Reference example pointed at the actual existing agent-browser/checks/smoke-graders.mjs. Conceptual contract unchanged (GOOD passes, BAD/EMPTY fail); only the execution mechanics corrected.
Step 5 of the skill-optimizer chain. Thin operator-driven CLI step — no subagent dispatch. Two outputs: timestamped raw bench data under 05-bench-results/<ts>/ (preserved naturally, exempt from the standard archive flow) and a versioned 05-bench-summary.md (follows the iteration protocol; archive-and-version on re-run). The summary is the entry point step 6 reads: per-model and per-case pass rates plus a failed-case pointer list into the raw output. The analyzer subagent at step 6 walks that list for trace and findings detail; the summary itself stays short. Handoff branches on overall pass rate: failures route to analyze-result; all-pass surfaces the choice to the user (accept that the picked set didn't expose a weakness, or re-run step 2 with a "make it harder" directive). Chain skills don't auto-invoke.
Two related coordination changes across the chain, plus a deferred limitation note. Rename convention for revised cases at step 2: when a directive asks to revise an existing case (rather than add a new one), the test-case-designer subagent appends a version suffix — `case-x` → `case-x-v2` → `case-x-v3`. Bare name is implicit v1. This gives step 4's diff logic a deterministic signal that the revised case needs a fresh test-writer dispatch (without the rename, the name-match would say "already built" and skip rebuilding the case whose spec actually changed). Encoded in B2's user-gate step and referenced from B4's diff logic so the v-suffixed names are not surprising downstream. Partial re-bench at step 5: deferred. CLI's run-suite does not accept a case filter, so step 5 always re-measures the full workbench. Documented as a known limitation with a roadmap pointer. Operator escape hatch (manual run-case + splice into 05-bench-results/<ts>/) noted as outside the chain.
Replace the version-field + archive-folder iteration model with
git as the history mechanism and the filesystem itself as the
current state. The prior protocol was reimplementing git in
frontmatter — version: ints, inputs.step_N: lineage tracking,
archive/<NN>-name-v<N>.md copies — and adding accidental complexity
across the chain.
Iteration-protocol rewrite:
- Drop version field, archive/ subdir, inputs.step_N int tracking
- Staleness detection: git log -1 --format=%ct mtime comparison
- Two step kinds preserved (fresh-derivation vs maintenance), with
the constraint reframed in terms of "does the subagent read its
own canonical file" — fresh-derivation says no (anti-ducktape),
maintenance says yes (filesystem IS the state)
- Subagent constraint added: don't walk git history of any tree
file — for fresh-derivation this is the anti-ducktape guarantee,
for maintenance this prevents reasoning from prior states
- Safe destructive edits: operator session commits a checkpoint
before maintenance-step rebuilds so git history has a clean
before/after breakpoint
Spec doc updates:
- State layout: tests/<functionality>/<test>/{spec.yaml, workspace,
grader, smoke} replaces 02-test-case.md + 04-tests-plan.md +
workbench/. Filesystem-as-state — no picked: [] or built_cases: []
arrays anywhere
- B2 output: 00-test-proposals.md (audit) + tests/<func>/spec.yaml
per functionality with picked: true|false in each
- B4 output: tests/<func>/<test>/ probe folders + generated
tests/suite.yml. Per-probe parallel test-writer dispatch
- B5 input: tests/suite.yml. Two outputs: timestamped raw +
single-canonical 05-bench-summary.md
- Subagent constraints table updated for every reasoning subagent
to reflect git-history-off-limits rule
- Per-step iteration-behavior sections rewritten for the new model
Undoes the -v2 rename convention added 30 min ago (no longer
needed — filesystem-as-state means revising a spec.yaml in place
is naturally detected, and B4 can just rebuild on directive
without any special name mangling).
Apply the filesystem-as-state + git-native iteration redesign across all five existing chain SKILL.md files. Drops version: frontmatter fields and archive/ directory references everywhere; reframes the subagent constraints in terms of "don't read own canonical / git history" (fresh-derivation) vs "do read current tree as state" (maintenance). B1 (investigate-functionality, fresh-derivation): drop version field; subagent constraint becomes "don't read own canonical or git history of it" instead of "don't read archive/". B2 (investigate-test-case, maintenance — major rewrite): output shape changed entirely. Was a single 02-test-case.md with picked: [] frontmatter array; now produces 00-test-proposals.md (one-time audit report) plus tests/<functionality>/spec.yaml per proposed functionality, each with picked: true|false in its own frontmatter. User gate is "edit picked: in each spec.yaml" rather than "tell me which names to pick". Subagent reads existing tests/ tree as load-bearing state per the maintenance rule. B3 (investigate-submissions, fresh-derivation): drop version and inputs.step_1_functionality fields; staleness now via git mtime against 01-functionality.md. B4 (write-tests, maintenance — major rewrite): replaced 04-tests-plan.md + workbench/ with tests/<functionality>/<test>/ probe folders. State is implicit: probe folder + grader.mjs present = built; no built_cases: [] array. Per-probe parallel test-writer dispatch; one probe = one test-writer subagent. Step generates tests/suite.yml from picked-functionality probes. Destructive-edit checkpoint pattern: operator commits before rebuilding existing probes. B5 (run-bench, fresh-derivation summary + timestamped raw): drop version and inputs.step_4_tests fields; reads tests/suite.yml as input. Summary 05-bench-summary.md is single-canonical with prior state in git; raw 05-bench-results/<ts>/ stays naturally accumulated and outside the protocol. Net: less bookkeeping, simpler mental model, fewer ways to get state out of sync. Filesystem IS the state across the chain.
Two wording fixes to the fresh-derivation subagent constraint: 1. The prior wording "no reference to what was produced before" read as forbidding even the directive mechanism — which is wrong. The operator session DOES read prior outputs (that's part of its job between iterations) and distills lessons into atomic new requirements. The subagent then satisfies those distilled requirements as fresh constraints without seeing the raw prior content. This separation is what keeps the new derivation from rationalizing the prior one while still letting the chain converge across iterations. 2. Step 7 has a specific carve-out worth stating explicitly: the SKILL itself (the improvement target) is upstream input, not the subagent's own canonical. The optimizer reads the current skill — which may include modifications from prior step-7 runs — and proposes new improvements on top. The "own canonical" that's off-limits is 07-improvement-proposal.md (the reasoning report), not the skill file. The skill accumulates improvements across iterations; the proposal reports do not. Triggered by review of the prior wording on iteration-protocol.md.
Replace the descriptive "there is no version: field" wording with a prescriptive "do not add version-tracking metadata" rule plus a practical decision aid for future SKILL.md authors: When in doubt about whether a field belongs: ask whether a chain skill needs to READ it to do its job right now (yes -> keep), or whether you're recording it for future-debugging / future-audit purposes (no -> that's git's job). The prior wording could be read as describing the current state without prohibiting reintroduction. The new wording makes the prohibition explicit so B6/B7/B8 (yet to be drafted) don't accidentally bring back version-tracking metadata.
B6 (analyze-result): the chain's anti-ducktape gate. Fresh-derivation step. Dispatches an analyzer subagent that reads bench summary + raw trial data + probe specs + skill content — but NOT the test inputs themselves (forces principle-thinking over solution-thinking). Output: 06-analysis.md with has_structural_weakness: true|false in frontmatter and per-weakness sections containing Pattern, Hypothesized cause, Connects to skill section, What WOULD address this (general principle), and What WOULD NOT address this (the explicit anti-ducktape list step 7's optimizer must reckon with). Honest refusal is built in: if no weakness can be articulated, the report says so and has_structural_weakness: false, which step 7 will refuse to fire on. Forced weakness-naming when the analyzer found nothing is the ducktape failure mode this step exists to prevent. B7 (improve-skill): the terminal generative step. Two subagents (optimizer + validator), both fresh-derivation. Refuses to fire if 06-analysis has has_structural_weakness: false (the anti-ducktape gate's downstream half). Optimizer: reads 06-analysis, 01-functionality, current skill, 03-submissions (if PR-bound). Does NOT see raw trials, grader internals, test inputs, or prior proposals. Must apply the analyzer's general principle and self-check against the anti-pattern list explicitly. Validator: reads skill BEFORE + AFTER + 01-functionality + the proposal artifact + 03-submissions (if PR-bound). Internal consistency + external consistency checks. Does NOT see the optimizer's reasoning trace, prior verdicts, or raw trial data. Bounded loop: max 2 revision rounds. If validator still says needs-revision after round 2, surface honestly with three realistic paths; do not loop indefinitely. Three handoff branches: local skill (modify in place), upstream + PR=false (modify vendored copy), upstream + PR=true (write 07-pr-draft.md with operator-steps-to-submit; the chain does NOT submit the PR). Outputs three or four artifacts: 07-improvement-proposal.md, 07-validator-verdict.md, the modified skill file (on approve), and 07-pr-draft.md (PR-bound + approve). Frontmatter on both reports carries runtime-relevant facts only (verdict, addresses_weaknesses, diff_target) per the iteration protocol's discipline rule. Both files follow the established structural template (front-loaded "Before you start", lettered workflow steps, edge cases, iteration behavior section). Pending user review of B1-B7 before B8 and the subagent prompt templates.
…rompt B6 SKILL.md was carrying a full markdown body template (per-weakness section template with placeholders, non-structural noise example, honest-refusal wording verbatim). That's the subagent's concern — the subagent writes the body per its prompt template; the operator session reads only the frontmatter for handoff branching. Keep in the SKILL.md: - Frontmatter contract (operator reads has_structural_weakness for step 7's gate) - Enumeration of the five required parts per weakness entry (the operator session verifies these in step (d)) - The architecture-level rationale on why the anti-pattern list is load-bearing (this is design-decision content, not subagent-side prose — explains WHY the constraint exists for future authors) - Pointer to the subagent prompt template for the full template + reasoning protocol Same audit pass on B1/B2/B4/B7: their What-you-produce sections show structural contracts (frontmatter schemas, file-tree shape) that the operator session actually reads, not narrative body templates the subagent fills in — so they stay as-is.
Two architectural fixes per review: 1. Drop PR draft from B7. PR composition is a separate downstream concern that consumes B7's proposal + 03-submissions.md; the auto-pilot (step 8) or a dedicated composer can handle it if pr_submission_intent: true. B7's job is just to improve the skill — packaging it as a PR is not what improve-skill does. Removed: 07-pr-draft.md as an artifact, the three-branch handoff (local / upstream + PR=false / upstream + PR=true), the operator-steps-to-submit checklist. Collapses to a single handoff message regardless of PR intent. 2. Never modify the original skill. The improved version lives at docs/skill-optimizer/<slug>/improved-skill/ — a separate location that accumulates improvements across iterations. The vendored upstream copy stays frozen; the local source file stays untouched. Git tracks improved-skill/ history. On re-run, the optimizer reads improved-skill/ if it exists (the accumulated state) and proposes the next improvement on top; iteration 1 reads the original source instead. Original is always recoverable; improved evolves under git. Removed: "Written in place for local skills" / "Written to vendored-skill/ for upstream skills" — both wrong now. Updated B7's "Before you start" carve-out summary, step (c) and (d) input descriptions (SKILL_CURRENT_PATH replaces SKILL_SOURCE_PATH; validator BEFORE = optimizer's input), step (f) write outputs (materialize improved-skill/; don't touch source), step (g) handoff (single message). Iteration-protocol's step-7 carve-out updated to match (improved-skill/ is the accumulated state; source stays frozen). Spec doc layout adds improved-skill/ alongside vendored-skill/; B7 section rewritten; subagent constraints table rows for optimizer + validator updated.
1. Rename 00-test-proposals.md -> 02-test-proposals.md. The 00- prefix was inconsistent with the chain's per-step numbering convention (B2's outputs should start with 02-). Renamed in both the SKILL.md and the spec doc layout. 2. Soften the "don't auto-flip" wording. The intent is "no proactive flipping without user direction", not "user must edit every spec.yaml by hand". If the user explicitly says "flip these to true" or "pick X, Y, Z", the operator session does it for them and confirms what was set. 3. Single path for user-added functionalities. The prior wording offered two paths (manual spec.yaml creation by operator OR subagent re-dispatch with directive). The first violates the architecture's no-operator-generative-writing rule — only the subagent writes test-design content. Collapsed to the single correct path: treat user's description as a directive and re-dispatch per step (e)(2).
…ounts B3 had hard-coded "last 10 merged PRs and last 5 closed-without-merge PRs" both in the SKILL.md "subagent sees" list and in the spec doc "Behavior" line. Two problems: 1. Redundant: the line above in SKILL.md already says "PR list" as part of the gh-CLI access, so the specific-counts bullet was restating with extra constraints. 2. Over-prescriptive: 10/5 are arbitrary; the subagent should sample enough recent PRs to identify shape patterns and rejection signals, but the exact counts are operational judgment not architecture. The subagent prompt template (Phase C) can recommend a starting point; SKILL.md and the spec shouldn't pin it. Collapsed both to a brief mention that the PR list covers both merged and closed-without-merge for shape patterns + rejection signals.
B7 had grown to ~500 lines covering both the optimizer and the validator with an in-step revision loop. Splitting into two single-shot steps cleans the architecture: B7 (improve-skill, ~280 lines): - Reads 06-analysis + 01-functionality + current skill state (improved-skill/ if exists, else original source) + 03-submissions if PR-bound - Dispatches optimizer subagent - Writes 07-improvement-proposal.md ONLY - Does NOT materialize improved-skill/ (that's step 8's job after approval) - Single-shot per invocation; 7->8->7 revision cycle is operator- driven, not in-step - Handoff: "invoke validate-improvement" B8 (validate-improvement, new, ~340 lines): - Reads 07-improvement-proposal.md + current skill + 01-functionality + 03-submissions if PR-bound - Dispatches validator subagent - Writes 08-validator-verdict.md - On verdict: approve: materializes improved-skill/ by applying the diff to a copy of the current state; original source stays frozen - Three handoff branches by verdict (approve / needs-revision / reject); does not auto-invoke step 7 on needs-revision - Single-shot; if needs-revision, operator distills and re-invokes step 7 then step 8 B9 (autopilot, renumbered from 8): - Walks 1->8 (was 1->7) - Handles the 7->8->7 revision loop bounded by max_iterations_per_step - Spec doc + autopilot section updated accordingly Other changes propagated: - Rename 07-validator-verdict.md -> 08-validator-verdict.md in spec layout, B2/B6 cross-references, subagent constraints table - Iteration-protocol step-kinds: add step 8 to fresh-derivation list; expand step-7-specific carve-out to cover both step 7 and step 8 - "step 1 through step 7" -> "step 1 through step 9" in all chain SKILL.md headers - B3 "validator (step 7)" -> "validator (step 8)" (3 instances); "optimizer (step 7)" stays correct - Eliminated the in-step bounded revision loop entirely — each chain skill is now genuinely single-shot per invocation, aligning with the "chain skills don't auto-invoke other chain skills" rule. The bounded loop survives as a cross-step pattern in auto-pilot (B9).
Per-skill verbosity audit identified four trim patterns: 1. "Before you start" preambles duplicated workflow step (b)/(c) content 2. "Why limited-context dispatch matters" sections lived far from the dispatch they explain; philosophy doc says "give a why-this-matters paragraph nearby" 3. "Confirm subagent output" boilerplate restated across all 7 subagent-dispatching skills 4. Verbatim multi-line user-dialogue and handoff templates were over-prescriptive (operator can phrase the exact words from the intent statement) Applied to all 8 chain skills: - Dropped "Before you start" sections entirely (~160L saved) - Inlined dispatch rationale at workflow step (c)/(d) as a short "Why this matters" paragraph (~170L saved net) - Tightened "Confirm subagent output" steps to one or two sentences (~50L) - Collapsed verbatim dialogue blocks to intent statements (~100L) - Examples lists trimmed from 4-5 to 2 (one to establish, one to show variation) Also split the shared iteration-protocol into three named docs: - iteration-protocol.md (~130L, was ~284L) — iteration mechanics only: step kinds, staleness, destructive-edit checkpoints, cascading, bootstrapping. Loaded at "Handle iteration" step. - subagent-dispatch.md (new, ~120L) — subagent constraints, operator-directives concept, templated dispatch inputs, no-auto-invocation rule. Loaded at "Dispatch subagent" step. - frontmatter-discipline.md (new, ~40L) — runtime facts vs history rule, decision aid. Referenced at "What you produce" section. Each chain skill loads only what it needs at the workflow step that needs it (lazy loading rather than front-loading everything at "Before you start"). Most skills need all three; B5 (no subagent dispatch) needs only iteration-protocol + frontmatter-discipline. Final line counts (all chain skills now under 200 lines): B1 investigate-functionality 227 -> 148 (-79) B2 investigate-test-case 296 -> 191 (-105) B3 investigate-submissions 240 -> 156 (-84) B4 write-tests 368 -> 200 (-168) B5 run-bench 233 -> 154 (-79) B6 analyze-result 302 -> 165 (-137) B7 improve-skill 317 -> 181 (-136) B8 validate-improvement 337 -> 189 (-148) Shared iteration-protocol 284 -> 130 + 120 + 40 = 290 Total: 2604 -> 1674 lines (-930, ~36% reduction). Aligns with the project's docs/skill-writing-philosophy.md: "Bias toward 'Claude is smart' — pruning beats adding. If the skill restates what Claude already knows, removing the restatement is often a more principled fix than adding new rules."
…footer, add workflow doc Two coupled cleanups per the philosophy doc's closeness principle: 1. Inlined step-specific edge cases into their workflow steps. Most edge cases were just elaborations of "Confirm prerequisites" or "Confirm subagent output" — they belong INSIDE those steps, not in a trailing section. The small remainder (cross-cutting concerns like filesystem-as-state observations) stays in a tiny "Edge cases" footer where it earns its space. 2. Dropped "Iteration behavior" sections from each chain skill. The re-run triggers were skill-scheduling info (when to invoke this skill), not in-skill workflow content. Moved them into a new operator-facing docs/skill-optimizer-workflow.md as a cross-skill matrix — closer to where an operator deciding what to run next would look. Added a one-line "**Fresh-derivation step.**" or "**Maintenance step.**" classification near the top of each skill (since this affects subagent behavior and is short). The new docs/skill-optimizer-workflow.md (Phase D, ~110 lines) covers cross-skill concerns: the 9-step chain table, per-step re-run triggers, backward triggers (when a step surfaces a problem with an earlier step), state layout, pointers to the shared docs. Final line counts: B1 investigate-functionality 148 -> 131 (-17) B2 investigate-test-case 191 -> 174 (-17) B3 investigate-submissions 156 -> 142 (-14) B4 write-tests 200 -> 182 (-18) B5 run-bench 154 -> 131 (-23) B6 analyze-result 165 -> 148 (-17) B7 improve-skill 181 -> 155 (-26) B8 validate-improvement 189 -> 165 (-24) workflow.md (new) 0 -> 110 Chain net: -156 lines from chain skills, +110 lines for the workflow doc that captures the cross-skill content previously duplicated across each skill's "Iteration behavior". Net per-file size is smaller, and the per-skill files now follow the closeness principle: edge cases live next to the step they refine; re-run triggers live in the operator-facing reference, not in the per-skill workflow. Combined with the prior verbosity sweep: chain SKILL.md files have gone from 2320 to 1228 lines (-1092, -47%) since this morning.
Top-level docs/ should hold project-wide reading; chain-specific
design docs belong elsewhere:
- docs/skill-optimizer-v1.4-spec.md
-> docs/superpowers/specs/2026-05-19-skill-optimizer-v1.4-design.md
(matches the superpowers brainstorming skill's convention:
docs/superpowers/specs/YYYY-MM-DD-<topic>-design.md)
- docs/skill-optimizer-v1.4-plan.md
-> docs/superpowers/plans/2026-05-19-skill-optimizer-v1.4.md
(matches the superpowers writing-plans skill's convention:
docs/superpowers/plans/YYYY-MM-DD-<topic>.md)
- docs/skill-optimizer-workflow.md
-> skills/skill-optimizer-shared/workflow.md
(chain-specific operator reference belongs alongside the chain
it documents; co-located with iteration-protocol.md +
subagent-dispatch.md + frontmatter-discipline.md)
Top-level docs/ now holds only:
workbench.md # project-wide workbench engine guide
README.codex.md # install
README.opencode.md # install
skill-writing-philosophy.md # project-wide authoring guidance
images/ # project-wide assets
superpowers/ # superpowers-plugin-managed dir
pilot-runs/ # project-wide
Internal references updated:
- The spec doc's "Companion docs" section now reflects the new
layout (project-wide vs chain-specific)
- Bulk sed across spec + plan to point at new paths
- workflow.md's relative paths fixed for its new location (../skills/
-> ./ since it now lives in skills/skill-optimizer-shared/)
Date chosen (2026-05-19) is the original git creation date of both
the spec and plan files, matching the YYYY-MM-DD convention.
The legacy canonical skill (skills/skill-optimizer/SKILL.md) was a
v1.3-era artifact — a direct workbench-CLI wrapper. Its role under
the v1.4 chain is filled by skill-optimizer-run-bench. The v1.4
spec already called for this removal "in a separate cleanup PR"
after validation; doing it now while the chain layout is being
restructured anyway.
- Moved skills/skill-optimizer/references/workbench.md
-> skills/skill-optimizer-shared/workbench.md
(load-bearing — B4 references the workbench schema reference;
co-located with the chain's other shared docs)
- Updated B4's pointer to the new location
- Deleted skills/skill-optimizer/ (folder)
Git history preserves the deleted SKILL.md content if needed.
NOT updated in this commit (separate cleanup needed before merge):
- Plugin metadata still references skills/skill-optimizer/SKILL.md
in .claude-plugin/, .codex-plugin/, .cursor-plugin/, .opencode/,
gemini-extension.json
- CLAUDE.md mentions skills/skill-optimizer/SKILL.md as canonical
- README.md and CONTRIBUTING.md may reference it
These references need updating to point at the v1.4 chain
(or the chain's entry point) before this work merges to
development. Flagged here so they're not forgotten.
…tion Two related trims per review: 1. Dropped redundant nouns from 3 chain skill names where the noun just restated the namespace (whole namespace is skill-optimizer, so "skill"/"result"/"improvement" added no information): skill-optimizer-analyze-result -> skill-optimizer-analyze skill-optimizer-improve-skill -> skill-optimizer-improve skill-optimizer-validate-improvement -> skill-optimizer-validate The verbs stay because they actually distinguish what each step does. Other 5 skills keep their full names (functionality, test-case, submissions, tests, bench are meaningful nouns). 2. Dropped the "Throughout this document, 'step 1' through 'step 9' (no parens) refer to skills in the chain. Internal workflow steps within THIS skill are labelled '(a)' through '(g)'." paragraph from each chain skill. Workflow steps use letters; chain steps use numbers; the distinction is self-explanatory from context. Ranged references like "step 1 through step 9" risked confusing the agent (per review: "sometimes the agent might not know what that means"). Updated cross-references throughout: chain skills' handoffs, spec doc's architecture overview, plan doc's task descriptions, workflow doc's chain table. Also fixed the spec doc layout: removed stale skill-optimizer/ folder entry (nuked previously), realigned column comments, added shared/ entries for workflow.md and workbench.md that weren't previously listed. Final chain SKILL.md sizes (8 files, 1196 total lines, avg 150): B1 investigate-functionality 131 -> 127 B2 investigate-test-case 174 -> 170 B3 investigate-submissions 142 -> 138 B4 write-tests 182 -> 178 B5 run-bench 131 -> 127 B6 analyze-result -> analyze 148 -> 144 B7 improve-skill -> improve 155 -> 151 B8 validate-improvement -> validate 165 -> 161
The chain previously had a gap: B4's smoke check verified grader/
fixture syntactic consistency (the test-writer wrote both the
fixture AND the smoke fixtures, so the smoke check is
self-validation), but no independent semantic check that the
probes actually probe what they claim to. v1.3 ran into this:
grader bugs propagated to misleading bench results and ducktape-
shaped improvements. This step closes that gate.
New skill: skill-optimizer-validate-tests (step 5, fresh-derivation)
- Dispatches test-validator subagents in parallel (one per probe)
- Each judges: does workspace exercise the parent functionality?
is the grader correct + fair? do smoke fixtures truly
distinguish (vs. coincidentally match)?
- Writes 05-tests-verdict.md (aggregate + per-probe verdicts)
- Step 6 (run-bench) refuses to fire unless all_probes_approved: true
- Parallel to the step 9 validator for improvement proposals;
both are anti-ducktape gates
Downstream renumbering (steps 5-9 -> 6-10):
Step Skill File
6 skill-optimizer-run-bench 06-bench-{results,summary}
7 skill-optimizer-analyze 07-analysis.md
8 skill-optimizer-improve 08-improvement-proposal.md
9 skill-optimizer-validate 09-validator-verdict.md
10 skill-optimizer-autopilot autopilot-summary-<ts>.md
Bulk renames executed:
- File paths: 05-bench-* -> 06-bench-*, 06-analysis.md -> 07-,
07-improvement-proposal.md -> 08-, 08-validator-verdict.md -> 09-
- Step number references in all chain SKILL.md + shared docs
(reverse-order sed to avoid collision: 9->10, 8->9, 7->8, 6->7,
5->6)
Spec doc updates:
- Goal: "seven independent skills" -> "nine independent skills +
auto-pilot driver"
- Architecture layout: insert validate-tests at #5; rename
autopilot to #10
- State layout: insert 05-tests-verdict.md
- Subagent constraints table: add test-validator row
- Per-step sections: insert ### 5. validate-tests; renumber
existing ### 5-9 to ### 6-10
- Autopilot: "Walks 1→8" -> "Walks 1→9"; "eight steps" -> "nine"
Iteration-protocol + subagent-dispatch shared docs: step-kind
table and fresh-derivation enumeration both updated for the new
step list.
Workflow.md: full rewrite of chain table, re-run triggers matrix,
backward triggers, state layout — added validate-tests row in
each.
Plan doc updated via bulk sed only (it's historical implementation
record; precise per-task accuracy not required at this stage).
Net: chain has 10 entries now (9 chain steps + autopilot).
Phase C of the v1.4 implementation. Each chain skill that
dispatches a subagent loads its prompt template at the dispatch
step; this commit creates the 8 templates.
Each prompt is structured consistently:
- Title + role (what step dispatches it, what it produces)
- Inputs (templated by the operator session with ${VAR}
placeholders matching the chain skill's substitution list)
- What you see / What you do NOT see (the constraints — these
mirror what each chain skill says in its Dispatch step, but
load-bearing for the subagent to internalize)
- Output specification (frontmatter + body shape)
- Reasoning protocol (lettered or numbered steps)
- Edge cases (typically BLOCKED conditions to surface)
- Return summary (what the subagent reports back to the operator)
Files written:
1. research-functionality.md (127L) — B1 functionality researcher
- Reads source skill + targeted web; produces 01-functionality.md
- Doesn't see prior analyses, tests, or proposals
2. test-case-designer.md (156L) — B2 test-case designer
- Reads 01-functionality + current tests/ tree; produces
02-test-proposals.md + tests/<func>/spec.yaml per functionality
- Doesn't see skill source (forces design from STATED
responsibilities)
3. research-submissions.md (133L) — B3 submission researcher
- Reads upstream repo via gh CLI; produces 03-submissions.md
- Doesn't see proposed change (preserves validator's
independence)
4. test-writer.md (166L) — B4 test writer (dispatched per probe)
- Reads probe spec + parent functionality + skill source;
produces probe folder contents
- Doesn't see other probes (per-probe isolation prevents
homogenization + grader-leak hacking)
5. test-validator.md (186L) — B5 test validator (NEW, per probe)
- Reads probe contents + parent functionality + skill source;
produces per-probe verdict
- Doesn't see other probes, test-writer's reasoning, prior
verdicts
- Four judgment dimensions: workspace fairness, grader
correctness, smoke fixture distinguishing power, fairness
across reasonable agent outputs
6. analyzer.md (183L) — B7 analyzer (anti-ducktape gate)
- Reads bench results + probe specs (intent only) + skill
source; produces 07-analysis.md
- Doesn't see workspace files (forces SKILL-thinking not
SOLUTION-thinking)
- Each weakness must include five required parts (Pattern,
Hypothesized cause, Connects to skill section, What WOULD
address this, What WOULD NOT address this)
7. optimizer.md (177L) — B8 optimizer (anti-ducktape critical)
- Reads 07-analysis + 01-functionality + current skill + (PR
03-submissions); produces 08-improvement-proposal.md
- Doesn't see raw trials, grader internals, prior proposals,
prior verdicts
- Required self-check section against analyzer's anti-pattern
list
8. validator.md (196L) — B9 validator (anti-ducktape gate)
- Reads BEFORE + AFTER skill + proposal artifact + 01-functionality
+ (PR 03-submissions); produces 09-validator-verdict.md
- Doesn't see optimizer's reasoning trace or prior verdicts
- Internal consistency check + external consistency check (if
PR-bound)
Total: 8 prompts, 1324 lines. Each prompt is self-contained and
testable independently of the parent chain skill.
Subagent-prompt files live in skills/skill-optimizer-subagents/.
Next steps: B10 autopilot SKILL.md; fix each chain SKILL.md's
dispatch step to point at the now-existing prompt file (most
references are already pointing at the right path; this is a
verification pass).
…ss of source B1 now vendors the source skill to vendored-skill/ unconditionally, not just for upstream sources. Downstream steps stop branching on "local vs upstream" — they always read vendored-skill/ as the single canonical input, and the user's original local file is never touched by the chain. Why this is cleaner: - One code path through the chain (no local/upstream conditional) - vendored-skill/ is THE input; the original is just the source we copied from (path recorded in 01-functionality.md's skill_source frontmatter) - Stability: long-running chain runs don't break if the user edits the local file mid-flight - improved-skill/ vs vendored-skill/ is the clean before/after pair for both source types; B8 and B9 just say "improved-skill/ if exists, else vendored-skill/" without local-file special cases B1 SKILL.md updates: - "What you produce" paragraph: vendoring is unconditional; user's original local file is not touched by the chain - Workflow step (c) renamed from "Vendor the source (upstream only)" to "Vendor the source" with explicit upstream/local copy paths (gh api fetch / cp -r) - Re-vendor triggers documented: source URL change (upstream), or user explicitly asks (local edits, upstream new commits) Updated all downstream references: - B6 (run-bench): "vendored-skill/ should exist" no longer says "for upstream skills" - B8 (improve), B9 (validate): SKILL_CURRENT_PATH simplified — "improved-skill/ if exists, else vendored-skill/" - iteration-protocol's "what this protocol does NOT cover": vendored-skill/ described as canonical input regardless of upstream/local - subagent-dispatch's step-7+8 carve-out: same simplification. Also fixed pre-existing typo (validator was labeled step 8; should be step 9) - 5 subagent prompts (research-functionality, test-writer, test-validator, analyzer, optimizer, validator): SKILL_SOURCE_PATH / SKILL_CURRENT_PATH / SKILL_BEFORE_PATH all reference vendored-skill/ unconditionally - workflow.md state-layout comment: "vendored-skill/ (always)" - spec doc state-layout comment: same Trim of an architecture branch that wasn't pulling its weight.
Real-world context from prior pilot runs: sometimes the user points at a SKILL.md that's just a thin wrapper referencing the actual content elsewhere. Common pattern — a multi-agent plugin has one canonical agent-agnostic content file and several agent-flavored SKILL.md wrappers (one per agent target) that all reference it. The PR target is the canonical content, not the wrapper; a change to the canonical may affect multiple wrappers. Updated research-submissions.md subagent prompt: New frontmatter fields the subagent emits: - entry_file_pattern: true|false - canonical_target_repo: <owner>/<repo> - canonical_target_path: <path to actual content> - linked_consumers: [<repo>:<path>, ...] New body sections: 1. Source structure — entry-file vs canonical content; if entry-file, document the relationship + linked consumers 2. Suggested PR target — based on (1), recommend which file(s) to modify and whether the change affects other consumers New reasoning protocol steps: 1. Detect entry-file pattern (read source SKILL.md, look for thin-wrapper signals: short body of "see X" pointers, frontmatter fields like reference: / source: / canonical:, multi-agent plugin layout). Follow the pointer to find the canonical content if detected. 2. Find linked consumers (search for other SKILL.md files referencing the same canonical content) 7. Suggest the PR target based on the above The license / CLA / frontmatter / conventions research now applies to the CANONICAL CONTENT'S repo, which may differ from the entry file's repo. Note: B1 (functionality researcher) likely also needs awareness of this pattern — if the user pointed at an entry file, downstream chain steps test/analyze/optimize the wrapper rather than the actual skill. Flagged as follow-up but not addressed in this commit (the user asked specifically about B3).
B1 now handles the entry-file pattern symmetrically with B3 —
detected at vendor time so downstream chain steps test/analyze/
optimize the actual skill content rather than a thin wrapper.
Previously: if the user pointed at an entry-file SKILL.md (a
wrapper referencing the actual content elsewhere), B1 vendored
the wrapper and all downstream steps operated on it. The whole
optimization run would miss its real target.
Now (B1 SKILL.md):
- New step (c.1) "Detect entry-file pattern + user gate" runs
after the initial vendor at (c). Operator reads
vendored-skill/SKILL.md for thin-wrapper signals (short "see X"
body; frontmatter reference:/source:/canonical: fields;
multi-agent plugin layout)
- If detected, surfaces to user with a clear choice: optimize
the wrapper, or re-vendor the canonical content and optimize
that. The common case (canonical) re-vendors; the rare case
(wrapper-specific change) keeps the existing vendored content
but records the relationship as an operator directive
- Either way, 01-functionality.md frontmatter records
entry_file_pattern: true|false and canonical_source: <path>
for downstream steps
Updated research-functionality.md subagent prompt:
- New inputs: ${ENTRY_FILE_PATTERN}, ${CANONICAL_SOURCE}
- New frontmatter fields: entry_file_pattern, canonical_source
- New body section 9: "Entry-file relationship" (only if
pattern detected) — notes the relationship, the vendored copy
content, any agent-specific adaptations
Updated research-submissions.md (B3) subagent prompt:
- Reasoning step 1 now reads 01-functionality.md's entry-file
fields as primary input. B3 trusts B1's detection; falls back
to its own detection only if 01-functionality.md was written
before this feature (defensive). If B3 detects a pattern B1
missed, surface to operator — suggests B1 needs a re-run
The whole chain now consistently knows which is the wrapper and
which is the canonical content from B1 onward.
Subagent prompts shouldn't reference chain-skill internal workflow labels (the lettered (a)-(g) steps inside each chain SKILL.md) — subagents only see their dispatched inputs + their prompt, never the chain skill itself. Two references to "step c.1" (B1's internal entry-file detection step) were invisible noise. Rephrased to describe what happened functionally — "the operator session confirmed with the user and re-vendored" — without naming the step label. Chain-step number references (step 1 through step 10) stay, because those are stable role descriptors the subagent understands as "another role in the chain", not internal workflow lettering. Files touched: - skills/skill-optimizer-subagents/research-functionality.md - skills/skill-optimizer-subagents/research-submissions.md
…al check
Previous design had B1's operator session do mechanical wrapper
detection at workflow step (c.1) — read SKILL.md, check for
heuristic signals, surface to user. Wrong design: the subagent is
already reading the source for research, has LLM judgment that
beats a mechanical check, and the heuristics ("body under 30 lines",
"frontmatter has reference:") would miss real cases.
Restructured: detection is the subagent's judgment, reported in
its return summary. Operator reacts by surfacing to user, who
picks the resolution.
B1 SKILL.md changes:
- Removed workflow step (c.1) — no more operator-side detection
- Step (g) renamed "Confirm + handle wrapper detection + hand off"
— reads the subagent's `likely_wrapper` frontmatter field; if
true, surfaces to user with three realistic responses:
1. Re-vendor the referenced content and re-research (common)
2. Proceed treating the wrapper as the skill (rare)
3. Cancel and provide a different source
- Frontmatter field rename: `entry_file_pattern` -> `likely_wrapper`
(more honest — it's a judgment, not a binary classification)
and `canonical_source` -> `wrapper_points_to` (descriptive
rather than presuming a canonical/wrapper hierarchy)
research-functionality.md (B1 subagent prompt):
- Removed ${ENTRY_FILE_PATTERN} and ${CANONICAL_SOURCE} inputs —
these were operator-pre-detected fields. Detection now lives in
the subagent's reasoning protocol.
- New "Wrapper detection" section in the reasoning protocol —
explicit patterns to look for, explicit instruction to record
`likely_wrapper` + `wrapper_points_to` in frontmatter when
judged true, and explicit instruction to NOT follow the pointer
or re-vendor itself (operator's job after user gate)
- Body section 9 renamed to "Wrapper observation" — describes
signals + confidence rather than asserting a canonical/wrapper
relationship
- Return summary now includes the wrapper finding so operator can
trigger the user gate
research-submissions.md (B3 subagent prompt):
- Frontmatter field rename: `canonical_target_repo` /
`canonical_target_path` -> `pr_target_repo` / `pr_target_path`
(cleaner — what the PR composer needs is the PR target, not a
taxonomy of canonical-vs-wrapper)
- Reasoning step 1 reads B1's `likely_wrapper` / `wrapper_points_to`
as authoritative; falls back to own judgment only if
01-functionality predates this feature
Net: wrapper-detection happens once (at B1, by the subagent), and
the result flows through frontmatter to downstream steps. Operator
sessions handle the user gates; no operator does LLM-style
judgment work.
Step 2 is now skill-optimizer-investigate-submissions (PR-bound only); step 3 is the renamed skill-optimizer-design-tests (was skill-optimizer-investigate-test-case). PR research now logically follows step 1 immediately when PR-bound, before test design. Renamed: - skills/skill-optimizer-investigate-test-case/ → skill-optimizer-design-tests/ - skills/skill-optimizer-subagents/test-case-designer.md → test-designer.md - 02-test-proposals.md ↔ 02-submissions.md (file numbers follow new step numbers) Also fixed pre-existing step-header bugs from the validate-tests insertion (run-bench, analyze, improve, validate had stale "Step N" headers off by one). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
B-prefix shorthand (B1, B2, ...) was internal brainstorming notation that leaked into shipped SKILL files. Sweeping it out so the chain documentation is self-explanatory to anyone who hasn't seen the v1.4 design discussions. Also fixed three additional stale step-number references found during the audit: - frontmatter-discipline.md gating-field summary was off by one to two steps (predated the validate-tests insertion) - analyze SKILL.md "Step 7 will refuse" handoff message named the wrong gating step (should be step 8, improve) - analyze SKILL.md "step-5 problem" should be "step-6 problem" for malformed bench output (run-bench is step 6) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The wrapper-vs-target gymnastics in research-submissions was
over-engineered. The correct model: step 1's user gate is where
the wrapper question gets resolved. If the user opts to optimize
the underlying content, step 1 re-vendors with an updated
SKILL_SOURCE and the new 01-functionality.md has likely_wrapper=
false. Downstream steps just read skill_source as the PR target —
no conditional logic, no re-litigation.
Changes:
- step 1 SKILL.md (g): make the SKILL_SOURCE update on re-vendor
explicit, and document what each user choice means for what
downstream steps see
- research-submissions.md frontmatter description: pr_target_*
derives directly from skill_source; no special wrapper case
- Drop body sections 1 ("Source structure") and 2 ("PR target")
— they were redundant with frontmatter and re-litigated the
step-1 decision
- Reasoning protocol: replace the wrapper-handling step with a
one-line read of skill_source; reorder so PR-shape research
comes before linked-consumer search
- Linked consumers stays as a coordination hint for the PR
composer, but as its own body section rather than woven into
the (now-removed) wrapper analysis
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The smoke-skill-distribution test was reading paths under the nuked skills/skill-optimizer/ folder. Updated to: - Walk all 9 chain skill directories and verify each has valid SKILL.md frontmatter (replaces the single-canonical check that predates the v1.4 chain decomposition) - Point workbench reference checks at skills/skill-optimizer-shared/ workbench.md (the new shared location) Also removed an empty skill-optimizer-validate-improvement/ directory that wasn't cleaned up when the skill was renamed to skill-optimizer-validate. Known v1.4 debt still on the cleanup list: .claude-plugin/ marketplace.json points at ./skills/skill-optimizer (the nuked path). The test passes because it does string compare without existence check; needs follow-up in the plugin metadata sweep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After the v1.4 redesign nuked the monolithic skills/skill-optimizer/ in favor of 9 chain skills under skills/skill-optimizer-*/, the plugin manifests, install docs, and Gemini context file still pointed at the dead paths. The smoke test caught the workbench reference but not the marketplace pointer (string compare only). Fixed across all surfaces: - .claude-plugin/marketplace.json: skills array now lists all 9 chain skill paths instead of the dead ./skills/skill-optimizer - GEMINI.md: @imports point at shared/workflow.md (chain overview) + shared/workbench.md instead of the deleted monolithic SKILL.md and references/workbench.md - README.md, CONTRIBUTING.md, AGENTS.md, CLAUDE.md, docs/README.{ codex,opencode}.md, .cursor/INSTALL.md, .codex/INSTALL.md, .opencode/INSTALL.md: replaced "canonical skill" framing with the 9-skill chain description; updated --skill flag examples to enumerate each chain skill explicitly - tests/smoke-skill-distribution.ts: marketplace test now asserts all 9 chain skills are listed AND verifies each path has a real SKILL.md on disk (closes the string-compare-only loophole that let the old dead path pass). Gemini test asserts the new @import targets. All 11 tests pass; typecheck clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implementation of the v1.4 skill-optimizer chain architecture proposed in RFC #52. Decomposes the v1.3 monolithic orchestrator subagent into 9 independent chain skills + 1 auto-pilot driver, each dispatching narrow-context subagents that produce human-reviewable artifacts at convention paths.
Draft because B10 (autopilot SKILL.md), plugin metadata updates, and end-to-end validation runs are still pending. Opening now for early review of the architecture as implemented.
What's in this PR
9 chain skill SKILL.md + 1 auto-pilot pending (B10)
investigate-functionalityinvestigate-test-caseinvestigate-submissions(optional)write-testsvalidate-testsrun-benchrun-suiteanalyzeimprovevalidateautopilot8 subagent prompt templates (Phase C, just completed)
skills/skill-optimizer-subagents/: each chain skill that dispatches a subagent loads the corresponding prompt template at its dispatch step. Each prompt has consistent structure: role, inputs, what-it-sees / what-it-doesn't-see (the anti-ducktape constraints), output spec, reasoning protocol, edge cases, return summary.4 shared docs
skills/skill-optimizer-shared/:iteration-protocol.md— iteration mechanics (step kinds, staleness, destructive-edit checkpoints)subagent-dispatch.md— dispatch architecture (constraints, directives, no-auto-invocation rule)frontmatter-discipline.md— runtime facts vs history (git replaces version+archive bookkeeping)workflow.md— operator-facing chain reference (chain table, re-run triggers, backward triggers)workbench.md— workbench schema (moved from legacy skills/skill-optimizer/references/; needs rewrite, flagged)Architecture
tests/<functionality>/<probe>/{spec.yaml, workspace/, grader.mjs, smoke/, checks/}. Nopicked: []orbuilt_cases: []arrays —picked: true|falselives per-functionality spec.yaml; built state is implicit (probe folder + grader.mjs present).version:fields, noarchive/directories. Git already content-addresses every prior state;git logmtime gives staleness signal for auto-pilot.vendored-skill/stays frozen;improved-skill/accumulates approved improvements via step 9.Spec + plan docs
Moved to the superpowers convention:
docs/superpowers/specs/2026-05-19-skill-optimizer-v1.4-design.mddocs/superpowers/plans/2026-05-19-skill-optimizer-v1.4.mdWhat's NOT in this PR (deferred)
skill-optimizer-autopilot/SKILL.md— chain driver still pendingworkbench.mdrewrite — currently in legacy shape from the v1.3 era; flagged as future workskills/auto-improve-orchestrator/removal — separate cleanupStats
Test plan
This PR is structurally complete but not yet validated end-to-end. Validation pending in a follow-up:
npx tsx tests/smoke-skill-distribution.ts,npm pack --dry-run)Reference
Builds on RFC #52: #52 — that PR carries the architectural rationale, evidence base, and concerns-addressed sections. This PR is the implementation.
🤖 Generated with Claude Code