Skip to content

feat(skill-optimizer): v1.4 chain implementation (draft, per RFC #52)#53

Draft
Zhaiyuqing2003 wants to merge 50 commits into
developmentfrom
feat/skill-optimizer-v1.4
Draft

feat(skill-optimizer): v1.4 chain implementation (draft, per RFC #52)#53
Zhaiyuqing2003 wants to merge 50 commits into
developmentfrom
feat/skill-optimizer-v1.4

Conversation

@Zhaiyuqing2003
Copy link
Copy Markdown

Summary

Implementation of the v1.4 skill-optimizer chain architecture proposed in RFC #52. Decomposes the v1.3 monolithic orchestrator subagent into 9 independent chain skills + 1 auto-pilot driver, each dispatching narrow-context subagents that produce human-reviewable artifacts at convention paths.

Draft because B10 (autopilot SKILL.md), plugin metadata updates, and end-to-end validation runs are still pending. Opening now for early review of the architecture as implemented.

What's in this PR

9 chain skill SKILL.md + 1 auto-pilot pending (B10)

# Skill Kind What it does
1 investigate-functionality fresh-derivation Research what the skill does
2 investigate-test-case maintenance Design functionalities to test
3 investigate-submissions (optional) fresh-derivation Research upstream PR conventions
4 write-tests maintenance Build per-probe workspace + grader + smoke fixtures
5 validate-tests fresh-derivation Independent semantic check on probes (anti-ducktape gate)
6 run-bench fresh-derivation (summary) CLI wrapper for run-suite
7 analyze fresh-derivation Cluster failures into named structural weaknesses
8 improve fresh-derivation Optimizer drafts principled fix
9 validate fresh-derivation Independent verdict on the improvement
10 (pending) autopilot chain driver Walks 1→9 end-to-end

8 subagent prompt templates (Phase C, just completed)

skills/skill-optimizer-subagents/: each chain skill that dispatches a subagent loads the corresponding prompt template at its dispatch step. Each prompt has consistent structure: role, inputs, what-it-sees / what-it-doesn't-see (the anti-ducktape constraints), output spec, reasoning protocol, edge cases, return summary.

4 shared docs

skills/skill-optimizer-shared/:

  • iteration-protocol.md — iteration mechanics (step kinds, staleness, destructive-edit checkpoints)
  • subagent-dispatch.md — dispatch architecture (constraints, directives, no-auto-invocation rule)
  • frontmatter-discipline.md — runtime facts vs history (git replaces version+archive bookkeeping)
  • workflow.md — operator-facing chain reference (chain table, re-run triggers, backward triggers)
  • workbench.md — workbench schema (moved from legacy skills/skill-optimizer/references/; needs rewrite, flagged)

Architecture

  • Filesystem-as-state for the test tree: tests/<functionality>/<probe>/{spec.yaml, workspace/, grader.mjs, smoke/, checks/}. No picked: [] or built_cases: [] arrays — picked: true|false lives per-functionality spec.yaml; built state is implicit (probe folder + grader.mjs present).
  • Git-native iteration: no version: fields, no archive/ directories. Git already content-addresses every prior state; git log mtime gives staleness signal for auto-pilot.
  • Two anti-ducktape gates:
    • Step 5 (validate-tests) → step 6: can't measure against bad probes.
    • Step 7 (analyze) → step 8 / step 9 (validate) → materialize improved-skill/: can't ship a ducktape patch (analyzer must name a weakness with both general principle AND explicit anti-pattern list; validator checks the optimizer's self-check against that list).
  • Original skill is never modified: vendored-skill/ stays frozen; improved-skill/ accumulates approved improvements via step 9.

Spec + plan docs

Moved to the superpowers convention:

  • docs/superpowers/specs/2026-05-19-skill-optimizer-v1.4-design.md
  • docs/superpowers/plans/2026-05-19-skill-optimizer-v1.4.md

What's NOT in this PR (deferred)

  • B10 skill-optimizer-autopilot/SKILL.md — chain driver still pending
  • Plugin metadata + CLAUDE.md + README updates pointing at v1.4 (separate cleanup PR per spec)
  • workbench.md rewrite — currently in legacy shape from the v1.3 era; flagged as future work
  • End-to-end validation runs on real skills (firecrawl regression case, etc.)
  • Legacy skills/auto-improve-orchestrator/ removal — separate cleanup

Stats

  • 27 files changed, +6129 / -210 lines
  • 9 chain SKILL.md (avg 150 lines each)
  • 8 subagent prompts (avg 166 lines each, 1324 total)
  • 4 shared docs (avg ~100 lines each)
  • 40 commits, all conventional-format

Test plan

This PR is structurally complete but not yet validated end-to-end. Validation pending in a follow-up:

  • Manual walk-through of the chain on a small local skill — verify each chain skill dispatches correctly, each subagent prompt produces the expected output shape
  • Re-run firecrawl (v1.3 regression case) under v1.4 — expect optimizer either produces a principled fix OR honestly refuses (no regression shipped)
  • Auto-pilot smoke test on a local skill (after B10 lands)
  • Plugin metadata + CLAUDE.md/README pointer updates so the v1.4 chain is discoverable
  • Cross-platform plugin packaging verification (npx tsx tests/smoke-skill-distribution.ts, npm pack --dry-run)

Reference

Builds on RFC #52: #52 — that PR carries the architectural rationale, evidence base, and concerns-addressed sections. This PR is the implementation.

🤖 Generated with Claude Code

Yuqing Zhai and others added 30 commits May 19, 2026 06:01
Approved design (brainstormed 2026-05-12) converting the v1.3
monolithic auto-improve-orchestrator into 7 independent Claude Code
skills that chain via the superpowers plugin pattern.

Motivation: team review of v1.3 PR drafts surfaced 4 critiques —
v1.3 optimizes for incremental numerical uplift without validating
test case quality, grader correctness, or improvement principledness.
The firecrawl iteration regression (1.0 → 0.44 from piling on
Recipe A+D simultaneously to chase a small uplift) is the canonical
"ducktape-by-monolithic-orchestrator" case study.

Key architectural shifts:
- Skills (not slash commands) per superpowers convention
- Convention-pathed reports at docs/skill-optimizer/<slug>/...
  (visible + committable, like docs/superpowers/specs/)
- Strict limited-context subagents for all generative work (writer,
  analyzer, optimizer, validator) — prevents tunnel-vision into
  ducktape patches
- Validator subagent after every improvement (internal + optional
  external consistency)
- Auto-pilot = natural chained invocation, not a separate orchestrator

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User clarification: the agent asks 'optimize for PR submission?' at
step 1 (when the user first provides an upstream skill), not at the
end of step 7. This decision determines whether step 3
(investigate-submissions) runs at all.

Changes:
- 'Two contexts, one workflow' rewritten: PR decision at step 1,
  recorded in 01-functionality.md frontmatter as pr_submission_intent
- Step 1 behavior: explicit PR question for upstream sources
- Step 2 handoff: reads pr_submission_intent to decide whether to
  invoke step 3
- Step 3 'skipped when': now keyed on pr_submission_intent: false
- Step 7 handoff: three branches (local / upstream-no-PR /
  upstream-yes-PR), no late prompts

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
15 tasks across 5 phases:
- Phase A (4 tasks, mechanical): skill dir shells, subagents/refs dirs, v1.3 deprecation, recipes.md seed
- Phase B (7 tasks, INTERACTIVE via writing-skills): one per SKILL.md, NOT subagent-driven per user request
- Phase C (6 tasks, mechanical): subagent prompt templates with limited-context constraints
- Phase D (1 task, mechanical): references/workflow.md chain diagram
- Phase E (2 tasks, E2E validation): local skill + firecrawl re-run (the v1.3 regression case)

Plan explicitly marks Phase B as not-for-subagent-driven-development;
the user explicitly stated 'writing good skills is HARD' and wants
interactive creation per skill.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+ writing-skills

Phases B–D produce v1.4's load-bearing artifacts (description-routed SKILL.md
files, subagent prompt templates that hold the anti-ducktape constraints, and
the workflow reference doc). Per operator direction, none of these should be
subagent-driven — "writing good skills is HARD" and "everything to be precise".

Tool assignment:
- Phase B (7 SKILL.md): skill-creator as outer loop (description-routing +
  eval iteration), superpowers:writing-skills as inner loop when behavioral
  compliance issues surface during eval
- Phase C (6 subagent prompts): superpowers:writing-skills only — these are
  compliance documents and the limited-context constraint must hold under
  adversarial pressure, pure TDD-pressure-scenario territory
- Phase D (workflow.md): no skill tool; direct interactive authoring with
  section-by-section operator review (matches handoffs from B + constraints
  from C, which may have shifted during interactive iteration)

Only Phase A (file scaffolding) and Phase E (real eval runs) sit outside the
interactive flow.
Body content for each SKILL.md will be written interactively in Phase B
via superpowers:writing-skills, with the interface contract from
docs/skill-optimizer-v1.4-spec.md as the brief. This commit just lays
down the structural skeleton and discoverable frontmatter.
Pulled from feat/auto-improve-skill-v1.3:skills/auto-improve-orchestrator/
references/lessons.md (the v1.3 orchestrator never landed on development,
so we source from the experimental branch). Adds v1.4-specific header
explaining how the analyzer (step 6) and optimizer (step 7) subagents will
use this file. Replaces v1.3 "auto-improve-skill" / "Phase 4" framing
with v1.4 step-numbering and subagent-naming.

Content body is the v1.3 Recipe A-E + grader patterns G1-G6 + run-record
protocol verbatim — that's the cumulative knowledge v1.4 inherits.
…ts dir, move philosophy to docs/

Changes to the v1.4 plan-as-executed, all in one place:

- Move skill-writing-philosophy.md from skills/references/ to docs/. The
  philosophy doc is contributor-facing — we'll distill the load-bearing
  rules into the optimizer subagent's prompt directly rather than have
  it load this doc at runtime.

- Delete skills/references/recipes.md (and the now-empty references/
  dir). The raw seed copied from v1.3 lessons.md is case-study-shaped
  (accumulated per-pilot observations), which contradicts the
  "generalize-from-feedback" principle in the philosophy doc. The
  curated abstract-pattern version needs real end-to-end observations
  to ground it in — deferred to a follow-up after the chain ships.

- Rename skills/subagents/ → skills/skill-optimizer-subagents/ so the
  dir name is scoped to this plugin and can't collide with subagents
  that other plugins might ship under a generic name.

- Add a "Bootstrapping limit" section to the philosophy doc. The
  skill-optimizer chain runs an empirical loop on target skills, but
  that loop can't validate itself — same shape as Thompson's
  "Reflections on Trusting Trust". The seven meta-skills get authored
  from philosophy + best judgment; their test is Phase-E real-world
  runs, not eval data for the meta-skills themselves.

- Spec + plan updated: file tree, Task A2 (revised), Task A3 (skipped),
  Task A4 (deferred), Phase C task file paths (subagents path
  rename), Phase D location (workflow.md goes to docs/ rather than
  skills/references/), acceptance criteria #4 (recipes.md deferred),
  coexistence section (v1.3 orchestrator never landed on development),
  open questions (recipes.md location TBD on real production).

- Add note about post-v1.4 cleanup of the original
  skills/skill-optimizer/ skill (now redundant with the chain's
  skill-optimizer-run-bench step). Deferred to its own PR because it
  touches the public plugin API.
…sification taxonomy + local-with-PR-intent path

Two changes to the B1 draft, plus a matching spec sync:

1. Classification: replaced the ad-hoc closed list
   (code-reviewer | document-producer | tool-use | code-patterns |
   other) with the v3 taxonomy actually used in the prioritization
   work — tool-use, code-patterns, document, prose-guidance, meta,
   interactive — and explicitly grants the subagent sovereignty to
   write a short descriptive label of its own when none fits, rather
   than collapsing to "other". A specific label gives downstream
   steps a real handle to work with.

2. Local-skill PR intent: the prior rule was "local = always
   pr_submission_intent: false". Revised to default false but treat
   PR-intent as live when the user explicitly says they want to send
   the local skill back upstream — in which case we capture the
   upstream guidelines location (URL / CONTRIBUTING.md path / Slack
   channel) into the report body as a "PR submission notes"
   subsection that step 3 (investigate-submissions) reads as its
   starting point.

Also: while reviewing, dropped the v1.3/v1.4 framing from the "Why
limited-context dispatch matters" section (version is historical
metadata, not skill-functional content) and updated subagent path
links to the renamed skill-optimizer-subagents/ directory.

Spec §"The 7 skills" #1 step 2 + step 6 (classification field
description) updated to match.
…r B2

Spec changes:

- New ## Iteration patterns section between ## Subagent constraints
  and ## The skills. Covers: backtrack-trigger table, per-report
  versioning mechanism (version + inputs frontmatter, with direct-
  upstream-only staleness checks), re-entry contract with the
  ${OPERATOR_DIRECTIVES} slot, latest-plus-archive convention, and
  the transitive-staleness policy (user judgment trusted; auto-pilot
  re-runs on direct-upstream version mismatch).

- ## The 7 skills → ## The skills. Added one-line "Iteration
  behavior" notes to each of the existing seven subsections.

- Step 2 (investigate-test-case) now dispatches a test-case-designer
  subagent rather than running in the operator session. Reasoning:
  cross-iteration contamination — on iter 2+ the operator has seen
  prior failures and biases test selection toward "what just failed"
  instead of comprehensive coverage. Isolated subagent fixes this.

- New ### 8. skill-optimizer-autopilot subsection. Walks 1→7,
  consults version mechanism per step (skip-or-dispatch), applies
  defaults for the three human-gate points (B1 PR-intent flag, B2
  pick-top-N, B7 log-and-exit on validator-rejected), bounds at
  max-iterations-per-step. Replaces the old "## Auto-pilot mode"
  section (which said "no separate skill, just chained invocation").

- Subagent constraints table updated: test-case-designer added;
  every reasoning subagent's "Does NOT see" column now includes
  prior drafts of its own output; every Sees column includes
  ${OPERATOR_DIRECTIVES}.

- Acceptance criteria expanded to 9 items (was 7): new #1 covers
  8 SKILL.md files, new #5 covers iteration-mechanism E2E, new #8
  covers auto-pilot smoke test.

Plan changes:

- Phase B count 7 → 8 tasks. New Task B8 (skill-optimizer-autopilot
  SKILL.md) added with its own A1-equivalent dir-creation step
  inline (the dir wasn't part of Phase A scope).

- Phase C count 6 → 7 tasks. New Task C1b (test-case-designer
  subagent prompt) inserted between C1 and C2 with a full draft
  template; suffixed "1b" rather than renumbering existing C2-C6
  to keep cross-references stable.

- Phase C section header adds a preamble explaining the
  ${OPERATOR_DIRECTIVES} requirement that applies to every
  reasoning-subagent prompt template.
…h iteration patterns

Five updates to match the spec's new iteration mechanism (§"Iteration
patterns" added in commit 0009066):

1. Report frontmatter now includes `version` field. Step 1 has no
   upstream reports, so `inputs:` is omitted; downstream steps will
   add it as they're authored.

2. New workflow step 5 — "Handle iteration: archive prior version +
   collect directives." Checks for existing report at the canonical
   path; if present, moves it to archive/01-functionality-v<N>.md
   and bumps version. Also where the operator pre-digests
   cross-iteration learnings into the OPERATOR_DIRECTIVES bulleted
   list (atomic new requirements, not a context dump).

3. Step 6 (renumbered from old step 5) dispatches with two new
   templated inputs: ${VERSION} and ${OPERATOR_DIRECTIVES}. The
   subagent's "does NOT see" list explicitly includes prior
   01-functionality.md drafts under archive/ to prevent the
   re-derivation from being contaminated by its own past output.

4. New "## Iteration behavior" section after "## Edge cases". Covers:
   when to re-run, what happens to vendored-skill/ on re-run, the
   archive convention, and how downstream cascade self-corrects via
   the version-mismatch check on next invocation.

5. Edge-case for "vendored-skill/ already exists" tightened: default
   to reuse (same source), re-fetch only on source URL change or
   explicit user ask. (Previously asked the user every time.)

Sets the template for B2-B8.
Authoring the same ~40 lines of archive + version-bump + directives-
collection logic into seven chain skills (B1-B7) would mean ~280
duplicated lines. Factor the mechanics into a single shared
operational reference that every chain skill loads explicitly.

New file: skills/skill-optimizer-shared/iteration-protocol.md (187
lines). Covers: versioning convention, the per-invocation decision
tree (no existing report → v1; existing + inputs match → current;
existing + mismatch → archive + bump), collecting OPERATOR_DIRECTIVES
(atomic new requirements, not context dumps), subagent constraints
under iteration (ignore own prior output, read upstream at latest),
cascading staleness policy (direct-upstream-only checks), archive
table, bootstrapping case, and what the protocol explicitly does
NOT cover (bench-results timestamping, auto-pilot summaries,
vendored-skill cache).

B1 changes:
- Step 5 stripped from ~25 lines of inline mechanics to ~10 lines
  with an explicit "Read this file now" instruction pointing at the
  protocol doc. The "Read this now" framing is load-bearing — the
  whole point is that the agent must consult the protocol, not
  improvise.
- "Iteration behavior" section trimmed from ~25 lines of general
  mechanics to ~10 lines listing only step-1-specific re-run
  triggers, with a pointer back to the protocol for the general
  mechanics.
- Net: 225 → 208 lines.

This sets the pattern for B2-B8: each chain skill's "Handle
iteration" step will be a brief pointer-and-step-specific-notes
combo, with the heavy mechanics centralized.

Spec changes:
- Architecture overview file tree: skill-optimizer-shared/ added.
- Acceptance criterion 2b added: shared protocol doc exists and is
  referenced explicitly from each chain skill's iteration step.

Plan changes:
- File tree updated (skill-optimizer-shared/ + autopilot/ stub).
- New Task A5 added: author iteration-protocol.md. Documented as
  "as-executed (added mid-execution)" because it emerged from B1's
  revision rather than the initial plan.
…sophy

Three issues that the protocol doc was a load-bearing reference for —
so they would have propagated into every chain skill's invocation —
all caught on read-through:

1. "B1-B7" was project-internal jargon (those are plan task labels,
   not part of the skill vocabulary). Replaced with "every skill in
   the chain". The shipped doc should be timeless; task labels live
   in the plan, not the protocol.

2. The "On every chain-skill invocation" section used an ASCII box-
   drawing decision tree. Per Anthropic and writing-skills guidance
   ("use flowcharts ONLY for non-obvious decision points"), this
   logic is straightforward conditional flow — bullets carry it
   more cleanly and match standard markdown rendering. Rewritten as
   nested bullets.

3. The "Bad examples" of operator directives were marked with ❌.
   Per the global instruction "Only use emojis if the user
   explicitly requests it. Avoid adding emojis to files unless
   asked", these don't belong. Replaced with explicit "Examples
   that count" / "Examples that do NOT count" section labels.

Net: 187 → 180 lines.
…mbering + front-load load-bearing context

Two related fixes:

1. Internal workflow step numbering (1-7) collided with chain step
   numbering (1-7 referring to other skills in the chain). Same word,
   two meanings — an agent reading "step 5" could plausibly think
   either "this skill's Handle-iteration step" or "the run-bench
   skill in the chain". Relabeled internal workflow steps to (a)
   through (g) so they're visually distinct from chain step
   references. Cross-references inside the doc updated accordingly.

   A note in the new "Before you start" section declares the
   convention explicitly so future readers don't have to infer it.

2. The "Read iteration-protocol.md" instruction was buried at
   internal step (e) — middle of a 7-step workflow. An agent
   reading sequentially might glance past it once they're in
   execution momentum. Added a "## Before you start" section right
   after the intro paragraph, listing the two load-bearing things
   to know up front: (1) read the iteration protocol now, (2) you
   will dispatch a subagent — you don't do the research yourself.

   Step (e) still requires reading the protocol — the "Before you
   start" section primes the agent so step (e) becomes reinforcement
   rather than first contact.

229 lines total (up from 208; "Before you start" earns its place as
the discipline frame). Sets the pattern for B2-B8 — each will
similarly relabel its workflow steps with (a)-(g) and front-load
the two load-bearing context items.
…visibility moved from step 4-blocked to step 4-allowed

Two coupled changes — B2's SKILL.md draft and a spec update that
makes the source-visibility split consistent across the chain.

Spec changes (Subagent constraints table):

- Step 2 (test-case designer) — explicitly added "skill source
  content" to "does NOT see". Coverage design happens at the
  responsibility level here; concrete fixture details enter at
  step 4. Without this constraint, designers could gerrymander
  test cases around the source's literal phrasing instead of
  reasoning from stated responsibilities.

- Step 4 (test writer) — removed "the skill's content" from "does
  NOT see"; added "skill source content" to "sees". Updated the
  rationale: fixture writing needs concrete patterns and violation
  examples, which come from source. The per-case test spec from
  step 2 constrains what the fixture should test, so the
  gerrymandering risk is bounded. Still blocks other test cases
  (prevents copying across the suite) and grader internals
  (prevents grader-leak hacking).

Architecture: source enters the chain at step 1 (research), exits
at step 2 (responsibility-level design needs no source), enters at
step 4 (fixture writing needs source detail), stays present for
step 6 (analyze failures) and step 7 (optimize/validate). The
spec's step-4 body description updated to match.

B2 SKILL.md:

- New file, follows B1's template: front-loaded "Before you start"
  section, lettered workflow steps (a)-(f), bold discipline markers
  at action sites, why-this-matters rationale, edge cases,
  iteration behavior section.
- 6 internal steps: (a) confirm prerequisites, (b) handle iteration,
  (c) dispatch designer subagent, (d) confirm subagent output,
  (e) user gate (present + collect picks), (f) conditional handoff
  based on pr_submission_intent.
- "picked" frontmatter field is the operator's responsibility —
  subagent writes proposal with picked: []; operator fills picked
  after user gate.
- Three-response handling in user gate: pick subset/all, ask for
  revision, pick zero.
- 215 lines, description 479 chars.
Step 3 of the chain — OPTIONAL, runs only when 01-functionality.md
has pr_submission_intent: true. Researches the upstream repo's PR
conventions and writes 03-submissions.md as the verbatim-pastable
context block the validator (step 7) uses for external consistency.

Follows the B1/B2 template:
- Front-loaded "Before you start" with iteration-protocol pointer +
  dispatch discipline + step-numbering convention
- Frontmatter with version, inputs.step_1_functionality, plus
  step-specific fields: upstream_repo, upstream_branch_target,
  license, requires_cla
- 5 lettered workflow steps: (a) confirm prerequisites including
  pr_submission_intent gate, (b) handle iteration, (c) dispatch
  subagent, (d) confirm subagent output with blocker flagging,
  (e) hand off
- Why limited-context dispatch matters: rationale specific to step
  3 — the validator's external consistency check depends on the
  submissions report being neutral upstream facts, not advocacy for
  the proposed change
- Edge cases: private repos, non-GitHub hosts, empty PR history,
  copyleft license
- Iteration behavior: explicitly note this rarely re-runs;
  upstream conventions change slowly

235 lines, description 578 chars.
…otocol + apply in B2

A chain skill never invokes another chain skill on its own. When a
step finds its upstream input unsatisfactory (thin functionality
report, too-easy bench, missing test case, wrong-target PR
conventions), it surfaces the finding and stops — the user (or
auto-pilot driver) decides whether to re-run upstream, accept the
situation, or abandon.

This was implicit in the architecture but not stated as a rule.
Adding it explicitly to iteration-protocol.md as a new section
between "Cascading staleness" and "Archive convention". Same
section covers the forward-handoff exception (those are normal
chain flow, not backward triggers).

Updated B2's edge case for "Subagent's proposal has fewer cases
than expected" — was "Either re-run step 1 with directives or
accept the small proposal", now "Surface this to the user with two
options ... Do not re-invoke step 1 yourself — re-runs require an
active signal from the user", with reference back to the protocol
doc's new section.

Audit: only B2 had a wording suggesting backward auto-trigger; B1
and B3 don't suggest re-running upstream steps. Same rule applies
to all future chain skills (B4-B8) — the protocol doc is the
chain-wide source of truth.
…eview

1. upstream_branch_target placeholder syntax. Was "<main | next |
   other>" which reads as a closed enum. The actual value is the
   literal branch name (could be `develop`, `release-2024`, etc.),
   so the placeholder should describe the field meaning, not enumerate
   three literal options. Now: "<branch name — usually `main`,
   sometimes `next` for new-skill repos, or a specific release branch>".

2. Removed the "frontmatter spec includes fields not currently in the
   vendored skill's frontmatter" blocker. It described a real fact in
   the report (upstream uses fields the vendored skill lacks) but
   isn't a blocker — the optimizer (step 7) reads 03-submissions.md
   and would add missing fields naturally. Doesn't need a separate
   operator alert.

3. Removed the "License is GPL or other copyleft" blocker. Copyleft
   licenses don't mechanically block PR submission — the upstream is
   whatever-licensed; a PR becomes part of that codebase under that
   license. Organization-policy concerns about contributing to
   copyleft projects exist but are contributor-side decisions, not
   chain-level blockers.

Step (d) reframed from "flag these blockers" to "verify file + only
the CLA fact needs explicit mention since it requires operator-side
work outside the chain". Other frontmatter fields (license, repo,
branch target) are facts downstream steps consume directly.

Edge case for "license is copyleft" replaced with one for unusual
frontmatter conventions — that's the actual blocker shape (subagent
can't extract a consistent spec, so the optimizer has to make a
judgment call).
Per operator review: B2's re-run was modeled as fresh derivation,
which loses continuity. The user's `picked` choices and existing
case names should survive a re-run; otherwise the user has to re-pick
everything every time they add coverage. The "archive as inert audit
trail" rule means continuity has to come from the current canonical
file, not from peeking at archived prior versions.

New concept: chain steps come in two kinds.

- Fresh-derivation steps (1, 3, 6, 7): subagent never sees its own
  step's prior or current file. The "anti-ducktape" rule applies in
  full — optimizer must not see prior attempts, analyzer must not see
  prior analyses.

- Maintenance steps (2, 4): subagent reads the current canonical file
  as load-bearing input and produces an extended version of it.
  Existing entries the user has invested in (picks, manually-added
  cases) are preserved unless directives explicitly say to revise.
  Archive still happens for audit; archive is still inert.

iteration-protocol changes:

- New "Step kinds: fresh-derivation vs maintenance" section
  classifying each step.
- "On every chain-skill invocation" decision tree updated to branch
  on step kind: fresh-derivation copies-to-archive then derives from
  scratch; maintenance copies-to-archive then dispatches with current
  file as input.
- "Subagent constraints under iteration" restructured to make the
  third rule kind-dependent (do/don't read your own canonical file).

B2 changes:

- Step (b) Handle iteration: explicitly invokes the maintenance
  pattern. Notes the special case where step 1's version bumped
  (responsibility set changed) — user decides whether to extend or
  start fresh by deleting the canonical file before dispatch.
- Step (c) Dispatch: new templated input ${EXISTING_CASES_PATH}
  (empty on iteration 1, current 02-test-case.md on re-runs).
  Subagent's "sees" list adds the current file with explicit
  preserve-existing semantics. "Does NOT see" list now correctly
  excludes archive only (was incorrectly excluding all prior drafts).
- Step (e) User gate: response #2 "user wants additions or revisions"
  now reflects maintenance — user does not need to re-pick everything
  since existing picks are preserved.

Spec changes:

- Subagent constraints table: test-case-designer row updated to
  reflect maintenance pattern. Sees-list adds "current 02-test-case.md
  (when extending in maintenance mode)"; does-NOT-see-list correctly
  scoped to "archived prior drafts" only; why-rationale updated.

Note: B4 (write-tests) when drafted will also follow the maintenance
pattern (workbench/ accumulates per-case files). B6 and B7 stay as
fresh-derivation — that's the load-bearing anti-ducktape constraint.
Step 4 of the chain — takes the picked cases from 02-test-case.md
and dispatches test-writer subagents in parallel (one per case) to
build concrete workspace files + graders. Runs a smoke check
against hand-crafted GOOD/BAD/EMPTY fixtures before declaring done.

Follows the established template (B1/B2/B3 shape) with several
B4-specific additions:

- **Maintenance step** (per the recent iteration-protocol update).
  workbench/ accumulates as new picks get built. The (b) iteration
  step diffs picked vs prior built_cases and only dispatches
  test-writers for NEW picks or for cases that directives flag for
  revision. Existing builds are preserved verbatim.

- **Parallel per-case dispatch** in step (d). All test-writers
  emitted in a single message so they run concurrently. Each sees
  only its case spec + 01-functionality.md + skill source; does
  NOT see other cases, other graders, prior failures, or anything
  under archive/. The per-case isolation prevents both
  cross-fixture homogenization and grader-leak hacking.

- **Source content access** (per the recent spec update). Test
  writers are the one step where the skill source is in-scope for
  a generative subagent. Bounded by the per-case spec from step 2.

- **Smoke check** (step (e)). New responsibility not in other
  steps. Verifies each grader against GOOD/BAD/EMPTY fixtures the
  test-writer also produces. last_smoke_check_passed in frontmatter
  is only set when all graders pass.

- **Two outputs**: 04-tests-plan.md (the meta-report) AND the
  workbench/ directory (the artifacts the run-bench step executes
  against). Maintenance protocol covers both via archive/04-tests-plan-vN.md
  and archive/workbench-vN/.

- **User-gate at step (c)** for the planned workbench structure
  before dispatching. Three response cases (approve / add revisions
  via directives / reject and abandon-or-replan).

- **6 internal steps**: (a) confirm prerequisites, (b) handle
  iteration with diff logic, (c) plan + user gate, (d) parallel
  dispatch, (e) smoke check, (f) assemble + commit + hand off.

350 lines, description 492 chars. Larger than B1/B2/B3 because the
extra responsibilities (parallel dispatch, smoke check, maintenance
diff) each earn their lines.
…o centralized script exists

The prior wording referenced a path that didn't exist
(`skills/skill-optimizer/references/scripts/smoke-check.mjs`). The
actual codebase doesn't have a centralized smoke-check runner —
each existing workbench has its own `checks/smoke-graders.mjs`
specific to its graders (see e.g.
examples/workbench/agent-browser/checks/smoke-graders.mjs).

Rewrote step (e) to be honest about this:

- Smoke check is per-workbench, not centralized.
- The test-writer subagent produces the smoke-check artifact
  alongside its grader. Shape follows the workbench schema docs.
- Two reasonable shapes described (per-case runner vs workbench-level
  aggregator); the test-writer prompt template (Phase C) will pick
  one and apply consistently.
- Reference example pointed at the actual existing
  agent-browser/checks/smoke-graders.mjs.

Conceptual contract unchanged (GOOD passes, BAD/EMPTY fail); only
the execution mechanics corrected.
Step 5 of the skill-optimizer chain. Thin operator-driven CLI step —
no subagent dispatch. Two outputs: timestamped raw bench data under
05-bench-results/<ts>/ (preserved naturally, exempt from the standard
archive flow) and a versioned 05-bench-summary.md (follows the
iteration protocol; archive-and-version on re-run).

The summary is the entry point step 6 reads: per-model and per-case
pass rates plus a failed-case pointer list into the raw output. The
analyzer subagent at step 6 walks that list for trace and findings
detail; the summary itself stays short.

Handoff branches on overall pass rate: failures route to
analyze-result; all-pass surfaces the choice to the user (accept
that the picked set didn't expose a weakness, or re-run step 2 with
a "make it harder" directive). Chain skills don't auto-invoke.
Two related coordination changes across the chain, plus a deferred
limitation note.

Rename convention for revised cases at step 2: when a directive asks
to revise an existing case (rather than add a new one), the
test-case-designer subagent appends a version suffix —
`case-x` → `case-x-v2` → `case-x-v3`. Bare name is implicit v1.
This gives step 4's diff logic a deterministic signal that the
revised case needs a fresh test-writer dispatch (without the rename,
the name-match would say "already built" and skip rebuilding the
case whose spec actually changed). Encoded in B2's user-gate step
and referenced from B4's diff logic so the v-suffixed names are not
surprising downstream.

Partial re-bench at step 5: deferred. CLI's run-suite does not
accept a case filter, so step 5 always re-measures the full
workbench. Documented as a known limitation with a roadmap pointer.
Operator escape hatch (manual run-case + splice into
05-bench-results/<ts>/) noted as outside the chain.
Replace the version-field + archive-folder iteration model with
git as the history mechanism and the filesystem itself as the
current state. The prior protocol was reimplementing git in
frontmatter — version: ints, inputs.step_N: lineage tracking,
archive/<NN>-name-v<N>.md copies — and adding accidental complexity
across the chain.

Iteration-protocol rewrite:
- Drop version field, archive/ subdir, inputs.step_N int tracking
- Staleness detection: git log -1 --format=%ct mtime comparison
- Two step kinds preserved (fresh-derivation vs maintenance), with
  the constraint reframed in terms of "does the subagent read its
  own canonical file" — fresh-derivation says no (anti-ducktape),
  maintenance says yes (filesystem IS the state)
- Subagent constraint added: don't walk git history of any tree
  file — for fresh-derivation this is the anti-ducktape guarantee,
  for maintenance this prevents reasoning from prior states
- Safe destructive edits: operator session commits a checkpoint
  before maintenance-step rebuilds so git history has a clean
  before/after breakpoint

Spec doc updates:
- State layout: tests/<functionality>/<test>/{spec.yaml, workspace,
  grader, smoke} replaces 02-test-case.md + 04-tests-plan.md +
  workbench/. Filesystem-as-state — no picked: [] or built_cases: []
  arrays anywhere
- B2 output: 00-test-proposals.md (audit) + tests/<func>/spec.yaml
  per functionality with picked: true|false in each
- B4 output: tests/<func>/<test>/ probe folders + generated
  tests/suite.yml. Per-probe parallel test-writer dispatch
- B5 input: tests/suite.yml. Two outputs: timestamped raw +
  single-canonical 05-bench-summary.md
- Subagent constraints table updated for every reasoning subagent
  to reflect git-history-off-limits rule
- Per-step iteration-behavior sections rewritten for the new model

Undoes the -v2 rename convention added 30 min ago (no longer
needed — filesystem-as-state means revising a spec.yaml in place
is naturally detected, and B4 can just rebuild on directive
without any special name mangling).
Apply the filesystem-as-state + git-native iteration redesign across
all five existing chain SKILL.md files. Drops version: frontmatter
fields and archive/ directory references everywhere; reframes the
subagent constraints in terms of "don't read own canonical / git
history" (fresh-derivation) vs "do read current tree as state"
(maintenance).

B1 (investigate-functionality, fresh-derivation): drop version
field; subagent constraint becomes "don't read own canonical or
git history of it" instead of "don't read archive/".

B2 (investigate-test-case, maintenance — major rewrite): output
shape changed entirely. Was a single 02-test-case.md with picked:
[] frontmatter array; now produces 00-test-proposals.md (one-time
audit report) plus tests/<functionality>/spec.yaml per proposed
functionality, each with picked: true|false in its own frontmatter.
User gate is "edit picked: in each spec.yaml" rather than "tell me
which names to pick". Subagent reads existing tests/ tree as
load-bearing state per the maintenance rule.

B3 (investigate-submissions, fresh-derivation): drop version and
inputs.step_1_functionality fields; staleness now via git mtime
against 01-functionality.md.

B4 (write-tests, maintenance — major rewrite): replaced
04-tests-plan.md + workbench/ with tests/<functionality>/<test>/
probe folders. State is implicit: probe folder + grader.mjs
present = built; no built_cases: [] array. Per-probe parallel
test-writer dispatch; one probe = one test-writer subagent. Step
generates tests/suite.yml from picked-functionality probes.
Destructive-edit checkpoint pattern: operator commits before
rebuilding existing probes.

B5 (run-bench, fresh-derivation summary + timestamped raw): drop
version and inputs.step_4_tests fields; reads tests/suite.yml as
input. Summary 05-bench-summary.md is single-canonical with prior
state in git; raw 05-bench-results/<ts>/ stays naturally
accumulated and outside the protocol.

Net: less bookkeeping, simpler mental model, fewer ways to get
state out of sync. Filesystem IS the state across the chain.
Two wording fixes to the fresh-derivation subagent constraint:

1. The prior wording "no reference to what was produced before"
   read as forbidding even the directive mechanism — which is
   wrong. The operator session DOES read prior outputs (that's
   part of its job between iterations) and distills lessons into
   atomic new requirements. The subagent then satisfies those
   distilled requirements as fresh constraints without seeing the
   raw prior content. This separation is what keeps the new
   derivation from rationalizing the prior one while still letting
   the chain converge across iterations.

2. Step 7 has a specific carve-out worth stating explicitly: the
   SKILL itself (the improvement target) is upstream input, not
   the subagent's own canonical. The optimizer reads the current
   skill — which may include modifications from prior step-7 runs
   — and proposes new improvements on top. The "own canonical"
   that's off-limits is 07-improvement-proposal.md (the reasoning
   report), not the skill file. The skill accumulates improvements
   across iterations; the proposal reports do not.

Triggered by review of the prior wording on iteration-protocol.md.
Replace the descriptive "there is no version: field" wording with
a prescriptive "do not add version-tracking metadata" rule plus a
practical decision aid for future SKILL.md authors:

  When in doubt about whether a field belongs: ask whether a chain
  skill needs to READ it to do its job right now (yes -> keep), or
  whether you're recording it for future-debugging / future-audit
  purposes (no -> that's git's job).

The prior wording could be read as describing the current state
without prohibiting reintroduction. The new wording makes the
prohibition explicit so B6/B7/B8 (yet to be drafted) don't
accidentally bring back version-tracking metadata.
B6 (analyze-result): the chain's anti-ducktape gate. Fresh-derivation
step. Dispatches an analyzer subagent that reads bench summary +
raw trial data + probe specs + skill content — but NOT the test
inputs themselves (forces principle-thinking over solution-thinking).
Output: 06-analysis.md with has_structural_weakness: true|false in
frontmatter and per-weakness sections containing Pattern,
Hypothesized cause, Connects to skill section, What WOULD address
this (general principle), and What WOULD NOT address this (the
explicit anti-ducktape list step 7's optimizer must reckon with).

Honest refusal is built in: if no weakness can be articulated, the
report says so and has_structural_weakness: false, which step 7
will refuse to fire on. Forced weakness-naming when the analyzer
found nothing is the ducktape failure mode this step exists to
prevent.

B7 (improve-skill): the terminal generative step. Two subagents
(optimizer + validator), both fresh-derivation. Refuses to fire if
06-analysis has has_structural_weakness: false (the anti-ducktape
gate's downstream half).

Optimizer: reads 06-analysis, 01-functionality, current skill,
03-submissions (if PR-bound). Does NOT see raw trials, grader
internals, test inputs, or prior proposals. Must apply the
analyzer's general principle and self-check against the
anti-pattern list explicitly.

Validator: reads skill BEFORE + AFTER + 01-functionality + the
proposal artifact + 03-submissions (if PR-bound). Internal
consistency + external consistency checks. Does NOT see the
optimizer's reasoning trace, prior verdicts, or raw trial data.

Bounded loop: max 2 revision rounds. If validator still says
needs-revision after round 2, surface honestly with three realistic
paths; do not loop indefinitely.

Three handoff branches: local skill (modify in place), upstream +
PR=false (modify vendored copy), upstream + PR=true (write
07-pr-draft.md with operator-steps-to-submit; the chain does NOT
submit the PR).

Outputs three or four artifacts: 07-improvement-proposal.md,
07-validator-verdict.md, the modified skill file (on approve), and
07-pr-draft.md (PR-bound + approve). Frontmatter on both reports
carries runtime-relevant facts only (verdict, addresses_weaknesses,
diff_target) per the iteration protocol's discipline rule.

Both files follow the established structural template (front-loaded
"Before you start", lettered workflow steps, edge cases, iteration
behavior section). Pending user review of B1-B7 before B8 and the
subagent prompt templates.
…rompt

B6 SKILL.md was carrying a full markdown body template (per-weakness
section template with placeholders, non-structural noise example,
honest-refusal wording verbatim). That's the subagent's concern —
the subagent writes the body per its prompt template; the operator
session reads only the frontmatter for handoff branching.

Keep in the SKILL.md:
- Frontmatter contract (operator reads has_structural_weakness for
  step 7's gate)
- Enumeration of the five required parts per weakness entry (the
  operator session verifies these in step (d))
- The architecture-level rationale on why the anti-pattern list is
  load-bearing (this is design-decision content, not subagent-side
  prose — explains WHY the constraint exists for future authors)
- Pointer to the subagent prompt template for the full template +
  reasoning protocol

Same audit pass on B1/B2/B4/B7: their What-you-produce sections
show structural contracts (frontmatter schemas, file-tree shape)
that the operator session actually reads, not narrative body
templates the subagent fills in — so they stay as-is.
Two architectural fixes per review:

1. Drop PR draft from B7. PR composition is a separate downstream
   concern that consumes B7's proposal + 03-submissions.md; the
   auto-pilot (step 8) or a dedicated composer can handle it if
   pr_submission_intent: true. B7's job is just to improve the
   skill — packaging it as a PR is not what improve-skill does.

   Removed: 07-pr-draft.md as an artifact, the three-branch
   handoff (local / upstream + PR=false / upstream + PR=true),
   the operator-steps-to-submit checklist. Collapses to a single
   handoff message regardless of PR intent.

2. Never modify the original skill. The improved version lives at
   docs/skill-optimizer/<slug>/improved-skill/ — a separate
   location that accumulates improvements across iterations. The
   vendored upstream copy stays frozen; the local source file
   stays untouched. Git tracks improved-skill/ history.

   On re-run, the optimizer reads improved-skill/ if it exists
   (the accumulated state) and proposes the next improvement on
   top; iteration 1 reads the original source instead. Original
   is always recoverable; improved evolves under git.

   Removed: "Written in place for local skills" / "Written to
   vendored-skill/ for upstream skills" — both wrong now.

Updated B7's "Before you start" carve-out summary, step (c) and
(d) input descriptions (SKILL_CURRENT_PATH replaces
SKILL_SOURCE_PATH; validator BEFORE = optimizer's input), step (f)
write outputs (materialize improved-skill/; don't touch source),
step (g) handoff (single message). Iteration-protocol's step-7
carve-out updated to match (improved-skill/ is the accumulated
state; source stays frozen). Spec doc layout adds improved-skill/
alongside vendored-skill/; B7 section rewritten; subagent
constraints table rows for optimizer + validator updated.
Yuqing Zhai and others added 20 commits May 21, 2026 07:31
1. Rename 00-test-proposals.md -> 02-test-proposals.md. The 00-
   prefix was inconsistent with the chain's per-step numbering
   convention (B2's outputs should start with 02-). Renamed in
   both the SKILL.md and the spec doc layout.

2. Soften the "don't auto-flip" wording. The intent is "no
   proactive flipping without user direction", not "user must
   edit every spec.yaml by hand". If the user explicitly says
   "flip these to true" or "pick X, Y, Z", the operator session
   does it for them and confirms what was set.

3. Single path for user-added functionalities. The prior wording
   offered two paths (manual spec.yaml creation by operator OR
   subagent re-dispatch with directive). The first violates the
   architecture's no-operator-generative-writing rule — only the
   subagent writes test-design content. Collapsed to the single
   correct path: treat user's description as a directive and
   re-dispatch per step (e)(2).
…ounts

B3 had hard-coded "last 10 merged PRs and last 5 closed-without-merge
PRs" both in the SKILL.md "subagent sees" list and in the spec doc
"Behavior" line. Two problems:

1. Redundant: the line above in SKILL.md already says "PR list" as
   part of the gh-CLI access, so the specific-counts bullet was
   restating with extra constraints.

2. Over-prescriptive: 10/5 are arbitrary; the subagent should
   sample enough recent PRs to identify shape patterns and
   rejection signals, but the exact counts are operational
   judgment not architecture. The subagent prompt template
   (Phase C) can recommend a starting point; SKILL.md and the
   spec shouldn't pin it.

Collapsed both to a brief mention that the PR list covers both
merged and closed-without-merge for shape patterns + rejection
signals.
B7 had grown to ~500 lines covering both the optimizer and the
validator with an in-step revision loop. Splitting into two
single-shot steps cleans the architecture:

B7 (improve-skill, ~280 lines):
- Reads 06-analysis + 01-functionality + current skill state
  (improved-skill/ if exists, else original source) + 03-submissions
  if PR-bound
- Dispatches optimizer subagent
- Writes 07-improvement-proposal.md ONLY
- Does NOT materialize improved-skill/ (that's step 8's job after
  approval)
- Single-shot per invocation; 7->8->7 revision cycle is operator-
  driven, not in-step
- Handoff: "invoke validate-improvement"

B8 (validate-improvement, new, ~340 lines):
- Reads 07-improvement-proposal.md + current skill + 01-functionality
  + 03-submissions if PR-bound
- Dispatches validator subagent
- Writes 08-validator-verdict.md
- On verdict: approve: materializes improved-skill/ by applying the
  diff to a copy of the current state; original source stays frozen
- Three handoff branches by verdict (approve / needs-revision /
  reject); does not auto-invoke step 7 on needs-revision
- Single-shot; if needs-revision, operator distills and re-invokes
  step 7 then step 8

B9 (autopilot, renumbered from 8):
- Walks 1->8 (was 1->7)
- Handles the 7->8->7 revision loop bounded by max_iterations_per_step
- Spec doc + autopilot section updated accordingly

Other changes propagated:
- Rename 07-validator-verdict.md -> 08-validator-verdict.md in spec
  layout, B2/B6 cross-references, subagent constraints table
- Iteration-protocol step-kinds: add step 8 to fresh-derivation list;
  expand step-7-specific carve-out to cover both step 7 and step 8
- "step 1 through step 7" -> "step 1 through step 9" in all chain
  SKILL.md headers
- B3 "validator (step 7)" -> "validator (step 8)" (3 instances);
  "optimizer (step 7)" stays correct
- Eliminated the in-step bounded revision loop entirely — each chain
  skill is now genuinely single-shot per invocation, aligning with
  the "chain skills don't auto-invoke other chain skills" rule. The
  bounded loop survives as a cross-step pattern in auto-pilot (B9).
Per-skill verbosity audit identified four trim patterns:
1. "Before you start" preambles duplicated workflow step (b)/(c) content
2. "Why limited-context dispatch matters" sections lived far from
   the dispatch they explain; philosophy doc says "give a
   why-this-matters paragraph nearby"
3. "Confirm subagent output" boilerplate restated across all 7
   subagent-dispatching skills
4. Verbatim multi-line user-dialogue and handoff templates were
   over-prescriptive (operator can phrase the exact words from the
   intent statement)

Applied to all 8 chain skills:
- Dropped "Before you start" sections entirely (~160L saved)
- Inlined dispatch rationale at workflow step (c)/(d) as a short
  "Why this matters" paragraph (~170L saved net)
- Tightened "Confirm subagent output" steps to one or two sentences
  (~50L)
- Collapsed verbatim dialogue blocks to intent statements (~100L)
- Examples lists trimmed from 4-5 to 2 (one to establish, one to
  show variation)

Also split the shared iteration-protocol into three named docs:
- iteration-protocol.md (~130L, was ~284L) — iteration mechanics
  only: step kinds, staleness, destructive-edit checkpoints,
  cascading, bootstrapping. Loaded at "Handle iteration" step.
- subagent-dispatch.md (new, ~120L) — subagent constraints,
  operator-directives concept, templated dispatch inputs,
  no-auto-invocation rule. Loaded at "Dispatch subagent" step.
- frontmatter-discipline.md (new, ~40L) — runtime facts vs history
  rule, decision aid. Referenced at "What you produce" section.

Each chain skill loads only what it needs at the workflow step
that needs it (lazy loading rather than front-loading everything
at "Before you start"). Most skills need all three; B5 (no
subagent dispatch) needs only iteration-protocol +
frontmatter-discipline.

Final line counts (all chain skills now under 200 lines):
  B1 investigate-functionality   227 -> 148  (-79)
  B2 investigate-test-case       296 -> 191  (-105)
  B3 investigate-submissions     240 -> 156  (-84)
  B4 write-tests                 368 -> 200  (-168)
  B5 run-bench                   233 -> 154  (-79)
  B6 analyze-result              302 -> 165  (-137)
  B7 improve-skill               317 -> 181  (-136)
  B8 validate-improvement        337 -> 189  (-148)
  Shared iteration-protocol      284 -> 130 + 120 + 40 = 290

Total: 2604 -> 1674 lines (-930, ~36% reduction).

Aligns with the project's docs/skill-writing-philosophy.md:
"Bias toward 'Claude is smart' — pruning beats adding. If the
skill restates what Claude already knows, removing the
restatement is often a more principled fix than adding new
rules."
…footer, add workflow doc

Two coupled cleanups per the philosophy doc's closeness principle:

1. Inlined step-specific edge cases into their workflow steps.
   Most edge cases were just elaborations of "Confirm prerequisites"
   or "Confirm subagent output" — they belong INSIDE those steps,
   not in a trailing section. The small remainder (cross-cutting
   concerns like filesystem-as-state observations) stays in a tiny
   "Edge cases" footer where it earns its space.

2. Dropped "Iteration behavior" sections from each chain skill.
   The re-run triggers were skill-scheduling info (when to invoke
   this skill), not in-skill workflow content. Moved them into a
   new operator-facing docs/skill-optimizer-workflow.md as a
   cross-skill matrix — closer to where an operator deciding what
   to run next would look.

Added a one-line "**Fresh-derivation step.**" or "**Maintenance
step.**" classification near the top of each skill (since this
affects subagent behavior and is short).

The new docs/skill-optimizer-workflow.md (Phase D, ~110 lines)
covers cross-skill concerns: the 9-step chain table, per-step
re-run triggers, backward triggers (when a step surfaces a problem
with an earlier step), state layout, pointers to the shared docs.

Final line counts:
  B1 investigate-functionality   148 -> 131  (-17)
  B2 investigate-test-case       191 -> 174  (-17)
  B3 investigate-submissions     156 -> 142  (-14)
  B4 write-tests                 200 -> 182  (-18)
  B5 run-bench                   154 -> 131  (-23)
  B6 analyze-result              165 -> 148  (-17)
  B7 improve-skill               181 -> 155  (-26)
  B8 validate-improvement        189 -> 165  (-24)
  workflow.md (new)              0   -> 110

Chain net: -156 lines from chain skills, +110 lines for the
workflow doc that captures the cross-skill content previously
duplicated across each skill's "Iteration behavior". Net per-file
size is smaller, and the per-skill files now follow the closeness
principle: edge cases live next to the step they refine; re-run
triggers live in the operator-facing reference, not in the
per-skill workflow.

Combined with the prior verbosity sweep: chain SKILL.md files have
gone from 2320 to 1228 lines (-1092, -47%) since this morning.
Top-level docs/ should hold project-wide reading; chain-specific
design docs belong elsewhere:

- docs/skill-optimizer-v1.4-spec.md
    -> docs/superpowers/specs/2026-05-19-skill-optimizer-v1.4-design.md
    (matches the superpowers brainstorming skill's convention:
    docs/superpowers/specs/YYYY-MM-DD-<topic>-design.md)

- docs/skill-optimizer-v1.4-plan.md
    -> docs/superpowers/plans/2026-05-19-skill-optimizer-v1.4.md
    (matches the superpowers writing-plans skill's convention:
    docs/superpowers/plans/YYYY-MM-DD-<topic>.md)

- docs/skill-optimizer-workflow.md
    -> skills/skill-optimizer-shared/workflow.md
    (chain-specific operator reference belongs alongside the chain
    it documents; co-located with iteration-protocol.md +
    subagent-dispatch.md + frontmatter-discipline.md)

Top-level docs/ now holds only:
  workbench.md                # project-wide workbench engine guide
  README.codex.md             # install
  README.opencode.md          # install
  skill-writing-philosophy.md # project-wide authoring guidance
  images/                     # project-wide assets
  superpowers/                # superpowers-plugin-managed dir
  pilot-runs/                 # project-wide

Internal references updated:
- The spec doc's "Companion docs" section now reflects the new
  layout (project-wide vs chain-specific)
- Bulk sed across spec + plan to point at new paths
- workflow.md's relative paths fixed for its new location (../skills/
  -> ./ since it now lives in skills/skill-optimizer-shared/)

Date chosen (2026-05-19) is the original git creation date of both
the spec and plan files, matching the YYYY-MM-DD convention.
The legacy canonical skill (skills/skill-optimizer/SKILL.md) was a
v1.3-era artifact — a direct workbench-CLI wrapper. Its role under
the v1.4 chain is filled by skill-optimizer-run-bench. The v1.4
spec already called for this removal "in a separate cleanup PR"
after validation; doing it now while the chain layout is being
restructured anyway.

- Moved skills/skill-optimizer/references/workbench.md
    -> skills/skill-optimizer-shared/workbench.md
  (load-bearing — B4 references the workbench schema reference;
  co-located with the chain's other shared docs)

- Updated B4's pointer to the new location

- Deleted skills/skill-optimizer/ (folder)

Git history preserves the deleted SKILL.md content if needed.

NOT updated in this commit (separate cleanup needed before merge):
- Plugin metadata still references skills/skill-optimizer/SKILL.md
  in .claude-plugin/, .codex-plugin/, .cursor-plugin/, .opencode/,
  gemini-extension.json
- CLAUDE.md mentions skills/skill-optimizer/SKILL.md as canonical
- README.md and CONTRIBUTING.md may reference it

These references need updating to point at the v1.4 chain
(or the chain's entry point) before this work merges to
development. Flagged here so they're not forgotten.
…tion

Two related trims per review:

1. Dropped redundant nouns from 3 chain skill names where the noun
   just restated the namespace (whole namespace is skill-optimizer,
   so "skill"/"result"/"improvement" added no information):

   skill-optimizer-analyze-result       -> skill-optimizer-analyze
   skill-optimizer-improve-skill        -> skill-optimizer-improve
   skill-optimizer-validate-improvement -> skill-optimizer-validate

   The verbs stay because they actually distinguish what each step
   does. Other 5 skills keep their full names (functionality,
   test-case, submissions, tests, bench are meaningful nouns).

2. Dropped the "Throughout this document, 'step 1' through
   'step 9' (no parens) refer to skills in the chain. Internal
   workflow steps within THIS skill are labelled '(a)' through
   '(g)'." paragraph from each chain skill. Workflow steps use
   letters; chain steps use numbers; the distinction is
   self-explanatory from context. Ranged references like "step 1
   through step 9" risked confusing the agent (per review:
   "sometimes the agent might not know what that means").

Updated cross-references throughout: chain skills' handoffs,
spec doc's architecture overview, plan doc's task descriptions,
workflow doc's chain table.

Also fixed the spec doc layout: removed stale skill-optimizer/
folder entry (nuked previously), realigned column comments,
added shared/ entries for workflow.md and workbench.md that
weren't previously listed.

Final chain SKILL.md sizes (8 files, 1196 total lines, avg 150):
  B1 investigate-functionality   131 -> 127
  B2 investigate-test-case       174 -> 170
  B3 investigate-submissions     142 -> 138
  B4 write-tests                 182 -> 178
  B5 run-bench                   131 -> 127
  B6 analyze-result -> analyze   148 -> 144
  B7 improve-skill -> improve    155 -> 151
  B8 validate-improvement -> validate  165 -> 161
The chain previously had a gap: B4's smoke check verified grader/
fixture syntactic consistency (the test-writer wrote both the
fixture AND the smoke fixtures, so the smoke check is
self-validation), but no independent semantic check that the
probes actually probe what they claim to. v1.3 ran into this:
grader bugs propagated to misleading bench results and ducktape-
shaped improvements. This step closes that gate.

New skill: skill-optimizer-validate-tests (step 5, fresh-derivation)
- Dispatches test-validator subagents in parallel (one per probe)
- Each judges: does workspace exercise the parent functionality?
  is the grader correct + fair? do smoke fixtures truly
  distinguish (vs. coincidentally match)?
- Writes 05-tests-verdict.md (aggregate + per-probe verdicts)
- Step 6 (run-bench) refuses to fire unless all_probes_approved: true
- Parallel to the step 9 validator for improvement proposals;
  both are anti-ducktape gates

Downstream renumbering (steps 5-9 -> 6-10):
  Step  Skill                                 File
  6     skill-optimizer-run-bench             06-bench-{results,summary}
  7     skill-optimizer-analyze               07-analysis.md
  8     skill-optimizer-improve               08-improvement-proposal.md
  9     skill-optimizer-validate              09-validator-verdict.md
  10    skill-optimizer-autopilot             autopilot-summary-<ts>.md

Bulk renames executed:
- File paths: 05-bench-* -> 06-bench-*, 06-analysis.md -> 07-,
  07-improvement-proposal.md -> 08-, 08-validator-verdict.md -> 09-
- Step number references in all chain SKILL.md + shared docs
  (reverse-order sed to avoid collision: 9->10, 8->9, 7->8, 6->7,
  5->6)

Spec doc updates:
- Goal: "seven independent skills" -> "nine independent skills +
  auto-pilot driver"
- Architecture layout: insert validate-tests at #5; rename
  autopilot to #10
- State layout: insert 05-tests-verdict.md
- Subagent constraints table: add test-validator row
- Per-step sections: insert ### 5. validate-tests; renumber
  existing ### 5-9 to ### 6-10
- Autopilot: "Walks 1→8" -> "Walks 1→9"; "eight steps" -> "nine"

Iteration-protocol + subagent-dispatch shared docs: step-kind
table and fresh-derivation enumeration both updated for the new
step list.

Workflow.md: full rewrite of chain table, re-run triggers matrix,
backward triggers, state layout — added validate-tests row in
each.

Plan doc updated via bulk sed only (it's historical implementation
record; precise per-task accuracy not required at this stage).

Net: chain has 10 entries now (9 chain steps + autopilot).
Phase C of the v1.4 implementation. Each chain skill that
dispatches a subagent loads its prompt template at the dispatch
step; this commit creates the 8 templates.

Each prompt is structured consistently:
- Title + role (what step dispatches it, what it produces)
- Inputs (templated by the operator session with ${VAR}
  placeholders matching the chain skill's substitution list)
- What you see / What you do NOT see (the constraints — these
  mirror what each chain skill says in its Dispatch step, but
  load-bearing for the subagent to internalize)
- Output specification (frontmatter + body shape)
- Reasoning protocol (lettered or numbered steps)
- Edge cases (typically BLOCKED conditions to surface)
- Return summary (what the subagent reports back to the operator)

Files written:

1. research-functionality.md (127L) — B1 functionality researcher
   - Reads source skill + targeted web; produces 01-functionality.md
   - Doesn't see prior analyses, tests, or proposals

2. test-case-designer.md (156L) — B2 test-case designer
   - Reads 01-functionality + current tests/ tree; produces
     02-test-proposals.md + tests/<func>/spec.yaml per functionality
   - Doesn't see skill source (forces design from STATED
     responsibilities)

3. research-submissions.md (133L) — B3 submission researcher
   - Reads upstream repo via gh CLI; produces 03-submissions.md
   - Doesn't see proposed change (preserves validator's
     independence)

4. test-writer.md (166L) — B4 test writer (dispatched per probe)
   - Reads probe spec + parent functionality + skill source;
     produces probe folder contents
   - Doesn't see other probes (per-probe isolation prevents
     homogenization + grader-leak hacking)

5. test-validator.md (186L) — B5 test validator (NEW, per probe)
   - Reads probe contents + parent functionality + skill source;
     produces per-probe verdict
   - Doesn't see other probes, test-writer's reasoning, prior
     verdicts
   - Four judgment dimensions: workspace fairness, grader
     correctness, smoke fixture distinguishing power, fairness
     across reasonable agent outputs

6. analyzer.md (183L) — B7 analyzer (anti-ducktape gate)
   - Reads bench results + probe specs (intent only) + skill
     source; produces 07-analysis.md
   - Doesn't see workspace files (forces SKILL-thinking not
     SOLUTION-thinking)
   - Each weakness must include five required parts (Pattern,
     Hypothesized cause, Connects to skill section, What WOULD
     address this, What WOULD NOT address this)

7. optimizer.md (177L) — B8 optimizer (anti-ducktape critical)
   - Reads 07-analysis + 01-functionality + current skill + (PR
     03-submissions); produces 08-improvement-proposal.md
   - Doesn't see raw trials, grader internals, prior proposals,
     prior verdicts
   - Required self-check section against analyzer's anti-pattern
     list

8. validator.md (196L) — B9 validator (anti-ducktape gate)
   - Reads BEFORE + AFTER skill + proposal artifact + 01-functionality
     + (PR 03-submissions); produces 09-validator-verdict.md
   - Doesn't see optimizer's reasoning trace or prior verdicts
   - Internal consistency check + external consistency check (if
     PR-bound)

Total: 8 prompts, 1324 lines. Each prompt is self-contained and
testable independently of the parent chain skill.

Subagent-prompt files live in skills/skill-optimizer-subagents/.

Next steps: B10 autopilot SKILL.md; fix each chain SKILL.md's
dispatch step to point at the now-existing prompt file (most
references are already pointing at the right path; this is a
verification pass).
…ss of source

B1 now vendors the source skill to vendored-skill/ unconditionally,
not just for upstream sources. Downstream steps stop branching on
"local vs upstream" — they always read vendored-skill/ as the
single canonical input, and the user's original local file is
never touched by the chain.

Why this is cleaner:
- One code path through the chain (no local/upstream conditional)
- vendored-skill/ is THE input; the original is just the source we
  copied from (path recorded in 01-functionality.md's skill_source
  frontmatter)
- Stability: long-running chain runs don't break if the user edits
  the local file mid-flight
- improved-skill/ vs vendored-skill/ is the clean before/after pair
  for both source types; B8 and B9 just say "improved-skill/ if
  exists, else vendored-skill/" without local-file special cases

B1 SKILL.md updates:
- "What you produce" paragraph: vendoring is unconditional;
  user's original local file is not touched by the chain
- Workflow step (c) renamed from "Vendor the source (upstream
  only)" to "Vendor the source" with explicit upstream/local
  copy paths (gh api fetch / cp -r)
- Re-vendor triggers documented: source URL change (upstream),
  or user explicitly asks (local edits, upstream new commits)

Updated all downstream references:
- B6 (run-bench): "vendored-skill/ should exist" no longer says
  "for upstream skills"
- B8 (improve), B9 (validate): SKILL_CURRENT_PATH simplified —
  "improved-skill/ if exists, else vendored-skill/"
- iteration-protocol's "what this protocol does NOT cover":
  vendored-skill/ described as canonical input regardless of
  upstream/local
- subagent-dispatch's step-7+8 carve-out: same simplification.
  Also fixed pre-existing typo (validator was labeled step 8;
  should be step 9)
- 5 subagent prompts (research-functionality, test-writer,
  test-validator, analyzer, optimizer, validator): SKILL_SOURCE_PATH
  / SKILL_CURRENT_PATH / SKILL_BEFORE_PATH all reference
  vendored-skill/ unconditionally
- workflow.md state-layout comment: "vendored-skill/ (always)"
- spec doc state-layout comment: same

Trim of an architecture branch that wasn't pulling its weight.
Real-world context from prior pilot runs: sometimes the user
points at a SKILL.md that's just a thin wrapper referencing the
actual content elsewhere. Common pattern — a multi-agent plugin
has one canonical agent-agnostic content file and several
agent-flavored SKILL.md wrappers (one per agent target) that all
reference it. The PR target is the canonical content, not the
wrapper; a change to the canonical may affect multiple wrappers.

Updated research-submissions.md subagent prompt:

New frontmatter fields the subagent emits:
- entry_file_pattern: true|false
- canonical_target_repo: <owner>/<repo>
- canonical_target_path: <path to actual content>
- linked_consumers: [<repo>:<path>, ...]

New body sections:
1. Source structure — entry-file vs canonical content; if
   entry-file, document the relationship + linked consumers
2. Suggested PR target — based on (1), recommend which file(s)
   to modify and whether the change affects other consumers

New reasoning protocol steps:
1. Detect entry-file pattern (read source SKILL.md, look for
   thin-wrapper signals: short body of "see X" pointers,
   frontmatter fields like reference: / source: / canonical:,
   multi-agent plugin layout). Follow the pointer to find the
   canonical content if detected.
2. Find linked consumers (search for other SKILL.md files
   referencing the same canonical content)
7. Suggest the PR target based on the above

The license / CLA / frontmatter / conventions research now applies
to the CANONICAL CONTENT'S repo, which may differ from the entry
file's repo.

Note: B1 (functionality researcher) likely also needs awareness of
this pattern — if the user pointed at an entry file, downstream
chain steps test/analyze/optimize the wrapper rather than the
actual skill. Flagged as follow-up but not addressed in this
commit (the user asked specifically about B3).
B1 now handles the entry-file pattern symmetrically with B3 —
detected at vendor time so downstream chain steps test/analyze/
optimize the actual skill content rather than a thin wrapper.

Previously: if the user pointed at an entry-file SKILL.md (a
wrapper referencing the actual content elsewhere), B1 vendored
the wrapper and all downstream steps operated on it. The whole
optimization run would miss its real target.

Now (B1 SKILL.md):
- New step (c.1) "Detect entry-file pattern + user gate" runs
  after the initial vendor at (c). Operator reads
  vendored-skill/SKILL.md for thin-wrapper signals (short "see X"
  body; frontmatter reference:/source:/canonical: fields;
  multi-agent plugin layout)
- If detected, surfaces to user with a clear choice: optimize
  the wrapper, or re-vendor the canonical content and optimize
  that. The common case (canonical) re-vendors; the rare case
  (wrapper-specific change) keeps the existing vendored content
  but records the relationship as an operator directive
- Either way, 01-functionality.md frontmatter records
  entry_file_pattern: true|false and canonical_source: <path>
  for downstream steps

Updated research-functionality.md subagent prompt:
- New inputs: ${ENTRY_FILE_PATTERN}, ${CANONICAL_SOURCE}
- New frontmatter fields: entry_file_pattern, canonical_source
- New body section 9: "Entry-file relationship" (only if
  pattern detected) — notes the relationship, the vendored copy
  content, any agent-specific adaptations

Updated research-submissions.md (B3) subagent prompt:
- Reasoning step 1 now reads 01-functionality.md's entry-file
  fields as primary input. B3 trusts B1's detection; falls back
  to its own detection only if 01-functionality.md was written
  before this feature (defensive). If B3 detects a pattern B1
  missed, surface to operator — suggests B1 needs a re-run

The whole chain now consistently knows which is the wrapper and
which is the canonical content from B1 onward.
Subagent prompts shouldn't reference chain-skill internal workflow
labels (the lettered (a)-(g) steps inside each chain SKILL.md) —
subagents only see their dispatched inputs + their prompt, never
the chain skill itself. Two references to "step c.1" (B1's
internal entry-file detection step) were invisible noise.

Rephrased to describe what happened functionally — "the operator
session confirmed with the user and re-vendored" — without naming
the step label.

Chain-step number references (step 1 through step 10) stay, because
those are stable role descriptors the subagent understands as
"another role in the chain", not internal workflow lettering.

Files touched:
- skills/skill-optimizer-subagents/research-functionality.md
- skills/skill-optimizer-subagents/research-submissions.md
…al check

Previous design had B1's operator session do mechanical wrapper
detection at workflow step (c.1) — read SKILL.md, check for
heuristic signals, surface to user. Wrong design: the subagent is
already reading the source for research, has LLM judgment that
beats a mechanical check, and the heuristics ("body under 30 lines",
"frontmatter has reference:") would miss real cases.

Restructured: detection is the subagent's judgment, reported in
its return summary. Operator reacts by surfacing to user, who
picks the resolution.

B1 SKILL.md changes:
- Removed workflow step (c.1) — no more operator-side detection
- Step (g) renamed "Confirm + handle wrapper detection + hand off"
  — reads the subagent's `likely_wrapper` frontmatter field; if
  true, surfaces to user with three realistic responses:
    1. Re-vendor the referenced content and re-research (common)
    2. Proceed treating the wrapper as the skill (rare)
    3. Cancel and provide a different source
- Frontmatter field rename: `entry_file_pattern` -> `likely_wrapper`
  (more honest — it's a judgment, not a binary classification)
  and `canonical_source` -> `wrapper_points_to` (descriptive
  rather than presuming a canonical/wrapper hierarchy)

research-functionality.md (B1 subagent prompt):
- Removed ${ENTRY_FILE_PATTERN} and ${CANONICAL_SOURCE} inputs —
  these were operator-pre-detected fields. Detection now lives in
  the subagent's reasoning protocol.
- New "Wrapper detection" section in the reasoning protocol —
  explicit patterns to look for, explicit instruction to record
  `likely_wrapper` + `wrapper_points_to` in frontmatter when
  judged true, and explicit instruction to NOT follow the pointer
  or re-vendor itself (operator's job after user gate)
- Body section 9 renamed to "Wrapper observation" — describes
  signals + confidence rather than asserting a canonical/wrapper
  relationship
- Return summary now includes the wrapper finding so operator can
  trigger the user gate

research-submissions.md (B3 subagent prompt):
- Frontmatter field rename: `canonical_target_repo` /
  `canonical_target_path` -> `pr_target_repo` / `pr_target_path`
  (cleaner — what the PR composer needs is the PR target, not a
  taxonomy of canonical-vs-wrapper)
- Reasoning step 1 reads B1's `likely_wrapper` / `wrapper_points_to`
  as authoritative; falls back to own judgment only if
  01-functionality predates this feature

Net: wrapper-detection happens once (at B1, by the subagent), and
the result flows through frontmatter to downstream steps. Operator
sessions handle the user gates; no operator does LLM-style
judgment work.
Step 2 is now skill-optimizer-investigate-submissions (PR-bound
only); step 3 is the renamed skill-optimizer-design-tests (was
skill-optimizer-investigate-test-case). PR research now logically
follows step 1 immediately when PR-bound, before test design.

Renamed:
- skills/skill-optimizer-investigate-test-case/ → skill-optimizer-design-tests/
- skills/skill-optimizer-subagents/test-case-designer.md → test-designer.md
- 02-test-proposals.md ↔ 02-submissions.md (file numbers follow new step numbers)

Also fixed pre-existing step-header bugs from the validate-tests
insertion (run-bench, analyze, improve, validate had stale
"Step N" headers off by one).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
B-prefix shorthand (B1, B2, ...) was internal brainstorming
notation that leaked into shipped SKILL files. Sweeping it out
so the chain documentation is self-explanatory to anyone who
hasn't seen the v1.4 design discussions.

Also fixed three additional stale step-number references found
during the audit:
- frontmatter-discipline.md gating-field summary was off by one
  to two steps (predated the validate-tests insertion)
- analyze SKILL.md "Step 7 will refuse" handoff message named
  the wrong gating step (should be step 8, improve)
- analyze SKILL.md "step-5 problem" should be "step-6 problem"
  for malformed bench output (run-bench is step 6)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The wrapper-vs-target gymnastics in research-submissions was
over-engineered. The correct model: step 1's user gate is where
the wrapper question gets resolved. If the user opts to optimize
the underlying content, step 1 re-vendors with an updated
SKILL_SOURCE and the new 01-functionality.md has likely_wrapper=
false. Downstream steps just read skill_source as the PR target —
no conditional logic, no re-litigation.

Changes:
- step 1 SKILL.md (g): make the SKILL_SOURCE update on re-vendor
  explicit, and document what each user choice means for what
  downstream steps see
- research-submissions.md frontmatter description: pr_target_*
  derives directly from skill_source; no special wrapper case
- Drop body sections 1 ("Source structure") and 2 ("PR target")
  — they were redundant with frontmatter and re-litigated the
  step-1 decision
- Reasoning protocol: replace the wrapper-handling step with a
  one-line read of skill_source; reorder so PR-shape research
  comes before linked-consumer search
- Linked consumers stays as a coordination hint for the PR
  composer, but as its own body section rather than woven into
  the (now-removed) wrapper analysis

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The smoke-skill-distribution test was reading paths under the
nuked skills/skill-optimizer/ folder. Updated to:

- Walk all 9 chain skill directories and verify each has valid
  SKILL.md frontmatter (replaces the single-canonical check that
  predates the v1.4 chain decomposition)
- Point workbench reference checks at skills/skill-optimizer-shared/
  workbench.md (the new shared location)

Also removed an empty skill-optimizer-validate-improvement/
directory that wasn't cleaned up when the skill was renamed to
skill-optimizer-validate.

Known v1.4 debt still on the cleanup list: .claude-plugin/
marketplace.json points at ./skills/skill-optimizer (the nuked
path). The test passes because it does string compare without
existence check; needs follow-up in the plugin metadata sweep.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After the v1.4 redesign nuked the monolithic skills/skill-optimizer/
in favor of 9 chain skills under skills/skill-optimizer-*/, the
plugin manifests, install docs, and Gemini context file still
pointed at the dead paths. The smoke test caught the workbench
reference but not the marketplace pointer (string compare only).

Fixed across all surfaces:

- .claude-plugin/marketplace.json: skills array now lists all 9
  chain skill paths instead of the dead ./skills/skill-optimizer
- GEMINI.md: @imports point at shared/workflow.md (chain overview)
  + shared/workbench.md instead of the deleted monolithic SKILL.md
  and references/workbench.md
- README.md, CONTRIBUTING.md, AGENTS.md, CLAUDE.md, docs/README.{
  codex,opencode}.md, .cursor/INSTALL.md, .codex/INSTALL.md,
  .opencode/INSTALL.md: replaced "canonical skill" framing with
  the 9-skill chain description; updated --skill flag examples
  to enumerate each chain skill explicitly
- tests/smoke-skill-distribution.ts: marketplace test now asserts
  all 9 chain skills are listed AND verifies each path has a real
  SKILL.md on disk (closes the string-compare-only loophole that
  let the old dead path pass). Gemini test asserts the new
  @import targets.

All 11 tests pass; typecheck clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant