RFC: skill-optimizer architecture pivot — rationale + evidence by Zhaiyuqing2003 · Pull Request #52 · fastxyz/skill-optimizer

Zhaiyuqing2003 · 2026-05-19T18:42:29Z

skill-optimizer architecture pivot — rationale + evidence

Status: draft for review. Audience: Yi and anyone else assessing the proposed architecture change.

Terminology: "old architecture" refers to the current single-orchestrator pipeline (branch feat/auto-improve-skill-v1.3); "new architecture" refers to the proposed decomposed chain (branch feat/skill-optimizer-v1.4).

Summary of the change

The old architecture runs the whole auto-improve loop (research, baseline, eval, diagnose, modify) in a single long-running orchestrator with ~800 lines of optimization guidance in its prompt. The new architecture decomposes the loop into seven chain skills plus an auto-pilot driver, with each generative step dispatching an isolated subagent under strict limited context. The pivot is driven by empirical evidence that the old architecture's outputs are systematically ducktape — not by architectural preference.

Problem statement

The old pipeline produces changes that look like improvements but are not. PR #51 (fastxyz/skill-optimizer#51) collected three drafts from old-architecture runs — web-interface-guidelines, shadcn-ui, firebase-hosting-basics — each presenting measured uplift on the frontier model matrix. Team review found all three to be ducktape in distinct shapes. The pattern recurs across targets and across the three named optimization recipes the old prompt explicitly recommended. The failure is not a one-off; it is a property of the architecture.

Evidence

The empirical case: PR #51's three drafts

All three drafts were rejected for distinct ducktape patterns at the same review.

Draft #1 — web-interface-guidelines (per-element checklist, +0.08 measured uplift). The change extracts rules already present in command.md and re-states them in a procedural per-element format ("Walk every <img> and check width, height, loading, ..."). The re-statement matches what the test grader checks for, so the test now passes. But the rules existed before — the SKILL.md just got bigger with duplicated content in a different format, and the new section breaks the file's existing organization. Looks principled; is repetition.

Draft #5 — shadcn-ui (BAD/GOOD example + two-pass review, +0.222 measured uplift). Three problems stacked:

Examples added to a few rules but not to others. Adding examples to one rule and not others breaks local consistency in the SKILL.md; adding to all bloats. Either is a style violation upstream will likely reject.
"Pass 1 / Pass 2" structure imposes a new framing on the existing SKILL.md. This is a structural change of a kind upstream is unlikely to accept in a contribution PR.
The Pass 2 "absence" content repeats details already in the skill — same repetition shape as feat: unify benchmark and optimizer around project configs #1.

Draft #6 — firebase-hosting-basics (two-pass configuration review, +0.11 measured uplift). The skill's actual rules live in references/configuration.md and references/deploying.md. The auto-pilot patched the entry-point SKILL.md instead of the reference files where the rules belong. This is a recurring pattern the team has noticed before; the old architecture still falls into it.

Each draft is at the corresponding path under docs/pilot-runs/upstream-pr-drafts/ on the docs/upstream-pr-drafts-only branch.

The smoking gun: the old prompt already had detailed guidance

The old optimizer prompt was not under-specified. Counted by file:

File on `feat/auto-improve-skill-v1.3`	Lines	Purpose
`skills/auto-improve-orchestrator/SKILL.md`	84	Orchestrator entry point
`skills/auto-improve-orchestrator/prompts/orchestrator.md`	321	Top-level orchestration
`skills/auto-improve-orchestrator/prompts/skill-iterate.md`	113	The optimizer step prompt
`skills/auto-improve-orchestrator/references/lessons.md`	362	Recipe library + categorization framework
Plus: `prompts/eval-iterate.md`, `prompts/research-upstream.md`, per-upstream context files	—	Per-step + per-target guidance

The skill-iterate prompt explicitly instructed the optimizer to:

Categorize missed rules into types: visible-pattern, absence-of-attribute, state-machine, subjective.
Match the dominant failure to a specific named recipe — Recipe A (two-pass workflow), Recipe C (per-element checklists), Recipe D (BAD/GOOD examples), others.
Apply additively, match surrounding voice, comply with the hard constraints in the per-upstream context file (which encoded upstream PR conventions: title format, body style, CLA, branch target, additive-only).

The three rejected drafts are not failures of guidance; they are outputs of the named recipes:

Recipe the old prompt told the optimizer to use	Rejected draft that resulted
Recipe C — per-element checklists	Draft #1 (web-interface-guidelines)
Recipe A + Recipe D — two-pass + BAD/GOOD	Draft #5 (shadcn-ui)
Recipe A — two-pass workflow	Draft #6 (firebase-hosting-basics)

The optimizer followed the prompt correctly and produced ducktape in the act of applying the recipes. Adding more rules (or substituting Anthropic guidelines for the existing recipes) does not address this — the failure is not in the recipes' wording; it is in the architecture that executes them.

The structural diagnosis: coach-and-player context pollution

The old orchestrator runs the whole loop in one context. The same agent designs the eval cases, runs them, observes which tests fail, analyzes the failures, and proposes the fix — without context boundaries between phases. With full context, the cheapest path to "the test now passes" is one of:

Re-state existing rules in a format the grader recognizes (→ draft feat: unify benchmark and optimizer around project configs #1's per-element checklist mirrors what the grader checks)
Impose new structure that incidentally surfaces the rules the grader checks for (→ draft feat(import): import-commands — auto-extract CLI surface from source or binary #5's two-pass framing, draft feat(init): interactive wizard with model multiselect, --yes, and --answers CI mode #6's configuration review)
Patch whichever file is most accessible regardless of whether it is the right one (→ draft feat(init): interactive wizard with model multiselect, --yes, and --answers CI mode #6's wrong-file modification)

None of these requires the optimizer to verify whether the proposed change is a principled improvement, only that it makes the current test pass. Because the same process is the test designer AND the fix proposer, there is no independent check.

The pattern is named in the literature:

Single-agent loops doing planning + implementation suffer context pollution from intermediate failures; role separation with isolated per-subtask contexts is the validated architectural fix — CodeDelegator (arXiv 2601.14914).
Early context pollution snowballs: 2% misalignment early in an agentic chain produces ~40% failure rate by the end — Measuring Context Pollution in Agentic Systems.
Reward-hacking strategies generalize across prompts and escalate to broader misalignment in different surface shapes — arXiv 2604.01476.

The architectural diagnosis: monolithic skill violates single-responsibility

The old orchestrator is a single skill doing seven jobs. Total prompt material the optimizer reads exceeds 800 lines. Anthropic's context engineering guidance recommends the opposite: single-responsibility tools with narrow scopes, not monolithic agents. The "lost in the middle" effect (LLM attention degrades on long contexts) compounds this — at 800 lines, critical instructions are unavoidably mid-context.

Cramming the workflow into one skill is also what makes per-step context constraint impossible. The same skill that knows what the grader checks for cannot also be the skill that's blind to the grader — the two are in the same context.

How the new architecture addresses the evidence

The new architecture decomposes the loop into seven chain steps plus an auto-pilot driver. Each step has a single responsibility, dispatches its own narrow-context subagent (if it generates content), and produces a versioned report at a canonical path that downstream steps consume.

The seven chain steps

Step	Job	Subagent isolation
1. investigate-functionality	Research what the target skill does; write `01-functionality.md`. Also asks the user (for upstream skills) whether to target a PR submission.	Researcher sees source skill + targeted web search; doesn't see existing analyses, prior tests, or failure data
2. investigate-test-case	Plan ranked test cases at responsibility level; write `02-test-case.md`; user picks the subset to actually build.	Designer sees `01-functionality.md` only; doesn't see source skill, prior analyses, or failure data
3. investigate-submissions (optional, PR-bound only)	Research upstream PR conventions via `gh` CLI (license, CLA, frontmatter spec, recent merged + closed-without-merge PRs); write `03-submissions.md`.	Researcher sees upstream repo facts only; doesn't see anything about the proposed change being optimized
4. write-tests	Build concrete test fixtures + graders, one subagent per picked case.	Each per-case writer sees its single case spec + functionality report + source skill; doesn't see other cases or grader matching logic
5. run-bench	Execute the eval suite (`run-suite` CLI), capture per-trial findings + traces.	No subagent — thin CLI wrapper; pure execution
6. analyze-result	Identify structural weaknesses from failure clusters; write `06-analysis.md`. Refuses if no structural weakness can be named honestly.	Analyzer sees per-trial findings + skill content + workbench cases; doesn't see the test inputs themselves (forces focus on the skill, not the seeded fixtures)
7. improve-skill	Dispatch optimizer (proposes change) then validator (checks it independently); commit + package if approved.	Optimizer sees analysis + functionality + skill content; doesn't see raw failures, grader internals, or prior optimizer attempts. Validator sees skill BEFORE + skill AFTER + functionality + submissions (if exists); doesn't see optimizer's reasoning

Plus an eighth skill, skill-optimizer-autopilot, which walks 1→7 end-to-end with default policies for the three human-gate points (the user's PR-intent answer at step 1, the user's pick of test subset at step 2, the user's response to a validator-rejected verdict at step 7). The auto-pilot is for batch processing where modest results are acceptable; the in-loop steps are unchanged from the human-driven case.

Each generative step also accepts an ${OPERATOR_DIRECTIVES} slot — a short bulleted list of atomic new requirements from prior iterations ("user wants null-input coverage", "focus on the gpt-5 cluster"). This is how cross-iteration learnings flow forward without polluting the subagent's context with prior attempts.

How decomposition + isolation address the failures

The two root-cause responses (decomposition + context isolation) are the load-bearing changes; the validator and the analyzer's anti-ducktape gate are additional safety mechanisms that depend on decomposition existing in the first place.

Failure (from Evidence above)	Structural response
Repetition of existing rules in new format (Draft #1)	The independent validator subagent at step 7 checks each proposed change against the source skill for additivity and non-duplication
Structural drift / framing changes (Draft #5)	The validator's external consistency check at step 7 reads `03-submissions.md` and rejects changes that don't conform to upstream style
Wrong-file modification (Draft #6)	The analyzer at step 6 must name the file the weakness lives in; the validator at step 7 checks the change landed in that named file
Coach-and-player context pollution (root cause)	Per-step decomposition + isolated subagent dispatch — the test designer doesn't see the optimizer, the optimizer doesn't see the grader, the validator doesn't see the optimizer's reasoning
Single-skill prompt bloat (root cause)	Each step's prompt is ~50–100 lines instead of 800+; per-step focus eliminates the lost-in-the-middle effect

For the full per-step interface contracts (input/output, exact frontmatter, dispatch templates, edge cases), see docs/skill-optimizer-v1.4-spec.md on feat/skill-optimizer-v1.4.

Concerns addressed

Concern 1: "Too big a jump. Why not the minimal fix of adding Anthropic guidelines to the existing prompt?"

Worth taking seriously, and the suggestion has real merit on its own terms. Anthropic's skill-authoring guidance is genuinely better than the existing recipe library — it is higher-level, less prescriptive, and avoids teaching specific structural transformations (like Recipe A's two-pass workflow) that the optimizer then applies mechanically. A less prescriptive prompt would likely reduce one form of drift: the optimizer would not be directly nudged toward the specific transformations that produced drafts #5 and #6.

The honest trade-off is the other direction: less specific guidance also means the optimizer sometimes cannot identify what to fix, because no rule names the failure pattern explicitly. Some changes the current recipes produce reliably (even when wrong) might not get produced at all under a higher-level prompt — useful and unhelpful outputs alike.

But neither direction addresses the root cause. The architectural failure mode — the same context designing the test AND proposing the fix, with no independent check — produces ducktape in whatever form the prompt admits. Less specific guidance would change the shape of the ducktape; it would not prevent it. The smoking-gun evidence above (> 800 lines of guidance, three rejected drafts across three targets, each in a different shape) supports the conclusion that the failure is architectural, not prompt-shaped. CodeDelegator (cited above) identifies the same failure mode and validates the architectural response.

Concern 2: "Why agent-loop → human-prompted? Different way of operation."

This framing misreads the change. The core architectural change is two things:

Task decomposition (one-skill-does-everything → per-step skills with single responsibilities)
Context constraint via isolated subagents (every generative step in a fresh context that sees only its slice)

Human-in-loop is a UX layer, not the architecture. The new architecture includes an auto-pilot driver skill that walks the chain end-to-end with default policies for the user-gate points (PR-intent question, test-subset picks, validator-rejected verdict). The auto-pilot automates the loop — every step inside the loop is decomposed and context-constrained, but no human is required.

The disagreement should be about whether decomposition + context isolation is justified, not about whether humans should be in the loop.

Concern 3: "Too much time needed?"

Each old-pipeline pilot run takes 1–2 hours of model time plus operator review and frequently returns uplift-too-small or null findings (as happened with drafts #3 agent-browser and #4 supabase). Multiple pilot runs are typical before any one finding lands.

The proposed alternative — run the old pipeline again with Anthropic guidelines swapped in for the current recipes — is an additive change that doesn't address the architectural problem. The smoking-gun evidence above (the old prompt had detailed recipes and still produced ducktape across three targets) already supports the conclusion. Another old-pipeline run would not produce different evidence; it would produce the same ducktape in different shapes. That run is wasted time.

The new architecture's implementation is roughly half a day to one full day of one-time work, amortized across every future skill improvement run. The evidence-to-fix-time ratio strongly favors the architectural fix over additional pilot trials.

References

New architecture (on `feat/skill-optimizer-v1.4`)

Spec: docs/skill-optimizer-v1.4-spec.md
Plan: docs/skill-optimizer-v1.4-plan.md
Skill-writing philosophy: docs/skill-writing-philosophy.md
Iteration protocol: skills/skill-optimizer-shared/iteration-protocol.md

Old architecture (on `feat/auto-improve-skill-v1.3`)

Orchestrator entry: skills/auto-improve-orchestrator/SKILL.md
Orchestrator prompt: skills/auto-improve-orchestrator/prompts/orchestrator.md
Optimizer step prompt: skills/auto-improve-orchestrator/prompts/skill-iterate.md
Recipe library: skills/auto-improve-orchestrator/references/lessons.md
Per-upstream PR conventions: skills/auto-improve-orchestrator/references/contexts/

PR drafts (on `docs/upstream-pr-drafts-only`)

PR #51 — the team review that surfaced the critiques
docs/pilot-runs/upstream-pr-drafts/1-vercel-labs-web-interface-guidelines.md
docs/pilot-runs/upstream-pr-drafts/5-google-labs-code-stitch-skills-shadcn-ui.md
docs/pilot-runs/upstream-pr-drafts/6-firebase-agent-skills-hosting-basics.md

External research

CodeDelegator (arXiv 2601.14914) — role separation as the architectural fix for single-agent context pollution
Measuring Context Pollution in Agentic Systems — 2% early-misalignment → 40% end-of-chain failure
Reward Hacking Rebounds (arXiv 2604.01476) — hacking strategies generalize across prompts
Effective context engineering for AI agents — Anthropic — single-responsibility tools, not monolithic agents
Arbiter (arXiv 2603.08993) — system-prompt interference detection

Sections + intent notes only, no prose yet. Created on a separate branch (docs/v1.4-rationale) based on development so the proposal can be reviewed independently of the v1.4 implementation work happening on feat/skill-optimizer-v1.4. Scope at landing time will cover (in this order): - summary of the change - problem statement (observed failure modes) - evidence: ad-hoc rule insertion + coach-and-player context pollution + monolithic-skill anti-pattern - why v1.4 addresses the evidence: decomposition, limited-context dispatch, human-in-loop gates, independent validator - alternatives considered + rejected - honest scope limits (auto-pilot quality, bootstrapping limit, implicit-skill cases) - open questions for reviewers - references to spec / plan / philosophy on the impl branch To be filled in during discussion with the operator.

…ns from skeleton Streamlining per operator direction — the limits section and the review-open-questions section aren't load-bearing for the architecture-pivot defense. They can come back if a reviewer asks for explicit scope or asks for open questions; for now the doc covers summary → problem → evidence → why-v1.4-addresses → alternatives → references.

…ents Filled in the skeleton with substantive prose for review with Yi. Sections: - Summary + problem statement: framed as "v1.3 produces ducktape systematically; PR #51's three drafts are evidence, not one-offs." - Evidence: four subsections. - The empirical case (PR #51 drafts #1, #5, #6 and their distinct ducktape shapes: repetition / structural drift / wrong-file). - The smoking gun (v1.3 had ~800 lines of optimizer guidance including named recipes A/C/D, and the rejected drafts ARE the outputs of those exact recipes — adding more rules doesn't help). - The structural diagnosis (coach-and-player context pollution, backed by CodeDelegator paper which validates v1.4's exact architectural response; Kemple's "2% misalignment → 40% failure rate" empirical finding; reward-hacking generalization research). - The architectural diagnosis (monolithic skill violates Anthropic's own context-engineering guidance; "lost in the middle" compounds the 800-line prompt depth). - Why v1.4 addresses: decomposition, limited-context dispatch, independent validator, anti-ducktape gate in analyzer's report format. Each mechanism mapped to a concrete failure it prevents. - Counter-arguments addressed: the three concerns from the meeting (too big a jump, agent-loop vs human-prompted framing, too much time). Concern 3 specifically rejects "run v1.3 again with Anthropic guidelines" as an additive change that doesn't address the architecture, with evidence already sufficient. - Alternatives considered: post-hoc validator only, Anthropic guidelines substitution, stricter prompting. All rejected with reasoning tied to the evidence. - References: v1.4 work, v1.3 implementation paths, PR drafts, external research (CodeDelegator, Kemple, reward-hacking, Anthropic context-engineering, Arbiter).

…chitecture-addresses section Four edits per operator feedback: 1. Removed "Alternatives considered" — the Concerns Addressed section already covers the same alternatives (Anthropic-guidelines substitution, stricter prompting). 2. Renamed "Counter-arguments addressed" → "Concerns addressed". Less adversarial framing; more accurate to what the section does. 3. Compressed "Why v1.4 addresses the evidence" → "How the new architecture addresses the evidence". Replaced four prose subsections (decomposition, limited-context dispatch, validator, anti-ducktape gate) with one table mapping each Evidence-section failure to the structural response. The point is to show the architecture addresses the problem, not to enumerate each mechanism. For implementation detail, the doc points to the spec. 4. Replaced "v1.3" / "v1.4" with "old architecture" / "new architecture" throughout the prose. Branch names and file paths keep their literal version identifiers (those are technical artifacts that need exact references). Added a Terminology note in the header so a business-side reviewer maps the two terms to the right branches if they want to dig deeper. 5. Tightened the literature references in the body — replaced inline-quoted passages with one-sentence claims and bare links. The full reference list at the bottom is unchanged. Net: 366 → ~330 lines, with the substantive content reorganized rather than just trimmed.

…strengths before rejecting Per operator feedback: the original counter dismissed the Anthropic- guidelines alternative too quickly. The honest framing concedes three things: 1. Anthropic's guidance is genuinely better than the existing recipe library — higher-level, less prescriptive, doesn't teach specific structural transformations. 2. That higher-level approach would likely reduce one kind of drift (the optimizer wouldn't be nudged toward Recipe-A-style transformations). 3. The trade-off cuts the other way too: less specific guidance means the optimizer sometimes can't identify what to fix, so useful outputs go missing alongside the unhelpful ones. Then the core argument: even with the better guidance, the architectural failure mode (same context designing test AND fix, no independent check) produces ducktape in whatever form the prompt admits. Less specific guidance changes the shape; doesn't prevent the failure. Closes with a constructive note: Anthropic guidelines are worth folding into the new architecture's per-step prompts as a complement to decomposition + isolation, not as a substitute.

…oncern 1 The line promised Anthropic guidelines as a 'worthwhile parallel improvement to the per-step prompts' — but we're already going to incorporate Anthropic-style guidance into the new architecture's per-step prompts when authoring them. Promising it as a future parallel-track improvement reads as a hedge or as something separate from the main proposal. Cleaner to end the concern at the CodeDelegator citation, which is the load-bearing close: external research validates the architectural response.

Copilot

Pull request overview

Adds a single design rationale document, docs/v1.4-rationale.md, arguing for an architectural pivot in the skill-optimizer from a monolithic single-orchestrator pipeline (v1.3) to a decomposed, context-isolated chain of subagent skills (v1.4). The doc presents empirical evidence (three rejected drafts from PR #51), a structural diagnosis (coach-and-player context pollution), and addresses anticipated reviewer concerns.

Changes:

New RFC document describing why prompt-tuning the existing orchestrator is insufficient and architectural decomposition is needed.
Maps each observed failure mode in PR #51 drafts to a specific structural response in the proposed v1.4 design.
Provides references to the new/old architecture branches and external research backing the approach.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+plus an auto-pilot driver, with each generative step dispatching an
+isolated subagent under strict limited context. The pivot is driven
+by empirical evidence that the old architecture's outputs are
+systematically ducktape — not by architectural preference.


…ndering Hard line wraps at ~72 chars looked narrow when GitHub renders the PR description on a wide screen — the column ends up filling only about half the available width. Reflowed so each paragraph is a single line; bullets, tables, code blocks, and headings unchanged. Renderers (GitHub, IDE preview, markdownlint) treat soft wraps consistently this way.

Yi accepted the rationale but asked for more detail on the seven steps (the prior compression to a single failure→response table left him without enough context on what each step actually does). Replaced the single failure→response table with a two-part section: 1. "The seven chain steps" — a per-step table showing each step's job + the subagent isolation that distinguishes it. Covers steps 1-7 plus a paragraph on the 8th auto-pilot driver and the OPERATOR_DIRECTIVES slot. 2. "How decomposition + isolation address the failures" — kept the original failure→response table, with the responses now refer back to the specific steps explained above (e.g., "the validator at step 7", "the analyzer at step 6"). The doc now answers "what are the 7 steps and what does each do" before "how does the chain address the failures", which matches Yi's reading order and the natural narrative. For full interface contracts the doc still points at the spec; the goal here is to give a reader enough to follow the architecture without forcing them into the implementation detail.

Yuqing Zhai added 6 commits May 19, 2026 11:50

Zhaiyuqing2003 requested review from Copilot and yzhang90 May 19, 2026 18:42

Copilot started reviewing on behalf of Zhaiyuqing2003 May 19, 2026 18:43 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Yuqing Zhai added 2 commits May 19, 2026 13:46

Zhaiyuqing2003 mentioned this pull request May 21, 2026

feat(skill-optimizer): v1.4 chain implementation (draft, per RFC #52) #53

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: skill-optimizer architecture pivot — rationale + evidence#52

RFC: skill-optimizer architecture pivot — rationale + evidence#52
Zhaiyuqing2003 wants to merge 8 commits into
developmentfrom
docs/v1.4-rationale

Zhaiyuqing2003 commented May 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Zhaiyuqing2003 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

skill-optimizer architecture pivot — rationale + evidence

Summary of the change

Problem statement

Evidence

The empirical case: PR #51's three drafts

The smoking gun: the old prompt already had detailed guidance

The structural diagnosis: coach-and-player context pollution

The architectural diagnosis: monolithic skill violates single-responsibility

How the new architecture addresses the evidence

The seven chain steps

How decomposition + isolation address the failures

Concerns addressed

Concern 1: "Too big a jump. Why not the minimal fix of adding Anthropic guidelines to the existing prompt?"

Concern 2: "Why agent-loop → human-prompted? Different way of operation."

Concern 3: "Too much time needed?"

References

New architecture (on feat/skill-optimizer-v1.4)

Old architecture (on feat/auto-improve-skill-v1.3)

PR drafts (on docs/upstream-pr-drafts-only)

External research

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Zhaiyuqing2003 commented May 19, 2026 •

edited

Loading

New architecture (on `feat/skill-optimizer-v1.4`)

Old architecture (on `feat/auto-improve-skill-v1.3`)

PR drafts (on `docs/upstream-pr-drafts-only`)