RFC: skill-optimizer architecture pivot — rationale + evidence#52
Open
Zhaiyuqing2003 wants to merge 8 commits into
Open
RFC: skill-optimizer architecture pivot — rationale + evidence#52Zhaiyuqing2003 wants to merge 8 commits into
Zhaiyuqing2003 wants to merge 8 commits into
Conversation
added 6 commits
May 19, 2026 11:50
Sections + intent notes only, no prose yet. Created on a separate branch (docs/v1.4-rationale) based on development so the proposal can be reviewed independently of the v1.4 implementation work happening on feat/skill-optimizer-v1.4. Scope at landing time will cover (in this order): - summary of the change - problem statement (observed failure modes) - evidence: ad-hoc rule insertion + coach-and-player context pollution + monolithic-skill anti-pattern - why v1.4 addresses the evidence: decomposition, limited-context dispatch, human-in-loop gates, independent validator - alternatives considered + rejected - honest scope limits (auto-pilot quality, bootstrapping limit, implicit-skill cases) - open questions for reviewers - references to spec / plan / philosophy on the impl branch To be filled in during discussion with the operator.
…ns from skeleton Streamlining per operator direction — the limits section and the review-open-questions section aren't load-bearing for the architecture-pivot defense. They can come back if a reviewer asks for explicit scope or asks for open questions; for now the doc covers summary → problem → evidence → why-v1.4-addresses → alternatives → references.
…ents Filled in the skeleton with substantive prose for review with Yi. Sections: - Summary + problem statement: framed as "v1.3 produces ducktape systematically; PR #51's three drafts are evidence, not one-offs." - Evidence: four subsections. - The empirical case (PR #51 drafts #1, #5, #6 and their distinct ducktape shapes: repetition / structural drift / wrong-file). - The smoking gun (v1.3 had ~800 lines of optimizer guidance including named recipes A/C/D, and the rejected drafts ARE the outputs of those exact recipes — adding more rules doesn't help). - The structural diagnosis (coach-and-player context pollution, backed by CodeDelegator paper which validates v1.4's exact architectural response; Kemple's "2% misalignment → 40% failure rate" empirical finding; reward-hacking generalization research). - The architectural diagnosis (monolithic skill violates Anthropic's own context-engineering guidance; "lost in the middle" compounds the 800-line prompt depth). - Why v1.4 addresses: decomposition, limited-context dispatch, independent validator, anti-ducktape gate in analyzer's report format. Each mechanism mapped to a concrete failure it prevents. - Counter-arguments addressed: the three concerns from the meeting (too big a jump, agent-loop vs human-prompted framing, too much time). Concern 3 specifically rejects "run v1.3 again with Anthropic guidelines" as an additive change that doesn't address the architecture, with evidence already sufficient. - Alternatives considered: post-hoc validator only, Anthropic guidelines substitution, stricter prompting. All rejected with reasoning tied to the evidence. - References: v1.4 work, v1.3 implementation paths, PR drafts, external research (CodeDelegator, Kemple, reward-hacking, Anthropic context-engineering, Arbiter).
…chitecture-addresses section Four edits per operator feedback: 1. Removed "Alternatives considered" — the Concerns Addressed section already covers the same alternatives (Anthropic-guidelines substitution, stricter prompting). 2. Renamed "Counter-arguments addressed" → "Concerns addressed". Less adversarial framing; more accurate to what the section does. 3. Compressed "Why v1.4 addresses the evidence" → "How the new architecture addresses the evidence". Replaced four prose subsections (decomposition, limited-context dispatch, validator, anti-ducktape gate) with one table mapping each Evidence-section failure to the structural response. The point is to show the architecture addresses the problem, not to enumerate each mechanism. For implementation detail, the doc points to the spec. 4. Replaced "v1.3" / "v1.4" with "old architecture" / "new architecture" throughout the prose. Branch names and file paths keep their literal version identifiers (those are technical artifacts that need exact references). Added a Terminology note in the header so a business-side reviewer maps the two terms to the right branches if they want to dig deeper. 5. Tightened the literature references in the body — replaced inline-quoted passages with one-sentence claims and bare links. The full reference list at the bottom is unchanged. Net: 366 → ~330 lines, with the substantive content reorganized rather than just trimmed.
…strengths before rejecting Per operator feedback: the original counter dismissed the Anthropic- guidelines alternative too quickly. The honest framing concedes three things: 1. Anthropic's guidance is genuinely better than the existing recipe library — higher-level, less prescriptive, doesn't teach specific structural transformations. 2. That higher-level approach would likely reduce one kind of drift (the optimizer wouldn't be nudged toward Recipe-A-style transformations). 3. The trade-off cuts the other way too: less specific guidance means the optimizer sometimes can't identify what to fix, so useful outputs go missing alongside the unhelpful ones. Then the core argument: even with the better guidance, the architectural failure mode (same context designing test AND fix, no independent check) produces ducktape in whatever form the prompt admits. Less specific guidance changes the shape; doesn't prevent the failure. Closes with a constructive note: Anthropic guidelines are worth folding into the new architecture's per-step prompts as a complement to decomposition + isolation, not as a substitute.
…oncern 1 The line promised Anthropic guidelines as a 'worthwhile parallel improvement to the per-step prompts' — but we're already going to incorporate Anthropic-style guidance into the new architecture's per-step prompts when authoring them. Promising it as a future parallel-track improvement reads as a hedge or as something separate from the main proposal. Cleaner to end the concern at the CodeDelegator citation, which is the load-bearing close: external research validates the architectural response.
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a single design rationale document, docs/v1.4-rationale.md, arguing for an architectural pivot in the skill-optimizer from a monolithic single-orchestrator pipeline (v1.3) to a decomposed, context-isolated chain of subagent skills (v1.4). The doc presents empirical evidence (three rejected drafts from PR #51), a structural diagnosis (coach-and-player context pollution), and addresses anticipated reviewer concerns.
Changes:
- New RFC document describing why prompt-tuning the existing orchestrator is insufficient and architectural decomposition is needed.
- Maps each observed failure mode in PR #51 drafts to a specific structural response in the proposed v1.4 design.
- Provides references to the new/old architecture branches and external research backing the approach.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| plus an auto-pilot driver, with each generative step dispatching an | ||
| isolated subagent under strict limited context. The pivot is driven | ||
| by empirical evidence that the old architecture's outputs are | ||
| systematically ducktape — not by architectural preference. |
added 2 commits
May 19, 2026 13:46
…ndering Hard line wraps at ~72 chars looked narrow when GitHub renders the PR description on a wide screen — the column ends up filling only about half the available width. Reflowed so each paragraph is a single line; bullets, tables, code blocks, and headings unchanged. Renderers (GitHub, IDE preview, markdownlint) treat soft wraps consistently this way.
Yi accepted the rationale but asked for more detail on the seven steps (the prior compression to a single failure→response table left him without enough context on what each step actually does). Replaced the single failure→response table with a two-part section: 1. "The seven chain steps" — a per-step table showing each step's job + the subagent isolation that distinguishes it. Covers steps 1-7 plus a paragraph on the 8th auto-pilot driver and the OPERATOR_DIRECTIVES slot. 2. "How decomposition + isolation address the failures" — kept the original failure→response table, with the responses now refer back to the specific steps explained above (e.g., "the validator at step 7", "the analyzer at step 6"). The doc now answers "what are the 7 steps and what does each do" before "how does the chain address the failures", which matches Yi's reading order and the natural narrative. For full interface contracts the doc still points at the spec; the goal here is to give a reader enough to follow the architecture without forcing them into the implementation detail.
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
skill-optimizer architecture pivot — rationale + evidence
Summary of the change
The old architecture runs the whole auto-improve loop (research, baseline, eval, diagnose, modify) in a single long-running orchestrator with ~800 lines of optimization guidance in its prompt. The new architecture decomposes the loop into seven chain skills plus an auto-pilot driver, with each generative step dispatching an isolated subagent under strict limited context. The pivot is driven by empirical evidence that the old architecture's outputs are systematically ducktape — not by architectural preference.
Problem statement
The old pipeline produces changes that look like improvements but are not. PR #51 (fastxyz/skill-optimizer#51) collected three drafts from old-architecture runs — web-interface-guidelines, shadcn-ui, firebase-hosting-basics — each presenting measured uplift on the frontier model matrix. Team review found all three to be ducktape in distinct shapes. The pattern recurs across targets and across the three named optimization recipes the old prompt explicitly recommended. The failure is not a one-off; it is a property of the architecture.
Evidence
The empirical case: PR #51's three drafts
All three drafts were rejected for distinct ducktape patterns at the same review.
Draft #1 — web-interface-guidelines (per-element checklist, +0.08 measured uplift). The change extracts rules already present in
command.mdand re-states them in a procedural per-element format ("Walk every<img>and check width, height, loading, ..."). The re-statement matches what the test grader checks for, so the test now passes. But the rules existed before — the SKILL.md just got bigger with duplicated content in a different format, and the new section breaks the file's existing organization. Looks principled; is repetition.Draft #5 — shadcn-ui (BAD/GOOD example + two-pass review, +0.222 measured uplift). Three problems stacked:
Draft #6 — firebase-hosting-basics (two-pass configuration review, +0.11 measured uplift). The skill's actual rules live in
references/configuration.mdandreferences/deploying.md. The auto-pilot patched the entry-pointSKILL.mdinstead of the reference files where the rules belong. This is a recurring pattern the team has noticed before; the old architecture still falls into it.Each draft is at the corresponding path under
docs/pilot-runs/upstream-pr-drafts/on thedocs/upstream-pr-drafts-onlybranch.The smoking gun: the old prompt already had detailed guidance
The old optimizer prompt was not under-specified. Counted by file:
feat/auto-improve-skill-v1.3skills/auto-improve-orchestrator/SKILL.mdskills/auto-improve-orchestrator/prompts/orchestrator.mdskills/auto-improve-orchestrator/prompts/skill-iterate.mdskills/auto-improve-orchestrator/references/lessons.mdprompts/eval-iterate.md,prompts/research-upstream.md, per-upstream context filesThe skill-iterate prompt explicitly instructed the optimizer to:
The three rejected drafts are not failures of guidance; they are outputs of the named recipes:
The optimizer followed the prompt correctly and produced ducktape in the act of applying the recipes. Adding more rules (or substituting Anthropic guidelines for the existing recipes) does not address this — the failure is not in the recipes' wording; it is in the architecture that executes them.
The structural diagnosis: coach-and-player context pollution
The old orchestrator runs the whole loop in one context. The same agent designs the eval cases, runs them, observes which tests fail, analyzes the failures, and proposes the fix — without context boundaries between phases. With full context, the cheapest path to "the test now passes" is one of:
None of these requires the optimizer to verify whether the proposed change is a principled improvement, only that it makes the current test pass. Because the same process is the test designer AND the fix proposer, there is no independent check.
The pattern is named in the literature:
The architectural diagnosis: monolithic skill violates single-responsibility
The old orchestrator is a single skill doing seven jobs. Total prompt material the optimizer reads exceeds 800 lines. Anthropic's context engineering guidance recommends the opposite: single-responsibility tools with narrow scopes, not monolithic agents. The "lost in the middle" effect (LLM attention degrades on long contexts) compounds this — at 800 lines, critical instructions are unavoidably mid-context.
Cramming the workflow into one skill is also what makes per-step context constraint impossible. The same skill that knows what the grader checks for cannot also be the skill that's blind to the grader — the two are in the same context.
How the new architecture addresses the evidence
The new architecture decomposes the loop into seven chain steps plus an auto-pilot driver. Each step has a single responsibility, dispatches its own narrow-context subagent (if it generates content), and produces a versioned report at a canonical path that downstream steps consume.
The seven chain steps
01-functionality.md. Also asks the user (for upstream skills) whether to target a PR submission.02-test-case.md; user picks the subset to actually build.01-functionality.mdonly; doesn't see source skill, prior analyses, or failure dataghCLI (license, CLA, frontmatter spec, recent merged + closed-without-merge PRs); write03-submissions.md.run-suiteCLI), capture per-trial findings + traces.06-analysis.md. Refuses if no structural weakness can be named honestly.Plus an eighth skill, skill-optimizer-autopilot, which walks 1→7 end-to-end with default policies for the three human-gate points (the user's PR-intent answer at step 1, the user's pick of test subset at step 2, the user's response to a validator-rejected verdict at step 7). The auto-pilot is for batch processing where modest results are acceptable; the in-loop steps are unchanged from the human-driven case.
Each generative step also accepts an
${OPERATOR_DIRECTIVES}slot — a short bulleted list of atomic new requirements from prior iterations ("user wants null-input coverage", "focus on the gpt-5 cluster"). This is how cross-iteration learnings flow forward without polluting the subagent's context with prior attempts.How decomposition + isolation address the failures
The two root-cause responses (decomposition + context isolation) are the load-bearing changes; the validator and the analyzer's anti-ducktape gate are additional safety mechanisms that depend on decomposition existing in the first place.
03-submissions.mdand rejects changes that don't conform to upstream styleFor the full per-step interface contracts (input/output, exact frontmatter, dispatch templates, edge cases), see
docs/skill-optimizer-v1.4-spec.mdonfeat/skill-optimizer-v1.4.Concerns addressed
Concern 1: "Too big a jump. Why not the minimal fix of adding Anthropic guidelines to the existing prompt?"
Worth taking seriously, and the suggestion has real merit on its own terms. Anthropic's skill-authoring guidance is genuinely better than the existing recipe library — it is higher-level, less prescriptive, and avoids teaching specific structural transformations (like Recipe A's two-pass workflow) that the optimizer then applies mechanically. A less prescriptive prompt would likely reduce one form of drift: the optimizer would not be directly nudged toward the specific transformations that produced drafts #5 and #6.
The honest trade-off is the other direction: less specific guidance also means the optimizer sometimes cannot identify what to fix, because no rule names the failure pattern explicitly. Some changes the current recipes produce reliably (even when wrong) might not get produced at all under a higher-level prompt — useful and unhelpful outputs alike.
But neither direction addresses the root cause. The architectural failure mode — the same context designing the test AND proposing the fix, with no independent check — produces ducktape in whatever form the prompt admits. Less specific guidance would change the shape of the ducktape; it would not prevent it. The smoking-gun evidence above (> 800 lines of guidance, three rejected drafts across three targets, each in a different shape) supports the conclusion that the failure is architectural, not prompt-shaped. CodeDelegator (cited above) identifies the same failure mode and validates the architectural response.
Concern 2: "Why agent-loop → human-prompted? Different way of operation."
This framing misreads the change. The core architectural change is two things:
Human-in-loop is a UX layer, not the architecture. The new architecture includes an auto-pilot driver skill that walks the chain end-to-end with default policies for the user-gate points (PR-intent question, test-subset picks, validator-rejected verdict). The auto-pilot automates the loop — every step inside the loop is decomposed and context-constrained, but no human is required.
The disagreement should be about whether decomposition + context isolation is justified, not about whether humans should be in the loop.
Concern 3: "Too much time needed?"
Each old-pipeline pilot run takes 1–2 hours of model time plus operator review and frequently returns
uplift-too-smallor null findings (as happened with drafts #3 agent-browser and #4 supabase). Multiple pilot runs are typical before any one finding lands.The proposed alternative — run the old pipeline again with Anthropic guidelines swapped in for the current recipes — is an additive change that doesn't address the architectural problem. The smoking-gun evidence above (the old prompt had detailed recipes and still produced ducktape across three targets) already supports the conclusion. Another old-pipeline run would not produce different evidence; it would produce the same ducktape in different shapes. That run is wasted time.
The new architecture's implementation is roughly half a day to one full day of one-time work, amortized across every future skill improvement run. The evidence-to-fix-time ratio strongly favors the architectural fix over additional pilot trials.
References
New architecture (on
feat/skill-optimizer-v1.4)docs/skill-optimizer-v1.4-spec.mddocs/skill-optimizer-v1.4-plan.mddocs/skill-writing-philosophy.mdskills/skill-optimizer-shared/iteration-protocol.mdOld architecture (on
feat/auto-improve-skill-v1.3)skills/auto-improve-orchestrator/SKILL.mdskills/auto-improve-orchestrator/prompts/orchestrator.mdskills/auto-improve-orchestrator/prompts/skill-iterate.mdskills/auto-improve-orchestrator/references/lessons.mdskills/auto-improve-orchestrator/references/contexts/PR drafts (on
docs/upstream-pr-drafts-only)docs/pilot-runs/upstream-pr-drafts/1-vercel-labs-web-interface-guidelines.mddocs/pilot-runs/upstream-pr-drafts/5-google-labs-code-stitch-skills-shadcn-ui.mddocs/pilot-runs/upstream-pr-drafts/6-firebase-agent-skills-hosting-basics.mdExternal research