Skip to content

RFC: skill-optimizer architecture pivot — rationale + evidence#52

Open
Zhaiyuqing2003 wants to merge 8 commits into
developmentfrom
docs/v1.4-rationale
Open

RFC: skill-optimizer architecture pivot — rationale + evidence#52
Zhaiyuqing2003 wants to merge 8 commits into
developmentfrom
docs/v1.4-rationale

Conversation

@Zhaiyuqing2003
Copy link
Copy Markdown

@Zhaiyuqing2003 Zhaiyuqing2003 commented May 19, 2026

skill-optimizer architecture pivot — rationale + evidence

Status: draft for review. Audience: Yi and anyone else assessing the proposed architecture change.

Terminology: "old architecture" refers to the current single-orchestrator pipeline (branch feat/auto-improve-skill-v1.3); "new architecture" refers to the proposed decomposed chain (branch feat/skill-optimizer-v1.4).

Summary of the change

The old architecture runs the whole auto-improve loop (research, baseline, eval, diagnose, modify) in a single long-running orchestrator with ~800 lines of optimization guidance in its prompt. The new architecture decomposes the loop into seven chain skills plus an auto-pilot driver, with each generative step dispatching an isolated subagent under strict limited context. The pivot is driven by empirical evidence that the old architecture's outputs are systematically ducktape — not by architectural preference.

Problem statement

The old pipeline produces changes that look like improvements but are not. PR #51 (fastxyz/skill-optimizer#51) collected three drafts from old-architecture runs — web-interface-guidelines, shadcn-ui, firebase-hosting-basics — each presenting measured uplift on the frontier model matrix. Team review found all three to be ducktape in distinct shapes. The pattern recurs across targets and across the three named optimization recipes the old prompt explicitly recommended. The failure is not a one-off; it is a property of the architecture.

Evidence

The empirical case: PR #51's three drafts

All three drafts were rejected for distinct ducktape patterns at the same review.

Draft #1 — web-interface-guidelines (per-element checklist, +0.08 measured uplift). The change extracts rules already present in command.md and re-states them in a procedural per-element format ("Walk every <img> and check width, height, loading, ..."). The re-statement matches what the test grader checks for, so the test now passes. But the rules existed before — the SKILL.md just got bigger with duplicated content in a different format, and the new section breaks the file's existing organization. Looks principled; is repetition.

Draft #5 — shadcn-ui (BAD/GOOD example + two-pass review, +0.222 measured uplift). Three problems stacked:

  • Examples added to a few rules but not to others. Adding examples to one rule and not others breaks local consistency in the SKILL.md; adding to all bloats. Either is a style violation upstream will likely reject.
  • "Pass 1 / Pass 2" structure imposes a new framing on the existing SKILL.md. This is a structural change of a kind upstream is unlikely to accept in a contribution PR.
  • The Pass 2 "absence" content repeats details already in the skill — same repetition shape as feat: unify benchmark and optimizer around project configs #1.

Draft #6 — firebase-hosting-basics (two-pass configuration review, +0.11 measured uplift). The skill's actual rules live in references/configuration.md and references/deploying.md. The auto-pilot patched the entry-point SKILL.md instead of the reference files where the rules belong. This is a recurring pattern the team has noticed before; the old architecture still falls into it.

Each draft is at the corresponding path under docs/pilot-runs/upstream-pr-drafts/ on the docs/upstream-pr-drafts-only branch.

The smoking gun: the old prompt already had detailed guidance

The old optimizer prompt was not under-specified. Counted by file:

File on feat/auto-improve-skill-v1.3 Lines Purpose
skills/auto-improve-orchestrator/SKILL.md 84 Orchestrator entry point
skills/auto-improve-orchestrator/prompts/orchestrator.md 321 Top-level orchestration
skills/auto-improve-orchestrator/prompts/skill-iterate.md 113 The optimizer step prompt
skills/auto-improve-orchestrator/references/lessons.md 362 Recipe library + categorization framework
Plus: prompts/eval-iterate.md, prompts/research-upstream.md, per-upstream context files Per-step + per-target guidance

The skill-iterate prompt explicitly instructed the optimizer to:

  1. Categorize missed rules into types: visible-pattern, absence-of-attribute, state-machine, subjective.
  2. Match the dominant failure to a specific named recipe — Recipe A (two-pass workflow), Recipe C (per-element checklists), Recipe D (BAD/GOOD examples), others.
  3. Apply additively, match surrounding voice, comply with the hard constraints in the per-upstream context file (which encoded upstream PR conventions: title format, body style, CLA, branch target, additive-only).

The three rejected drafts are not failures of guidance; they are outputs of the named recipes:

Recipe the old prompt told the optimizer to use Rejected draft that resulted
Recipe C — per-element checklists Draft #1 (web-interface-guidelines)
Recipe A + Recipe D — two-pass + BAD/GOOD Draft #5 (shadcn-ui)
Recipe A — two-pass workflow Draft #6 (firebase-hosting-basics)

The optimizer followed the prompt correctly and produced ducktape in the act of applying the recipes. Adding more rules (or substituting Anthropic guidelines for the existing recipes) does not address this — the failure is not in the recipes' wording; it is in the architecture that executes them.

The structural diagnosis: coach-and-player context pollution

The old orchestrator runs the whole loop in one context. The same agent designs the eval cases, runs them, observes which tests fail, analyzes the failures, and proposes the fix — without context boundaries between phases. With full context, the cheapest path to "the test now passes" is one of:

None of these requires the optimizer to verify whether the proposed change is a principled improvement, only that it makes the current test pass. Because the same process is the test designer AND the fix proposer, there is no independent check.

The pattern is named in the literature:

  • Single-agent loops doing planning + implementation suffer context pollution from intermediate failures; role separation with isolated per-subtask contexts is the validated architectural fix — CodeDelegator (arXiv 2601.14914).
  • Early context pollution snowballs: 2% misalignment early in an agentic chain produces ~40% failure rate by the end — Measuring Context Pollution in Agentic Systems.
  • Reward-hacking strategies generalize across prompts and escalate to broader misalignment in different surface shapes — arXiv 2604.01476.

The architectural diagnosis: monolithic skill violates single-responsibility

The old orchestrator is a single skill doing seven jobs. Total prompt material the optimizer reads exceeds 800 lines. Anthropic's context engineering guidance recommends the opposite: single-responsibility tools with narrow scopes, not monolithic agents. The "lost in the middle" effect (LLM attention degrades on long contexts) compounds this — at 800 lines, critical instructions are unavoidably mid-context.

Cramming the workflow into one skill is also what makes per-step context constraint impossible. The same skill that knows what the grader checks for cannot also be the skill that's blind to the grader — the two are in the same context.

How the new architecture addresses the evidence

The new architecture decomposes the loop into seven chain steps plus an auto-pilot driver. Each step has a single responsibility, dispatches its own narrow-context subagent (if it generates content), and produces a versioned report at a canonical path that downstream steps consume.

The seven chain steps

Step Job Subagent isolation
1. investigate-functionality Research what the target skill does; write 01-functionality.md. Also asks the user (for upstream skills) whether to target a PR submission. Researcher sees source skill + targeted web search; doesn't see existing analyses, prior tests, or failure data
2. investigate-test-case Plan ranked test cases at responsibility level; write 02-test-case.md; user picks the subset to actually build. Designer sees 01-functionality.md only; doesn't see source skill, prior analyses, or failure data
3. investigate-submissions (optional, PR-bound only) Research upstream PR conventions via gh CLI (license, CLA, frontmatter spec, recent merged + closed-without-merge PRs); write 03-submissions.md. Researcher sees upstream repo facts only; doesn't see anything about the proposed change being optimized
4. write-tests Build concrete test fixtures + graders, one subagent per picked case. Each per-case writer sees its single case spec + functionality report + source skill; doesn't see other cases or grader matching logic
5. run-bench Execute the eval suite (run-suite CLI), capture per-trial findings + traces. No subagent — thin CLI wrapper; pure execution
6. analyze-result Identify structural weaknesses from failure clusters; write 06-analysis.md. Refuses if no structural weakness can be named honestly. Analyzer sees per-trial findings + skill content + workbench cases; doesn't see the test inputs themselves (forces focus on the skill, not the seeded fixtures)
7. improve-skill Dispatch optimizer (proposes change) then validator (checks it independently); commit + package if approved. Optimizer sees analysis + functionality + skill content; doesn't see raw failures, grader internals, or prior optimizer attempts. Validator sees skill BEFORE + skill AFTER + functionality + submissions (if exists); doesn't see optimizer's reasoning

Plus an eighth skill, skill-optimizer-autopilot, which walks 1→7 end-to-end with default policies for the three human-gate points (the user's PR-intent answer at step 1, the user's pick of test subset at step 2, the user's response to a validator-rejected verdict at step 7). The auto-pilot is for batch processing where modest results are acceptable; the in-loop steps are unchanged from the human-driven case.

Each generative step also accepts an ${OPERATOR_DIRECTIVES} slot — a short bulleted list of atomic new requirements from prior iterations ("user wants null-input coverage", "focus on the gpt-5 cluster"). This is how cross-iteration learnings flow forward without polluting the subagent's context with prior attempts.

How decomposition + isolation address the failures

The two root-cause responses (decomposition + context isolation) are the load-bearing changes; the validator and the analyzer's anti-ducktape gate are additional safety mechanisms that depend on decomposition existing in the first place.

Failure (from Evidence above) Structural response
Repetition of existing rules in new format (Draft #1) The independent validator subagent at step 7 checks each proposed change against the source skill for additivity and non-duplication
Structural drift / framing changes (Draft #5) The validator's external consistency check at step 7 reads 03-submissions.md and rejects changes that don't conform to upstream style
Wrong-file modification (Draft #6) The analyzer at step 6 must name the file the weakness lives in; the validator at step 7 checks the change landed in that named file
Coach-and-player context pollution (root cause) Per-step decomposition + isolated subagent dispatch — the test designer doesn't see the optimizer, the optimizer doesn't see the grader, the validator doesn't see the optimizer's reasoning
Single-skill prompt bloat (root cause) Each step's prompt is ~50–100 lines instead of 800+; per-step focus eliminates the lost-in-the-middle effect

For the full per-step interface contracts (input/output, exact frontmatter, dispatch templates, edge cases), see docs/skill-optimizer-v1.4-spec.md on feat/skill-optimizer-v1.4.

Concerns addressed

Concern 1: "Too big a jump. Why not the minimal fix of adding Anthropic guidelines to the existing prompt?"

Worth taking seriously, and the suggestion has real merit on its own terms. Anthropic's skill-authoring guidance is genuinely better than the existing recipe library — it is higher-level, less prescriptive, and avoids teaching specific structural transformations (like Recipe A's two-pass workflow) that the optimizer then applies mechanically. A less prescriptive prompt would likely reduce one form of drift: the optimizer would not be directly nudged toward the specific transformations that produced drafts #5 and #6.

The honest trade-off is the other direction: less specific guidance also means the optimizer sometimes cannot identify what to fix, because no rule names the failure pattern explicitly. Some changes the current recipes produce reliably (even when wrong) might not get produced at all under a higher-level prompt — useful and unhelpful outputs alike.

But neither direction addresses the root cause. The architectural failure mode — the same context designing the test AND proposing the fix, with no independent check — produces ducktape in whatever form the prompt admits. Less specific guidance would change the shape of the ducktape; it would not prevent it. The smoking-gun evidence above (> 800 lines of guidance, three rejected drafts across three targets, each in a different shape) supports the conclusion that the failure is architectural, not prompt-shaped. CodeDelegator (cited above) identifies the same failure mode and validates the architectural response.

Concern 2: "Why agent-loop → human-prompted? Different way of operation."

This framing misreads the change. The core architectural change is two things:

  1. Task decomposition (one-skill-does-everything → per-step skills with single responsibilities)
  2. Context constraint via isolated subagents (every generative step in a fresh context that sees only its slice)

Human-in-loop is a UX layer, not the architecture. The new architecture includes an auto-pilot driver skill that walks the chain end-to-end with default policies for the user-gate points (PR-intent question, test-subset picks, validator-rejected verdict). The auto-pilot automates the loop — every step inside the loop is decomposed and context-constrained, but no human is required.

The disagreement should be about whether decomposition + context isolation is justified, not about whether humans should be in the loop.

Concern 3: "Too much time needed?"

Each old-pipeline pilot run takes 1–2 hours of model time plus operator review and frequently returns uplift-too-small or null findings (as happened with drafts #3 agent-browser and #4 supabase). Multiple pilot runs are typical before any one finding lands.

The proposed alternative — run the old pipeline again with Anthropic guidelines swapped in for the current recipes — is an additive change that doesn't address the architectural problem. The smoking-gun evidence above (the old prompt had detailed recipes and still produced ducktape across three targets) already supports the conclusion. Another old-pipeline run would not produce different evidence; it would produce the same ducktape in different shapes. That run is wasted time.

The new architecture's implementation is roughly half a day to one full day of one-time work, amortized across every future skill improvement run. The evidence-to-fix-time ratio strongly favors the architectural fix over additional pilot trials.

References

New architecture (on feat/skill-optimizer-v1.4)

  • Spec: docs/skill-optimizer-v1.4-spec.md
  • Plan: docs/skill-optimizer-v1.4-plan.md
  • Skill-writing philosophy: docs/skill-writing-philosophy.md
  • Iteration protocol: skills/skill-optimizer-shared/iteration-protocol.md

Old architecture (on feat/auto-improve-skill-v1.3)

  • Orchestrator entry: skills/auto-improve-orchestrator/SKILL.md
  • Orchestrator prompt: skills/auto-improve-orchestrator/prompts/orchestrator.md
  • Optimizer step prompt: skills/auto-improve-orchestrator/prompts/skill-iterate.md
  • Recipe library: skills/auto-improve-orchestrator/references/lessons.md
  • Per-upstream PR conventions: skills/auto-improve-orchestrator/references/contexts/

PR drafts (on docs/upstream-pr-drafts-only)

  • PR #51 — the team review that surfaced the critiques
  • docs/pilot-runs/upstream-pr-drafts/1-vercel-labs-web-interface-guidelines.md
  • docs/pilot-runs/upstream-pr-drafts/5-google-labs-code-stitch-skills-shadcn-ui.md
  • docs/pilot-runs/upstream-pr-drafts/6-firebase-agent-skills-hosting-basics.md

External research

Yuqing Zhai added 6 commits May 19, 2026 11:50
Sections + intent notes only, no prose yet. Created on a separate
branch (docs/v1.4-rationale) based on development so the proposal
can be reviewed independently of the v1.4 implementation work
happening on feat/skill-optimizer-v1.4.

Scope at landing time will cover (in this order):
- summary of the change
- problem statement (observed failure modes)
- evidence: ad-hoc rule insertion + coach-and-player context
  pollution + monolithic-skill anti-pattern
- why v1.4 addresses the evidence: decomposition, limited-context
  dispatch, human-in-loop gates, independent validator
- alternatives considered + rejected
- honest scope limits (auto-pilot quality, bootstrapping limit,
  implicit-skill cases)
- open questions for reviewers
- references to spec / plan / philosophy on the impl branch

To be filled in during discussion with the operator.
…ns from skeleton

Streamlining per operator direction — the limits section and the
review-open-questions section aren't load-bearing for the
architecture-pivot defense. They can come back if a reviewer asks
for explicit scope or asks for open questions; for now the doc
covers summary → problem → evidence → why-v1.4-addresses →
alternatives → references.
…ents

Filled in the skeleton with substantive prose for review with Yi.

Sections:

- Summary + problem statement: framed as "v1.3 produces ducktape
  systematically; PR #51's three drafts are evidence, not one-offs."

- Evidence: four subsections.
  - The empirical case (PR #51 drafts #1, #5, #6 and their distinct
    ducktape shapes: repetition / structural drift / wrong-file).
  - The smoking gun (v1.3 had ~800 lines of optimizer guidance
    including named recipes A/C/D, and the rejected drafts ARE the
    outputs of those exact recipes — adding more rules doesn't help).
  - The structural diagnosis (coach-and-player context pollution,
    backed by CodeDelegator paper which validates v1.4's exact
    architectural response; Kemple's "2% misalignment → 40% failure
    rate" empirical finding; reward-hacking generalization research).
  - The architectural diagnosis (monolithic skill violates Anthropic's
    own context-engineering guidance; "lost in the middle" compounds
    the 800-line prompt depth).

- Why v1.4 addresses: decomposition, limited-context dispatch,
  independent validator, anti-ducktape gate in analyzer's report
  format. Each mechanism mapped to a concrete failure it prevents.

- Counter-arguments addressed: the three concerns from the meeting
  (too big a jump, agent-loop vs human-prompted framing, too much
  time). Concern 3 specifically rejects "run v1.3 again with
  Anthropic guidelines" as an additive change that doesn't address
  the architecture, with evidence already sufficient.

- Alternatives considered: post-hoc validator only, Anthropic
  guidelines substitution, stricter prompting. All rejected with
  reasoning tied to the evidence.

- References: v1.4 work, v1.3 implementation paths, PR drafts,
  external research (CodeDelegator, Kemple, reward-hacking,
  Anthropic context-engineering, Arbiter).
…chitecture-addresses section

Four edits per operator feedback:

1. Removed "Alternatives considered" — the Concerns Addressed
   section already covers the same alternatives (Anthropic-guidelines
   substitution, stricter prompting).

2. Renamed "Counter-arguments addressed" → "Concerns addressed".
   Less adversarial framing; more accurate to what the section
   does.

3. Compressed "Why v1.4 addresses the evidence" → "How the new
   architecture addresses the evidence". Replaced four prose
   subsections (decomposition, limited-context dispatch, validator,
   anti-ducktape gate) with one table mapping each Evidence-section
   failure to the structural response. The point is to show the
   architecture addresses the problem, not to enumerate each
   mechanism. For implementation detail, the doc points to the
   spec.

4. Replaced "v1.3" / "v1.4" with "old architecture" / "new
   architecture" throughout the prose. Branch names and file paths
   keep their literal version identifiers (those are technical
   artifacts that need exact references). Added a Terminology note
   in the header so a business-side reviewer maps the two terms to
   the right branches if they want to dig deeper.

5. Tightened the literature references in the body — replaced
   inline-quoted passages with one-sentence claims and bare links.
   The full reference list at the bottom is unchanged.

Net: 366 → ~330 lines, with the substantive content reorganized
rather than just trimmed.
…strengths before rejecting

Per operator feedback: the original counter dismissed the Anthropic-
guidelines alternative too quickly. The honest framing concedes
three things:

1. Anthropic's guidance is genuinely better than the existing recipe
   library — higher-level, less prescriptive, doesn't teach specific
   structural transformations.
2. That higher-level approach would likely reduce one kind of drift
   (the optimizer wouldn't be nudged toward Recipe-A-style
   transformations).
3. The trade-off cuts the other way too: less specific guidance
   means the optimizer sometimes can't identify what to fix, so
   useful outputs go missing alongside the unhelpful ones.

Then the core argument: even with the better guidance, the
architectural failure mode (same context designing test AND fix,
no independent check) produces ducktape in whatever form the prompt
admits. Less specific guidance changes the shape; doesn't prevent
the failure.

Closes with a constructive note: Anthropic guidelines are worth
folding into the new architecture's per-step prompts as a
complement to decomposition + isolation, not as a substitute.
…oncern 1

The line promised Anthropic guidelines as a 'worthwhile parallel
improvement to the per-step prompts' — but we're already going to
incorporate Anthropic-style guidance into the new architecture's
per-step prompts when authoring them. Promising it as a future
parallel-track improvement reads as a hedge or as something separate
from the main proposal. Cleaner to end the concern at the
CodeDelegator citation, which is the load-bearing close: external
research validates the architectural response.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a single design rationale document, docs/v1.4-rationale.md, arguing for an architectural pivot in the skill-optimizer from a monolithic single-orchestrator pipeline (v1.3) to a decomposed, context-isolated chain of subagent skills (v1.4). The doc presents empirical evidence (three rejected drafts from PR #51), a structural diagnosis (coach-and-player context pollution), and addresses anticipated reviewer concerns.

Changes:

  • New RFC document describing why prompt-tuning the existing orchestrator is insufficient and architectural decomposition is needed.
  • Maps each observed failure mode in PR #51 drafts to a specific structural response in the proposed v1.4 design.
  • Provides references to the new/old architecture branches and external research backing the approach.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/v1.4-rationale.md Outdated
plus an auto-pilot driver, with each generative step dispatching an
isolated subagent under strict limited context. The pivot is driven
by empirical evidence that the old architecture's outputs are
systematically ducktape — not by architectural preference.
Yuqing Zhai added 2 commits May 19, 2026 13:46
…ndering

Hard line wraps at ~72 chars looked narrow when GitHub renders the
PR description on a wide screen — the column ends up filling only
about half the available width. Reflowed so each paragraph is a
single line; bullets, tables, code blocks, and headings unchanged.
Renderers (GitHub, IDE preview, markdownlint) treat soft wraps
consistently this way.
Yi accepted the rationale but asked for more detail on the seven
steps (the prior compression to a single failure→response table left
him without enough context on what each step actually does).

Replaced the single failure→response table with a two-part section:

1. "The seven chain steps" — a per-step table showing each step's
   job + the subagent isolation that distinguishes it. Covers steps
   1-7 plus a paragraph on the 8th auto-pilot driver and the
   OPERATOR_DIRECTIVES slot.

2. "How decomposition + isolation address the failures" — kept the
   original failure→response table, with the responses now refer
   back to the specific steps explained above (e.g., "the validator
   at step 7", "the analyzer at step 6").

The doc now answers "what are the 7 steps and what does each do"
before "how does the chain address the failures", which matches
Yi's reading order and the natural narrative.

For full interface contracts the doc still points at the spec; the
goal here is to give a reader enough to follow the architecture
without forcing them into the implementation detail.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants