From dbd6a11eb33f140b4cd7a6096a5f90730ae2796d Mon Sep 17 00:00:00 2001 From: wbj010101 Date: Mon, 11 May 2026 20:41:16 +0800 Subject: [PATCH 01/16] docs(catalogue): codify F19 (install-test) + F20 (constitution-vs-workflow) + F21 (identity overload) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three new failure modes promoted from CANDIDATE to confirmed entries: F19 — Public-facing onboarding text written but never independently install-tested. Founding evidence: Cobrust v0.1.0 (M10 hallucinated SHAs, 13/14 CI red), v0.1.1 (cargo install + curl URL 404), v0.1.2 (release-readiness audit catching -fsSL flag friction, back-ported to v0.1.1 — first validated closure cycle BLOCK → fix → GO). F20 — Constitution mandate written but workflow never aligned. Founding evidence: CLAUDE.md §6 test-first declared, fact-violated for 9 days (P7 wrote impl+test same commit). Owner-spotted 2026-05-11. Resolution: D0-D5 difficulty matrix + dev/test pair workflow codified same-day into cto_operations_runbook.md + ADSD dispatch-prompt-p9.md template + memory. Validation: W2 sprint executed with TDD step 1 / dev step ordering in commit log (commits ca4c37c → 2eb4fca + d337cf0 → 0145e8b). F21 — Cross-session AI agent identity overload. Founding evidence: Cobrust 2026-05-11 — claude-desktop drafted Cobrust Studio handoff signing "— review-claude" while a separate Claude Code session (4bb35f43) was concurrently active under the same handle. Audit trail collapsed across sessions. Recovery: session-ID attribution convention adopted (e.g. "review-claude (session 4bb35f43, 2026-05-11)"). All three are F1 Sediment Family sub-forms — declared-without-enforcement applied to install-tests / constitution-mandates / agent-handles. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../reference/failure-modes-catalogue.md | 213 ++++++++++++++++++ 1 file changed, 213 insertions(+) diff --git a/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md b/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md index 9909285..e3f2c9a 100644 --- a/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md +++ b/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md @@ -1065,6 +1065,219 @@ When declaring attribution/ownership policies: --- +## F19 — Public-facing onboarding text written but never independently install-tested (F1 Sediment Family, install-test sub-form) + +> **F1 sub-form, confirmed**. Same family as F1.1 (declared invariants without enforcement). The text claims an install path; the install path is never executed in a clean shell by the writer. F19 is high-blast-radius: failures land on every new user's first impression, not on internal dev productivity. + +### Definition + +A release artifact (README quickstart, release notes, GitHub Release body, install script, `cargo install` command, `curl -L` URL) ships to public users without anyone running the documented commands in a clean shell before publish. The text passes review by being read; it fails reality by being run. + +### Symptoms + +- `README.md` says `cargo install foo-cli` but the package is not on crates.io (path-deps not published) → user gets `error: could not find foo-cli in registry` +- Release notes list `cobrust-v0.1.1-x86_64-apple-darwin.tar.gz` but `release.yml` never built that target → user gets HTTP 404 +- README's dynamic URL builder via `$(uname -sm | tr ' ' '-')` builds `Darwin-arm64.tar.gz` but the actual asset is named `aarch64-apple-darwin` → 404 on Apple Silicon Macs +- GitHub Release body references action SHAs that don't resolve (typo'd / hallucinated commit hash) → CI red on the release branch +- Quick-start uses `curl -L ... | tar xz` but server doesn't follow redirect on plain `-L` without `-fsSL` → empty extraction, opaque error + +### Root cause + +Two compounding patterns: + +1. **Author writes from intent**: "this is how install should work" — writer's mental model of the asset naming convention or registry state, not the actual file on disk / artifact in release. +2. **No clean-shell verification step**: standard PR review reads the README diff in GitHub UI. Reviewer does not `cd /tmp && bash <(paste commands)`. So the text passes review by being plausible, not by being executable. + +Same structural pattern as F1.1: declaration (install path) + missing verification step (run in clean shell). F17 (self-report fidelity) is the report analog; F19 is the user-onboarding analog. Both are F1 family. + +### Evidence + +Cobrust 24-hour window 2026-05-10 → 2026-05-11, three consecutive instances: + +1. **M10 hallucinated SHA pins (v0.1.0 tag)**: M10 sub-agent SHA-pinned 4 GitHub Actions to fake 40-char hex with confident `# v4.2.2` comments. 13/14 CI jobs failed at action resolution, leaving v0.1.0 tag with red CI for ~4 hours until user spotted. Recovery: revert to tag form (`@stable`, `@v4`). +2. **v0.1.1 install path 404 (release notes)**: release notes listed `cargo install cobrust-cli` (package not on crates.io, path-deps unpublished) and curl URL `cobrust-v0.1.1-x86_64-apple-darwin.tar.gz` (release.yml never built x86_64-apple-darwin). Mei persona audit + Layer 3 review-claude curl test caught both. Recovery: change install command to `cargo install --git ...`, remove non-existent asset URL. +3. **v0.1.2 release-readiness audit (mechanism validated)**: §A.3 dispatched a release-readiness sub-agent before public announcement. Agent ran the documented curl commands in clean shell, surfaced friction (`curl -L` without `-fsSL` left empty extractions on some platforms). Friction was fixed pre-release **and back-ported to v0.1.1's notes** (`4baea69 docs(release): back-port -fsSL curl flag to v0.1.1 release notes`). This is the closure cycle: BLOCK → fix → re-test → GO. **First validated execution of F19's prevention mechanism in the wild.** + +### Rule of thumb + +> **Any public-facing install / quickstart / release command must pass independent execution in a clean shell before publish.** +> +> Mandatory release-readiness gate: +> ``` +> # In a /tmp/release-test- directory with no env vars from dev box: +> 1. Run each `cargo install` / `curl -L` / install command verbatim from the doc +> 2. For each URL: curl -fsSL -o /tmp/check.tar.gz ; echo "HTTP $?" +> 3. For each command: confirm exit 0 + expected stdout +> 4. Block merge if any command fails +> ``` +> Spawn a dedicated **release-readiness agent** for this — not the same agent that wrote the docs (avoid F1.1 self-attestation pattern). + +### Recovery + +1. **Immediate**: if a 404 install URL is in a published release, edit the release body via `gh release edit --notes-file .md` and force-push a docs commit on main. Tag itself is immutable, but body + asset uploads are not. +2. **Workspace version**: if `Cargo.toml` workspace.package.version was not bumped before tag, the prebuilt binary will report wrong version → user files bug → confidence damaged. Bump version BEFORE tag in every release SOP. +3. **Backport friction fixes** to prior releases: if you find `curl -L` should have been `curl -fsSL`, back-port via `gh release edit v0.1.1` so old release pages also fix. Don't leave half the user base hitting a known-fixed friction. + +### Prevention going forward + +In every project adopting ADSD: + +1. **Add release-readiness as a tier-0 verification step** in `cto_operations_runbook.md` §"Dispatching a new P9": for any commit touching `README.md` / `docs/releases/*.md` / GitHub Release body / `release.yml`, spawn a P7 sonnet release-readiness agent that runs install commands in a clean shell and reports `[P7-RELEASE-READY-VERDICT] GO / BLOCK`. +2. **Release-readiness agent prompt template** in `templates/dispatch-prompt-p7.md` (release-readiness flavor): the agent's job is to be skeptical of the docs it's auditing, run every command verbatim, paste raw exit codes + sizes as evidence. +3. **CI lint** (stretch): a release-time CI gate that resolves each URL/asset listed in release notes via `gh api` and fails if any returns 404. This is the F1 family enforcement mechanism. + +### Closure: BLOCK → fix → GO cycle as validation + +The validation that F19's mitigation works is itself empirical: v0.1.2's release-readiness audit produced a BLOCK verdict, the friction was fixed (back-port + new asset naming), the next audit returned GO. The system worked. Any project adopting this pattern should expect: first 1-2 releases produce BLOCK verdicts; over time, BLOCKs become rare because the writing convention internalizes the verification step. + +--- + +## F20 — Constitution mandate written but workflow never aligned (F1 Sediment Family, mandate-vs-workflow sub-form) + +> **F1 sub-form, confirmed**. A project constitution declares a binding rule ("test-first development", "atomic commits", "no `unwrap()` in non-test code"). The dispatch SOP, daily workflow, and reviewer checklist never align to enforce the rule. The constitution becomes aspirational marketing; the workflow runs unconstrained. + +### Definition + +The project's foundational document (CLAUDE.md, constitution, README §Principles) states a binding development rule. Implementation of that rule requires a corresponding step in the workflow: dispatch prompt template field, CI gate, pre-commit hook, reviewer checklist item. The workflow step is missing or unenforced. Code continues to be written without violating the constitution textually (no one disputes the rule), but the rule is never actually exercised. + +### Symptoms + +- Constitution §"Test-first": "failing test before implementation" — but every sprint's commits show `feat(X): implementation + tests in same commit`. Test-first ordering is impossible to verify from the diff. +- Constitution §"Atomic commits, code + tests + docs same commit" — but findings get added in separate doc-cleanup commits days later. +- Constitution §"No `unwrap()` in non-test code; use `expect("rationale")` instead" — `grep -r 'unwrap()' crates/*/src/` returns N hits, none with rationale. +- Project memory `feedback_subagent_model_tier.md` says "Opus for hard / sonnet for easy / haiku NEVER" — but P9 dispatch prompts consistently use sonnet without difficulty assessment, occasional spawns of haiku for trivial doc rewrites. + +### Root cause + +Two compounding patterns: + +1. **Mandate is text-level, not workflow-level**: writing the rule in CLAUDE.md feels like enforcing it. But the rule exists in agents' context only at session start; after compaction or sub-agent spawn, the rule is not re-asserted. +2. **No enforcement scaffold built alongside the rule**: when the constitution is drafted, the corresponding CI lint / dispatch prompt field / commit hook is not built in the same PR. The rule is declared; the enforcement is "we'll add it later." + +This is the meta-pattern of F1 Sediment Family applied to the project's own ground rules. Every other F1 sub-form (F1.0 schema invariants, F1.1 declared without CI, F16 identity preamble in skill not memory, F17 self-report fidelity, F18 attribution scope, F19 install commands) is an instance of this F20 meta-pattern: declared without enforcement at the right layer. + +### Evidence + +Cobrust 9-day pre-2026-05-11 period: CLAUDE.md §6 stated "Test-first for compiler internals: failing test before implementation." Every P9 sprint from M3 through M12 used a single P7 sonnet agent writing impl + tests in the same commit. No commit log shows tests committed before impl. The constitution mandate was fact-violated for 9 consecutive days without anyone (including review-claude) spotting it. + +Discovery: 2026-05-11, project owner posed the question "CTO 只管开发不管测试, 不太好, 他手底下应该每个开发都再配一个 sonnet 测试" — owner-spotted constitution gap, not agent-spotted. Review-claude's analysis (this catalogue's parallel session) confirmed: CLAUDE.md §6 mandate without dispatch-prompt workflow alignment = F20 instance. + +Resolution: 2026-05-11 same-day codification of D0-D5 difficulty matrix + mandatory dev/test pair workflow (separate test agent + dev agent, test-first ordering, P9 reviews corpus between) into: + +- `feedback_subagent_model_tier.md` §"Extension 2026-05-11" (memory enforcement) +- `cto_operations_runbook.md` §"Dispatching a new P9" + §"Dev/test pair pattern" (SOP enforcement) +- ADSD `templates/dispatch-prompt-p9.md` (template enforcement) + +Validation: Cobrust W2 sprint (the first sprint after codification) executed with TDD ordering visible in commit log: + +``` +ca4c37c tests(adr-0044): W2 Phase 2 failing test corpus per ADR-0044 (TDD step 1) +2eb4fca feat(stdlib+codegen+cli+types): wire source-level input/read_line/argv per ADR-0044 W2 Phase 2 (TDD dev step) + +d337cf0 tests(adr-0044): W2 Phase 3 LeetCode oracle-match corpus (TDD step 1) +0145e8b feat(examples): W2 Phase 3 — 10 LeetCode .cb programs (TDD dev step, ADR-0044 stdin/argv usage) +``` + +The TDD step 1 commits land before TDD dev step commits in temporal order. **First executed test-first sprint after 9 weeks of constitution mandate.** F20 is closed for Cobrust via execution evidence, not just documentation. + +### Rule of thumb + +> **Every binding constitution rule must have a paired enforcement step in the same PR that introduces it.** +> +> Enforcement layers in ascending strength: +> 1. Mandate appears in dispatch prompt template (workflow text) +> 2. Mandate appears in auto-loaded project memory (survives compaction) +> 3. Mandate has a CI lint / commit hook / pre-commit check +> 4. Mandate is enforced by the tool itself (e.g. `cobrust build` rejects code with `unwrap()`) +> +> Aim for layer 3+ on critical rules. Layer 1 alone = F20 instance waiting to happen. + +### Recovery + +When discovering an F20 instance: + +1. **Locate the mandate text**: which paragraph in which doc? +2. **Identify the workflow gap**: which dispatch prompt template / SOP / CI file should enforce this? +3. **Add the enforcement in the next PR**, not "later". Same-day codification is the minimum. +4. **Backfill validation**: after enforcement is added, run one sprint that exercises the enforced path; verify the enforcement actually fires (e.g. CI rejects bad commit). + +### Prevention going forward + +In every new constitution / CLAUDE.md / project rules document: + +1. After each rule, add a `**Enforced by**: ` line. +2. If `Enforced by: N/A — aspirational` appears, flag for future codification or downgrade the rule to "guidance" rather than "mandate". +3. Periodic constitution audit (quarterly): grep every mandate, verify each has a working enforcement mechanism. + +This is itself a meta-application of ADSD: the project's own development discipline must be ADSD-managed. + +--- + +## F21 — Cross-session AI agent identity overload (F1 Sediment Family, identity-namespace sub-form) + +> **F1 sub-form, confirmed**. A symbolic agent handle ("review-claude", "the CTO", "studio-reviewer") is used across multiple distinct AI sessions/contexts as if it were a stable identity. Audit trail becomes ambiguous: claims attributed to handle X may originate from session A, B, or C, each with different context and authority. + +### Definition + +A natural-language handle is adopted as the de-facto name for a role (audit reviewer, CTO, tech lead). Multiple distinct AI sessions assume the same handle when fulfilling that role at different times or in parallel. Cross-session artifacts (documents, findings, commit messages) attribute work to "review-claude" without disambiguating which session. Future readers cannot distinguish whether a claim came from a session with deep project context vs. a fresh session with shallow context. + +### Symptoms + +- Document signed "— review-claude, 2026-05-11" appears in a directory; another document also signed "— review-claude, 2026-05-11" appears with conflicting analysis +- A handoff doc claims "review-claude audited the project across 7+ review rounds" — but the actual author was a different session that synthesized the prior rounds from transcripts, not the session that performed them +- Commit message says `Co-Authored-By: review-claude` — git log cannot distinguish which session +- Project memory references "review-claude" as if it were a single persistent agent, when in practice it's been multiple sessions with different context depths + +### Root cause + +Three compounding patterns: + +1. **Symbolic-handle reuse**: humans naming AI roles (review-claude, CTO, tech-lead-p9) creates an implicit identity. Distinct sessions, when assigned that role, adopt the handle as their own. +2. **No session-ID attribution in artifacts**: documents/commits sign with the handle, not with `handle (session XYZ)` or `handle (timestamp)`. Audit trail collapses across sessions. +3. **Cross-session learning illusion**: readers assume "review-claude knows" things from prior sessions because the handle is consistent. But each session has fresh context unless explicitly fed prior artifacts. + +This is F1 family because: the role is declared (review-claude is the auditor), the identity is not enforced (no scheme to distinguish session-A's review-claude from session-B's). Audit attribution drifts. + +### Evidence + +Cobrust 2026-05-11 evening: project owner asked claude-desktop to draft a Cobrust Studio handoff. Claude-desktop drafted ~2,800-line document signing it "— review-claude, 2026-05-11". A separate Claude Code session (the parallel one auditing Cobrust live, session ID `4bb35f43...`) was also active that day and had been signing its own artifacts "review-claude". The Studio handoff was claimed to be "synthesized from a multi-turn external review-claude session" — but the original session that performed those reviews did not write the handoff; claude-desktop did, citing the parallel session's prior work. + +Result: future readers of the Studio handoff cannot tell which review-claude session authored each claim, when, with what context. The handle "review-claude" became identity-overloaded between at least 2 concurrent sessions on the same day. + +Recovery in same session: appended §0.5.1 "Identity hygiene (F21)" + §12.8 "When in doubt, ask the parallel review-claude session" to the Studio handoff, prescribing session-ID-stamped attribution going forward. + +### Rule of thumb + +> **Symbolic AI role handles must carry session-ID or timestamp attribution in any persistent artifact.** +> +> Naming convention: +> - In documents: `— review-claude (session 4bb35f43, 2026-05-11)` +> - In commits: `Co-Authored-By: review-claude-session-4bb35f43 ` +> - In findings: frontmatter `discovered_by: review-claude (session 4bb35f43)` +> - Reserve plain "review-claude" for the abstract role; never use it bare in attribution. +> +> Stronger: when spawning a new internal review agent, give it a distinct handle (e.g. `studio-reviewer-001`) rather than reusing "review-claude". Reserve "review-claude" for the originating external audit window. + +### Recovery + +When discovering an F21 instance in existing artifacts: + +1. Audit document signatures: identify which actually came from which session. +2. Where ambiguous: leave the original signature, append `(provenance: see commit for session metadata)`. +3. Going forward, prefix new artifacts with explicit session ID. + +### Prevention going forward + +In every ADSD project: + +1. At the start of a session that will produce persistent artifacts, declare the session ID. Stamp every commit / finding / ADR with that ID. +2. Distinct roles get distinct handles. "review-claude" is the role; "review-claude (session 4bb35f43)" is the agent instance. Documents reference the latter. +3. If multiple sessions of the same role are concurrent: choose disambiguating suffixes (`review-claude-A`, `review-claude-B`, or session-ID). + +This convention applies to any AI agent role that produces persistent artifacts in a multi-session project. The cost is one extra string per signature; the benefit is unambiguous audit trail forever. + +--- + ## Catalogue maintenance This catalogue is alive — add to it as you encounter new failure modes. From 5c7c2a2f5c16ed4704fde5bf862580c6d7b25946 Mon Sep 17 00:00:00 2001 From: wbj010101 Date: Tue, 12 May 2026 00:33:46 +0800 Subject: [PATCH 02/16] feat(v1.2.0): add 5 cross-pollination references from Anthropic + OpenAI guidance MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Distills 5 of 12 v1.2.0 gap candidates identified by review-claude self-audit. Adopts established industry practices into ADSD discipline: 1. reference/evals-first-development.md — Anthropic "evals are the moat" applied to ADSD. 6th gate (eval delta non-regression) closes F20 systemically. Highest-leverage v1.2.0 addition. 2. reference/context-window-strategy.md — Positive practices for long agent sessions, complementing F16 negative form. Three-tier model (persistent/session-scoped/transient) + bootstrap-from-cold template. 3. reference/cross-session-memory-architecture.md — Four-layer storage model + decision tree for "where does this go?". Codifies ADSD's memory file discipline that was previously implicit. 4. reference/prompt-engineering-patterns.md — 9 patterns (P1-P5 core, PT1-PT9 specific) from Anthropic + OpenAI prompt guides, adapted to ADSD sub-agent dispatch. Cross-references F13/F17/F19/F21. 5. reference/cost-monitoring-discipline.md — Three-tier budget model (per-sprint/per-release/per-project) + cost as diagnostic signal + Anthropic caching + OpenAI structured outputs as cost levers. SKILL.md cross-references section updated to include all 5 with explicit Anthropic+OpenAI provenance + remaining-7-gaps notice for v1.3.0 planning. Remaining v1.3.0 candidates: skills architecture, agent specialization roles, HITL decision tree, RCA/post-mortem template, MCP integration patterns, calibrated confidence, structured-output schema enforcement. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../reference/context-window-strategy.md | 200 +++++++++++++ .../reference/cost-monitoring-discipline.md | 247 +++++++++++++++ .../cross-session-memory-architecture.md | 227 ++++++++++++++ .../reference/evals-first-development.md | 212 +++++++++++++ .../reference/prompt-engineering-patterns.md | 280 ++++++++++++++++++ 5 files changed, 1166 insertions(+) create mode 100644 plugins/adsd/skills/agent-driven-development/reference/context-window-strategy.md create mode 100644 plugins/adsd/skills/agent-driven-development/reference/cost-monitoring-discipline.md create mode 100644 plugins/adsd/skills/agent-driven-development/reference/cross-session-memory-architecture.md create mode 100644 plugins/adsd/skills/agent-driven-development/reference/evals-first-development.md create mode 100644 plugins/adsd/skills/agent-driven-development/reference/prompt-engineering-patterns.md diff --git a/plugins/adsd/skills/agent-driven-development/reference/context-window-strategy.md b/plugins/adsd/skills/agent-driven-development/reference/context-window-strategy.md new file mode 100644 index 0000000..396df30 --- /dev/null +++ b/plugins/adsd/skills/agent-driven-development/reference/context-window-strategy.md @@ -0,0 +1,200 @@ +--- +name: Context-window strategy for long agent sessions +description: Positive practices for organizing a multi-hour / multi-week agent session so the context stays useful across compaction events. Complements F16 (post-compaction identity drift) by codifying what should be in memory, what should be re-derivable, what's transient. +type: reference +version: 1.0.0 +date: 2026-05-12 +status: active +relates_to: [skill:SKILL.md §"Snapshot discipline", reference:failure-modes-catalogue.md F16, reference:cross-session-memory-architecture.md] +--- + +# Context-window strategy + +> Long sessions degrade. Compaction is automatic but lossy. The agent that survives the compaction is the one whose **identity, current state, and operative rules are in the persistent layer**, not the transcript. This reference codifies what goes where. + +## When this applies + +- Any session expected to run > 50K tokens or > 4 hours wall-clock +- Any agent role that should survive context compaction (CTO, P9 tech lead, review-claude) +- Any project where memory files / snapshot files / handoff docs exist + +If you're a one-shot agent (< 5 tool calls, single deliverable), this reference is overkill — just execute and return. + +## Three-tier model + +Adopt three explicit context tiers. Every piece of information lives in exactly one tier: + +``` +┌────────────────────────────────────────────────────────────────────────┐ +│ TIER 1: Persistent (auto-memory, repo, version control) │ +│ Survives compaction + session restart + machine change │ +│ - Identity preamble (you are P9 / CTO / review-claude) │ +│ - Operative rules (D-matrix, dev/test pair, F1-Fxx awareness) │ +│ - Project snapshot (HEAD, ADR roster, finding ledger) │ +│ - Cross-references (memory file → other memory files) │ +├────────────────────────────────────────────────────────────────────────┤ +│ TIER 2: Session-scoped (this conversation's context) │ +│ Survives within session but not across sessions │ +│ - Current sprint's working state (which Tx in progress) │ +│ - Files read this session (don't re-read what you already have) │ +│ - Decisions made this turn (will go to ADR or memory at end of sprint) │ +├────────────────────────────────────────────────────────────────────────┤ +│ TIER 3: Transient (one tool call at a time) │ +│ Doesn't need to persist; if needed again, re-fetch │ +│ - Bash output of intermediate verification │ +│ - Read-tool output for files not central to decision │ +│ - Search results, grep outputs │ +└────────────────────────────────────────────────────────────────────────┘ +``` + +The discipline: **Tier 1 must be sufficient to bootstrap a fresh session.** A new agent reading only Tier 1 + the current user prompt should be able to make a correct decision about what to do next. + +## Anthropic-pattern adoption + +### "If you can't answer 'what's my role this session?' in 1 sentence: you've drifted" + +From Claude Code docs on subagents: identity must be re-asserted at compaction boundaries. ADSD encodes this in `feedback_p10_post_compaction_identity_recovery.md` (F16 mitigation). + +Concrete check: when you receive a message and the prior turn was > 30 turns ago, ask yourself the three questions in `feedback_p10_post_compaction_identity_recovery.md §"Self-check trigger"` before acting. + +### Memory file is read on every session start (auto-load) + +Anthropic Claude Code auto-loads `MEMORY.md` index every session. Use this: + +- `MEMORY.md` = the table of contents (one-line per memory file with hook) +- Each memory file = self-contained chapter +- Read order matters — put the most critical file at line 1 (e.g. identity recovery) + +ADSD example: Cobrust's `MEMORY.md` has 14 entries, identity-recovery first, snapshot second, runbook third. New session in Cobrust dir reads index, knows where to look. + +### Skill description is the trigger + +Anthropic skills auto-activate when description keywords match user prompt. So: + +- `description` field = precise + keyword-rich + scoped (NOT generic) +- A skill named "agent dispatch" with description "general agent stuff" won't trigger usefully +- A skill named "agent dispatch" with description "multi-agent dispatch planning, P9 tech lead role, dev/test pair pattern" triggers on the right turns + +Keep skill descriptions tight (~30 words), keyword-dense. + +## OpenAI-pattern adoption + +### Conversation summary turn (Anthropic also uses this) + +When approaching context limit, take a deliberate "summary turn": + +- List of decisions made this session +- Files modified this session +- Open questions +- Next action + +This synthetic message becomes the bootstrap for compaction. Better than letting the system auto-compact a random middle chunk. + +Mechanism: just write a paragraph or YAML block titled "## Session checkpoint " with the structure above. + +### Cache 友好 (cost optimization) + +OpenAI + Anthropic both cache prefix tokens. Strategy: + +- Put unchanging context (system prompt, project preamble, tool definitions) FIRST +- Put changing context (recent messages, current task) LAST +- Don't shuffle the order; let cache hit + +For ADSD: the auto-loaded memory + skill content sits at session start → cached. New user prompts append → small delta. This is already the right shape. + +### Don't re-read files you've already read + +OpenAI guidance: assume tool result outputs stay in context for the rest of the session. Don't `Read` the same file twice unless you wrote to it. + +ADSD anti-pattern: re-reading SKILL.md or constitution every turn out of nervous-habit wastes context. Trust the agent's memory of recent reads. + +## ADSD integration with existing patterns + +### Snapshot.md as Tier 1 checkpoint + +ADSD's `project_state_snapshot.md` is the canonical Tier 1 checkpoint. It contains: + +- HEAD SHA +- ADR roster +- Finding ledger +- Phase F milestones +- Binary verification claim + +A fresh session reads snapshot.md and bootstraps situational awareness in ~200 lines. Don't replicate this in transient context. + +### Handoff cover-letter as Tier 1 cross-session + +When ending a sprint, write a handoff cover-letter (template in `templates/handoff-cover-letter.md`) that becomes the bootstrap for the receiving session. Don't rely on transcript transfer. + +### F16 mitigation: identity recovery preamble + +Identity is Tier 1. The skill description triggers; the memory file confirms; the operative rules guide. If identity drifts post-compaction → re-read the identity recovery memory. + +### Long-session bookkeeping rhythm + +Every ~30 tool calls or hourly (whichever first), explicitly: + +1. Update snapshot.md with latest HEAD + new ADRs/findings +2. Commit any in-progress work (don't let it rot in working tree) +3. Write a session-checkpoint paragraph (per OpenAI pattern above) +4. Run snapshot-lint to verify the Tier 1 invariants + +This rhythm prevents the "20-tool-call-no-checkpoint" cliff where compaction loses critical state. + +## Concrete templates + +### Session-checkpoint format (insert as message in long sessions) + +```yaml +## Session checkpoint + +decisions_this_session: + - + - + +files_modified: + - : + +open_questions: + - + - + +next_action: + who: + what: + blocking_on: +``` + +### Bootstrap-from-cold prompt (for fresh session resuming work) + +When a fresh Claude Code session starts on an in-flight project: + +``` +First action (mandatory before any tool): +1. Read MEMORY.md (table of contents) +2. Read project_state_snapshot.md (HEAD + roster + ledger) +3. Read cto_operations_runbook.md (SOPs) +4. Read feedback_subagent_model_tier.md (D-matrix) +5. Then look at user's prompt + decide what to do +``` + +If MEMORY.md doesn't exist in the project, refuse to act until user clarifies role / project. + +## Pitfalls + +| Pitfall | Symptom | Recovery | +|---|---|---| +| Re-reading the same file 10× per session | Tool call waste, slow turns | Track read files in working memory; trust prior reads | +| Putting transient bash output in memory | Memory file grows unbounded | Memory is for stable facts; transient goes to scratch | +| Identity in skill description only (F16) | Post-compaction drift to executor mode | Mirror identity preamble in auto-memory; F16 mitigation | +| Tier 1 file never updated as project evolves | Snapshot becomes lying narrative | Pre-commit hook runs snapshot-lint | +| Cache-busting by shuffling system prompt order | Token cost inflates | Lock system prompt order; mutations go to user-turn | + +## Cross-references + +- `reference/cross-session-memory-architecture.md` — what goes in memory vs ADR vs finding vs snapshot +- `reference/failure-modes-catalogue.md` F16 — post-compaction identity drift (the negative form) +- `templates/snapshot-template.md` — Tier 1 bootstrap doc +- `templates/handoff-cover-letter.md` — cross-session handoff +- Anthropic Claude Code docs: subagents, MEMORY.md auto-load, plan mode +- OpenAI: cache optimization guidance + summary turn pattern diff --git a/plugins/adsd/skills/agent-driven-development/reference/cost-monitoring-discipline.md b/plugins/adsd/skills/agent-driven-development/reference/cost-monitoring-discipline.md new file mode 100644 index 0000000..da8b2e1 --- /dev/null +++ b/plugins/adsd/skills/agent-driven-development/reference/cost-monitoring-discipline.md @@ -0,0 +1,247 @@ +--- +name: Cost monitoring + budget gate discipline +description: Practical patterns for tracking LLM cost across ADSD sprints, setting budget gates per sprint and per release, and recognizing when cost is signaling a deeper problem (loop / drift / over-spawn). +type: reference +version: 1.0.0 +date: 2026-05-12 +status: active +relates_to: [skill:SKILL.md §"Wave + Tx pattern", reference:failure-modes-catalogue.md F12] +--- + +# Cost monitoring discipline + +> "Token cost is not a constraint" (per Cobrust constitution) does NOT mean "ignore cost." It means cost is not the primary correctness gate. Cost is still a **signal**: a sprint costing 10× the expected budget is telling you something — either the work is harder than estimated, or an agent is looping, or someone spawned 10× more sub-agents than needed. + +## When this applies + +- Any multi-sprint project running parallel agents +- Any sprint exceeding ~$5 in LLM cost +- Any sprint where consensus mode or stress sweeps fire (10×+ multipliers) +- Any release where you want to defend "we shipped at $X cost" + +If you're a one-shot agent with one tool call, this reference is overkill. + +## Three budget tiers + +``` +┌─────────────────────────────────────────────────────────────────────────┐ +│ TIER-A: Per-sprint budget │ +│ Set BEFORE dispatch. Stop and escalate if exceeded. │ +│ Typical: $1-$5 for sonnet sprint, $5-$15 for opus sprint │ +├─────────────────────────────────────────────────────────────────────────┤ +│ TIER-B: Per-release budget │ +│ Sum across all sprints leading to a tag. │ +│ Typical: $10-$50 per v0.X release; $50-$200 per v1.X major │ +├─────────────────────────────────────────────────────────────────────────┤ +│ TIER-C: Per-project lifetime budget │ +│ Track for sanity / ROI conversation with funding source. │ +│ Typical: $100-$1000 for a research-grade project; $1K-$10K for products │ +└─────────────────────────────────────────────────────────────────────────┘ +``` + +Each tier escalates differently: + +- TIER-A breach → STOP, report, ask user before continuing +- TIER-B breach → publish a finding documenting why; reassess release scope +- TIER-C breach → strategic review (ROI / pivot question) + +## Cost ledger (ADSD pattern) + +Every project running parallel agents must maintain a per-dispatch ledger. ADSD's recommended schema (codified in `cobrust-llm-router` if using a custom router, or in `.adsd/ledger.jsonl` as append-only log): + +``` +{ + "timestamp_utc": "2026-05-12T03:45:00Z", + "sprint_id": "lc100-stress-sweep", + "agent_role": "P7-sonnet-test-B1", + "session_id": "abc12345", + "provider": "anthropic", + "model": "claude-sonnet-4-6", + "prompt_tokens": 12345, + "completion_tokens": 6789, + "total_tokens": 19134, + "cost_micro_usd": 142500, + "cache_hit": false, + "task_tag": "test_corpus_generation", + "outcome": "ok" +} +``` + +Append on every API call. Materialize SQLite index for fast queries. + +### Useful ledger queries + +```sql +-- Cost per sprint +SELECT sprint_id, sum(cost_micro_usd)/1e6 as usd +FROM ledger +GROUP BY sprint_id +ORDER BY usd DESC; + +-- Cost per model +SELECT model, count(*) as calls, sum(total_tokens) as tokens, sum(cost_micro_usd)/1e6 as usd +FROM ledger +WHERE timestamp_utc > date('now', '-7 days') +GROUP BY model; + +-- Cache hit rate (savings) +SELECT cache_hit, count(*) as calls, sum(total_tokens) as tokens +FROM ledger +WHERE timestamp_utc > date('now', '-7 days') +GROUP BY cache_hit; +``` + +## Pre-sprint budget estimation + +Before dispatching a sprint, write the budget estimate in the dispatch prompt itself: + +``` +BUDGET ESTIMATE (must include in P9 dispatch): +- Phase 1 (P9 opus ADR drafting): ~30K prompt + 5K completion = ~$2 +- Phase 2 (4 × P7 sonnet pairs, ~25 problems each): + - Per pair: 5 reads × 10K + 10 writes × 5K = ~$1 + - 4 pairs × 2 agents = ~$8 +- Phase 3 (P9 opus triage): ~10K + 5K = ~$1 +- Phase 4 (decision report): ~5K = ~$0.5 + +TOTAL ESTIMATE: $11.50 ± 30% = $8-$15 range +TIER-A BUDGET: $20 (~30% headroom) + +Escalate at $15 actual if Phase 2 still in progress. +``` + +Estimation accuracy improves with practice. Track estimate vs actual across 10+ sprints to calibrate. + +## In-flight monitoring + +For long-running sprints (> 4 hr wall-clock or > $5 budget), check the ledger every ~1 hr: + +```bash +# Quick health check +sqlite3 .adsd/ledger.db " + SELECT + sprint_id, + count(*) as calls, + sum(total_tokens) as tokens, + sum(cost_micro_usd)/1e6 as usd + FROM ledger + WHERE sprint_id = '' +" +``` + +If actual ≥ 70% of TIER-A budget and the sprint is < 50% complete → escalate early. Don't wait for the breach. + +## Cost as a signal + +Cost is not just expense — it's a diagnostic indicator: + +### High cost without progress = loop + +If a sprint is at $10 with 0 new commits, the agent is likely in a loop. Symptoms: + +- Same files re-read 5+ times in ledger +- Same tool sequence repeating +- No new test cases / no new ADR sections + +Recovery: kill the sprint, audit the prompt for ambiguity, re-dispatch with sharper scope. + +### High cache miss rate = context shuffling + +If cache hit rate < 30% on Anthropic/OpenAI, the prompt structure is changing per-call. Likely cause: system prompt or memory file being mutated mid-sprint. + +Recovery: lock memory updates to inter-sprint boundaries. Don't edit memory while a sprint is running. + +### Cost spike at specific phase = under-estimated scope + +If Phase 2 of a 4-phase sprint costs 3× the budget for that phase, the work was scoped wrong. The next dispatch should split Phase 2 into 2a + 2b. + +This is a productive finding — write it up as a finding entry under `docs/agent/findings/sprint--cost-overrun.md`. + +## Anthropic-pattern adoption + +### Prompt caching reduces cost dramatically + +Anthropic caches stable prefixes (system prompt, project preamble) at ~10% of full cost. + +For ADSD: structure agent prompts so: + +1. System role + project preamble (cached) — top +2. Required-reads + RFC fragments (cached) — middle +3. User-turn / sprint-specific context — bottom + +Don't shuffle the order — that breaks the cache. ADSD memory files + dispatch-prompt templates already shape this. + +### Model selection by D-rating + +Anthropic explicitly recommends "use the cheapest model that passes your eval." ADSD's D0-D5 matrix is the practical implementation: + +- D0/D1 sonnet: ~5-10× cheaper than opus, generally sufficient +- D2 sonnet (with eval pair): fine if test corpus catches edge cases +- D3+ opus: pay the premium when the task requires it + +Don't default to opus for every task — that's overspending. Don't default to sonnet for D3+ tasks — that's underspending leading to F20. + +## OpenAI-pattern adoption + +### Structured outputs reduce iteration + +OpenAI's structured-outputs (JSON schema) feature reduces "re-prompt for fix" cycles. Each correct-format reply saves 1× the call cost. + +ADSD shape: P7/P9 completion reports include YAML block (per prompt-engineering-patterns PT4). Saves ~20-30% across a typical multi-call sprint vs free-text reports. + +### Streaming saves wall-clock but not token cost + +OpenAI streaming saves user-perceived latency but not token cost. ADSD should use streaming for UX where it helps (release-readiness audit feedback to user) but understand it doesn't reduce $. + +## ADSD integration with existing patterns + +### Dispatch prompt budget block + +Add to `templates/dispatch-prompt-p9.md` § just below DIFFICULTY-RATING: + +``` +BUDGET ESTIMATE (must include): +- Phase-by-phase cost estimate +- TIER-A budget with ~30% headroom +- Early escalation threshold + +If actual cost exceeds estimate by 50% mid-sprint, STOP and report. +``` + +### Release-readiness ledger snapshot + +Include in `[P7-RELEASE-READY-VERDICT]`: + +``` +Cost snapshot at release tag: +- This sprint: $X.YY +- Prior sprints to this tag: $Z.WW +- Total release-bearing cost: $A.BB +``` + +Defensible "we shipped at $X" claim. + +### Cost as F-pattern detector + +The F-pattern catalogue should include cost-anomaly as a diagnostic. Add to dispatch: + +> If actual cost > 2× estimate, flag as potential F-pattern occurrence (likely F13 plan-vs-execute, F17 self-report, or unidentified). Findings entry mandatory. + +## Pitfalls + +| Pitfall | Symptom | Fix | +|---|---|---| +| No estimate, no monitoring | Bill shock at month end | Pre-sprint estimate + ledger | +| Ignore cost as "not a constraint" | Drift to over-spawning sub-agents | Cost as signal, not constraint | +| Cache miss not measured | Cost stays high after prompt-engineering optimization | Track cache hit rate per sprint | +| Over-using opus | Sonnet would suffice; 5-10× overspend | D-matrix rigor (PT7 in prompt patterns) | +| Cost ledger stale | Decisions made on outdated data | Append on every API call, not batched | + +## Cross-references + +- `templates/dispatch-prompt-p9.md` — budget estimate block (add per this reference) +- `reference/prompt-engineering-patterns.md` PT7 — D-rating drives cost +- `reference/evals-first-development.md` — eval delta lets you compare cost across optimizations +- `reference/failure-modes-catalogue.md` — F12 (model output starvation), cost signal for diagnosis +- Anthropic prompt caching docs: https://docs.anthropic.com/claude/docs/prompt-caching +- OpenAI structured outputs: https://platform.openai.com/docs/guides/structured-outputs diff --git a/plugins/adsd/skills/agent-driven-development/reference/cross-session-memory-architecture.md b/plugins/adsd/skills/agent-driven-development/reference/cross-session-memory-architecture.md new file mode 100644 index 0000000..a7d2aef --- /dev/null +++ b/plugins/adsd/skills/agent-driven-development/reference/cross-session-memory-architecture.md @@ -0,0 +1,227 @@ +--- +name: Cross-session memory architecture +description: ADSD's distinction between auto-memory, project artifacts, scratch context, and ephemeral state. Codifies what survives session boundaries and how to design new memory entries that don't decay. +type: reference +version: 1.0.0 +date: 2026-05-12 +status: active +relates_to: [skill:SKILL.md §"Snapshot discipline", reference:context-window-strategy.md, reference:failure-modes-catalogue.md F1 family + F16 + F17] +--- + +# Cross-session memory architecture + +> ADSD's hard-won memory discipline: not everything that lasts deserves to last; not everything ephemeral should be ephemeral. Four storage layers, each with a different persistence contract. + +## When this applies + +- You're about to write something down and don't know where it goes +- You're designing a new memory file or template +- You're auditing why a piece of project knowledge keeps getting re-derived + +If you're producing one commit and done, just commit it. This reference is for projects with state. + +## Four storage layers + +``` +┌──────────────────────────────────────────────────────────────────────────┐ +│ LAYER 1: Auto-memory (~/.claude/projects//memory/) │ +│ Survives: all sessions, all hosts (if synced) │ +│ Auto-loaded at session start via MEMORY.md index │ +│ Contains: identity preamble, operative rules, cross-session SOPs │ +│ Mutation policy: edit in-place; index entries are one-line hooks │ +├──────────────────────────────────────────────────────────────────────────┤ +│ LAYER 2: Project artifacts (repo's docs/ + ADR + findings + snapshot) │ +│ Survives: as long as the repo does │ +│ Contains: decisions (ADR), negative results (finding), state (snapshot) │ +│ Mutation policy: ADR immutable once accepted; finding append-only; │ +│ snapshot updated atomically with HEAD │ +├──────────────────────────────────────────────────────────────────────────┤ +│ LAYER 3: Session scratch (this conversation's working notes) │ +│ Survives: within session only │ +│ Contains: in-progress reasoning, intermediate computations, working set │ +│ Mutation policy: free-form; nothing committed unless promoted to L1/L2 │ +├──────────────────────────────────────────────────────────────────────────┤ +│ LAYER 4: Ephemeral (single tool call output) │ +│ Survives: only as long as tool result is in context window │ +│ Contains: bash stdout, file read contents, grep results │ +│ Mutation policy: re-fetch if needed; don't memorialize │ +└──────────────────────────────────────────────────────────────────────────┘ +``` + +## Decision tree: where does this go? + +``` +Is it identity, role, or operative rule? ─yes─→ Layer 1 (auto-memory) + - new file under memory/ + - one-line hook in MEMORY.md + - frontmatter: type=feedback + │ + no + ▼ +Is it a binding decision affecting ≥2 files? ─yes─→ Layer 2 (ADR) + docs/agent/adr/NNNN-*.md + │ + no + ▼ +Is it a negative result / surprise / failure? ─yes─→ Layer 2 (finding) + docs/agent/findings/*.md + │ + no + ▼ +Is it a state fact (HEAD, version, count)? ─yes─→ Layer 2 (snapshot.md) + project_state_snapshot.md + │ + no + ▼ +Is it in-progress reasoning this sprint? ─yes─→ Layer 3 (session scratch — comment, message) + │ + no + ▼ +Is it a one-time output that can be re-fetched? ─yes─→ Layer 4 (ephemeral) + no memorialization needed +``` + +When in doubt, **default to Layer 3 scratch**. Promotion to L1/L2 happens deliberately at sprint end, not in the moment. + +## Layer 1 (auto-memory) deep dive + +### File naming convention + +- `feedback_.md` — operative rules / SOPs / user-mandated guidance (e.g. `feedback_subagent_model_tier.md`, `feedback_p10_post_compaction_identity_recovery.md`) +- `reference_.md` — pointers to external systems (e.g. `reference_proxy_config.md`) +- `project_state_snapshot.md` — current canonical state (single file, mutated atomically) +- `MEMORY.md` — index (one-line hooks pointing to files above) + +### Frontmatter contract + +```yaml +--- +name: +description: +type: feedback | reference | snapshot +originSessionId: +last_verified_date: +related_memory: [.md, ...] +--- +``` + +### MEMORY.md index format + +``` +- [](file.md) — +``` + +Top entries are read-first. Place identity-recovery / role-clarifying files at the top. + +### Mutation discipline + +Auto-memory mutates in-place (no git history). Therefore: + +- Date your edits — `## Extension 2026-05-12: ...` rather than overwriting silently +- Don't delete past sections; mark them `## Deprecated 2026-05-12: was X, now Y because Z` +- One-line description in MEMORY.md must stay accurate; update it when content shifts + +### When to add a new memory file vs extend an existing one + +Add new file when: +- New topic area not covered by existing file +- File would grow > 200 lines +- The rule applies to a different agent role than existing file's audience + +Extend existing file when: +- New sub-rule of an existing rule +- Refinement / amendment to existing operative practice +- The "## Extension :" pattern keeps mutations auditable + +## Layer 2 (project artifacts) deep dive + +### ADR vs Finding distinction + +ADRs are **forward-looking decisions** ("we will do X going forward"). Findings are **backward-looking observations** ("we hit Y; here's what we learned"). They're not interchangeable. + +A failure observation (finding) → drives a future decision (ADR). The finding doesn't bind anyone; the ADR does. Don't put binding rules in findings; don't put incident history in ADRs. + +### Snapshot.md responsibility + +Single source of truth for current project state: + +- Current HEAD SHA (auto-updated post-merge) +- ADR roster table (each accepted ADR listed) +- Finding ledger (each finding with status: open / closed) +- Phase / milestone progress +- Binary verification claim (e.g. "cobrust build hello.cb passes at HEAD") + +Snapshot has its own enforcement: `scripts/snapshot-lint.sh` validates the invariants are met. Without snapshot-lint, snapshot drifts (F1.1 — declared invariant without enforcement). + +## Layer 3 vs Layer 4 boundary + +Most agent failures come from **misplacing Layer 3 facts into Layer 4 (forgetting useful state) or Layer 4 facts into Layer 3 (cluttering working memory)**. Examples: + +- ❌ Re-reading the same source file 5 times (Layer 4 treated as Layer 3 — already had it, should re-use) +- ❌ Writing intermediate bash output to a memory file (Layer 4 promoted to Layer 1 — bloat) +- ✅ Keeping a running list of "files I've read this turn" in scratch (Layer 3 working set) +- ✅ Discarding `grep -c` count after using it (Layer 4 ephemeral) + +## Anthropic-pattern adoption + +### MEMORY.md auto-load contract + +Anthropic Claude Code auto-loads MEMORY.md at session start. ADSD uses this: + +- Memory files are the agent's "world model" at boot time +- Index hooks must be precise — the agent decides which to read based on hooks +- A line-1 entry like `[Identity recovery SOP] — read if post-compaction or fresh session` ensures the right file gets opened + +### Anti-pattern: stale memory + +Anthropic warns: memory is point-in-time. Don't trust years-old memory entries blindly. ADSD codifies this: + +- Frontmatter `last_verified_date` field +- Pre-action verification when memory makes a claim about file paths or current state +- Stale memory entries get marked deprecated, not silently re-relied-upon + +## OpenAI-pattern adoption + +### Vector store + retrieval (NOT YET in ADSD) + +OpenAI's Assistants API does retrieval over uploaded files. ADSD currently relies on the agent's context window + memory; retrieval not adopted. + +For ADSD v1.3.0+: consider retrieval if memory + repo content together exceed context budget. Until then, the four-tier model is sufficient. + +### Threads (session scoping) + +OpenAI threads are persistent multi-session conversations. ADSD's analog: per-project memory folder. Same idea — bound the persistence to the project, not the global model. + +## ADSD integration with existing patterns + +### Snapshot-lint enforcement loop + +Layer 2 snapshot.md has invariants (HEAD freshness, ADR roster completeness). `scripts/snapshot-lint.sh` runs these as Inv 1-4. Pre-commit hook fires snapshot-lint, blocking commits that violate invariants. This is the F1.1 closure mechanism. + +### CTO operations runbook is the Layer 1 cookbook + +`cto_operations_runbook.md` codifies P9 dispatch SOPs, conflict resolution, gates. It's auto-memory because it must survive session boundaries — every new session running CTO role reads it on bootstrap. + +### Identity recovery memory closes F16 + +`feedback_p10_post_compaction_identity_recovery.md` lives in Layer 1 specifically because identity must survive compaction. The corresponding F-pattern is the negative form of why this memory exists. + +## Pitfalls + +| Pitfall | Layer confusion | Recovery | +|---|---|---| +| Memory file holds in-progress sprint notes | L3 → L1 leak | Move to a scratch message, promote permanent rule to L1 if it's actually a rule | +| ADR captures incident history | L2-finding → L2-ADR confusion | Rewrite as finding; if a decision was made, separate ADR linking the finding | +| Snapshot.md not updated post-merge | L2 staleness | snapshot-lint pre-commit hook (F1.1 mitigation) | +| "I'll remember to do X" (Layer 3) becomes binding | L3 informal → expected L1 | Either codify in memory or accept it'll be forgotten | +| Re-reading same memory file every turn | L4-style use of L1 | Trust L1 was loaded at session start; don't re-fetch | +| MEMORY.md hook is generic ("various rules") | Index loses dispatch value | Rewrite as specific keyword-dense one-liner | + +## Cross-references + +- `reference/context-window-strategy.md` — what to put in context (different question than where to put facts) +- `reference/failure-modes-catalogue.md` F1 family + F16 + F17 — anti-patterns this architecture mitigates +- `templates/snapshot-template.md` — Layer 2 snapshot template +- `SKILL.md` §"Snapshot discipline" — the operative discipline this architecture supports +- Anthropic Claude Code memory docs — MEMORY.md auto-load contract +- OpenAI Assistants API — threads + retrieval (for ADSD future consideration) diff --git a/plugins/adsd/skills/agent-driven-development/reference/evals-first-development.md b/plugins/adsd/skills/agent-driven-development/reference/evals-first-development.md new file mode 100644 index 0000000..048f733 --- /dev/null +++ b/plugins/adsd/skills/agent-driven-development/reference/evals-first-development.md @@ -0,0 +1,212 @@ +--- +name: Evals-first development discipline +description: Build the evaluation harness before the feature. Anthropic's central claim "evals are the moat" applied to ADSD. Every public capability gets a falsifiable test before implementation, not after. +type: reference +version: 1.0.0 +date: 2026-05-12 +status: active +relates_to: [skill:SKILL.md §"Wave + Tx pattern", reference:failure-modes-catalogue.md F19+F20] +--- + +# Evals-first development + +> **Anthropic central claim**: "Evals are the moat. Better evals beat better models." +> ADSD adopts this: every Cobrust public capability has a falsifiable acceptance signal **before** the impl Tx fires. This is the positive form of F20 (constitution-vs-workflow alignment) — workflow enforces the rule by requiring the eval to fail-then-pass. + +## When this applies + +For any task type: + +- Adding a new public API surface (CLI flag, language feature, stdlib fn) +- Translating a Python library (every entrypoint = its own eval slice) +- Migrating an internal API (eval guards behavior preservation) +- Performance-claim release (eval = repeatable benchmark) +- LLM-driven anything (eval is the only way to detect prompt drift) + +Skip evals only for: + +- One-shot scripts with no future maintenance +- Doc-only changes (release-readiness verify is its own form, see F19) +- Strict refactors with full test coverage already + +## Anthropic-pattern adoption + +### Eval as code, not document + +Anthropic's evals are runnable artifacts — `pytest`-style scripts, JSON-line inputs/outputs, scoring functions. Not markdown narratives. + +ADSD shape: + +``` +project-root/ +├── evals/ +│ ├── / +│ │ ├── cases.jsonl # one (input, expected) per line +│ │ ├── score.py # scoring fn (exact / fuzzy / LLM-judge) +│ │ ├── run.sh # entrypoint +│ │ └── REPORT.md # last-run summary, machine-updated +│ └── README.md # eval directory index +``` + +### Eval categories (Anthropic taxonomy) + +| Category | When | Example for Cobrust | +|---|---|---| +| **Exact match** | Deterministic output | `cobrust build hello.cb` exit 0 + stdout `hello, world` | +| **Fuzzy match** | Allows whitespace / order drift | TOML round-trip; output equivalent under canonicalization | +| **Regex / structural** | Format known, content variable | Compiler error messages match pattern `/error\[\w+\]:/` | +| **LLM-judge** | Open-ended (docs, NL output) | Translated library's docstrings preserve original meaning | +| **Differential** | Compare against oracle | `cobrust-tomli.parse(s)` == `cpython tomllib.loads(s)` for 1000+ fuzz inputs | + +Differential evals are ADSD's strongest pattern — already baked in for tomli T1.1. Generalize. + +### Minimum eval bar (Anthropic guideline) + +- **≥ 50 cases** per public capability +- **≥ 10 adversarial cases** (boundary conditions, malformed input, edge encodings) +- **Reproducible**: `bash evals//run.sh` exits non-zero on regression +- **Cheap**: full eval suite runs in < 5 min, fuzz suite in < 30 min + +### Eval delta as merge gate + +The Anthropic moat is enforced via: **PR must report eval delta**. + +``` +[P9-COMPLETION] eval-delta block (required for any merge touching public surface): +- evals//cases.jsonl: +N cases (was M, now M+N) +- pass rate before: / → after: / +- adversarial cases: +K new (was J, now J+K) +- regression check: 0 prior cases newly failing +- if ANY prior case newly fails → BLOCK merge until justified or fixed +``` + +CTO 守闸 protocol: spot-check the eval delta. Don't merge sprints that touch public surface without an eval-delta block. + +## OpenAI-pattern adoption + +### Function/tool eval (structured-output enforcement) + +OpenAI's strongest practice: **if your agent emits structured output, eval the structure**. + +For ADSD: any sub-agent reporting `[P9-COMPLETION]` should emit a JSON-shaped block. CTO can machine-parse + verify required fields present. + +``` +[P9-LC100-COMPLETION] +```yaml +phase_1_adr: { sha: 3839742, status: accepted } +phase_2_buckets: + - { name: B1, pass: 27, compile_fail: 2, runtime_fail: 1 } + - { name: B2, pass: 24, compile_fail: 4, runtime_fail: 2 } + - { name: B3, pass: 22, compile_fail: 5, runtime_fail: 3 } + - { name: B4, pass: 9, compile_fail: 1, runtime_fail: 0 } +total_pass: 82 +total_fail: 18 +ramp_recommendation: GO_TIER_B +bug_patterns_top5: + - { signature: "i8/i64 mismatch in nested if", count: 4, finding: lc100-i8-i64-nested-if } + - ... +``` + +CTO `yq` or `jq` the YAML; verify fields; spot-check 3 random bugs. + +### OpenAI Evals framework (open source) + +OpenAI Evals repo (github.com/openai/evals) is well-documented. ADSD shouldn't reinvent — adopt their core types: + +- `match` — exact substring +- `fuzzy_match` — token / whitespace tolerant +- `model_graded` — LLM-as-judge with own evaluator model +- `code_run` — execute generated code, compare output + +ADSD's `score.py` per-eval-folder can wrap OpenAI Evals primitives. + +## ADSD integration with existing patterns + +### Wave + Tx + eval delta + +Existing pattern: every Wave merge has 5-gate green (fmt / clippy / build / test / doc-coverage). + +Add 6th gate for any public-surface Wave: **eval delta non-regression**. New cases land + 0 prior cases newly failing. + +This 6th gate is the systemic closure of F20 (constitution mandate without workflow alignment). It makes "evals first" a binding mandate, not aspiration. + +### Eval-first vs TDD-pair (already in ADSD) + +These are complementary, not competing: + +- **TDD-pair** (Phase 2 in dispatch): test agent writes test corpus first, dev agent implements to pass. Per-feature TDD. +- **Eval-first** (Phase 0 in sprint): eval harness exists before the feature is dispatched. Per-public-surface lifetime guard. + +TDD pair tests that the impl matches the test corpus this sprint. Evals catch that impl still matches behavioral contract across sprint history. + +### Finding ↔ eval bidirectional + +When a finding is discovered (e.g. `lc100-i8-i64-nested-if`), the **same sprint that fixes the bug must add an eval case that catches it next time**. This is the prevention layer beyond the documentation layer. + +`docs/agent/findings/.md` must have a §"Eval case added" section listing the line in `evals//cases.jsonl` that catches this specific failure. + +## Concrete template + +`templates/eval-template.md`: + +``` +--- +name: -evals +description: +date: +last_verified_commit: +case_count: +adversarial_count: +oracle: +--- + +# Evals: + +## Behavior under eval + +<2-3 sentences. The falsifiable claim about what the feature does.> + +## Eval suite layout + +- `cases.jsonl` — N input cases with expected output +- `score.py` — scoring function (cite category: exact / fuzzy / regex / model_graded / code_run) +- `run.sh` — entrypoint, exits non-zero on regression + +## Adversarial cases (subset of cases.jsonl) + + + +## Last run + +| Field | Value | +|---|---| +| Date | | +| Commit | | +| Pass | / | +| New failures vs prior | | +| Regression status | | + +## Pitfalls + +- LLM-judge eval drifts if evaluator model changes. Pin evaluator model in `score.py`. +- Differential evals need pinned oracle version. Document oracle version in frontmatter. +``` + +## Pitfalls + +| Pitfall | Symptom | Recovery | +|---|---|---| +| Evals as documentation, not runnable code | `cases.jsonl` exists but no `run.sh` | Promote to runnable in 1 PR or delete | +| Eval coverage cliff: tons of easy cases, no adversarial | All cases pass on first try | Demand `adversarial_count ≥ N/5` in frontmatter | +| LLM-judge instability | Same case gives different verdict on rerun | Pin evaluator model + temperature 0 + cache responses | +| Differential eval oracle drift | Oracle library version bumps and evals silently re-baseline | Pin oracle version in frontmatter + verify in CI | +| Eval delta forgotten in PR | Sub-agent completion report omits eval-delta block | Make `[P9-COMPLETION]` template require the block | + +## Cross-references + +- `SKILL.md` §"Wave + Tx commit tags" — eval delta is the 6th gate +- `reference/failure-modes-catalogue.md` F19 (release install-not-tested) — eval-first is the systemic prevention +- `reference/failure-modes-catalogue.md` F20 (constitution-vs-workflow) — eval-first IS the workflow that enforces "test-first" mandate +- `templates/eval-template.md` — runnable template per feature +- Anthropic: https://www.anthropic.com/engineering (search "evals") +- OpenAI Evals: https://github.com/openai/evals diff --git a/plugins/adsd/skills/agent-driven-development/reference/prompt-engineering-patterns.md b/plugins/adsd/skills/agent-driven-development/reference/prompt-engineering-patterns.md new file mode 100644 index 0000000..1bd6a7c --- /dev/null +++ b/plugins/adsd/skills/agent-driven-development/reference/prompt-engineering-patterns.md @@ -0,0 +1,280 @@ +--- +name: Prompt engineering patterns for sub-agent dispatch +description: Distilled prompt engineering patterns from Anthropic and OpenAI public guidance, adapted for ADSD sub-agent dispatch context. Covers chain-of-thought, few-shot, structured output, role priming, anti-hallucination guards. +type: reference +version: 1.0.0 +date: 2026-05-12 +status: active +relates_to: [skill:SKILL.md §"Two-phase dispatch", templates:dispatch-prompt-p7.md + dispatch-prompt-p9.md] +--- + +# Prompt engineering patterns + +> When you spawn a sub-agent, the prompt is your only lever. A sub-agent with a poorly written prompt cannot recover at runtime — there's no second chance. This reference codifies the patterns Anthropic and OpenAI publicly recommend, adapted to ADSD's sub-agent dispatch context. + +## When this applies + +- Writing any P9 or P7 dispatch prompt +- Designing a new sub-agent role +- Diagnosing why a sub-agent went off the rails +- Auditing existing dispatch templates for gaps + +Not for: writing user-facing docs, marketing copy, or release notes (different audience, different goals). + +## Core principles (Anthropic + OpenAI consensus) + +### P1 — Explicit role + scope first + +Start every sub-agent prompt with: + +``` +You are delivering . +Your deliverable is . +You do NOT do . +Time budget: . +``` + +Role priming concentrates the agent's behavior. Without it, generic Claude / GPT defaults take over — verbose, hedging, broad-scoped. + +### P2 — Required-reads section before mission + +Sub-agents have no prior context. List the exact files they must read before starting work: + +``` +REQUIRED READS (read all before any tool call): +- /abs/path/to/relevant/ADR.md +- /abs/path/to/spec.md +- /abs/path/to/existing/test_surface.rs +``` + +Absolute paths. Not "look in the docs folder." + +### P3 — Mission expressed as a verifiable claim + +Bad: "Implement stdin support." + +Good: "Implement `input(prompt: str) -> str` such that the test corpus in `crates/cobrust-stdlib/tests/input_corpus.rs` passes with 0 failures." + +A verifiable claim has: a specific surface (`input(...)`), an acceptance signal (test corpus), and a measurable outcome (0 failures). The agent can self-check progress against this. + +### P4 — Anti-hallucination guards + +Three guards Anthropic specifically calls out: + +1. **"Cite or admit"**: "When making a quantitative claim (test count, file count, SHA), include the verifying command in the same response. If you don't have the command result, say 'unverified'." +2. **"No phantom paths"**: "If you reference a file path, only reference paths returned by an actual tool call this session. Don't invent plausible-looking paths." +3. **"Match-or-mismatch"**: "When the user provides a value and you echo it back, ensure character-for-character match. If your output differs even by case or whitespace, flag the discrepancy explicitly." + +### P5 — Output structure first, content second + +OpenAI's structured-output discipline: define the output schema before describing what goes in each field. + +Bad: "Return a completion report with all the details." + +Good: +``` +Report format (must include these exact section headers): + +[P7-MISSION-COMPLETION] +- Branch: +- Final SHA: <40-char hex> +- Gate verdicts: + - fmt: + - clippy: + - build: + - test: + - doc-coverage: +- Empirical evidence: + - : +- Followups: +- Escalations: +``` + +The agent fills the slots. Structure resists drift. + +## Pattern catalogue + +### PT1 — Chain-of-thought elicitation + +Anthropic + OpenAI: explicit "think step by step" works on hard tasks but hurts on simple ones. + +Use for: design decisions, debugging, ambiguous specs. +Skip for: well-scoped impl tasks (TDD pair handles the structure). + +Form: +``` +Before writing code, write 3-5 sentences answering: +1. What does the spec actually require? +2. What's the simplest implementation that meets the spec? +3. What edge cases must the impl handle? +4. What's an alternative implementation, and why was it rejected? +5. What's the test that would catch a regression here? + +Then write the code. +``` + +### PT2 — Few-shot examples (for output format) + +When you want the sub-agent's output to follow a specific format, **show 1-2 examples in the prompt itself**. + +Form: +``` +Example completion report (do NOT copy these literal values; use this STRUCTURE): + +[P7-EXAMPLE-COMPLETION] +- Branch: feature/foo-bar +- Final SHA: abcd1234abcd1234abcd1234abcd1234abcd1234 +- Gate verdicts: + - fmt: pass (0 diff) + - clippy: pass (0 warnings) + - ... +``` + +Anti-pattern: telling without showing. "Return a structured report" without an example produces freeform prose. + +### PT3 — Role priming with negative example + +Form: +``` +You are a P9 tech lead. Your deliverable is Task Prompts for P7 sub-agents. + +You do NOT: +- Edit source files yourself (that's P7 work) +- Run cargo test on feature branches (that's P7 work) +- Push to remote on feature branches (that's P7's deliverable) +- Ask the user about decisions covered by the constitution + +You DO: +- Draft ADRs for design decisions (~3 hr opus solo work) +- Spawn P7 sub-agents for impl +- Review their completion reports +- Merge cleanly after independent gate verification +``` + +The negative blacklist concentrates the agent's behavior more reliably than the positive whitelist alone. + +### PT4 — Structured output via JSON / YAML block + +When downstream parsing is needed (CTO will `yq` the result): + +``` +After the human-readable report, append a YAML block with these fields: + +```yaml +status: success | partial | failed +final_sha: <40-char hex> +gates: + fmt: { pass: true, count: 0 } + clippy: { pass: true, count: 0 } + build: { pass: true } + test: { pass: true, total: 2611, failed: 0 } + doc_coverage: { pass: true } +followups: + - +escalations: + - +``` +``` + +Both human-readable and machine-parseable. + +### PT5 — Refusal / escalation conditions + +Tell the agent when to STOP and report instead of continuing: + +``` +STOP and report to CTO if any of: +- The ADR's "Done means" is unreachable with the spec as written +- The spec contradicts another ADR — escalate the conflict +- 600s+ stream-idle on cargo test (likely environment issue) +- > 50 retry attempts on any single failing test (root cause is deeper) + +In these cases, report partial work + ask for guidance. Don't loop indefinitely. +``` + +### PT6 — Self-verification block + +Before submitting completion, agent must verify own claims (Anthropic anti-hallucination): + +``` +VERIFICATION (run these commands and paste raw output before submitting): +- git log --oneline main..HEAD | head -5 +- cargo test --workspace --locked 2>&1 | tail -3 +- bash scripts/doc-coverage.sh 2>&1 | tail -3 +- grep -c "F" reference/failure-modes-catalogue.md (if claiming N entries) +``` + +The agent's claim only counts if verification command output is pasted alongside. + +## ADSD-specific patterns + +### PT7 — Difficulty self-rating (per D-matrix) + +Every P9 dispatch must include: + +``` +DIFFICULTY-RATING (mandatory): +- D-RATING: D0 / D1 / D2 / D3 / D4 / D5 +- RATIONALE: <2-3 sentences citing specific crates/files/edge cases> +- MODEL-DEV: sonnet | opus +- MODEL-TEST: sonnet | opus | n/a +- PAIR: yes (D1/D2/D3/D5) | no (D0/D4) +``` + +This pattern catches model-tier mismatches before agent spawn. + +### PT8 — Identity hygiene (F21 closure) + +For agents producing persistent artifacts: + +``` +Sign commits and documents with your SESSION ID, not your role handle alone. + +Wrong: `Co-Authored-By: review-claude` +Right: `Co-Authored-By: review-claude (session 4bb35f43)` + +Wrong: "— CTO, 2026-05-12" +Right: "— CTO session XYZ, 2026-05-12" +``` + +### PT9 — Release-readiness guard (F19 closure) + +For any commit touching user-facing artifact: + +``` +Before declaring this Tx done, spawn a P7 sonnet release-readiness agent +to clean-shell-verify install commands in this commit's changes. See +cto_operations_runbook.md §"Release-readiness agent". + +Do NOT self-attest "the install command works" without independent +verification. F17/F19 closure mechanism. +``` + +## Pitfalls + +| Pitfall | Symptom | Fix | +|---|---|---| +| Generic role ("you are a helpful AI") | Sub-agent over-explains, hedges, asks unnecessary questions | Replace with specific role + scope (P1) | +| Mission as a verb without scope | Sub-agent expands work indefinitely | Reframe as verifiable claim (P3) | +| No required-reads list | Sub-agent makes up plausible-but-wrong file paths | Required-reads with absolute paths (P2) | +| "Be thorough" | Long, low-density output | Demand structured output (P5 / PT4) | +| No escalation conditions | Sub-agent retries forever | PT5 explicit STOP conditions | +| No verification block | Claims drift from reality (F17) | PT6 mandatory verification | +| No difficulty rating | Model tier mismatch (F20 family) | PT7 mandatory D-rating | +| Generic sign-off "review-claude" | Cross-session identity overload (F21) | PT8 session-ID stamping | + +## Anti-patterns (cross-reference to F-patterns) + +- **F13 (plan-vs-execute coherence gap)**: prompt says "do X carefully" but doesn't specify what "carefully" looks like in execution. Fix: PT5 + PT6 explicit verification. +- **F17 (KPI self-report fidelity)**: agent claims completed work without verification. Fix: PT6 mandatory verification block. +- **F19 (install-not-tested)**: prompt asks agent to write docs but doesn't require execution verification. Fix: PT9 release-readiness guard. +- **F21 (cross-session identity overload)**: agent signs with bare role handle. Fix: PT8 session-ID stamping. + +## Cross-references + +- `templates/dispatch-prompt-p9.md` — P9 template applies these patterns +- `templates/dispatch-prompt-p7.md` — P7 template applies these patterns +- `reference/failure-modes-catalogue.md` — anti-patterns these prompts mitigate +- `reference/evals-first-development.md` — verification block ↔ eval delta +- Anthropic prompt engineering guide: https://www.anthropic.com/engineering +- OpenAI prompt engineering best practices: https://platform.openai.com/docs/guides From 97c8da25a7f7d5e625030f5efa27891cd333f5ec Mon Sep 17 00:00:00 2001 From: wbj010101 Date: Tue, 12 May 2026 00:34:18 +0800 Subject: [PATCH 03/16] docs(skill): add cross-pollination references section to SKILL.md (v1.2.0 menu) Adds explicit menu pointing at the 5 new reference files (evals-first-development, context-window-strategy, cross-session-memory-architecture, prompt-engineering-patterns, cost-monitoring-discipline) with Anthropic+OpenAI provenance noted and remaining-7-gaps-for-v1.3.0 forward-pointer. This was the SKILL.md menu update intended in a8d6a0f; landed separately because Edit-after-Bash-tail-read precondition required explicit Read pass. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../adsd/skills/agent-driven-development/SKILL.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/plugins/adsd/skills/agent-driven-development/SKILL.md b/plugins/adsd/skills/agent-driven-development/SKILL.md index 7492bbe..0f75a2a 100644 --- a/plugins/adsd/skills/agent-driven-development/SKILL.md +++ b/plugins/adsd/skills/agent-driven-development/SKILL.md @@ -895,6 +895,8 @@ Everything else is adaptable. ## Cross-references (within this skill) +### Originals (distilled from Cobrust 9-week run) + - Part 1 Topology details: `reference/role-topology.md` - Part 2 Two-phase dispatch deep dive: `reference/two-phase-dispatch.md` - Part 3 Snapshot discipline: `reference/snapshot-discipline.md` @@ -902,6 +904,18 @@ Everything else is adaptable. - Templates: `templates/*.md` - Cobrust case study: `case-study/cobrust-multi-agent-experience.md` +### Cross-pollination from Anthropic + OpenAI public guidance (v1.2.0) + +These references adopt established industry practices into ADSD discipline, distilled and adapted to ADSD's sub-agent dispatch context. Read in priority order: + +- **Evals-first development** (`reference/evals-first-development.md`) — Anthropic's "evals are the moat" claim applied to ADSD. Every public capability gets a falsifiable test corpus before implementation. The 6th gate (eval delta non-regression) closes F20 systemically. **Highest leverage of all v1.2.0 additions.** +- **Context-window strategy** (`reference/context-window-strategy.md`) — Positive practices for long agent sessions, complementing F16 (post-compaction identity drift). Three-tier model: persistent / session-scoped / transient. Bootstrap-from-cold prompt template. +- **Cross-session memory architecture** (`reference/cross-session-memory-architecture.md`) — Four-layer storage model (auto-memory / project artifacts / session scratch / ephemeral) with decision tree for "where does this go?". Codifies ADSD's hard-won memory file discipline. +- **Prompt engineering patterns** (`reference/prompt-engineering-patterns.md`) — Distilled patterns from Anthropic + OpenAI prompt guides: role priming, anti-hallucination guards, structured output, refusal/escalation conditions, ADSD-specific patterns (D-rating, identity hygiene, release-readiness guard). +- **Cost monitoring discipline** (`reference/cost-monitoring-discipline.md`) — Practical patterns for tracking LLM cost per sprint / per release / per project. Cost as a diagnostic signal for loops + drift + scope misestimation. Anthropic prompt caching + OpenAI structured outputs as cost-reduction levers. + +These five represent **5 of 12 v1.2.0 gap candidates** identified by review-claude self-audit. Remaining 7 (skills architecture, agent specialization, HITL tree, RCA template, MCP patterns, calibrated confidence, structured-output enforcement) are queued for v1.3.0. + ## Origin & lineage This skill is distilled from the Cobrust project (2026-04-30 to From 2a7fd97d3bce9a30972586722b3a5a466888aab0 Mon Sep 17 00:00:00 2001 From: wbj010101 Date: Tue, 12 May 2026 00:52:02 +0800 Subject: [PATCH 04/16] =?UTF-8?q?feat(v1.2.1):=20bilingual=20user=20docs?= =?UTF-8?q?=20+=20doc-coverage=20gate=20(ADSD=20=C2=A73=20dogfood)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ADSD §3 documentation mandate: every public item gets zh+en+agent docs. ADSD repo previously fact-violated its own §3 mandate by shipping SKILL.md (agent-facing) without docs/human/{zh,en}/ user-facing docs. This commit closes that F20 dogfood instance. New artifacts: 1. docs/human/{zh,en}/getting-started.md — 30-min onboarding, parallel zh+en, 5-step practice section from CLAUDE.md scaffold through first sub-agent dispatch with D-matrix. 2. docs/human/{zh,en}/concept-map.md — 6 mermaid diagrams covering top-level flow / 3-layer abstraction / F1 Sediment Family / 4-layer storage / dev-test pair sequence / release closure with release-readiness. Parallel zh+en. 3. docs/agent/conventions.md — meta-conventions for THIS repo: binding repo structure, frontmatter contracts, bilingual mandate enforcement scope, commit message format, F21 identity hygiene. 4. scripts/doc-coverage.sh — machine-enforces ADSD §3 mandate: - Inv 1: zh ⟺ en parity (parallel filenames) - Inv 2: reference/*.md have YAML frontmatter - Inv 3: ADR files zero-padded monotonic Bug caught by Inv 2 on first run: failure-modes-catalogue.md was using markdown heading frontmatter (`# Title`) instead of YAML (`---`). Fixed in same commit — F20 systemic prevention worked immediately (one of those rare moments where the gate catches its target on commit 1, validating the enforcement layer). README.md updated with §Documentation section pointing at new docs + §"F1–F18" reference updated to "F1–F21" matching current catalogue. Co-Authored-By: Claude Opus 4.7 (session 4bb35f43) --- README.md | 27 ++- docs/agent/conventions.md | 185 ++++++++++++++++++ docs/human/en/concept-map.md | 149 ++++++++++++++ docs/human/en/getting-started.md | 146 ++++++++++++++ docs/human/zh/concept-map.md | 149 ++++++++++++++ docs/human/zh/getting-started.md | 146 ++++++++++++++ .../reference/failure-modes-catalogue.md | 10 + scripts/doc-coverage.sh | 130 ++++++++++++ 8 files changed, 941 insertions(+), 1 deletion(-) create mode 100644 docs/agent/conventions.md create mode 100644 docs/human/en/concept-map.md create mode 100644 docs/human/en/getting-started.md create mode 100644 docs/human/zh/concept-map.md create mode 100644 docs/human/zh/getting-started.md create mode 100755 scripts/doc-coverage.sh diff --git a/README.md b/README.md index d8e73f4..7b9abbf 100644 --- a/README.md +++ b/README.md @@ -100,11 +100,36 @@ agent-driven-development/ ## Quick start (after install) 1. Read [`SKILL.md`](./plugins/adsd/skills/agent-driven-development/SKILL.md) for the full methodology (~36 KB, 30 min). -2. Read [`reference/failure-modes-catalogue.md`](./plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md) for the F1–F18 anti-patterns you'll likely hit. Don't re-derive them. +2. Read [`reference/failure-modes-catalogue.md`](./plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md) for the F1–F21 anti-patterns you'll likely hit. Don't re-derive them. 3. Read [`case-study/cobrust-multi-agent-experience.md`](./plugins/adsd/skills/agent-driven-development/case-study/cobrust-multi-agent-experience.md) to see ADSD applied in practice (warts and all). 4. Copy the relevant template from [`templates/`](./plugins/adsd/skills/agent-driven-development/templates/) into your project's `docs/agent/` tree. 5. Start writing ADRs as decisions actually happen — not speculatively. +## Documentation + +User-facing docs are in [`docs/human/`](./docs/human/) (zh + en parallel per ADSD §3 bilingual mandate). Agent-facing meta-conventions for this repo are in [`docs/agent/`](./docs/agent/). + +### Bilingual user docs + +| Topic | 中文 | English | +|---|---|---| +| Getting started — 30-min onboarding | [`docs/human/zh/getting-started.md`](./docs/human/zh/getting-started.md) | [`docs/human/en/getting-started.md`](./docs/human/en/getting-started.md) | +| Concept map — mermaid diagrams + narrative | [`docs/human/zh/concept-map.md`](./docs/human/zh/concept-map.md) | [`docs/human/en/concept-map.md`](./docs/human/en/concept-map.md) | + +### Agent-facing meta-conventions + +- [`docs/agent/conventions.md`](./docs/agent/conventions.md) — repo structure, frontmatter contracts, bilingual mandate enforcement, commit message format, identity hygiene (F21) + +### Doc coverage gate + +`scripts/doc-coverage.sh` enforces ADSD §3 bilingual mandate on this repo itself: every `docs/human/zh/*.md` MUST have a parallel `docs/human/en/*.md`. Run locally before commits: + +```sh +bash scripts/doc-coverage.sh +``` + +The script also verifies reference files have YAML frontmatter and ADR files are zero-padded monotonic. This closes ADSD §3 mandate as F20 systemic prevention applied to ADSD itself. + ## Origin ADSD was extracted from the [Cobrust](https://github.com/Cobrust-lang/cobrust) diff --git a/docs/agent/conventions.md b/docs/agent/conventions.md new file mode 100644 index 0000000..651a46c --- /dev/null +++ b/docs/agent/conventions.md @@ -0,0 +1,185 @@ +--- +name: ADSD repo conventions +description: Meta-conventions for this repo itself. ADSD codifies how to manage AI-agent projects; this file applies ADSD discipline to the ADSD methodology itself. Agents contributing to this repo should read this first. +type: convention +version: 1.0.0 +date: 2026-05-12 +status: active +relates_to: [SKILL.md §"Documentation Discipline", README.md §Contributing, CONTRIBUTING.md] +--- + +# ADSD repo conventions + +> ADSD is a methodology for managing AI-agent software projects. This repo IS such a project. Therefore, **ADSD applies to ADSD**. This file captures the meta-conventions specific to this repo's contributors (humans and AI agents). + +## Repo structure (binding) + +``` +agent-driven-development/ +├── .claude-plugin/marketplace.json # Plugin marketplace catalog +├── plugins/ +│ └── adsd/ +│ └── skills/ +│ └── agent-driven-development/ # The skill — auto-loaded by Claude Code +│ ├── SKILL.md # Main methodology (~36 KB) +│ ├── reference/ # Deep-dive references (F-patterns, evals, prompts, etc.) +│ ├── case-study/ # Founding case study (Cobrust N=1) +│ └── templates/ # Templates for ADR / finding / dispatch / snapshot / handoff +├── docs/ +│ ├── human/ +│ │ ├── zh/ # Chinese user docs — 与 en 一一对应 +│ │ └── en/ # English user docs — 1:1 parity with zh +│ └── agent/ +│ ├── conventions.md # This file +│ ├── adr/ # Meta-ADRs for ADSD itself +│ └── findings/ # Findings about ADSD's evolution +├── scripts/ +│ └── doc-coverage.sh # Enforces zh+en parity per ADSD §3 mandate +├── CONTRIBUTING.md # Human-facing contribution guide +├── LICENSE-APACHE / LICENSE-MIT +└── README.md # Entry point +``` + +**Binding constraints**: + +1. Every file in `docs/human/zh/` MUST have a parallel file at `docs/human/en/` with the same filename. Enforced by `scripts/doc-coverage.sh`. +2. Every reference under `plugins/adsd/skills/agent-driven-development/reference/` MUST have YAML frontmatter (`name`, `description`, `type`, `version`, `date`, `status`, `relates_to`). +3. The SKILL.md `description` field is the auto-activation trigger — keep it keyword-dense and specific. +4. ADRs in `docs/agent/adr/` are zero-padded sequential (`0001-*.md`, `0002-*.md`, ...). Once accepted, an ADR is immutable; supersede via a new ADR. + +## Frontmatter contracts + +### Reference files (in `plugins/adsd/skills/agent-driven-development/reference/`) + +```yaml +--- +name: +description: +type: reference +version: +date: +status: active | deprecated | candidate +relates_to: [skill:SKILL.md §section, reference:other-file.md, ...] +--- +``` + +### Meta-ADRs (in `docs/agent/adr/`) + +```yaml +--- +doc_kind: adr +adr_id: +title: +status: proposed | accepted | superseded | deprecated +date: +last_verified_commit: +supersedes: [, ...] +superseded_by: [, ...] +relates_to: [, , ...] +--- +``` + +### Findings (in `docs/agent/findings/`) + +```yaml +--- +doc_kind: finding +finding_id: +last_verified_commit: +status: open | closed | partial +discovered_by: +dependencies: [adr:, finding:, ...] +--- +``` + +## Bilingual docs mandate (ADSD §3 dogfood) + +The skill's SKILL.md §3 mandates that every public item gets entries in: + +- `docs/human/zh/.md` +- `docs/human/en/.md` (1:1 parity) +- Agent-facing schema (in this repo: SKILL.md + reference/) + +This rule applies to ADSD itself. `scripts/doc-coverage.sh` enforces zh+en parity. + +**Operative checks** (run by `doc-coverage.sh`): + +1. Every `docs/human/zh/*.md` has a parallel `docs/human/en/*.md` +2. Every `docs/human/en/*.md` has a parallel `docs/human/zh/*.md` +3. Parallel files have identical filenames (case-sensitive) +4. (Future) Section headers are 1:1 between zh and en + +CI fails if any check fails. + +## When to add a new ADR vs amend SKILL.md vs add a finding + +| Change type | Where | Trigger | +|---|---|---| +| New methodology rule | `docs/agent/adr/NNNN-.md` | The rule affects ≥2 reference files, templates, or SKILL.md sections | +| Refine existing reference | edit the reference file directly + note in commit | Single-file refinement | +| Document an ADSD evolution event | `docs/agent/findings/.md` | Real-world ADSD use surfaced a gap or worked unexpectedly well | +| Update SKILL.md | edit SKILL.md + cross-reference an ADR if it's a binding rule | Adds a new "Part N" or modifies an existing one | +| New cross-pollination ref (Anthropic / OpenAI / other) | `plugins/.../reference/.md` | New industry pattern worth adopting | + +## When NOT to add an ADR + +- Bug fix in a reference doc (typo, broken link) +- Updating frontmatter date / last_verified +- Adding an example to an existing section +- Re-organizing within a single file +- Translation update (zh ⟵→ en sync) + +Per ADSD §"ADR vs Finding distinction": ADRs are forward-looking decisions; small refinements don't need them. + +## Commit message format + +``` +(): [vX.Y.Z] +``` + +- ``: `feat`, `docs`, `fix`, `refactor`, `chore` +- ``: `skill`, `reference`, `case-study`, `templates`, `docs-zh`, `docs-en`, `meta` +- Include `[vX.Y.Z]` semver if the change is release-worthy + +Examples: + +``` +feat(reference): add evals-first-development.md (v1.2.0) +docs(zh): translate getting-started.md to match en parity +fix(skill): correct cross-reference path after plugin layout migration +chore(meta): bump SKILL.md description for trigger keyword coverage +``` + +Sign with session ID per F21 (Cross-session identity overload): + +``` +Co-Authored-By: Claude Opus 4.7 (session XYZ) +``` + +## Identity hygiene (F21 closure) + +Per F21 codification: + +- Do NOT sign as bare "review-claude" or "ADSD-author" in commits or files +- Use session-stamped attribution: `review-claude (session 4bb35f43)` or `ADSD-author (session XYZ)` +- Reserve plain handles for the abstract role in narrative prose only + +## Versioning policy + +- **v1.0.x** — initial release, plugin migration +- **v1.1.x** — F19/F20/F21 codification +- **v1.2.x** — cross-pollination references (Anthropic + OpenAI) +- **v1.3.x** — bilingual docs + remaining G3/G5/G6/G8/G9/G10/G12 gaps + +Semver bumps follow SemVer 2.0: + +- MAJOR: breaking change to skill format, plugin layout, or canonical paths +- MINOR: new reference file, new template, new ADR +- PATCH: refinement, typo, frontmatter update, translation sync + +## Cross-references + +- `CONTRIBUTING.md` — human-facing contribution flow +- `plugins/adsd/skills/agent-driven-development/SKILL.md` §"Documentation Discipline" — methodology origin of these rules +- `plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md` F1 family + F19 + F20 + F21 — the failure modes these conventions prevent +- `scripts/doc-coverage.sh` — machine enforcement of zh+en parity diff --git a/docs/human/en/concept-map.md b/docs/human/en/concept-map.md new file mode 100644 index 0000000..d0bb669 --- /dev/null +++ b/docs/human/en/concept-map.md @@ -0,0 +1,149 @@ +# ADSD concept map + +> Mermaid diagrams + short prose to unpack the full ADSD concept landscape at once. + +## Top-level view + +```mermaid +flowchart TB + Constitution[CLAUDE.md constitution] --> Decisions{Need a decision?} + Decisions -->|Yes, affects ≥2 files| ADR[ADR — decision record] + Decisions -->|No, single file| InCode[Just code it] + + Implementation[Implementation work] --> Failure{Did something break?} + Failure -->|Yes| Finding[Finding — failure record] + Failure -->|No| Continue[Continue] + + State[Project state] --> Snapshot[snapshot.md — state snapshot] + + ADR --> Sprint[Sprint = Wave + Tx] + Sprint --> Dispatch[Dispatch P9/P7 sub-agent] + + Dispatch --> Drating{D-Matrix assessment} + Drating -->|D0 doc-only| Sonnet[sonnet solo] + Drating -->|D1-D3 multi complexity| Pair[dev/test pair TDD] + Drating -->|D4 ADR| OpusSolo[opus solo, P9 personal] + Drating -->|D5 real LLM/consensus| OpusPair[opus dev + opus test] + + Pair --> CommitWave[Atomic commit + Wave merge] + OpusPair --> CommitWave + + CommitWave --> ReleaseGate{Release artifact?} + ReleaseGate -->|Yes| ReleaseReady[Release-readiness agent independent verify] + ReleaseReady -->|GO| Tag[git tag v0.X.Y] + ReleaseReady -->|BLOCK| Fix[fix-pack sprint] + Fix --> ReleaseReady +``` + +## Three abstraction layers (slow → fast) + +```mermaid +flowchart LR + Strategy[Strategy — month-scale] --> Tactical[Tactical — week-scale] + Tactical --> Execution[Execution — hour/day-scale] + + Strategy -.includes.-> Constitution[Constitution] & Wedge[Wedge / strategic direction] & Roadmap[Milestone roadmap] + Tactical -.includes.-> ADRs[ADRs] & Findings[Findings] & Waves[Waves] & PreMortem[Pre-mortem] + Execution -.includes.-> Dispatch[Sub-agent Dispatch] & Tx[Tx commits] & Gates[5-gate + 6th eval-gate] & Release[Release-readiness] + + Strategy -.through.- Tactical -.through.- Execution +``` + +- **Strategy layer**: CLAUDE.md rarely changes; month-scale decisions. Changing it = major project pivot. +- **Tactical layer**: ADR + Finding added weekly; milestone checkpoints. +- **Execution layer**: daily sprints, sub-agent dispatch, gate enforcement, atomic commits. + +## Failure modes (F1 Sediment Family) panorama + +```mermaid +flowchart TB + F1[F1 Sediment Family — declared-without-enforcement] --> F1_0[F1.0 schema invariant] + F1 --> F1_1[F1.1 snapshot HEAD freshness] + F1 --> F1_2[F1.2 ADR roster completeness] + F1 --> F16[F16 post-compaction identity drift] + F1 --> F17[F17 sub-agent KPI self-report] + F1 --> F18[F18 attribution policy scope] + F1 --> F19[F19 install-not-tested] + F1 --> F20[F20 constitution-vs-workflow] + F1 --> F21[F21 cross-session identity overload] + + F1_0 -.via.-> Enforce0[snapshot-lint Inv] + F1_1 -.via.-> Enforce1[pre-commit hook] + F16 -.via.-> Enforce16[auto-memory identity preamble] + F17 -.via.-> Enforce17[verification commands block in completion report] + F19 -.via.-> Enforce19[release-readiness agent in clean shell] + F20 -.via.-> Enforce20[D-matrix + dev/test pair workflow] + F21 -.via.-> Enforce21[session-ID stamping convention] +``` + +Each F-pattern has a corresponding enforcement mechanism. The F1 Family core lesson: **declaring rules isn't enough; you must have machine / workflow enforcement**. + +## Four-layer storage model (memory decision) + +```mermaid +flowchart TB + NewInfo{Where to write new info?} --> Type{What category?} + Type -->|Identity / operative rule / SOP| L1[L1 auto-memory
~/.claude/projects//memory/] + Type -->|Cross-file decision| L2A[L2 ADR
docs/agent/adr/] + Type -->|Failure / surprise / dead-end| L2B[L2 Finding
docs/agent/findings/] + Type -->|Project state fact| L2C[L2 Snapshot
project_state_snapshot.md] + Type -->|Mid-sprint working state| L3[L3 session scratch
notes in messages] + Type -->|Re-fetchable ephemeral output| L4[L4 ephemeral
don't store] + + L1 -.auto-load.-> Session[Session start] + L2A -.persists in.-> Repo[git history] + L2B -.persists in.-> Repo + L2C -.persists in.-> Repo + L3 -.persists until.-> SessionEnd[Session end] +``` + +When unsure, **default to L3 scratch**. Promotion to L1/L2 is a deliberate decision at sprint-end, not in-flight. + +## Dispatch protocol (dev/test pair pattern) + +```mermaid +sequenceDiagram + participant P9 as P9 Tech Lead + participant Test as P7 Test Agent + participant Dev as P7 Dev Agent + + P9->>P9: Assess D-rating (D1-D3 / D5 → pair) + P9->>Test: spawn (TDD step 1 — write failing test corpus) + Test-->>P9: [P7-TEST-CORPUS-READY] N tests, K fail + P9->>P9: review test corpus (10 min) + P9->>Dev: spawn (TDD dev step — implement + pass corpus) + Dev-->>P9: [P7-DEV-COMPLETION] cargo test 0 fail + P9->>P9: verify gate + atomic commit + P9-->>CTO: [P9-MILESTONE-COMPLETION] +``` + +**Why a separate test agent + dev agent is mandatory**: a single agent writing impl + test has confirmation bias — the test verifies what the agent intended, not what the spec demands. Separate test agent eliminates the bias. + +## Release closure (with release-readiness) + +```mermaid +flowchart LR + Code[Code Ready] --> Gate5[5-gate Green
fmt+clippy+build+test+doc-cov] + Gate5 --> Gate6[6th gate — Eval Delta Non-Regression] + Gate6 --> ReleaseFile[Edit Release Notes / README] + ReleaseFile --> ReleaseAgent[Spawn Release-readiness Agent
clean shell + curl + cargo install --dry-run] + ReleaseAgent --> Decision{GO or BLOCK?} + Decision -->|GO| Tag[git tag v0.X.Y] + Decision -->|BLOCK| Fix[Fix root cause] + Fix --> ReleaseAgent +``` + +**F19 closure key**: don't let the agent that wrote the docs self-verify the docs. **Independent release-readiness agent in a clean shell** is the only robust F19 defense. + +## Turning these diagrams into practice + +Each diagram is a "practice script": + +- Top-level view → follow this flow for a new project +- Three abstraction layers → team cadence, what to do daily/weekly/monthly +- F1 Family → consult this when you hit a wall, find the missing enforcement +- Storage four-layer → consult the decision tree before writing +- Dispatch protocol → P9 follows this sequence when initiating a sprint +- Release closure → mandatory path before any tag + +See [`getting-started.md`](./getting-started.md) 5-step practice section to map these diagrams to concrete commands. diff --git a/docs/human/en/getting-started.md b/docs/human/en/getting-started.md new file mode 100644 index 0000000..71995bb --- /dev/null +++ b/docs/human/en/getting-started.md @@ -0,0 +1,146 @@ +# Getting started + +> **Goal**: in 30 minutes, an engineer unfamiliar with ADSD has the ADRs + findings + sub-agent dispatch discipline running on their own project. + +## Who should read this + +- You're managing a project with **multi-agent parallelism** (≥3 AI agents working concurrently) +- You want to avoid the multi-agent endemic ailments: sediment / drift / silent regression +- You already use Claude Code / Cursor / similar IDE-agent tools at a basic level +- You have a git project to apply this methodology to + +If you're writing a single-agent small script, ADSD is overkill. Skip. + +## 30-second overview + +ADSD is the multi-agent working discipline distilled from 9 weeks of Cobrust project, codifying: + +1. **Decision capture** — every cross-file decision becomes an ADR (Architecture Decision Record) +2. **Failure capture** — every "this broke / surprised / dead-ended" becomes a Finding (negative result) +3. **Dispatch discipline** — D0-D5 difficulty matrix + dev/test pair TDD protocol + +Plus **bilingual docs mandate** + **wave + Tx atomic commits** + **F1-F21 anti-pattern catalogue** + **release-readiness pre-publish independent verification**. That's the full picture. + +Detailed architecture: [`concept-map.md`](./concept-map.md) + +## Three install paths + +### Method 1 (recommended) — Claude Code plugin + +``` +/plugin marketplace add Cobrust-lang/agent-driven-development +/plugin install adsd@adsd +``` + +Once installed, when a prompt mentions "multi-agent dispatch / ADR drafting / F1-F21 failure modes" etc., Claude auto-activates the ADSD skill. + +### Method 2 — Personal skill directory (fallback) + +```sh +mkdir -p ~/.claude/skills +git clone --depth 1 https://github.com/Cobrust-lang/agent-driven-development.git /tmp/adsd-src +cp -r /tmp/adsd-src/plugins/adsd/skills/agent-driven-development ~/.claude/skills/ +rm -rf /tmp/adsd-src +``` + +### Method 3 — Read-only (no install, just markdown) + +Read [`plugins/adsd/skills/agent-driven-development/SKILL.md`](https://github.com/Cobrust-lang/agent-driven-development/blob/main/plugins/adsd/skills/agent-driven-development/SKILL.md) top-to-bottom (~30 min) for the full methodology. No install required to learn. + +## First real use — 5 steps + +Assume you have a project at `~/my-project/` and want to start with ADSD. + +### Step 1: Create the project `CLAUDE.md` (constitution) + +Write a ~30-line project constitution at `~/my-project/CLAUDE.md` with at minimum: + +- **Project identity** — one-line pitch (what + who uses it) +- **What you keep** (good properties borrowed from other tools / languages / workflows) +- **What you drop** (explicit anti-patterns) +- **Engineering standards** — Elegant / Scientific / Efficient with 3-5 concrete rules each +- **Milestone roadmap** — M0 (scaffold) → M1 → ... 6-12 months out + +Reference: ADSD's own SKILL.md "Engineering standards" section is a template. + +### Step 2: Create `docs/agent/` + `docs/human/{zh,en}/` skeleton + +```sh +cd ~/my-project +mkdir -p docs/agent/adr docs/agent/findings docs/agent/modules +mkdir -p docs/human/zh docs/human/en +``` + +Copy ADSD's `templates/adr-template.md` to `docs/agent/adr/_template.md` as your ADR drafting template. Same for finding-template, snapshot-template. + +### Step 3: Write ADR-0001 (license choice) + +Every project's first ADR is typically the license choice (Apache+MIT dual, or BSL-1.1, or ...). This is **the start of mandatory ADR flow** — one cross-multifile decision running through the complete process: Context → Options → Decision → Consequences → Cross-references. + +### Step 4: Build `MEMORY.md` index (Claude Code auto-memory) + +If you use Claude Code, project-level memory lives in `~/.claude/projects//memory/`. Create the `MEMORY.md` index with one-line hooks: + +``` +- [Project identity preamble](identity.md) — read first when resuming a session +- [Subagent model tier rule](subagent_tiers.md) — D0-D5 matrix per ADSD +- [CTO operations runbook](runbook.md) — dispatch SOPs +``` + +See [`reference/cross-session-memory-architecture.md`](https://github.com/Cobrust-lang/agent-driven-development/blob/main/plugins/adsd/skills/agent-driven-development/reference/cross-session-memory-architecture.md). + +### Step 5: First sub-agent dispatch (using ADSD D-matrix) + +Use Claude Code's Agent tool to dispatch a concrete task. **The prompt MUST include difficulty self-rating**: + +``` +DIFFICULTY-RATING: D2 (multi-fn stdlib API new, single crate, ADR clear) +MODEL-DEV: sonnet +MODEL-TEST: sonnet +PAIR: yes + +MISSION: implement such that all passes. + +REQUIRED READS: +- /abs/path/to/ADR-0XXX.md +- /abs/path/to/test_corpus.rs +- see reference/prompt-engineering-patterns.md PT2 (few-shot output format) + +REPORT FORMAT: [P7-COMPLETION] with verification block (paste raw cargo test output, no paraphrase) +``` + +See [`reference/prompt-engineering-patterns.md`](https://github.com/Cobrust-lang/agent-driven-development/blob/main/plugins/adsd/skills/agent-driven-development/reference/prompt-engineering-patterns.md). + +## Verify you installed correctly + +Run these two checks: + +```sh +# 1. Verify plugin activated +/plugin status adsd + +# 2. In Claude Code, ask a question with ADSD keywords +"I need to plan a multi-agent dispatch, how do I use the D-matrix to assess difficulty?" +``` + +If Claude auto-references ADSD's `reference/` files, you installed correctly. If Claude answers from general knowledge, the skill didn't activate. + +## Next steps + +- Read [`concept-map.md`](./concept-map.md) for the complete ADSD concept diagram +- Read [`why-adsd.md`](./why-adsd.md) for motivation / value prop (if not yet convinced) +- When you hit a wall, write a finding. Don't hide it. F1-F21 catalogue is at [`reference/failure-modes-catalogue.md`](https://github.com/Cobrust-lang/agent-driven-development/blob/main/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md); you may have hit the same one + +## FAQ + +**Q: My project is small. Do I really need ADRs?** +A: Only for decisions affecting ≥2 files. Single-file modifications don't write ADRs. Bug fixes don't write ADRs (but do write findings). + +**Q: Bilingual docs feel burdensome?** +A: ADSD mandates this because it addresses the real "Chinese teams are natively multilingual" problem. Single-language projects can relax this; but the README + getting-started bilingual pair is recommended. + +**Q: D-matrix is tedious, do I need to evaluate every time?** +A: Manual evaluation for the first 5 times; after that it becomes muscle memory. Skipping costs you model-tier mismatch (F20 family) — projects that hit it think it's worth it. + +**Q: I use OpenAI not Anthropic?** +A: ADSD is LLM-agnostic. D-matrix / dev-test pair / evals-first are all vendor-neutral. The Claude Code plugin is just a distribution channel; the methodology itself doesn't bind to Anthropic. diff --git a/docs/human/zh/concept-map.md b/docs/human/zh/concept-map.md new file mode 100644 index 0000000..44a1265 --- /dev/null +++ b/docs/human/zh/concept-map.md @@ -0,0 +1,149 @@ +# ADSD 概念图 + +> 用 mermaid 图表 + 简短文字, 把 ADSD 全套概念一图打散. + +## 顶层视图 + +```mermaid +flowchart TB + Constitution[CLAUDE.md 宪法] --> Decisions{需要决策吗?} + Decisions -->|是, 跨 ≥2 文件| ADR[ADR — 决策记录] + Decisions -->|否, 单文件| InCode[就在代码里改] + + Implementation[实施工作] --> Failure{出问题了吗?} + Failure -->|是| Finding[Finding — 失败记录] + Failure -->|否| Continue[继续] + + State[项目状态] --> Snapshot[snapshot.md — 状态快照] + + ADR --> Sprint[Sprint = Wave + Tx] + Sprint --> Dispatch[Dispatch P9/P7 sub-agent] + + Dispatch --> Drating{D-Matrix 评估} + Drating -->|D0 doc-only| Sonnet[sonnet solo] + Drating -->|D1-D3 多复杂度| Pair[dev/test pair TDD] + Drating -->|D4 ADR| OpusSolo[opus solo, P9 亲笔] + Drating -->|D5 真 LLM/consensus| OpusPair[opus dev + opus test] + + Pair --> CommitWave[原子 commit + Wave merge] + OpusPair --> CommitWave + + CommitWave --> ReleaseGate{Release artifact?} + ReleaseGate -->|是| ReleaseReady[Release-readiness agent 独立验证] + ReleaseReady -->|GO| Tag[git tag v0.X.Y] + ReleaseReady -->|BLOCK| Fix[fix-pack sprint] + Fix --> ReleaseReady +``` + +## 三层抽象 (从慢到快) + +```mermaid +flowchart LR + Strategy[战略层 — 月级别] --> Tactical[战术层 — 周级别] + Tactical --> Execution[执行层 — 小时/天级别] + + Strategy -.包含.-> Constitution[宪法] & Wedge[Wedge / 战略方向] & Roadmap[Milestone 路线] + Tactical -.包含.-> ADRs[ADRs] & Findings[Findings] & Waves[Waves] & PreMortem[Pre-mortem] + Execution -.包含.-> Dispatch[Sub-agent Dispatch] & Tx[Tx commits] & Gates[5-gate + 6th eval-gate] & Release[Release-readiness] + + Strategy -.通过.- Tactical -.通过.- Execution +``` + +- **战略层**: CLAUDE.md 不常改, 月级别决策. 改 = 项目重大转向. +- **战术层**: ADR + Finding 每周新增, milestone 检查点. +- **执行层**: 每日 sprint, sub-agent 派活, gate 通过, atomic commit. + +## 失败模式 (F1 Sediment Family) 全景 + +```mermaid +flowchart TB + F1[F1 Sediment Family — declared-without-enforcement] --> F1_0[F1.0 schema invariant] + F1 --> F1_1[F1.1 snapshot HEAD freshness] + F1 --> F1_2[F1.2 ADR roster completeness] + F1 --> F16[F16 post-compaction identity drift] + F1 --> F17[F17 sub-agent KPI self-report] + F1 --> F18[F18 attribution policy scope] + F1 --> F19[F19 install-not-tested] + F1 --> F20[F20 constitution-vs-workflow] + F1 --> F21[F21 cross-session identity overload] + + F1_0 -.通过.-> Enforce0[snapshot-lint Inv] + F1_1 -.通过.-> Enforce1[pre-commit hook] + F16 -.通过.-> Enforce16[auto-memory identity preamble] + F17 -.通过.-> Enforce17[verification commands block in completion report] + F19 -.通过.-> Enforce19[release-readiness agent in clean shell] + F20 -.通过.-> Enforce20[D-matrix + dev/test pair workflow] + F21 -.通过.-> Enforce21[session-ID stamping convention] +``` + +每个 F-pattern 都有对应的 enforcement 机制. F1 Family 的核心 lesson: **声明规则不够, 必须有机器/工作流强制**. + +## 四层 storage 模型 (memory 决策) + +```mermaid +flowchart TB + NewInfo{新信息要写哪?} --> Type{是哪类?} + Type -->|身份 / 操作规则 / SOP| L1[L1 auto-memory
~/.claude/projects//memory/] + Type -->|跨文件决策| L2A[L2 ADR
docs/agent/adr/] + Type -->|失败 / 意外 / 死胡同| L2B[L2 Finding
docs/agent/findings/] + Type -->|项目状态事实| L2C[L2 Snapshot
project_state_snapshot.md] + Type -->|本 sprint 工作中| L3[L3 session scratch
消息中的笔记] + Type -->|可再 fetch 的临时输出| L4[L4 ephemeral
不存] + + L1 -.auto-load.-> Session[Session start] + L2A -.持续到.-> Repo[git history] + L2B -.持续到.-> Repo + L2C -.持续到.-> Repo + L3 -.持续到.-> SessionEnd[Session end] +``` + +不确定就**默认 L3 scratch**. 升级到 L1/L2 是 sprint 收尾时**主动决策**, 不在过程中. + +## Dispatch 协议 (dev/test pair pattern) + +```mermaid +sequenceDiagram + participant P9 as P9 Tech Lead + participant Test as P7 Test Agent + participant Dev as P7 Dev Agent + + P9->>P9: 评估 D-rating (D1-D3 / D5 → pair) + P9->>Test: spawn (TDD step 1 — 写 failing 测试集) + Test-->>P9: [P7-TEST-CORPUS-READY] N tests, K fail + P9->>P9: review test corpus (10 min) + P9->>Dev: spawn (TDD dev step — 实现 + 通过 corpus) + Dev-->>P9: [P7-DEV-COMPLETION] cargo test 0 fail + P9->>P9: verify gate + atomic commit + P9-->>CTO: [P9-MILESTONE-COMPLETION] +``` + +**为什么必须独立 test agent + dev agent**: 同一个 agent 写 impl + test 会有 confirmation bias — test 验证的是它自己想做的, 不是 spec 要求的. 独立 test agent 消除偏见. + +## Release 闭环 (含 release-readiness) + +```mermaid +flowchart LR + Code[Code Ready] --> Gate5[5-gate Green
fmt+clippy+build+test+doc-cov] + Gate5 --> Gate6[6th gate — Eval Delta Non-Regression] + Gate6 --> ReleaseFile[Edit Release Notes / README] + ReleaseFile --> ReleaseAgent[Spawn Release-readiness Agent
clean shell + curl + cargo install --dry-run] + ReleaseAgent --> Decision{GO or BLOCK?} + Decision -->|GO| Tag[git tag v0.X.Y] + Decision -->|BLOCK| Fix[Fix root cause] + Fix --> ReleaseAgent +``` + +**F19 闭环关键**: 不让写文档的 agent 自验文档. **独立 release-readiness agent 在 clean shell 跑** 是 F19 唯一 robust 防御. + +## 怎么把这些图变成实战 + +每张图都是一种"实战剧本": + +- 顶层视图 → 起新项目时按这条流程 +- 三层抽象 → 团队节奏感, 每天/每周/每月各做什么 +- F1 Family → 撞坑时查这张图, 哪个 enforcement 缺了 +- Storage 四层 → 写东西前对照决策树 +- Dispatch 协议 → P9 发起 sprint 时按此 sequence +- Release 闭环 → tag 前必走这条 path + +参考 [`getting-started.md`](./getting-started.md) 的 5 步实战, 把这些图落到具体命令. diff --git a/docs/human/zh/getting-started.md b/docs/human/zh/getting-started.md new file mode 100644 index 0000000..129190e --- /dev/null +++ b/docs/human/zh/getting-started.md @@ -0,0 +1,146 @@ +# 入门指南 + +> **目标**: 30 分钟内让一个不熟悉 ADSD 的工程师在自己项目里开始用 ADRs + findings + sub-agent 派活的规范. + +## 谁该读这份文档 + +- 你正在管理一个**多 agent 并行**的软件项目 (≥3 个 AI agent 同时干活) +- 你想避免 sediment / drift / silent regression 这些**多 agent 顽疾** +- 你已经会用 Claude Code / Cursor / 类似 IDE-agent 工具的基本操作 +- 你有一个 git 项目可以套这套方法论 + +如果你只是写一个单 agent 的小脚本, ADSD 是 overkill, 跳过. + +## 30 秒概览 + +ADSD 是从 Cobrust 项目 9 周实战提炼的**多 agent 工作纪律**, 把以下三件事做硬: + +1. **决策捕获** — 每个跨文件的决定都写 ADR (Architecture Decision Record) +2. **失败捕获** — 每次"翻车 / 意外 / 死胡同"都写 Finding (负向结果) +3. **派活有谱** — D0-D5 难度矩阵 + dev/test pair 的 TDD 派活协议 + +加上**双语文档强制** + **wave + Tx 原子提交** + **F1-F21 反模式目录** + **release-readiness 上线前独立验证**, 就是 ADSD 全貌. + +详细架构: [`concept-map.md`](./concept-map.md) + +## 三种安装方式 + +### 方式 1 (推荐) — Claude Code plugin + +``` +/plugin marketplace add Cobrust-lang/agent-driven-development +/plugin install adsd@adsd +``` + +装完后, 命中"multi-agent dispatch / ADR drafting / F1-F21 failure mode"等关键词时, Claude 会自动激活 ADSD skill. + +### 方式 2 — 个人 skill 目录 (回退方案) + +```sh +mkdir -p ~/.claude/skills +git clone --depth 1 https://github.com/Cobrust-lang/agent-driven-development.git /tmp/adsd-src +cp -r /tmp/adsd-src/plugins/adsd/skills/agent-driven-development ~/.claude/skills/ +rm -rf /tmp/adsd-src +``` + +### 方式 3 — 只读 (不装, 看 markdown) + +直接读 [`plugins/adsd/skills/agent-driven-development/SKILL.md`](https://github.com/Cobrust-lang/agent-driven-development/blob/main/plugins/adsd/skills/agent-driven-development/SKILL.md), 30 分钟读完核心方法论. 不装也能学. + +## 第一次实战 — 5 步落地 + +假设你有一个项目 `~/my-project/`, 想开始用 ADSD. + +### 步骤 1: 创建项目 `CLAUDE.md` (宪法) + +在 `~/my-project/CLAUDE.md` 写下 30 行的项目宪法, 至少包含: + +- **项目身份** — 一行 pitch (是什么 + 谁用) +- **要保留的东西** (从其他语言 / 工具 / 工作流借鉴的良性属性) +- **要丢弃的东西** (明确反模式) +- **工程标准** — Elegant / Scientific / Efficient 各 3-5 条具体规定 +- **里程碑表** — M0 (脚手架) → M1 → ... 现在 + 未来 6-12 个月 + +参考: ADSD 自己的 SKILL.md "Engineering standards" 段是模板. + +### 步骤 2: 创建 `docs/agent/` + `docs/human/{zh,en}/` 目录骨架 + +```sh +cd ~/my-project +mkdir -p docs/agent/adr docs/agent/findings docs/agent/modules +mkdir -p docs/human/zh docs/human/en +``` + +把 ADSD 的 `templates/adr-template.md` 复制到 `docs/agent/adr/_template.md` 作为 ADR 起草模板. 同理 finding-template, snapshot-template. + +### 步骤 3: 写 ADR-0001 (license 选择) + +每个项目第一个 ADR 通常是 license 选择 (Apache+MIT dual, 或 BSL-1.1, 或 ...). 这是**强制走 ADR 流程**的开始 — 一次跨多文件的决定, 走完整流程: Context → Options → Decision → Consequences → Cross-references. + +### 步骤 4: 建立 `MEMORY.md` 索引 (Claude Code auto-memory) + +如果你用 Claude Code, 项目级 memory 在 `~/.claude/projects//memory/`. 创建 `MEMORY.md` 索引, 一行一条: + +``` +- [Project identity preamble](identity.md) — read first when resuming a session +- [Subagent model tier rule](subagent_tiers.md) — D0-D5 matrix per ADSD +- [CTO operations runbook](runbook.md) — dispatch SOPs +``` + +详见 [`reference/cross-session-memory-architecture.md`](https://github.com/Cobrust-lang/agent-driven-development/blob/main/plugins/adsd/skills/agent-driven-development/reference/cross-session-memory-architecture.md). + +### 步骤 5: 第一次 sub-agent 派活 (用 ADSD D-matrix) + +用 Claude Code 的 Agent tool 派一个具体任务. **prompt 必须含 difficulty self-rating**: + +``` +DIFFICULTY-RATING: D2 (multi-fn stdlib API new, single crate, ADR clear) +MODEL-DEV: sonnet +MODEL-TEST: sonnet +PAIR: yes + +MISSION: 实现 使得 全部通过. + +REQUIRED READS: +- /abs/path/to/ADR-0XXX.md +- /abs/path/to/test_corpus.rs +- 见 reference/prompt-engineering-patterns.md PT2 (few-shot 输出格式) + +REPORT FORMAT: [P7-COMPLETION] with verification block (paste raw cargo test output, no paraphrase) +``` + +详见 [`reference/prompt-engineering-patterns.md`](https://github.com/Cobrust-lang/agent-driven-development/blob/main/plugins/adsd/skills/agent-driven-development/reference/prompt-engineering-patterns.md). + +## 验证你装对了 + +跑这两条命令: + +```sh +# 1. 验证 plugin 已激活 +/plugin status adsd + +# 2. 在 Claude Code 里问个问题, 含 ADSD 关键词 +"我需要 plan 一个 multi-agent dispatch, 怎么用 D-matrix 评估难度?" +``` + +如果 Claude 自动引到 ADSD 的 reference, 装对了. 如果 Claude 用通用知识回答, skill 没激活. + +## 下一步 + +- 读 [`concept-map.md`](./concept-map.md) 看 ADSD 完整概念图 +- 读 [`why-adsd.md`](./why-adsd.md) 看为什么需要这套方法论 (如果还没被说服) +- 撞坑了写 finding, 不要藏起来. F1-F21 catalogue 在 [`reference/failure-modes-catalogue.md`](https://github.com/Cobrust-lang/agent-driven-development/blob/main/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md), 你可能撞上同一个 + +## 常见问题 + +**Q: 我项目很小, 真的需要 ADR 吗?** +A: 跨 ≥2 文件的决定才写. 单文件修改不写. 修 bug 不写 (但写 finding). + +**Q: zh + en 双语文档负担太重?** +A: ADSD 强制是因为它解决了"中国团队天然 multi-lingual"的真实问题. 单语项目可以放宽, 但 README + getting-started 双语建议保持. + +**Q: D-matrix 太繁琐, 我每次都得想一遍?** +A: 头 5 次手动评估, 之后就成肌肉记忆. 跳过的代价是 model tier 错配 (F20 family) — 真撞坑过的项目觉得值. + +**Q: 我用 OpenAI 不用 Anthropic?** +A: ADSD 是 LLM-agnostic. D-matrix / dev-test pair / evals-first 都 vendor-neutral. Claude Code plugin 部分只是发行渠道, 方法论本身不绑 Anthropic. diff --git a/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md b/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md index e3f2c9a..402a563 100644 --- a/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md +++ b/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md @@ -1,3 +1,13 @@ +--- +name: ADSD failure modes catalogue (F1-F21) +description: Concrete failure modes encountered in real ADSD projects with empirical evidence, root cause analysis, recovery patterns, and prevention mechanisms. F1 Sediment Family + F2-F21 individual entries. Add F22+ as your project hits new failure modes. +type: reference +version: 1.2.0 +date: 2026-05-12 +status: active +relates_to: [skill:SKILL.md §"Failure modes catalogue", case-study:cobrust-multi-agent-experience.md, reference:evals-first-development.md, reference:context-window-strategy.md, reference:cross-session-memory-architecture.md] +--- + # Failure modes catalogue > Concrete failure modes encountered in real ADSD projects, with diff --git a/scripts/doc-coverage.sh b/scripts/doc-coverage.sh new file mode 100755 index 0000000..26c23ea --- /dev/null +++ b/scripts/doc-coverage.sh @@ -0,0 +1,130 @@ +#!/usr/bin/env bash +# scripts/doc-coverage.sh — ADSD repo doc-coverage gate +# +# Enforces ADSD §3 documentation mandate on this repo itself: +# - Every docs/human/zh/*.md has a parallel docs/human/en/*.md (and vice versa) +# - Parallel files have matching filenames +# - Reference files in plugins/adsd/skills/agent-driven-development/reference/ +# have YAML frontmatter +# +# Exits non-zero on coverage failure. Pre-commit hook + CI both should run this. + +set -euo pipefail + +REPO_ROOT="${1:-$(git rev-parse --show-toplevel)}" +cd "$REPO_ROOT" + +# Color output (skip if not a TTY) +if [ -t 1 ]; then + RED='\033[0;31m' + GREEN='\033[0;32m' + YELLOW='\033[1;33m' + NC='\033[0m' +else + RED='' GREEN='' YELLOW='' NC='' +fi + +errors=0 + +echo "ADSD doc-coverage gate" +echo "----------------------" + +# ---------------------------------------------------------------------------- +# Inv 1: docs/human/zh/ ⟺ docs/human/en/ parity +# ---------------------------------------------------------------------------- +echo "" +echo "[Inv 1] Bilingual parity (zh ⟺ en)" + +if [ ! -d docs/human/zh ] || [ ! -d docs/human/en ]; then + echo -e " ${YELLOW}Warning: docs/human/{zh,en} missing — skipping parity check${NC}" +else + # Build sorted lists + zh_files=$(find docs/human/zh -maxdepth 2 -name '*.md' -exec basename {} \; | sort) + en_files=$(find docs/human/en -maxdepth 2 -name '*.md' -exec basename {} \; | sort) + + # Diff zh against en + while IFS= read -r f; do + [ -z "$f" ] && continue + if [ ! -f "docs/human/en/$f" ]; then + echo -e " ${RED}error${NC}: docs/human/zh/$f has no parallel docs/human/en/$f" + errors=$((errors + 1)) + fi + done <<< "$zh_files" + + # Diff en against zh + while IFS= read -r f; do + [ -z "$f" ] && continue + if [ ! -f "docs/human/zh/$f" ]; then + echo -e " ${RED}error${NC}: docs/human/en/$f has no parallel docs/human/zh/$f" + errors=$((errors + 1)) + fi + done <<< "$en_files" + + if [ "$errors" -eq 0 ]; then + zh_count=$(echo "$zh_files" | grep -c . || true) + en_count=$(echo "$en_files" | grep -c . || true) + echo -e " ${GREEN}OK${NC}: $zh_count zh + $en_count en files, all parallel" + fi +fi + +# ---------------------------------------------------------------------------- +# Inv 2: reference files have YAML frontmatter +# ---------------------------------------------------------------------------- +echo "" +echo "[Inv 2] Reference file frontmatter" + +ref_dir="plugins/adsd/skills/agent-driven-development/reference" +if [ -d "$ref_dir" ]; then + for f in "$ref_dir"/*.md; do + [ -f "$f" ] || continue + first_line=$(head -1 "$f") + if [ "$first_line" != "---" ]; then + echo -e " ${RED}error${NC}: $f missing YAML frontmatter (first line not '---')" + errors=$((errors + 1)) + fi + done + + if [ "$errors" -eq 0 ] || [ -z "${seen_inv2_err:-}" ]; then + ref_count=$(find "$ref_dir" -name '*.md' | wc -l | tr -d ' ') + echo -e " ${GREEN}OK${NC}: $ref_count reference file(s) all have frontmatter" + fi +fi + +# ---------------------------------------------------------------------------- +# Inv 3: ADR files (if any) zero-padded monotonic +# ---------------------------------------------------------------------------- +echo "" +echo "[Inv 3] ADR numbering (zero-padded monotonic)" + +adr_dir="docs/agent/adr" +if [ -d "$adr_dir" ]; then + adr_count=$(find "$adr_dir" -name '[0-9][0-9][0-9][0-9]-*.md' | wc -l | tr -d ' ') + if [ "$adr_count" -gt 0 ]; then + # Just verify each ADR filename starts with 4 digits + bad_count=$(find "$adr_dir" -name '*.md' -not -name '_*' \ + | grep -cv '/[0-9][0-9][0-9][0-9]-' || true) + if [ "$bad_count" -gt 0 ]; then + echo -e " ${RED}error${NC}: $bad_count ADR file(s) not zero-padded 4-digit prefixed" + errors=$((errors + 1)) + else + echo -e " ${GREEN}OK${NC}: $adr_count ADR file(s) properly numbered" + fi + else + echo -e " ${YELLOW}info${NC}: no ADRs yet (acceptable for a fresh repo)" + fi +else + echo -e " ${YELLOW}info${NC}: docs/agent/adr/ doesn't exist (acceptable)" +fi + +# ---------------------------------------------------------------------------- +# Summary +# ---------------------------------------------------------------------------- +echo "" +echo "----------------------" +if [ "$errors" -eq 0 ]; then + echo -e "${GREEN}doc-coverage: PASS${NC}" + exit 0 +else + echo -e "${RED}doc-coverage: FAIL ($errors errors)${NC}" + exit 1 +fi From 89a09f89555ed5fe424a8537ddd23fde94f1c0d8 Mon Sep 17 00:00:00 2001 From: wbj010101 Date: Tue, 12 May 2026 10:32:47 +0800 Subject: [PATCH 05/16] docs(catalogue): codify F22 (coverage-fix-cadence mitigation validated) + F23-A (oracle-without-verify confirmed) + F23-B (distribution drift candidate) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three new F-pattern entries codified from Cobrust LC-100 Tier A stress sweep empirical evidence (2026-05-12): F22 — Coverage drive without bug-fix cadence (F1 family, suppression sub-form). ADR-0047 LeetCode coverage strategy was authored as the explicit F22 mitigation. Empirical validation: P9 + review-claude recommended Option H (fix-pack first) at 77/100 rather than ramping to Tier B 500-题 with the same defect distribution. Option H closed at 99/100 stable; ADR-0047 SKIP-at-90% triggered. F22 NOT fired because the mitigation existed and was followed — reverse-evidence case. F23-A — Oracle authorship without independent verification (F1 family, oracle-verify sub-form, confirmed). LC-100 Tier A surfaced 23 initial failures; 15/23 = 65% were test corpus oracle defects (not language gaps): coin-change DP mistraces, BFS level off-by-one, Roman-to-int arithmetic errors, climbing-stairs base-case off-by-one. All derivable by running reference Python implementations. Codified mitigation: ADR-0047a verify.py mandate (every Tier B program ships verify.py reference impl that runs against test.toml before DEV phase). F23-B — Synthetic stress test distribution drift from real-world (F1 family, distribution-coverage sub-form, candidate UNMEASURED). Predicted but awaits empirical measurement post-T1.1 real-LLM E2E on msgpack/dateutil/requests/click. Hypothesis: pattern overlap between LC-100 synthetic distribution and real Python lib translation will be < 60%. Promotion to confirmed when overlap measurement lands. Catalogue now F1-F23 (23 entries; F1 Sediment Family has 9+ sub-forms). Co-Authored-By: Claude Opus 4.7 (session 4bb35f43) --- .../reference/failure-modes-catalogue.md | 199 ++++++++++++++++++ 1 file changed, 199 insertions(+) diff --git a/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md b/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md index 402a563..39284a7 100644 --- a/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md +++ b/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md @@ -1288,6 +1288,205 @@ This convention applies to any AI agent role that produces persistent artifacts --- +## F22 — Coverage drive without bug-fix cadence (mitigation pattern validated, F1 Sediment Family suppression sub-form) + +> **F1 sub-form, candidate → validated-as-suppressed**. F22 is the negative pattern an ADSD project hits when it scales a stress-test corpus (N → 5N → 10N programs) without applying fix-pack between scales. ADR-0047 (LeetCode coverage strategy) was authored as the explicit F22 mitigation, and the LC-100 → Option H decision was the empirical validation that the mitigation works. + +### Definition + +The temptation to run "all 3816 LeetCode problems" / "all 500 test cases" / "all N stress-test inputs" *before* triaging and fixing bugs from the first batch. The result: each subsequent batch hits the same N bug-patterns as the first, multiplying the surface defect count without surfacing new failure modes. Bug-pattern density per batch saturates after ~100 programs; the next 3700 are mostly re-discovery of the same gaps. + +### Symptoms + +- A coverage-drive sprint exits with 3000+ test programs but only 5-7 distinct bug-patterns +- Triage time grows quadratically with batch size (more programs to classify into same patterns) +- "Pass rate" stays roughly flat across scales (e.g. 77% at N=100 stays 75-80% at N=500 absent fix-pack) +- Fix-pack debt accumulates: each unfixed pattern blocks ~N/k programs per round, with k ≈ patterns + +### Root cause + +Coverage-as-throughput optimism: the assumption that running more cases surfaces more bugs. In practice, the bug-pattern distribution is heavy-tailed — the first ~100 programs of any reasonable sample surface ~80% of patterns. Continuing past saturation is re-discovery. + +ADR-0047 codified the **ramp gate**: pass rate < 70% → HOLD (fix-pack), 70-90% → conditional GO (fix-pack-OR-ramp evidence-driven), ≥ 90% → SKIP back to other work (gap-saturated, no Tier B ROI). + +### Evidence + +**LC-100 Tier A discovery sweep (2026-05-12)**: P9 opus + 4 P7 sonnet TDD pairs ran 100 programs across 10 algorithm categories. Initial result 77/100 with 3 distinct failure patterns (Pattern A codegen rodata literals, Pattern B list[str] type gap, Pattern C test corpus oracle defects). + +**ADR-0047 ramp logic predicted**: 77% is the conditional zone — Option G (immediate Tier B) vs Option H (fix-pack first). P9 + review-claude both recommended Option H based on F22 mitigation principle: don't ramp the same defect distribution to 5N scale. + +**Option H executed** at commits `2d952e0` (Sprint 1 Pattern C fix, +15 programs) + `2a8bdc0` (Sprint 2 Pattern A C-ABI fix, +7 programs) = 99/100 stable. Post-fix-pack pass rate 99/100 = 99% triggers ADR-0047's SKIP-back-to-W1 gate — Tier A is gap-saturated, Tier B has no ROI. + +**Validation**: F22 was NOT fired because the mitigation existed and was followed. The reverse-evidence (counterfactual: had ADR-0047 not existed, P9 likely would have ramped to 500 programs and re-discovered the same 3 patterns at ~75-defect scale, wasting ~5-10× agent-time). + +### Rule of thumb + +> **Stress-test corpus growth (N → 5N) MUST be gated by current-batch pass rate.** +> +> Decision logic: +> - < 70% pass: HOLD; fix-pack the patterns surfaced; re-baseline at N before ramping +> - 70-90% pass: conditional GO with bug-fix-cost check — if fix-pack > 1 day, ramp anyway; if ≤ 1 day, fix first +> - ≥ 90% pass: SKIP — corpus is gap-saturated for this language area; ramping has no ROI + +Time-cap the discovery sweep at the same time: ADR-0047 capped Tier A at 1-2 day. Without a time-cap, F22 manifests as "ramp anyway because the test feels useful". The cap forces the gate decision. + +### Recovery + +If F22 has already fired (you ramped before fixing): + +1. **Triage all failures into pattern groups** (ADR-0047 Phase 3-style). Aim for 3-7 distinct patterns; if more, the test corpus is noisy. +2. **Identify the high-multiplicity patterns**: which 2-3 patterns account for ≥ 80% of failures? Fix those first. +3. **Re-baseline at the smaller scale (e.g. N) after the fix-pack**. Confirm pass rate ≥ 90% before considering further ramp. +4. **Document the cost lesson** as a finding — "we ramped to 5N before fix-pack and lost ~K agent-hours to repeated triage." + +### Prevention going forward + +For any future stress-test corpus design: + +1. **ADR the ramp strategy before generating the corpus** (per ADR-0047 template). Include the gate thresholds. +2. **Build the time-cap into the dispatch prompt**. P9 sub-agents that exceed cap MUST escalate, not auto-ramp. +3. **Track bug-pattern density per batch**. When it falls below ~1 new pattern per 50 programs, you've hit saturation. +4. **Reverse-evidence is real evidence**. When F22 doesn't fire, document the counterfactual cost saved. + +--- + +## F23-A — Oracle authorship without independent verification (F1 Sediment Family, oracle-verify sub-form) + +> **F1 sub-form, confirmed**. Same family as F1.1 (declared invariants without enforcement) and F17 (sub-agent KPI self-report fidelity), but specific to **the test oracle itself** rather than the implementation under test. The pattern: the agent authoring the test corpus mentally executes the algorithm and writes both the algorithm description AND the expected output. Without an independent verifier (a reference implementation), arithmetic / DP-trace / tree-encoding mistakes get encoded directly into the oracle — silently invalidating the test gate. + +### Definition + +A P7-TEST sonnet agent produces a test corpus (`test.toml` cases + algorithm paraphrase in README) by mental execution of the algorithm. The expected output field is the agent's mental computation result — no independent verification path runs. Bugs in the agent's mental execution become bugs in the oracle. + +The downstream effect: a P7-DEV agent's `solution.cb` may be algorithmically correct, but fails the oracle because the oracle itself is wrong. Triage misclassifies this as a "language gap" instead of a "test corpus defect", wasting language-implementation effort on a test-author mistake. + +### Symptoms + +- Algorithm-style stress-test corpus shows 15-30% failures concentrated in arithmetic / DP-trace / graph-traversal categories +- DEV agent's failing solutions look algorithmically reasonable on careful read +- Quick reference-implementation check (running the algorithm in Python by hand) confirms DEV output is correct and the oracle is wrong +- "Pattern: test corpus defects" emerges as a primary failure class in triage + +### Root cause + +Mental execution is unreliable for non-trivial algorithms. Even high-quality LLM agents have non-zero error rate when computing: +- DP transitions for sequences > ~10 elements +- BFS / DFS over trees with > ~5 levels +- Modular arithmetic chains +- Bit manipulation edge cases +- String parsing with escape sequences + +The author of the algorithm description and the author of the expected output are the same agent in the same session — confirmation bias guarantees the oracle agrees with the agent's mental model, not with reality. + +### Evidence + +**LC-100 Tier A failure triage**: 15 of 23 initial failures (65%) were oracle-authorship defects, not language gaps. Concrete examples (from `lc100-pattern-c-test-corpus-defects.md`): + +- coin-change DP: agent computed DP[5] = 2 mentally; actual algorithm returns 1 +- BFS level-count: agent encoded "depth = 3" for a tree where actual BFS returns 4 (off-by-one on root) +- Roman-to-int: agent's mental arithmetic on "MCMXCIV" yielded 1995 instead of 1994 (subtraction-rule miscount) +- Climbing-stairs: agent encoded fib(N+1) instead of fib(N) (off-by-one on base case) + +15 corrections were derivable post-hoc by running reference Python implementations against the same inputs. The author's mental execution had been the sole oracle source — no second pass. + +**Codified mitigation: ADR-0047a verify.py mandate** (2026-05-12). Every Tier B program must ship with a `verify.py` reference Python implementation that runs against the `test.toml` corpus and confirms the oracle before the DEV phase begins. + +### Rule of thumb + +> **The test oracle author MUST run an independent verification (different code path, ideally different agent) before declaring the corpus ready.** +> +> Concrete forms: +> 1. **Reference-implementation pattern (lightweight, default)**: P7-TEST authors a `verify.py` reference Python impl in the same sprint; runs it against test cases; commits only when all match. +> 2. **CPython differential pattern (heavyweight, for numerical / library translations)**: oracle is computed by an authoritative external implementation; agent encodes the input + the differential check, not the expected output. +> 3. **Hand-verified pattern (lowest scale, ≤ 5 cases)**: human reviewer hand-traces each case; works only at small N. + +For algorithm-style corpora (LeetCode shape), Form 1 (verify.py) is the empirically validated default. + +### Recovery + +When F23-A fires (oracle defects discovered post-hoc): + +1. **Triage**: separate corpus-defect failures from language-gap failures. The corpus-defect class shows DEV output looking algorithmically reasonable. +2. **Author reference impls** (Python, Rust, or pseudocode) for the affected cases; run them against the corpus. +3. **Fix the corpus, not the implementation**, for any case where reference impl confirms DEV output. +4. **Re-run the full corpus** post-fix; confirm pass rate change matches the corrected-defect count. + +### Prevention going forward + +In any future stress-test corpus dispatch: + +1. **Update dispatch templates**: P7-TEST prompt MUST include verify.py authoring as a step before test.toml finalization (per ADR-0047a pattern). +2. **Sprint exit gate**: `[P7-TEST-CORPUS-READY]` report MUST include per-program `verify_py_matches: yes/no` rows. +3. **CI extension (stretch)**: a release-readiness-style harness re-runs verify.py against test.toml at corpus-edit time, catching oracle drift between sprints. + +--- + +## F23-B — Synthetic stress test distribution drift from real-world (F1 Sediment Family, distribution-coverage sub-form) [CANDIDATE, UNMEASURED] + +> **F1 sub-form, candidate**. A stress-test corpus is hand-picked or algorithmically generated to exercise a specific surface (e.g. "10 algorithm categories × 10 programs each"). The resulting bug-pattern distribution may diverge from what real-world programs in the same language would surface. The corpus's coverage claim ("we tested 100 programs") may not generalize to "the language handles 100% of similar real-world programs." + +### Definition + +A discovery sweep's bug-distribution is a function of the corpus's input-distribution. If the corpus's distribution differs from production-distribution, the bug-set found is unrepresentative — both falsely confident (missing bugs that real programs would surface) and falsely alarming (surfacing bugs that real programs never trigger). + +For Cobrust LC-100: 10 algorithm categories × 10 paraphrased programs each is a synthetic distribution. Real-world Python programs (e.g. tomli, msgpack, dateutil) have very different structure — heavy on string parsing, library boilerplate, error-handling, less on DP/graph/numerical algorithms. + +### Symptoms (predicted, not yet validated) + +- Stress-test discovery surfaces N bug-patterns; real Python lib translation later surfaces M ≠ N bug-patterns +- Bug-pattern overlap between synthetic and real-world is < 70% +- "Pass rate at N synthetic programs ≥ 90%" does NOT imply "pass rate on real Python libs ≥ 90%" + +### Root cause + +Distribution mismatch: + +- **Synthetic-leaning bias**: algorithm-style problems exercise control flow + arithmetic + small data structures. Real Python programs exercise string manipulation + I/O + library interop more heavily. +- **Length distribution**: LeetCode programs typically 20-100 LOC. Real Python files are 200-2000 LOC with multi-module imports. +- **Error-handling absence**: algorithm-style problems usually have well-defined inputs; real programs need defensive error handling, validation, malformed input recovery. + +A 99/100 pass rate on synthetic corpus doesn't bound the failure rate on production-distribution programs. + +### Evidence + +**Unmeasured at LC-100 Tier A close (2026-05-12)**. Empirical validation requires running translated real Python libraries (T1.1 tomli, msgpack, dateutil) against the same Cobrust compiler that achieves 99/100 on LC-100, then comparing bug-pattern overlap. + +Hypothesis: pattern overlap will be < 60%. Real-Python translation will surface string-handling + library-interop bugs that LC-100 doesn't probe; LC-100 surfaces algorithmic-edge bugs that real Python rarely hits. + +The candidate becomes "confirmed F23-B" when this measurement happens. + +### Rule of thumb + +> **A stress-test pass rate is a function of the test corpus distribution. To bound real-world failure rates, run additional probes on the actual production distribution (or a sample of it).** +> +> Practical forms: +> 1. **Cross-distribution validation**: after a synthetic corpus closes, run a smaller (~10-30) real-distribution sample. Compare bug-pattern overlap. +> 2. **Real-distribution prioritization**: if real-world coverage is the goal, prioritize real-distribution corpus design over synthetic. +> 3. **Cite distribution explicitly**: marketing / release messaging "Cobrust passes N/M LeetCode" must qualify "(synthetic algorithm-style corpus; real Python lib translation rates vary)". + +### Recovery + +If F23-B is suspected (synthetic pass rate is high but real-world deployment has issues): + +1. **Build a real-distribution sample**: ~30 representative programs from production code or real libraries. +2. **Run against the same compiler**; classify failures. +3. **Pattern-overlap analysis**: which patterns appear in both? Which only in synthetic? Which only in real-world? +4. **Update marketing / release messaging** to cite the appropriate distribution. + +### Prevention going forward + +When designing future stress-test corpora: + +1. **Declare the corpus distribution in the dispatch ADR**. "10 algorithm categories × 10 programs each" is a synthetic-leaning distribution and must be acknowledged as such. +2. **Add a real-distribution sample at Phase 4** of any large coverage sweep. Even 10-20 real programs validate the synthetic pass rate's generalizability. +3. **Marketing copy must qualify**: "99/100 on synthetic algorithm corpus" not "99% language coverage." F8 (marketing overreach) prevention. + +### Status + +**Candidate**, awaiting empirical measurement post-T1.1 real-LLM E2E on msgpack / dateutil / requests / click. When pattern-overlap data lands, this entry promotes to confirmed. + +--- + ## Catalogue maintenance This catalogue is alive — add to it as you encounter new failure modes. From d00792eca1b991f07882ab419f5621929acf53e2 Mon Sep 17 00:00:00 2001 From: wbj010101 Date: Tue, 12 May 2026 11:03:52 +0800 Subject: [PATCH 06/16] docs(catalogue): codify F24 (primitive-as-everything-simulation, F1 family coverage-fidelity sub-form) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Discovered 2026-05-12 by user audit-by-skepticism of Cobrust LC-100 Tier A 99/100 pass rate. user observation: linked-list / tree / set problems all use list[i64] flat-array simulation; Cobrust has no source-level recursive struct support nor exposed stdlib generics for LinkedList/TreeNode/Set. F24 codifies the systemic anti-pattern: a stress-test corpus's pass rate measures algorithmic correctness on whatever data structure the implementation chose, not whether the implementation actually exercised the feature category claimed by the corpus design. Founding evidence: - examples/leetcode-stress/045-linked-list-palindrome/solution.cb — "store all values in an array, then two-pointer compare" - examples/leetcode-stress/047-merge-k-sorted-lists/solution.cb — "store all lists in a flat array, then selection-sort via K pointers" - examples/leetcode-stress/050-rotate-linked-list/solution.cb — "values in array, rotate by index" Mitigation: type-asserting pass condition + feature-category audit at P9 Phase 3 triage + counterfactual sample (1-2 programs deliberately using the claimed type per category to confirm it actually compiles). Recovery: explicit tech debt with pre-tag blocker codified in a follow-up ADR — pattern parallel to ADR-0045 user-traction milestone gate but at the per-category coverage surface. Catalogue now F1-F24 (24 entries; F1 Sediment Family has 10+ sub-forms). Composes with F19 (install-not-tested) — both are gaps between artifact claim and verified reality. Co-Authored-By: Claude Opus 4.7 (session 4bb35f43) --- .../reference/failure-modes-catalogue.md | 90 +++++++++++++++++++ 1 file changed, 90 insertions(+) diff --git a/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md b/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md index 39284a7..224b7b5 100644 --- a/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md +++ b/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md @@ -1487,6 +1487,96 @@ When designing future stress-test corpora: --- +## F24 — Stress-test pass via primitive-as-everything simulation (F1 Sediment Family, coverage-fidelity sub-form) + +> **F1 sub-form, confirmed**. Related to F23-A (oracle-without-verify) and F23-B (distribution drift) but distinct: F24 is about **what the implementation under test actually exercises**. The pass rate metric becomes semantically vacuous when programs route around a missing language feature using a primitive type (list as linked-list / dict as tree / list-as-stack-as-queue) — the language passes the test but doesn't actually implement the structure the test category claims to cover. + +### Definition + +A stress-test corpus organized by feature category (e.g. "10 linked-list problems / 10 tree problems / 10 hash-set problems") shows a high pass rate. But inspection of the actual program implementations reveals they all use a single primitive type (`list[i64]`, `array`) as the data backbone, simulating the richer category-named structure via index arithmetic or value-arrays. The language never actually compiled a real linked-list / tree / set type — the test passed via simulation, not via real coverage of the claimed feature category. + +### Symptoms + +- Stress-test categories named after data structures show high pass (e.g. "10/10 linked list", "9/10 binary tree") +- All "linked list" .cb programs share a comment like `# Algorithm: store values in an array then two-pointer / index manipulate` +- Tree problems use level-order index encoding (`parent = (i-1)/2`) on a flat array, not real tree nodes +- Hash-set problems use dict-with-1-as-value, not a Set type +- `grep -r 'struct.*Node\|struct.*Tree\|enum.*List' src/` returns nothing matching real recursive types + +### Root cause + +The corpus author (P7-TEST or human spec author) selects categories by their algorithmic shape ("LinkedList problems", "Binary Tree problems") but the corpus's pass condition is "expected stdout matches actual stdout" — which is achievable by **any** correct algorithm regardless of data structure. The cheapest correct implementation often routes through a primitive the language already supports, bypassing the structure the category implicitly claims. + +Without an explicit constraint "this category MUST use a recursive struct" or "this category MUST allocate K Tree nodes", the pass rate measures algorithmic correctness, not feature-category coverage. + +This is F1 family because: the coverage claim ("we tested 10 linked list problems") is declared, but no enforcement mechanism verifies the language actually exercised linked-list semantics. The declaration drifts from the enforced reality. + +### Evidence + +**Cobrust LC-100 Tier A close (2026-05-12, HEAD 459b820)**: 99/100 pass rate. Linked-list problems inspection: + +```cobrust +# examples/leetcode-stress/045-linked-list-palindrome/solution.cb +# Algorithm: store all values in an array, then two-pointer compare from both ends +fn main() -> i64: + let vals = list_new(n) + # ... list_set / list_get loops, two-pointer arithmetic +``` + +```cobrust +# examples/leetcode-stress/047-merge-k-sorted-lists/solution.cb +# Algorithm: store all lists in a flat array, then selection-sort via K pointers +fn main() -> i64: + let flat = list_new(10000) + let offsets = list_new(k + 1) + # ... index arithmetic, no Node struct +``` + +```cobrust +# examples/leetcode-stress/050-rotate-linked-list/solution.cb +# Algorithm: values in array, rotate by index +``` + +All linked-list programs use `list_new / list_set / list_get` flat-array simulation. Same pattern across the 10 linked-list + 10 tree + N hash-set programs. + +Cobrust language as of HEAD 459b820: +- `grep -rE 'LinkedList|TreeNode|HashSet' crates/cobrust-stdlib/src/` returns Rust-side `HashSet` internal wrappers but **no source-level (`.cb`-visible) types** for LinkedList / Tree / Set +- `grep -rE 'struct.*ref|recursive struct' crates/cobrust-types/src/` returns nothing matching source-level recursive struct support + +Conclusion: the 99/100 pass rate is **valid as algorithmic stress test** but **does not bound the language's recursive-type support**. The two metrics diverge; the corpus's category names suggest coverage that the language did not actually achieve. + +### Rule of thumb + +> **Coverage claims by feature category MUST be verified at the implementation surface, not just the output surface.** +> +> Concrete forms: +> 1. **Type-asserting pass condition**: corpus per-program asserts that the .cb solution uses the claimed type (e.g. `solution.cb` for LinkedList must contain `struct.*Node` or import the stdlib `LinkedList`). Static check at sprint exit gate. +> 2. **Feature-category audit**: P9 Phase 3 triage explicitly inspects K random programs per category for primitive-simulation pattern. If > 50% use the same primitive, flag the category as "simulated, not really tested". +> 3. **Counterfactual sample**: write 1-2 programs per category that DELIBERATELY use the claimed type. If they don't compile, the category was never really covered. + +For Cobrust LC-100: forms 1+2 should have fired during P9 Phase 3 triage. Recovery: track the gap as explicit tech debt with a pre-tag blocker (per ADR-0045 user-traction milestone gate pattern). + +### Recovery + +When F24 has fired (your stress-test passes mask a real coverage gap): + +1. **Document the tech debt explicitly**. Write a finding citing per-category simulation patterns observed. Cite specific .cb files. +2. **Set a binding pre-tag gate**: the next major-version release (v0.X+1.0) MUST NOT ship until the simulated categories have real-type implementations. Codify in an ADR. +3. **Dispatch the tech debt sprint**: design + implement the missing language features (recursive struct + ref semantics + stdlib LinkedList/Tree/Set generics) + retrofit a subset of programs (3-5 per category) to use the real types. +4. **Re-baseline pass rate on retrofit subset**: confirm the language really compiles and runs the typed implementations. Pass rate on retrofit subset is the honest coverage metric. + +### Prevention going forward + +For future stress-test corpus design: + +1. **Categorize by data structure constraint, not just by algorithm**. "10 programs that MUST use struct Node" not "10 programs about linked lists" — the difference is enforcement. +2. **Sprint exit gate per-category**: static analysis confirms each program in a category exercises the claimed feature. +3. **Cross-reference language ADRs**: if your corpus has a "tree" category, but the language doesn't have an ADR for recursive struct support, the category is fictional until that ADR lands. + +This pattern composes with F19 (install-not-tested): both reflect a gap between **what the artifact claims** and **what was actually verified**. F19 is on user-facing surface (install commands), F24 is on test-coverage surface (category claims). Both close by the same principle: independent verification of the claim against reality. + +--- + ## Catalogue maintenance This catalogue is alive — add to it as you encounter new failure modes. From 7244736c8b13724fd1877556eff2c94fb7d71bca Mon Sep 17 00:00:00 2001 From: wbj010101 Date: Tue, 12 May 2026 11:22:40 +0800 Subject: [PATCH 07/16] =?UTF-8?q?fix(framing):=20correct=20'9-week'=20over?= =?UTF-8?q?claim=20=E2=80=94=20Cobrust=20is=2012=20days=20wall-clock=20(F8?= =?UTF-8?q?=20closure)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit User audit-by-skepticism 2026-05-12 caught the F8 instance: ADSD repo described Cobrust as "9-week multi-agent Rust compiler project" across 6 public surfaces + 1 GitHub About. Empirical verification: git log first commit: 2026-04-30 16:23:41 (M0 bootstrap) git log HEAD: 2026-05-12 11:19:10 (ADR-0048 supplement) Wall-clock: 11 days = 1.6 weeks (rounded to "12 days") Commit count: 278 (was claimed as ~178 in SKILL.md origin section — also stale) Overclaim factor: ~5.6× (9 weeks claimed vs 1.6 weeks actual). Codified as F8 (marketing overreach without citation) — ADSD repo's own description committed F8 about its own founding project's duration. Honest framing landed in 6 surfaces + SKILL.md origin section + GitHub About (separate gh repo edit command queued): - README.md L3 (top hero) - README.md L137 (Origin section): "9-week run" → "10-day intensive run (first commit 2026-04-30 → v0.1.0 tag 2026-05-10)" - SKILL.md L11 (Distilled from): "~178 commits, ~24 hours" → "12 days wall-clock (2026-04-30 → 2026-05-12), ~278 commits, 48+ ADRs, 24+ findings, v0.1.0 + v0.1.1 + v0.1.2 shipped + α Phase F.2 in flight" - SKILL.md L898 (cross-references): "Cobrust 9-week run" → "Cobrust 12-day intensive run" - SKILL.md L923 (lineage): "~178 commits" → "~278 commits over 12 wall-clock days, 48+ ADRs, 24+ findings" - reference/failure-modes-catalogue.md F20 entry: "9 weeks of constitution mandate" → "12 days of constitution mandate fact-violated" - docs/human/en/getting-started.md L16: "9 weeks of Cobrust project" → "12 days of intensive Cobrust development (2026-04-30 → 2026-05-12, ~278 commits)" - docs/human/zh/getting-started.md L16 (parallel) GitHub About description update (separate `gh repo edit` step, queued). Lesson: ADSD's own description must dogfood F8 (cite or admit). The overclaim was likely a human-frame intuition ("feels like 9 weeks of work"; ADSD §3 reference for agent-velocity calibration codifies the inverse error — underclaiming agent time as 6-8 weeks). Both directions of the same anti-pattern: time framing without empirical anchor. Co-Authored-By: Claude Opus 4.7 (session 4bb35f43) --- README.md | 4 ++-- docs/human/en/getting-started.md | 2 +- docs/human/zh/getting-started.md | 2 +- plugins/adsd/skills/agent-driven-development/SKILL.md | 8 +++----- .../reference/failure-modes-catalogue.md | 2 +- 5 files changed, 8 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 7b9abbf..ea52499 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # Agent-Driven Software Development (ADSD) -> Methodology distilled from running a 9-week multi-agent Rust compiler +> Methodology distilled from running a 12-day multi-agent Rust compiler > project where AI agents wrote ≥ 70% of the code under human strategic > direction. @@ -134,7 +134,7 @@ The script also verifies reference files have YAML frontmatter and ADR files are ADSD was extracted from the [Cobrust](https://github.com/Cobrust-lang/cobrust) project, a Rust-implemented Python successor with an AI-native compiler. -Cobrust shipped its `0.1.0` stable tag on 2026-05-10 after a 9-week run +Cobrust shipped its `0.1.0` stable tag on 2026-05-10 after a 10-day intensive run (first commit 2026-04-30 → v0.1.0 tag 2026-05-10) with multiple parallel Claude agents (Opus 4.7 and Sonnet 4.6) coordinated via the methodology you'll find in [`SKILL.md`](./plugins/adsd/skills/agent-driven-development/SKILL.md). diff --git a/docs/human/en/getting-started.md b/docs/human/en/getting-started.md index 71995bb..9c576e9 100644 --- a/docs/human/en/getting-started.md +++ b/docs/human/en/getting-started.md @@ -13,7 +13,7 @@ If you're writing a single-agent small script, ADSD is overkill. Skip. ## 30-second overview -ADSD is the multi-agent working discipline distilled from 9 weeks of Cobrust project, codifying: +ADSD is the multi-agent working discipline distilled from 12 days of intensive Cobrust development (2026-04-30 → 2026-05-12, ~278 commits), codifying: 1. **Decision capture** — every cross-file decision becomes an ADR (Architecture Decision Record) 2. **Failure capture** — every "this broke / surprised / dead-ended" becomes a Finding (negative result) diff --git a/docs/human/zh/getting-started.md b/docs/human/zh/getting-started.md index 129190e..cc37cc0 100644 --- a/docs/human/zh/getting-started.md +++ b/docs/human/zh/getting-started.md @@ -13,7 +13,7 @@ ## 30 秒概览 -ADSD 是从 Cobrust 项目 9 周实战提炼的**多 agent 工作纪律**, 把以下三件事做硬: +ADSD 是从 Cobrust 项目 12 天密集开发实战 (2026-04-30 → 2026-05-12, ~278 commits) 提炼的**多 agent 工作纪律**, 把以下三件事做硬: 1. **决策捕获** — 每个跨文件的决定都写 ADR (Architecture Decision Record) 2. **失败捕获** — 每次"翻车 / 意外 / 死胡同"都写 Finding (负向结果) diff --git a/plugins/adsd/skills/agent-driven-development/SKILL.md b/plugins/adsd/skills/agent-driven-development/SKILL.md index 0f75a2a..acd68b6 100644 --- a/plugins/adsd/skills/agent-driven-development/SKILL.md +++ b/plugins/adsd/skills/agent-driven-development/SKILL.md @@ -8,9 +8,7 @@ description: ADSD methodology for managing multi-agent software projects where A > A methodology for managing software projects where the bulk of the work > is done by AI agents under human strategic direction. > -> **Distilled from**: Cobrust project, ~178 commits, ~24 hours of intense -> multi-agent work, 39 ADRs, 14 findings, 2 P0 codegen bugs found via -> organic stress test, 0.1.0-beta release plan. +> **Distilled from**: Cobrust project, **12 days wall-clock (2026-04-30 → 2026-05-12)**, ~278 commits, 48+ ADRs, 24+ findings, 2 P0 codegen bugs found via organic stress test, v0.1.0 + v0.1.1 + v0.1.2 shipped + α Phase F.2 in flight. > > **Status**: extracted 2026-05-10. Apply as-is or adapt; this is > battle-tested but not orthodoxy. @@ -895,7 +893,7 @@ Everything else is adaptable. ## Cross-references (within this skill) -### Originals (distilled from Cobrust 9-week run) +### Originals (distilled from Cobrust 12-day intensive run) - Part 1 Topology details: `reference/role-topology.md` - Part 2 Two-phase dispatch deep dive: `reference/two-phase-dispatch.md` @@ -920,7 +918,7 @@ These five represent **5 of 12 v1.2.0 gap candidates** identified by review-clau This skill is distilled from the Cobrust project (2026-04-30 to 2026-05-10) — a Rust-implemented Python successor with AI-native -compiler. ~178 commits, 39 ADRs, 14 findings, 2 stress-test farms, +compiler. ~278 commits over 12 wall-clock days, 48+ ADRs, 24+ findings, 2 stress-test farms, 4 parallel-agent topology stress-tested at 4-way max. Patterns documented here passed the test of "did we hit this in production and did the fix work?". diff --git a/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md b/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md index 224b7b5..f75f801 100644 --- a/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md +++ b/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md @@ -1188,7 +1188,7 @@ d337cf0 tests(adr-0044): W2 Phase 3 LeetCode oracle-match corpus (TDD step 1) 0145e8b feat(examples): W2 Phase 3 — 10 LeetCode .cb programs (TDD dev step, ADR-0044 stdin/argv usage) ``` -The TDD step 1 commits land before TDD dev step commits in temporal order. **First executed test-first sprint after 9 weeks of constitution mandate.** F20 is closed for Cobrust via execution evidence, not just documentation. +The TDD step 1 commits land before TDD dev step commits in temporal order. **First executed test-first sprint after 12 days of constitution mandate fact-violated.** F20 is closed for Cobrust via execution evidence, not just documentation. ### Rule of thumb From a737d95c76245f9d1025cf3da37b09d81c39a2bb Mon Sep 17 00:00:00 2001 From: wbj010101 Date: Tue, 12 May 2026 11:39:02 +0800 Subject: [PATCH 08/16] fix(v1.2.5): apply review-claude 3-agent audit findings A1+A2+A3 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Post v1.2.4 (9-week→12-day F8 closure), 3 sub-agent audits identified 27 total defects across ADSD repo. This fix-pack closes all empirically- fixable items in one atomic commit. === A1 empirical-anchor audit (opus, 18 mismatches) === README.md: - L13 "~178 commits, 43 ADRs, 19 findings, 21 documented failure modes" → "~278 commits, 49 ADRs (0001..0048 + 0047a), 27 findings, 24 documented failure modes" (Cobrust ground truth at HEAD a2b3eab) - L137 v0.1.0-beta (05-10) vs v0.1.0 stable (05-11) disambiguated; "10-day intensive run" → "11-day intensive run (first commit 04-30 → v0.1.0 stable 05-11; v0.1.2 + α Phase F.2 followed)" - L77 directory tree "plugins/agent-driven-development/" → "plugins/adsd/" - L84 "F1-F21" → "F1-F24" case-study/cobrust-multi-agent-experience.md (heavily stale): - Frontmatter case_study_id terminal date 2026-05-10 → 2026-05-12 - duration: 11 days → 12 days wall-clock - title "11-day multi-agent build-up" → "12-day" - "~178 commits / 39 ADRs / 14 findings" → "~278 / 49 / 27" (3 sites: frontmatter L7 + §Project shape L32-34 + Numbers table L376-380) SKILL.md: - L920 "(2026-04-30 to 2026-05-10)" → "(2026-04-30 to 2026-05-12)" (resolved self-contradiction with L11) - L921 "~278 commits over 12 wall-clock days, 48+ ADRs, 24+ findings" → "~278 commits over 12 wall-clock days, 49 ADRs (0001..0048 + 0047a), 27 findings" (exact counts vs lower bounds) - L920 "AI-native compiler" → "LLM-driven translation pipeline" (consistent with ADR-0048 framing reframe applied to lineage description) === A2 F-pattern evidence audit (opus, 3 stale citations of 24 entries) === reference/failure-modes-catalogue.md: - F1.2 (L122-125): "0 hits" empirical claim time-stamped at 11th review (HEAD ~06df4b4, 2026-05-10) + 2026-05-12 note that specific grep is now stale but systemic pattern remains the F1.2 instance — recursive F1.2 - F18 (L1024-1033): direct quote not literally present; reframed as pattern description anchored to actual file (review-claude-handoff/ README.md §"Attribution policy") with paraphrase rather than verbatim - F21 (L1253-1257): "~2,800-line Cobrust Studio handoff" + §0.5.1/§12.8 unfindable; reframed to actual locatable artifacts (claude-desktop- integrated-handoff.md + docs/agent/conventions.md §"Identity hygiene") 21/24 F-pattern entries cleanly verified. === A3 mechanical audit (sonnet, 8 defects) === - 4 SKILL.md cross-reference broken links removed (role-topology.md, two-phase-dispatch.md, snapshot-discipline.md were referenced but never created — honest framing > placeholder pretense) - 2 docs/human/{zh,en}/getting-started.md L131 broken refs to why-adsd.md removed (file never created) - reference/evals-first-development.md L150,210 templates/eval-template.md references reframed as inline guidance (file split out deferred to v1.3.0) - reference/cost-monitoring-discipline.md L246 stale URL docs.anthropic.com/claude/docs/prompt-caching → platform.claude.com/docs/en/prompt-caching (HTTP 301 follow) zh+en parity (A3.2): CLEAN per scripts/doc-coverage.sh ✓ External URL audit (A3.3): 4/5 HTTP 200, 1 stale-but-redirecting URL fixed above; OpenAI platform.openai.com 403 inconclusive (Cloudflare bot block, DNS resolves — not flagged as broken). === Verification === - bash scripts/doc-coverage.sh → PASS (Inv 1+2+3 all OK) - grep "178 commits" → 0 hits - grep "39 ADRs" → 0 hits - grep "14 findings" → 0 hits - intentional residual: "11-day intensive run" in README L137 (correctly describes 04-30 → v0.1.0 stable 05-11 span) 3 audit findings preserved at review-claude-handoff/findings/: - 2026-05-12-adsd-empirical-mismatch-audit-A1.md (audit-A1 opus) - 2026-05-12-adsd-f-pattern-evidence-audit-A2.md (audit-A2 opus) - 2026-05-12-adsd-mechanical-audit-A3.md (audit-A3 sonnet) review-claude own #21: a824d77 fix was scope-incomplete (caught 9-week phrase but missed numeric drift + case-study frozen snapshot + self-contradictions + broken refs). v1.2.5 closes the residue via 3-agent parallel audit + atomic synthesis. Pattern: empirical audit > manual phrase-grep. Co-Authored-By: Claude Opus 4.7 (session 4bb35f43) --- README.md | 12 +++--- docs/human/en/getting-started.md | 3 +- docs/human/zh/getting-started.md | 3 +- .../skills/agent-driven-development/SKILL.md | 7 +--- .../cobrust-multi-agent-experience.md | 28 ++++++------- .../reference/cost-monitoring-discipline.md | 2 +- .../reference/evals-first-development.md | 3 +- .../reference/failure-modes-catalogue.md | 40 ++++++++++++------- 8 files changed, 51 insertions(+), 47 deletions(-) diff --git a/README.md b/README.md index ea52499..41744c6 100644 --- a/README.md +++ b/README.md @@ -10,9 +10,9 @@ ## What this is ADSD is **not a framework**. It's a documented working style that survived -contact with reality: ~178 commits, ~2,611 tests, 43 ADRs, 19 findings, 21 -documented failure modes, 2 P0 codegen bugs caught via organic stress test, -and a 0.1.1 release shipped publicly. +contact with reality: ~278 commits, ~2,611 tests, 49 ADRs (0001..0048 + 0047a), +27 findings, 24 documented failure modes, 2 P0 codegen bugs caught via organic +stress test, and v0.1.2 stable shipped publicly + α Phase F.2 in flight. ADSD codifies the discipline that kept the multi-agent project coherent: ADRs as decision capture, findings as negative-result memory, bilingual @@ -74,14 +74,14 @@ agent-driven-development/ ├── .claude-plugin/ │ └── marketplace.json # Self-hosted single-plugin marketplace catalog ├── plugins/ -│ └── agent-driven-development/ # Plugin root (matches marketplace.json source) +│ └── adsd/ # Plugin root (matches marketplace.json source) │ ├── .claude-plugin/ │ │ └── plugin.json # Plugin manifest │ └── skills/ │ └── agent-driven-development/ # Skill — auto-discovered by Claude Code │ ├── SKILL.md # Main methodology document (~36 KB) │ ├── reference/ -│ │ └── failure-modes-catalogue.md # F1-F21 anti-patterns with empirical evidence +│ │ └── failure-modes-catalogue.md # F1-F24 anti-patterns with empirical evidence │ ├── case-study/ │ │ └── cobrust-multi-agent-experience.md # The founding case study (N=1) │ └── templates/ @@ -134,7 +134,7 @@ The script also verifies reference files have YAML frontmatter and ADR files are ADSD was extracted from the [Cobrust](https://github.com/Cobrust-lang/cobrust) project, a Rust-implemented Python successor with an AI-native compiler. -Cobrust shipped its `0.1.0` stable tag on 2026-05-10 after a 10-day intensive run (first commit 2026-04-30 → v0.1.0 tag 2026-05-10) +Cobrust shipped its `0.1.0-beta` tag on 2026-05-10 and `0.1.0` stable on 2026-05-11, after an 11-day intensive run (first commit 2026-04-30 → v0.1.0 stable tag 2026-05-11; v0.1.2 + α Phase F.2 followed) with multiple parallel Claude agents (Opus 4.7 and Sonnet 4.6) coordinated via the methodology you'll find in [`SKILL.md`](./plugins/adsd/skills/agent-driven-development/SKILL.md). diff --git a/docs/human/en/getting-started.md b/docs/human/en/getting-started.md index 9c576e9..7a7d3b5 100644 --- a/docs/human/en/getting-started.md +++ b/docs/human/en/getting-started.md @@ -128,8 +128,7 @@ If Claude auto-references ADSD's `reference/` files, you installed correctly. If ## Next steps - Read [`concept-map.md`](./concept-map.md) for the complete ADSD concept diagram -- Read [`why-adsd.md`](./why-adsd.md) for motivation / value prop (if not yet convinced) -- When you hit a wall, write a finding. Don't hide it. F1-F21 catalogue is at [`reference/failure-modes-catalogue.md`](https://github.com/Cobrust-lang/agent-driven-development/blob/main/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md); you may have hit the same one +- When you hit a wall, write a finding. Don't hide it. F1-F24 catalogue is at [`reference/failure-modes-catalogue.md`](https://github.com/Cobrust-lang/agent-driven-development/blob/main/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md); you may have hit the same one ## FAQ diff --git a/docs/human/zh/getting-started.md b/docs/human/zh/getting-started.md index cc37cc0..271ed37 100644 --- a/docs/human/zh/getting-started.md +++ b/docs/human/zh/getting-started.md @@ -128,8 +128,7 @@ REPORT FORMAT: [P7-COMPLETION] with verification block (paste raw cargo test out ## 下一步 - 读 [`concept-map.md`](./concept-map.md) 看 ADSD 完整概念图 -- 读 [`why-adsd.md`](./why-adsd.md) 看为什么需要这套方法论 (如果还没被说服) -- 撞坑了写 finding, 不要藏起来. F1-F21 catalogue 在 [`reference/failure-modes-catalogue.md`](https://github.com/Cobrust-lang/agent-driven-development/blob/main/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md), 你可能撞上同一个 +- 撞坑了写 finding, 不要藏起来. F1-F24 catalogue 在 [`reference/failure-modes-catalogue.md`](https://github.com/Cobrust-lang/agent-driven-development/blob/main/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md), 你可能撞上同一个 ## 常见问题 diff --git a/plugins/adsd/skills/agent-driven-development/SKILL.md b/plugins/adsd/skills/agent-driven-development/SKILL.md index acd68b6..21aed9c 100644 --- a/plugins/adsd/skills/agent-driven-development/SKILL.md +++ b/plugins/adsd/skills/agent-driven-development/SKILL.md @@ -895,9 +895,6 @@ Everything else is adaptable. ### Originals (distilled from Cobrust 12-day intensive run) -- Part 1 Topology details: `reference/role-topology.md` -- Part 2 Two-phase dispatch deep dive: `reference/two-phase-dispatch.md` -- Part 3 Snapshot discipline: `reference/snapshot-discipline.md` - Part 6 Full failure-modes catalogue: `reference/failure-modes-catalogue.md` - Templates: `templates/*.md` - Cobrust case study: `case-study/cobrust-multi-agent-experience.md` @@ -917,8 +914,8 @@ These five represent **5 of 12 v1.2.0 gap candidates** identified by review-clau ## Origin & lineage This skill is distilled from the Cobrust project (2026-04-30 to -2026-05-10) — a Rust-implemented Python successor with AI-native -compiler. ~278 commits over 12 wall-clock days, 48+ ADRs, 24+ findings, 2 stress-test farms, +2026-05-12) — a Rust-implemented Python successor with an LLM-driven +translation pipeline. ~278 commits over 12 wall-clock days, 49 ADRs (0001..0048 + 0047a), 27 findings, 2 stress-test farms, 4 parallel-agent topology stress-tested at 4-way max. Patterns documented here passed the test of "did we hit this in production and did the fix work?". diff --git a/plugins/adsd/skills/agent-driven-development/case-study/cobrust-multi-agent-experience.md b/plugins/adsd/skills/agent-driven-development/case-study/cobrust-multi-agent-experience.md index 560cebe..c222ada 100644 --- a/plugins/adsd/skills/agent-driven-development/case-study/cobrust-multi-agent-experience.md +++ b/plugins/adsd/skills/agent-driven-development/case-study/cobrust-multi-agent-experience.md @@ -1,14 +1,14 @@ --- -case_study_id: cobrust-multi-agent-2026-04-30-to-2026-05-10 -project: Cobrust (Rust-implemented Python successor + AI-native compiler) -duration: 11 days (~24 hours of intense agent work in final 36 hours) -human_time: ~6 hours (estimated, mostly strategic decisions + 守闸) +case_study_id: cobrust-multi-agent-2026-04-30-to-2026-05-12 +project: Cobrust (Rust-implemented Python successor + AI-native translation pipeline) +duration: 12 days wall-clock (2026-04-30 → 2026-05-12); main narrative covers Days 1-10 with Day 11+12 appendix events +human_time: ~6 hours (estimated, mostly strategic decisions + 守闸; expanded in Day 11+12 appendix) agent_time: ~80% of LOC produced -final_state: 0.1.0-beta release plan, ~178 commits, 39 ADRs, 14 findings +final_state: v0.1.0 + v0.1.0-beta + v0.1.0-beta.1 + v0.1.1 + v0.1.2 stable shipped + α Phase F.2 in flight; ~278 commits, 49 ADRs (0001..0048 + 0047a), 27 findings attribution_origin: review-claude window (third-party audit) --- -# Case study: Cobrust 11-day multi-agent build-up +# Case study: Cobrust 12-day multi-agent build-up This case study reports what worked and what failed in applying ADSD-flavor methodology to a real software project — Cobrust, a @@ -29,10 +29,10 @@ record of what ADSD prevents and what it doesn't. - **Goal at start**: Phase E (M0..M14) — language core + tooling - **Goal at day 11**: 0.1.0-beta public release with end-to-end Python library translation demo -- **Total commits**: ~178 -- **Total ADRs**: 39 (0001..0039 with some reservations) -- **Total findings**: 14 -- **Cumulative tests**: 2,541 passed / verified at HEAD `6008634` +- **Total commits**: ~278 (at HEAD `a2b3eab` 2026-05-12) +- **Total ADRs**: 49 files (0001..0048 + 0047a sub-numbered) +- **Total findings**: 27 +- **Cumulative tests**: 2,541 passed / verified at HEAD `6008634` (Day 8 anchor; current at HEAD ~2,611+, not re-baselined in case study) ## Topology actually used @@ -373,10 +373,10 @@ N = 5). | Metric | Value | |---|---| -| Total commits | ~178 | -| ADRs landed | 39 | -| Findings | 14 | -| Tests passing at HEAD | 2,541 | +| Total commits | ~278 (at HEAD `a2b3eab` 2026-05-12) | +| ADRs landed | 49 (0001..0048 + 0047a) | +| Findings | 27 | +| Tests passing at Day-8 anchor | 2,541 (current ~2,611+ not re-baselined) | | Test failures pre-cleanup | 2 (msgpack DoS + pyo3 compile) | | P0 codegen bugs found via stress-test farm | 2 | | Hours of human work (estimated) | ~6 | diff --git a/plugins/adsd/skills/agent-driven-development/reference/cost-monitoring-discipline.md b/plugins/adsd/skills/agent-driven-development/reference/cost-monitoring-discipline.md index da8b2e1..3dfecde 100644 --- a/plugins/adsd/skills/agent-driven-development/reference/cost-monitoring-discipline.md +++ b/plugins/adsd/skills/agent-driven-development/reference/cost-monitoring-discipline.md @@ -243,5 +243,5 @@ The F-pattern catalogue should include cost-anomaly as a diagnostic. Add to disp - `reference/prompt-engineering-patterns.md` PT7 — D-rating drives cost - `reference/evals-first-development.md` — eval delta lets you compare cost across optimizations - `reference/failure-modes-catalogue.md` — F12 (model output starvation), cost signal for diagnosis -- Anthropic prompt caching docs: https://docs.anthropic.com/claude/docs/prompt-caching +- Anthropic prompt caching docs: https://platform.claude.com/docs/en/prompt-caching - OpenAI structured outputs: https://platform.openai.com/docs/guides/structured-outputs diff --git a/plugins/adsd/skills/agent-driven-development/reference/evals-first-development.md b/plugins/adsd/skills/agent-driven-development/reference/evals-first-development.md index 048f733..1047be6 100644 --- a/plugins/adsd/skills/agent-driven-development/reference/evals-first-development.md +++ b/plugins/adsd/skills/agent-driven-development/reference/evals-first-development.md @@ -147,7 +147,7 @@ When a finding is discovered (e.g. `lc100-i8-i64-nested-if`), the **same sprint ## Concrete template -`templates/eval-template.md`: +Inline template (copy this into your project's `evals//REPORT.md` — a dedicated `templates/eval-template.md` may be split out in a future ADSD release): ``` --- @@ -207,6 +207,5 @@ oracle: - `SKILL.md` §"Wave + Tx commit tags" — eval delta is the 6th gate - `reference/failure-modes-catalogue.md` F19 (release install-not-tested) — eval-first is the systemic prevention - `reference/failure-modes-catalogue.md` F20 (constitution-vs-workflow) — eval-first IS the workflow that enforces "test-first" mandate -- `templates/eval-template.md` — runnable template per feature - Anthropic: https://www.anthropic.com/engineering (search "evals") - OpenAI Evals: https://github.com/openai/evals diff --git a/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md b/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md index f75f801..6153461 100644 --- a/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md +++ b/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md @@ -119,10 +119,14 @@ in Cobrust's case — silently skips the gate. script-creation time. New milestones add ADRs/findings, but the script doesn't auto-extend. -**Evidence**: Cobrust 11th-review §H2. -`grep -rE "ADR-003[0-9]" docs/human/` returns **0 hits**. -ADR-0030..0039 全部 not in zh+en doc trees. Triple-tree drift is -systemic for all post-M14 work, but doc-coverage.sh is silent on it. +**Evidence**: Cobrust 11th-review §H2 (anchored at HEAD ~`06df4b4`, 2026-05-10). +At that time, `grep -rE "ADR-003[0-9]" docs/human/` returned **0 hits** — +ADR-0030..0039 were not in zh+en doc trees. Triple-tree drift was +systemic for all post-M14 work, but doc-coverage.sh was silent on it. +(Note 2026-05-12: this specific grep has since changed as later doc +sync added ADR-0030..0039 mentions, but the systemic pattern remains +the F1.2 instance — the verification step was hardcoded against a +specific milestone range and went stale.) **Recovery**: doc-coverage scripts must auto-discover scope via `ls docs/agent/adr/00*.md` patterns, not hardcode milestone lists. @@ -1022,15 +1026,21 @@ attribution rather than schema invariants. ### Evidence -Cobrust Day 11, ~14:00: P7 sonnet sub-agent, dispatched for a broad cleanup -sprint, edited README sections that included review-claude's narrative §F -findings summary. This is described in review-claude's README §A.NEW5 -"review-claude 13 own" item: "P7 sonnet boundary violation editing -review-claude README". - -The attribution policy was clearly stated in README §Attribution: -"findings/ entries are review-claude originals — discovered_by field -marks source." P7 had no enforcement signal preventing the edit. +Cobrust Day 11 (2026-05-11), early afternoon: a P7 sonnet sub-agent +dispatched for a broad cleanup sprint edited README narrative sections +that review-claude considered its own authoring territory. The +boundary violation was paraphrased in the review-claude session's +own-up log (review-claude session 4bb35f43, paraphrased as: "P7 broad- +cleanup spawn edited review-claude's narrative §F without scope- +exclusion guard in dispatch prompt"). The literal handoff-README +text content at that line had evolved over multiple turns, so the +canonical citation is the pattern description, not a verbatim quote. + +The attribution policy was stated in `review-claude-handoff/README.md` +§"Attribution policy": "findings/ entries are review-claude originals — +each file's `discovered_by:` frontmatter marks source." P7 had no +machine-enforcement signal preventing the edit (no CODEOWNERS, no +dispatch-prompt-level exclusion list). Note: This is a **candidate F18** because (a) it was observed in a single session, (b) the root-cause was partially ambiguous (was it P7 ignoring @@ -1250,11 +1260,11 @@ This is F1 family because: the role is declared (review-claude is the auditor), ### Evidence -Cobrust 2026-05-11 evening: project owner asked claude-desktop to draft a Cobrust Studio handoff. Claude-desktop drafted ~2,800-line document signing it "— review-claude, 2026-05-11". A separate Claude Code session (the parallel one auditing Cobrust live, session ID `4bb35f43...`) was also active that day and had been signing its own artifacts "review-claude". The Studio handoff was claimed to be "synthesized from a multi-turn external review-claude session" — but the original session that performed those reviews did not write the handoff; claude-desktop did, citing the parallel session's prior work. +Cobrust 2026-05-11 evening: project owner asked claude-desktop to draft a Cobrust Studio handoff. Claude-desktop drafted a multi-hundred-line document signing it "— review-claude, 2026-05-11". A separate Claude Code session (the parallel one auditing Cobrust live, session ID `4bb35f43...`) was also active that day and had been signing its own artifacts "review-claude". The Studio handoff cited an external "multi-turn review-claude session" — but the original session that performed those reviews did not write the handoff; claude-desktop did, citing the parallel session's prior work. Result: future readers of the Studio handoff cannot tell which review-claude session authored each claim, when, with what context. The handle "review-claude" became identity-overloaded between at least 2 concurrent sessions on the same day. -Recovery in same session: appended §0.5.1 "Identity hygiene (F21)" + §12.8 "When in doubt, ask the parallel review-claude session" to the Studio handoff, prescribing session-ID-stamped attribution going forward. +The cleanly-locatable artifact instances on disk: `review-claude-handoff/handoff-pack/dispatches/claude-desktop-integrated-handoff.md` (claude-desktop integration record) + 5+ findings under `review-claude-handoff/findings/` with `discovered_by:` frontmatter, and ADSD's own `docs/agent/conventions.md` §"Identity hygiene (F21 closure)" prescribing session-ID-stamped attribution going forward. ### Rule of thumb From 0b436e02bdb7e52531d47fc28e1db5514c2b382a Mon Sep 17 00:00:00 2001 From: Hakureirm Date: Tue, 12 May 2026 14:45:07 +0800 Subject: [PATCH 09/16] =?UTF-8?q?feat(case-study):=20Cobrust=20Studio=20N?= =?UTF-8?q?=3D2=20dogfood=20=E2=80=94=202-day=20MVP=20exercised=20+=20exte?= =?UTF-8?q?nded=20ADSD=20v1.2.1?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Land case-study/cobrust-studio-experience.md as the second ADSD case study. Where Cobrust (N=1) generated the methodology from its own 12-day intensive, Studio (N=2) is the first project to consume ADSD v1.2.1 as input rather than co-evolving with it — a 2-day MVP run applying the methodology under acceleration. Concrete catalogue evidence surfaced: - 2× F1.0 catches (BSD-sed silent failure in M0 doc-coverage.sh; CTO 守闸 grep leak swallowing 9 failed integration tests at A4 merge) - 2× F19 catches (M4 SPA fallback Path regression on Router::fallback in v0.1.0; v0.1.1 Cargo.lock stale on --locked) - 2× F20 catches (last_verified_commit: HEAD placeholder shipped 2× before doc-coverage.sh §5 SHA-shape + git-reachability enforcement landed; doc-coverage.sh §6 paired exit-code + FAILED-grep gate hardened in v0.1.2 — the "recursive F20 closure" pattern) - 1× prospective F21 validation (zero macOS Full-Name leak across 125 commits; explicit session-handle attribution on every dispatch) Methodology extensions surfaced for v1.2.2+ back-port: - "Tag → audit → patch" as a RELEASE PATTERN, not a one-shot gate (v0.1.0 broken → v0.1.1 broken → v0.1.2 usable in 6 hours wall-clock; each tag is the experimental cycle) - Recursive F20 closure (every enforcement layer needs its own orthogonal-failure paired review) - Continuous persona testing executed in-sprint with persona-output → PR mapping (Mei/Aleksandr/Sarah dispatches drove the M5 README rewrite + F-05 dead-deps removal + CI matrix landing) - AI velocity confirmed ~2.5× on a 5-day plan, but multiplier buys experimental cycles, not shippable-first-try - 4-layer "constitution → ADR → finding → script" stack as the right F20 abstraction (each layer's gaps map cleanly to the next layer's enforcement) The case study is structured symmetrically to cobrust-multi-agent- experience.md: project meta + topology, what ADSD validated, what ADSD stressed (broken catches with file:line evidence), what ADSD extended, numbers, patterns to carry forward/reconsider. Signed-off: studio-p7-adsd-backport-opus47 Co-Authored-By: Claude Opus 4.7 (1M context) --- .../case-study/cobrust-studio-experience.md | 1370 +++++++++++++++++ 1 file changed, 1370 insertions(+) create mode 100644 plugins/adsd/skills/agent-driven-development/case-study/cobrust-studio-experience.md diff --git a/plugins/adsd/skills/agent-driven-development/case-study/cobrust-studio-experience.md b/plugins/adsd/skills/agent-driven-development/case-study/cobrust-studio-experience.md new file mode 100644 index 0000000..8d045ec --- /dev/null +++ b/plugins/adsd/skills/agent-driven-development/case-study/cobrust-studio-experience.md @@ -0,0 +1,1370 @@ +--- +case_study_id: cobrust-studio-2026-05-11-to-2026-05-12 +project: Cobrust Studio (AI-agent project-management console; self-hosted web UI + REST/SSE API over a markdown ADR/finding/ledger tree) +duration: ~21 hours wall-clock (2026-05-11 17:22:37 +0800 → 2026-05-12 14:36:16 +0800; 5-day human plan compressed to 2 calendar days) +human_time: ~3-4 hours (strategic decisions + 守闸 + persona-audit reading; no implementation code written by human) +agent_time: ~95% of LOC produced (3 Rust crates + SvelteKit 5 frontend); 18 opus sub-agent dispatches across 6 waves + 4 reconcile rounds + 1 release agent +final_state: v0.1.0 (broken) → v0.1.1 (broken) → v0.1.2 (usable) shipped within the same calendar day; 125 commits on main; 3 tags; 6 ADRs; 4 findings; 4 module-docs; 196 Rust tests / 14 hermetic Playwright e2e / 2 dogfood specs / real-LLM e2e — all green at HEAD +attribution_origin: studio-cto-session-002-opus47 + studio-p7-{a*,m*}-opus47 sub-agents (live dispatch window, no third-party audit gap) +relates_to: [case-study:cobrust-multi-agent-experience.md (N=1), reference:failure-modes-catalogue.md §F1.0/F19/F20/F21, SKILL.md §"Origin & lineage"] +--- + +# Case study: Cobrust Studio — N=2 dogfood, 2-day MVP exercised + extended ADSD v1.2.1 + +This case study reports what worked and what failed in applying ADSD +v1.2.1 to a **second, independent project** — Cobrust Studio, a +self-hosted web console for managing AI coding agents under engineering +discipline. The first ADSD case study +([`cobrust-multi-agent-experience.md`](cobrust-multi-agent-experience.md)) +documents a 12-day multi-agent build of the Cobrust language project +(N=1). Studio is the **N=2** evidence: a different codebase, a +different domain, a 10× shorter timeline, executed against the +already-codified methodology rather than co-evolving with it. + +If the Cobrust case study answers "did this discipline scale to a +12-day language compiler with a 4-way parallel agent team?", this +case study answers a sharper question: + +> **Does ADSD survive being applied as-written to a project it wasn't distilled from?** + +Short answer: yes, with two important caveats — **the methodology +was both validated and stressed in load-bearing ways**, and **Studio +surfaced new F-sub-forms that retroactively validate F19/F20/F21** +(added to the catalogue between N=1 and N=2). Where Cobrust *generated* +the failure-modes catalogue from its own scars, Studio *consumed* it +and reported back on which entries paid for themselves under +acceleration. + +This case study is also not a sanitised success story. v0.1.0 shipped +broken. v0.1.1 shipped broken (differently). v0.1.2 was the first +usable tag. Each broken tag is a data point about which enforcement +layer was missing; the patch dance below names file:line for every +gap. + +--- + +## §0 Dashboard (one-pager) + +``` +Project: Cobrust Studio +Repo: github.com/Cobrust-lang/cobrust-studio +License: Apache-2.0 OR MIT (ADR-0001) +Span (wall-clock): 2026-05-11 17:22 → 2026-05-12 14:36 (~21 hours) +Span (5-day human plan): collapsed to 2 calendar days (AI velocity ~2.5×) +Bus factor: 1 (single human contributor; explicit caveat) +Commits on main: 125 +Tags pushed: 3 (v0.1.0 broken / v0.1.1 broken / v0.1.2 usable) +Rust crates: 3 (studio-router / studio-store / studio-server) +Frontend: SvelteKit 5, 5 pages, Tailwind v4 +Binary deployment: single 9.0 MiB self-contained (rust-embed; ADR-0002) +Rust tests at HEAD: 196 (32 ok groups, 0 FAILED) +Playwright e2e: 14 hermetic + 2 dogfood (all green at HEAD) +Real-LLM e2e: PASS (codex-forwarder + gpt-5.5) +ADRs landed: 6 (0001..0006) +Findings filed: 4 (P0 / P1 / P2 / P3 all represented; 3 of 4 closed within session) +Module-docs: 4 (studio-router / studio-store / studio-server / web-frontend) +Opus agent dispatches: ~18 (6 waves × DEV+TEST+REVIEW trio + 4 reconcile rounds + 1 release agent) +Reconcile rounds: 4 (A2, A3, A4, A5 — multiple per wave on M1-era waves) +CI gates enforced: 6 (fmt / clippy -D warnings / build / test / doc-coverage §5 SHA / doc-coverage §6 cargo test) +Persona audits: 3 (Mei / Aleksandr / Sarah — post-v0.1.2, AMBER / REAL / PASS-watch-6-month) +F-catalogue catches: F1.0 ×2 / F19 ×2 / F20 ×2 / F21 ×1 +ADSD-firsts: First F20 systemic closure in a non-origin project + First documented "tag → audit → patch" release pattern + First "recursive F20 closure" (enforcement script auditing itself) + First N=2 dogfood of the methodology +``` + +--- + +## §1 Project shape & meta + +### What Studio is + +A 9 MiB single Rust binary that gives engineering teams a web UI + +REST/SSE API over a plain-markdown ADR/finding/ledger tree backed by +a git repo. Five pages: `/login` / `/adr` / `/agent` (dispatch) / +`/finding` / `/ledger`. The discipline it productises is the +**ADR + finding + bilingual docs + Tx-tagged waves + doc-coverage CI +gate** stack distilled from Cobrust. Studio's pitch: "if your team is +doing serious AI-driven development, you need answers to *what did +the agents decide / what went wrong / where did the tokens go / are +we drifting / is the methodology actually being followed?* — Studio +gives you all five against any git repo, no SaaS, no per-seat +pricing." + +### Why it was built + +Two reasons, neither of which is the publicly stated pitch: + +1. **N=2 ADSD dogfood.** The Cobrust language project (N=1) was + the substrate the methodology was distilled from. Distillation + from the same project that generates the data is methodologically + suspect — "did the methodology actually work, or did we just + describe what we did?" Studio is the **second, independent + application**: a different language stack (Axum + SvelteKit vs + Cranelift codegen), a different problem domain (CRUD + SSE vs + LLM-driven translation), a different parallelism profile (3-way + dev/test/review trio per wave vs 4-way parallel sprint farm). The + only constant is the methodology and the human at P10. + +2. **A vehicle for back-porting v1.2.1 catalogue entries + (F19/F20/F21) under acceleration.** F19/F20/F21 were added to the + ADSD failure-modes catalogue between N=1 ending and Studio + starting. They were untested under the conditions they describe + (a new project, a fresh constitution, a tight timeline that + pressures shortcuts). Studio's M0-M5 trajectory was the first + project to consume those entries as inputs rather than outputs. + +### Topology + +``` +Human (1, hakureirm ): + - Strategic decisions: license (Apache + MIT), repo namespace + (Cobrust-lang/cobrust-studio), public tag timing, persona-audit + follow-up direction + - Final 守闸 + merge approval on all wave merges + - ~3-4 hours total work; zero implementation code authored + +CTO agent (1, opus, studio-cto-session-002-opus47): + - M0..M5 milestone planning + Phase 1 ADR spikes (ADR-0001..0006) + - 18 sub-agent dispatches across 6 waves + - 4 reconcile rounds (DEV-TEST-REVIEW resolution per wave on A2-A5) + - 1 dedicated release-readiness agent (M4 post-tag audit) + - 3 persona-audit dispatches (Mei / Aleksandr / Sarah, post-v0.1.2) + +P7 sub-agents (~18 opus dispatches): + - studio-p7-a1-1-opus47 : router lift + strip + - studio-p7-a2-dev-opus47 : studio-store impl + - studio-p7-a2-test-opus47 : studio-store contract corpus + - studio-p7-a2-review-opus47: A2 audit + - studio-p7-a3-{dev,test,review}-opus47 : Axum core + - studio-p7-a4-{dev,test,review}-opus47 : 10 M1 routes + SSE + - studio-p7-a5-{dev,test,review}-opus47 : router wire + dispatch SSE + - studio-p7-m2-{dev,test,review}-opus47 : SvelteKit 5 frontend + - studio-p7-m3-{dev,test,review}-opus47 : rust-embed + dogfood + - studio-p7-m4-{dev,test,review}-opus47 : v0.1.0 release prep + - studio-cto-m4.1-release-readiness-opus47 : post-v0.1.0 audit (caught F-M4-01) + +Persona agents (3, sonnet, post-v0.1.2): + - Mei (Python data scientist, target user) + - Aleksandr (Rust skeptic, technical credibility) + - Sarah (OSS evaluator + tech-lead, governance) +``` + +**Critical attribution note (F21 hygiene)**: every CTO and P7 dispatch +in Studio carried an explicit session-handle suffix +(`studio-cto-session-002-opus47`, `studio-p7-a4-dev-opus47`). No +artifact in this repo signs bare "review-claude" or bare "the CTO". +The discipline came from F21 being on the table at session start — +empirical validation that **F21 prevention is cheap if applied +prospectively**. + +### Wave structure + +Six waves, each with the **3-team trio pattern** (DEV + TEST + REVIEW +in parallel, then CTO reconcile): + +| Wave | Scope | Merge SHA | Notes | +|---|---|---|---| +| A0/M0 | Workspace scaffold + 5 ADRs + 5-gate CI | `b7d8f71` | Initial commit; F1.0 BSD-sed caught on first run | +| A1.x | studio-router lift from cobrust-llm-router @ `61f2aff` + strip | `d616548` | Strip #2 verified as no-op (`a1-1-strip-2-noop-at-pin-61f2aff.md`) | +| A2 | studio-store: ADR/finding/ledger CRUD + SQLite index | `36651a4` | First `last_verified_commit: HEAD` leak (F-A2-01) | +| A3 | studio-server Axum core | `d26f3ac` | Second HEAD leak (F-A3-01); same wave fixed via doc-coverage §5 | +| A4 | 10 M1 HTTP routes + SSE | `8d5475f` | Shipped 9 failing integration tests under broken grep守闸 | +| A5 | Router wire + dispatch SSE + A4 baseline fixes | `0e699c4` | A5 DEV agent flagged the broken-baseline as side-effect; finding filed | +| M2 | SvelteKit 5 frontend (5 pages) | `bfbfb8f` | Vitest + Playwright scaffolding | +| M3 | rust-embed integration + dogfood smoke | `5685f49`, `a426067` | The `Path` mounted on `Router::fallback` — landed here, caught at M4 | +| M4 | v0.1.0 release prep | `a722e09` | Tag `0a7fd3e` v0.1.0 — known-broken (SPA fallback) | +| M4.1 | Post-tag CTO 守闸 release-readiness audit | `503260d` | Caught F-M4-01; doc-coverage §6 added | +| v0.1.1 | SPA fallback `Path` → `Uri` extractor | `15b6f46` | Tag — known-broken (stale Cargo.lock) | +| v0.1.2 | Cargo.lock refresh + doc-coverage §6 paired exit-code gate | `7ea9ae3` | Tag — first usable | +| M5 | persona-audit-driven README rewrite + F-05 dead deps + CI matrix | `339e1ab`, `58cbe94`, `ffaf1fb` | Mei/Aleksandr/Sarah outputs converted into concrete PRs | + +The Wave A waves used a 3-team-per-wave dispatch pattern (~3 P7 +dispatches per wave); Wave M3+ collapsed back to single-P7 dispatches +because the frontend work was less cross-cutting. The variance is +itself an N=2 data point: **3-team trio is overkill for +single-surface UI work; appropriate for cross-crate Rust changes**. + +--- + +## §2 What Studio validated about ADSD + +This section walks each ADSD invariant the project exercised. The +question: did the methodology, applied as written, behave the way +the catalogue claims? + +### §2.1 The 4-tier role topology held under tight-timeline pressure + +ADSD §1 specifies P10 (CTO) / P9 (tech lead) / P7 (senior engineer) / +P0 (atomic) + external review. Studio used **only P10 + P7** — +collapsed P9 into P10 because the wave scope was tractable for direct +CTO-to-P7 dispatch. **The ≤4-way parallel cap was honored throughout**; +peak concurrency was 3 (DEV + TEST + REVIEW trio). + +This is a meaningful adaptation: ADSD's case study #1 (Cobrust) ran +4-way parallel through a heavyweight P9-led decomposition, because +each milestone (M11.x, M12.x) was a multi-crate spike. Studio's +waves were narrower (single crate per wave on A-series; single page +per dispatch on M2). The trio pattern at ≤3-way is **the right +fidelity for narrow-scope waves**; the P9 layer is overhead for +projects shorter than ~5 days. + +> **Methodology learning: P9 is optional below a complexity floor.** +> When the wave plan fits in a single ADR with ≤5 sub-tasks, CTO → +> P7 trio direct is fine. Reserve P9 for waves that need +> sub-decomposition of the ADR itself. + +This learning is being back-ported into §1 of SKILL.md (see §6 +below). + +### §2.2 Two-phase dispatch SOP held — and ADR-0006 demonstrates the blame-integrity move + +The single most-validated pattern was the **CTO Phase 1 ADR spike → +P7 Phase 2 impl** loop. Every wave followed it: + +``` +Phase 1 (CTO): Commit ADR-NNNN with options/decision/done-means. + Land on main. +Phase 2 (P7): Dispatch with a working dir + required reads + (including the ADR) + mission + deliverables + gates. +Phase 3 (CTO): 守闸 — 5-gate green check + read the diff + merge. +``` + +**Concrete validation**: +[`docs/agent/adr/0006-studio-router-api-and-lift-provenance.md`](https://github.com/Cobrust-lang/cobrust-studio/blob/main/docs/agent/adr/0006-studio-router-api-and-lift-provenance.md) +was spiked CTO-solo at Phase 1 (commit `93ae8f8`, 2026-05-11 17:24). +The §"Decision" block enumerated the studio-router public surface +and proposed a builder shape (`with_config / with_cache / with_ledger +/ from_toml`). P7 A1.1 lifted the upstream code and **discovered the +real upstream builder shape was different** (`register_provider / +build(&cfg)` async, `from_toml_str(&str)` not `from_toml(&path)`). + +This is exactly the F2 layer-divergence pattern from Cobrust ADR-0033 +/ ADR-0035. The right move was the **blame-integrity addendum**: + +> ADR-0006 §"Addendum 2026-05-11 — post-A1.1 reality reconciliation" +> preserves the original §"Decision" text **unchanged**, then appends +> a §F-01 / §F-02 / §F-03 addendum block enumerating each correction +> with the as-built reality. The original CTO speculation is +> preserved verbatim; the corrections are dated, attributed +> (`studio-review-wave-a1-opus47`), and load-bearing for downstream +> implementation. + +This pattern — **don't rewrite the spike, append the correction** — +is identical to Cobrust ADR-0033 §"Layer correction". Studio's +contribution: a clean second instance, with explicit prose calling +out *why* the original text is preserved (audit trail / blame +integrity). Future ADSD users now have two case-study instances of +the pattern in the wild. + +> **Methodology learning: ADR addendum pattern is the BLAME-INTEGRITY MOVE.** +> When Phase 2 implementation reveals Phase 1 was speculative-wrong, +> never edit §"Decision". Append `§"Addendum YYYY-MM-DD"` with the +> reality and a pointer to the review that surfaced it. Anyone +> reading the ADR can see both the original strategic intent and the +> tactical correction, with the lineage intact. + +### §2.3 5-gate verification held — and gained a 6th gate the same session + +ADSD §"5-gate verification" specifies: fmt / clippy / build / test / +doc-coverage. Studio enforced all 5 from M0; the M0 scaffold's first +commit (`b7d8f71`) shipped with green CI on day 0 hour 0. + +**By M4.1, the gate count was 6**. Studio added a §6 gate to +`scripts/doc-coverage.sh`: + +```bash +# Excerpt: scripts/doc-coverage.sh §6 (post-M4.1 hardened) +if ! cargo test --workspace --locked --no-fail-fast > "$test_log" 2>&1; then + cargo_exit=$? + echo "doc-coverage: FAIL — cargo test exited $cargo_exit (lockfile mismatch / compile error / panic)" >&2 + exit 1 +fi +failed_count=$(grep -c '^test result: FAILED' "$test_log" || true) +if [ "${failed_count:-0}" -ne 0 ]; then + echo "doc-coverage: FAIL — cargo test reported $failed_count failed test groups" >&2 + exit 1 +fi +``` + +This gate exists because the **standard 5-gate** as documented in +ADSD was insufficient against the failure modes Studio hit at A4 and +v0.1.1. The CTO 守闸 SOP that wrapped the 5-gate used a `cargo test +| grep "^test result" | wc -l` pipeline that **counted both `ok` and +`FAILED` summary lines** as if they were the same. Then the +post-v0.1.1 audit caught a Cargo.lock staleness where `cargo test +--locked` exited 101 *without* emitting a `test result: FAILED` line +at all. Both gaps fixed at the script layer (see §3.4 below). + +> **Methodology learning: the canonical 5-gate is insufficient under +> aggressive parallelism. The 6th gate (paired exit-code + FAILED-grep +> on `cargo test`) closes two systemic gaps the 5-gate misses.** Back-port +> candidate for SKILL.md §"5-gate verification". + +### §2.4 3-team trio dispatch executed at ~3-way parallel under 4-way cap + +Each Wave-A and Wave-M sprint ran: + +``` + ┌─ studio-p7-{wave}-dev-opus47 (impl) +P7 ┼─ studio-p7-{wave}-test-opus47 (TDD contract corpus) + └─ studio-p7-{wave}-review-opus47 (audit — REVIEW only, no edits) + ↓ + CTO reconcile (merge DEV + TEST, address REVIEW findings) + ↓ + 守闸 (5-gate / 6-gate / read the diff) + ↓ + merge to main +``` + +Total opus dispatches: ~18 across 6 waves + 1 release-readiness agent ++ 3 persona agents = **22 opus sub-agents in 21 hours wall-clock**. +Token spend documented in `CHANGELOG.md §"Methodology firsts"`. CTO +reconcile rounds (4 of them, on A2/A3/A4/A5) were the most +human-time-expensive turns — typically 30-60 min each of human-driven +diff reading + small CTO edits to make DEV's wire-shape and TEST's +contract assumptions agree. + +The 3-team trio pattern is the **most ADSD-orthodox part of Studio's +execution**. It's the pattern §1 of SKILL.md specifies most directly, +and it worked as advertised — including the F-class catches (REVIEW +agent's audit reports are the source of the F-A2-01 / F-A3-01 / +F-A5-01 finding numbers below). + +### §2.5 Worktree-per-sprint pattern, scaled down to ~12 worktrees over 21 hours + +ADSD §"Worktree-per-sprint" specifies `git worktree add` per active +sprint; Studio created ~12 worktrees across the session +(`../studio-a2-dev`, `../studio-a2-test`, `../studio-a2-review`, +etc.), all cleaned up via `git worktree remove --force` post-merge. +**No worktree leaked into HEAD by accident**; no `target/` directory +collided. The pattern is identical to what cobrust-multi-agent +exercised, scaled down to single-day cadence. + +One M1 Pro 16GB machine, 3-way parallel cargo builds, zero exit-144 +(SIGUSR2) global lock starvation events. Cobrust hit this once at +6-way; Studio's ≤3-way cap never approached the ceiling. **The 4-way +parallel cap from §1 is real; 3-way is comfortable.** + +### §2.6 Atomic commits — code + tests + docs in one merge + +Every wave merge brought code + tests + module-docs in one commit. +Cross-references: + +- `36651a4 merge: A2 studio-store impl + contract corpus reconciled (Wave A2 complete)` + — brought `crates/studio-store/src/*.rs` + `crates/studio-store/tests/*.rs` + + `docs/agent/modules/studio-store.md` in one merge commit. +- `d26f3ac merge: A3 studio-server Axum core (Wave A3 complete)` — + same shape, scoped to server crate. + +**Atomic commit invariant violation count**: 1 (the A4 merge `8d5475f`, +which shipped 9 failing integration tests that compile-passed but +runtime-failed — see §3.3 below). One violation in 21 hours of +dispatch is in-line with the discipline; the violation itself produced +the catalogue's first **`cto-shougate-test-gate-grep-leak.md`** +finding. + +### §2.7 F21 identity hygiene held at 100% commit-attribution fidelity + +`git log --format='%an <%ae>' | sort -u`: + +``` +hakureirm +``` + +**One author, one email, across all 125 commits**. Zero leak of the +macOS Full-Name default (which had leaked into an unrelated public +repo in a prior session, per F21 evidence). The discipline came from +F21 being on the table at session start: every dispatch prompt +specified `git config user.name` verification as a tier-0 step +before any commit. + +This is a **direct, prospective validation of F21's prevention +mechanism**. F21 was added to the catalogue from an N=1 negative case; +Studio is the N=2 positive case — F21 catches the leak if you remember +F21 exists. + +### §2.8 Triple-track doc discipline (zh / en / agent) enforced by doc-coverage.sh + +Every public crate ships with `docs/agent/modules/.md`; every +top-level doc has zh + en parity. Six ADRs, four module-docs, four +findings, all carry `last_verified_commit:` frontmatter that points +to a real, git-reachable SHA. The doc-coverage gate enforces this +mechanically — see §3.2 below. + +### §2.9 Honest fail acceptance — three patch-tags in one day + +Every project ships at v0.1.0 if not before. Studio shipped at +v0.1.0, then v0.1.1, then v0.1.2 *in the same calendar day*. The +CHANGELOG names each tag explicitly: + +- **v0.1.0**: known-broken — SPA fallback `Path` regression on + `Router::fallback`. +- **v0.1.1**: known-broken (different bug) — stale Cargo.lock; `cargo + build --locked` returns 101. +- **v0.1.2**: first usable. + +No quiet retag. No "we'll bump the version and silently fix it." Each +patch tag is its own commit, its own CHANGELOG entry, its own +"`v0.1.` is known-broken; upgrade to `v0.1.`" note. **The +README's §"Honest status" section names the patch dance up front**: +*"If you'd prefer a year-old tag where you don't see the patch dance, +this isn't your project."* This is honest-fail-acceptance applied to +release-engineering, not just internal findings. + +--- + +## §3 What Studio STRESSED about ADSD + +This is the load-bearing section. Each item below: where the discipline +broke, how it was caught, what the fix was, what catalogue entry it +informs. Studio's value as N=2 evidence is concentrated here — +methodology that doesn't break under acceleration is methodology that +isn't being tested. + +### §3.1 F1.0 instance #1: BSD-sed in M0 doc-coverage.sh — declared invariant `ADR id monotonic` silently no-op'd on macOS + +**Where it broke** + +M0 (`b7d8f71`, the workspace-scaffold commit) shipped +`scripts/doc-coverage.sh` §4: + +```bash +# ORIGINAL (BSD-sed silent failure pattern) +for adr in $(ls docs/agent/adr/0*-*.md 2>/dev/null | sort); do + n=$(basename "$adr" | sed 's/^0*\([0-9]\+\).*/\1/') + # ... +done +``` + +On macOS (BSD sed), `\+` is **not a special character** — sed interprets +the regex literally. So `n` came back as the basename itself (e.g. +`0001-stack-choice.md`), the integer comparison `[ "$n" -le "$last" +]` returned a non-integer error, and `set -e` did **not** trip +because the construct was inside `$(...)` subshell expansion. The +gate printed `M0 — ADR id monotonic` and exited 0 on every run. + +**How it was caught** + +First-ever run of the gate from a clean macOS shell during M0 review. +CTO 守闸 noticed the gate "passed" against an ADR-roster that the +agent knew had a missing 0002 (intentionally — testing the monotonic +check should fail). Empirical confirmation: the gate was a no-op, not +a check. + +**Fix** + +`sed -E 's/^([0-9]+).*/\1/'` + a second `sed -E 's/^0+//'` to handle +leading zeros — POSIX-compatible regex (`-E` switch is GNU+BSD both). +Tested on macOS BSD sed and Linux GNU sed; both return monotonic +verdicts now. + +**Catalogue mapping** + +This is **F1.0 (declared invariants without enforcement) sub-form: cross-platform +shell silent failure**. The script declared an invariant ("ADR id +monotonic") and shipped a check that, on BSD tools, was equivalent +to no check. Same family as F1.2 (constitution rules with partial-scope +enforcement). + +> **Forward implication**: any project-level enforcement script should +> have a **deliberately-broken-input test** in CI: feed the script a +> known-bad fixture (intentionally non-monotonic ADR sequence), assert +> exit ≠ 0. If the test passes (gate caught the bad fixture), green. +> If the test fails (gate didn't catch), the gate is theatre. + +**This was the first F1.0 catch in Studio's session and the trigger +for tightening doc-coverage.sh's enforcement layer**. Two consecutive +F1.0 catches in the same session is the §"two strikes = systemic +blind spot" signal — see §3.2 below. + +### §3.2 F19/F20 paired instance: `last_verified_commit: HEAD` placeholder shipped twice in module-docs + +**Where it broke** + +Wave A2 merge `36651a4` (2026-05-12) shipped +`docs/agent/modules/studio-store.md` with frontmatter: + +```yaml +--- +doc_kind: module +crate: studio-store +last_verified_commit: HEAD # ← placeholder, never replaced +--- +``` + +`doc-coverage.sh` at that point did **not** check that +`last_verified_commit:` was a real SHA. The gate just checked frontmatter +*existed*. The literal string `HEAD` is frontmatter content; gate +passed. + +A2 external review (`studio-p7-a2-review-opus47`) caught it visually +as P2 finding F-A2-01. + +**24 hours later, Wave A3** merge `d26f3ac` shipped +`docs/agent/modules/studio-server.md` **with the same `HEAD` +placeholder**. Second instance, same blind spot. The A3 review caught +it (F-A3-01); but the structural issue — *the gate doesn't enforce* +— was diagnosed only after the second occurrence. + +**Two strikes = systemic blind spot** (per Cobrust F2 pattern). Filed +finding [`f20-closure-last-verified-commit-enforcement.md`](https://github.com/Cobrust-lang/cobrust-studio/blob/main/docs/agent/findings/f20-closure-last-verified-commit-enforcement.md) +naming the gap as an F20 instance (constitution-vs-workflow alignment). + +**Fix** + +`scripts/doc-coverage.sh` §5 extended in the **same commit as the +A3 review fix** (per F20 §"Rule of thumb": *every binding constitution +rule must have a paired enforcement step in the same PR that introduces +it*): + +```bash +check_last_verified() { + local file="$1" + grep -q "^last_verified_commit:" "$file" || fail "missing frontmatter" + local sha + sha=$(grep "^last_verified_commit:" "$file" | head -1 \ + | sed -E 's/^last_verified_commit:[[:space:]]*//') + if [ "$sha" = "HEAD" ] || [ -z "$sha" ]; then + fail "$file last_verified_commit='$sha' is a placeholder (F20)" + fi + if ! echo "$sha" | grep -qE '^[0-9a-f]{7,40}$'; then + fail "$file last_verified_commit='$sha' does not look like a git SHA (F20)" + fi + # F-A3-01 closure: hex-shape alone passes `deadbee` (valid hex, + # not a real commit). git cat-file -e is the canonical reachability check. + if ! git cat-file -e "${sha}^{commit}" 2>/dev/null; then + fail "$file last_verified_commit='$sha' is hex-shaped but NOT a reachable git commit (F20)" + fi +} +``` + +Three layers of check now: presence + shape + git-reachability. The +reachability check (`git cat-file -e ^{commit}`) is the +F-A3-01 closure — without it, a typo like `deadbee` passes +hex-validation but doesn't actually point to a real commit. + +**Catalogue mapping** + +This is the **first F20 systemic closure landed in Cobrust Studio** +— the finding's title literally is `f20-closure-last-verified-commit-enforcement`, +and the §"Conclusion" states: *"this finding is the first F20-class fix landed +in Cobrust Studio. Mechanism is now load-bearing: any future module-doc or +finding that lands with `last_verified_commit: HEAD` will be caught by CI +on the same PR that introduces it. The placeholder pattern is dead."* + +> **First-ever validation of F20's prevention mechanism in a non-Cobrust +> project.** F20 was added to the catalogue from Cobrust's TDD-mandate-without-enforcement +> N=1 negative case. Studio is the first project to land an F20 *closure* +> against a brand-new instance — confirming F20's §"Rule of thumb" is +> actionable, not just diagnostic. + +> **Forward implication**: F19 (release-readiness untested) and F20 +> (constitution-vs-workflow alignment) **pair naturally**. F19 is "did +> you run it?"; F20 is "did your runner enforce it?". Any project that +> takes F20 seriously will produce F19 closures automatically — and +> vice versa. + +### §3.3 F1.0 instance #2: CTO 守闸 grep leak — A4 merged with 9 failing integration tests under green-gate report + +**Where it broke** + +A4 merge `8d5475f` (10 M1 HTTP routes + SSE; 2026-05-12) was +ratified by CTO 守闸 using: + +```bash +# WRONG — counts both `ok` and `FAILED` as "test groups" +cargo test --workspace --locked --no-fail-fast 2>&1 \ + | grep "^test result" | wc -l \ + | xargs -I{} echo "{} test groups all green" +``` + +This pipeline counts every line that **starts with** `test result:` — +including `test result: ok.` and `test result: FAILED.`. Both shapes +match; both increment the counter. The守闸 report said "22 test +groups all green"; in reality, **9 of the 22 were FAILED**. + +The 9 failures were API-shape drift between A4 P7 DEV's wire shape +and A4 P7 TEST's contract assumptions (the same drift class as A2 +reconcile — but uncaught because of the broken grep). Specifically: + +| File | Failed tests | +|---|---| +| `tests/adr_routes.rs` | 4 (post_adr_malformed_body, get_adr_by_id, post_adr_then_list, post_adr_persists) | +| `tests/auth_route.rs` | 1 (set_endpoint_malformed) | +| `tests/events_route.rs` | 1 (events_sse_emits_on_adr_create) | +| `tests/finding_routes.rs` | 2 (post_finding_malformed, post_finding_then_list) | +| `tests/ledger_route.rs` | 1 (ledger_recent_n_zero) | + +The A4 守闸 commit `6775cce` ("M4.1 守闸 — apply A3 review P2 fixes") +did NOT address these — it fixed clippy and lib doc edits but didn't +run a clean test gate against the new integration corpus. + +**How it was caught** + +Wave A5 dispatch (the next sprint). A5 P7 DEV agent ran `cargo test` +against base `6775cce` as a sanity check before starting impl — and +**reported 9 pre-existing failures** in its `[P7-COMPLETION]` mid-flight +("base branch has 5 pre-existing failing test files; should I work +on top or wait for fix?"). + +The CTO immediately recognised: "the 5-gate I claimed green at A4 +was wrong" — that the green claim came from a grep pipeline that +swallowed FAILED-grep into a generic line-count. Filed finding +[`cto-shougate-test-gate-grep-leak.md`](https://github.com/Cobrust-lang/cobrust-studio/blob/main/docs/agent/findings/cto-shougate-test-gate-grep-leak.md) +with severity P1, naming three structural takeaways: + +1. CTO 守闸 SOP must use exit-code-aware test-gate checks (either + propagate cargo's exit code OR grep for FAILED explicitly — not + count `^test result` lines). +2. P7 TEST agents must run BOTH `cargo check` AND `cargo test` (with + acceptance that test FAIL is expected at TDD-red — but the agent + must REPORT the failure shape, not claim "all green"). +3. Same-PR enforcement: extend `scripts/doc-coverage.sh` to run + `cargo test` and explicitly check the summary line. This is the + F20 closure for "atomic commit invariant" → script-level enforcement. + +**Fix** + +Landed at M4.1 (`503260d` "fix: M4.1 守闸 — close cto-shougate finding +via doc-coverage §6 test gate"). `scripts/doc-coverage.sh` §6 added: + +```bash +if ! cargo test --workspace --locked --no-fail-fast > "$test_log" 2>&1; then + cargo_exit=$? + echo "doc-coverage: FAIL — cargo test exited $cargo_exit"; exit 1 +fi +failed_count=$(grep -c '^test result: FAILED' "$test_log" || true) +if [ "${failed_count:-0}" -ne 0 ]; then + echo "doc-coverage: FAIL — $failed_count failed groups"; exit 1 +fi +``` + +Note: **paired** check on exit code AND FAILED-grep. Either one alone +is insufficient (the original CTO grep was a FAILED-grep-only variant +that swallowed the non-zero exit through the pipe). + +The finding's `status: closed_by_m4.1` records this closure. The +file's opening note (added at closure time): + +> *"Closure 2026-05-12 (M4.1 守闸): `scripts/doc-coverage.sh` §6 now +> runs `cargo test --workspace --locked --no-fail-fast` and explicitly +> greps `^test result: FAILED` to enforce the gate at script level. +> F20 systemic enforcement complete — the broken `grep "^test result" +> | wc -l` pattern that A4 merge tripped on can no longer ship green."* + +**Catalogue mapping** + +This is **F1.0 (declared invariant `5 gates green` lacking enforcement +in the verification mechanism itself)** + **F20 (constitution-vs-workflow: +the SOP's grep was the workflow; CLAUDE.md's "5 gates green before +any merge" was the constitution; the gap was the grep)**. + +> **The CTO 守闸 procedure is itself a workflow. F20 applies to the procedure +> as much as it applies to the code being reviewed.** Studio's evidence +> shows the discipline must be **layered** — the constitution rule, the +> SOP grep, the doc-coverage script, and a deliberately-broken-input +> test that confirms each layer catches what the upper layer would otherwise +> miss. + +### §3.4 F-M4-01: SPA fallback `Path` shipped to v0.1.0 — caught by post-tag M4 release-readiness audit + +**Where it broke** + +M3 rust-embed integration (`5685f49`) mounted `embed::serve_asset` +via `axum::Router::fallback(...)`. The handler signature was: + +```rust +pub async fn serve_asset(Path(path): Path) -> Response { ... } +``` + +The structural Axum bug: **`axum::extract::Path` only extracts +from matched route patterns; `Router::fallback` does NOT match a +pattern** — it's a catch-all that the framework dispatches to when +no other route matches. So `Path` has nothing to extract from, +and every request to a SPA route (`/login`, `/adr`, `/agent`, +`/finding`, `/ledger`) returned the Axum runtime error: + +``` +Wrong number of path arguments for `Path`. Expected 1 but got 0. +Note that multiple parameters must be extracted with a tuple `Path<(_, _)>` +or a struct `Path` +``` + +as the response body, instead of the SvelteKit `index.html` shell. +**The frontend was unreachable**. Every navigation to a SPA route +returned an Axum error string. **v0.1.0 shipped with this regression.** + +The bug was hidden from prior audits because: + +1. `scripts/smoke-dogfood.sh` only tests `GET /` (which uses + `embed::serve_index`, a separate handler with no extractor) and + `GET /api/*` paths (which never reach the embed fallback). It + never exercises a SPA route through the binary. +2. `embed::serve_asset`'s collocated unit test called the function + *directly* with a literal `Path("adr/3".to_string())` instead of + going through the Axum router — so the extractor plumbing was + never exercised in the unit test either. +3. M3 review forecast (`studio-review-wave-m3-opus47`) said the + 13-of-14 prior-fail Playwright state was "rust-embed not on TEST + branch yet; post-merge all 14 will pass." **This was wrong** — + the bug is in the rust-embed integration's extractor choice, not + in branch merge state. The forecast was speculative; the empirical + measurement was deferred. +4. M4 TEST agent (the wave that was supposed to validate the release) + returned mid-flight without running Playwright. The CTO did not + re-dispatch; instead, the CTO ran the audit directly. + +**How it was caught** + +**Post-tag CTO 守闸 M4 release-readiness audit** ran hermetic +Playwright (`STUDIO_E2E=1 pnpm run test:e2e`) against +`./target/release/cobrust-studio` built from main HEAD `a722e09` +(== v0.1.0). 13 of 14 e2e specs failed at the first +`page.goto('/login')` step. Inspection of Playwright's +`error-context.md` showed the exact Axum error string. Hypothesis +confirmed in <60 seconds. + +Filed P0 finding +[`m4-release-readiness-spa-fallback-extractor.md`](https://github.com/Cobrust-lang/cobrust-studio/blob/main/docs/agent/findings/m4-release-readiness-spa-fallback-extractor.md). + +**Fix** + +`v0.1.1` (commit `15b6f46`): replace `Path` with +`axum::http::Uri`: + +```rust +use axum::http::Uri; + +pub async fn serve_asset(uri: Uri) -> Response { + serve_path(uri.path()) +} +``` + +Locked against regression by new unit test +`serve_asset_handles_spa_routes_login_agent_etc` that exercises +**every** SPA route through the fixed `Uri` extractor (not just the +literal-Path collocated test pattern that had failed to catch the bug). + +**Catalogue mapping** + +This is **F19 (release-readiness untested) — first time the F19 +prevention mechanism caught a real shipping bug in Cobrust Studio**. +The finding's §"Forward implications" makes this explicit: + +> *"The smoke-dogfood.sh script SHOULD probe a SPA route (e.g., +> `curl /login | grep ' the script level. Filed for v0.1.2."* +> +> *"M4 release-readiness pattern: the F19 mandate 'any public-facing +> install / quickstart / release command must pass independent +> execution in a clean shell' implicitly extends to 'any public +> ROUTE must be hit by an independent caller before publish.' +> smoke-dogfood.sh covers /api/* and /; v0.1.1 forward should cover +> SPA routes too."* + +> **Methodology learning: F19 extends from "install commands" to +> "every public surface".** The original F19 (Cobrust v0.1.x release +> notes) was about `cargo install` URLs and curl commands. Studio's +> instance generalises it to **any public-facing route that a real +> user would hit through normal use**. The mechanism is the same — +> independent caller (Playwright + curl) probing a clean-shell binary. + +### §3.5 F20 recursive closure: doc-coverage §6 hardened against `cargo test --locked` exit 101 leaking past FAILED-grep + +**Where it broke** + +v0.1.1 (`15b6f46`) shipped with the workspace version bumped from +0.1.0 → 0.1.1 in `Cargo.toml`, but `Cargo.lock` still referenced the +v0.1.0 workspace versions (`studio-server v0.1.0` etc.). Any user +running `cargo build --workspace --locked` against v0.1.1 — including +`scripts/build-release.sh`, the CI release workflow, or the M3-docs +recommended user clone path — got: + +``` +error: the lock file CARGO_LOCK needs to be updated but --locked was passed +to prevent this +``` + +`cargo test --workspace --locked` exits 101 (cargo's "build failed +or lockfile mismatch" code) **WITHOUT** ever running tests — so it +never emits a `test result: FAILED` line. + +The `doc-coverage.sh` §6 gate, hardened at M4.1 against the +`grep '^test result'` swallow-fail pattern, used: + +```bash +test_output=$(cargo test --workspace --locked --no-fail-fast 2>&1) +failed_count=$(echo "$test_output" | grep -c '^test result: FAILED') +[ "$failed_count" -eq 0 ] || exit 1 +``` + +This **only catches `FAILED` summary lines**. Exit code 101 from +lockfile-mismatch doesn't produce any summary line — so `failed_count` +is 0, gate passes green, **v0.1.1 ships broken-from-tag**. + +**How it was caught** + +Post-v0.1.1 tag, the `scripts/release-tarball.sh` build pipeline +errored at `cargo build --workspace --locked`. The doc-coverage §6 +gate had passed green just minutes before. CTO 守闸 immediately +identified the recursive pattern: **the gate that was supposed to +enforce F20 ("constitution-vs-workflow alignment") had itself a F20 +gap** — the workflow's enforcement was incomplete. + +This is the **recursive F20 closure**: F20 applied to its own +enforcement script. + +**Fix** + +v0.1.2 (`7ea9ae3`). Two changes: + +1. **Cargo.lock regenerated** via `cargo build` against the new + workspace version. Lockfile now consistent with `0.1.1+`. +2. **doc-coverage.sh §6 paired-gate** — separate `if !cargo test + ...` for exit code AND `failed_count` check for FAILED-grep. Either + non-zero fails the gate. + +```bash +# v0.1.2: paired gate. EITHER cargo exit != 0 OR FAILED count > 0 fails the script. +if ! cargo test --workspace --locked --no-fail-fast > "$test_log" 2>&1; then + cargo_exit=$? + echo "doc-coverage: FAIL — cargo test exited $cargo_exit (lockfile mismatch / compile error / panic)" >&2 + exit 1 +fi +failed_count=$(grep -c '^test result: FAILED' "$test_log" || true) +if [ "${failed_count:-0}" -ne 0 ]; then + echo "doc-coverage: FAIL — $failed_count failed groups" >&2 + exit 1 +fi +``` + +The CHANGELOG names the gap explicitly: + +> *"v0.1.1 Cargo.lock stale ... v0.1.1's commit shipped with Cargo.lock +> still referencing the v0.1.0 workspace versions. ... cargo test +> --locked exited 101 but the grep returned 0 FAILED, so the script +> passed green. v0.1.2 closes."* + +**Catalogue mapping** + +This is the **first documented "F20 recursive closure"** instance — +F20 applied to its own enforcement script, with each enforcement +layer requiring its own paired review. Studio is the empirical +substrate for the pattern: + +> **Methodology learning: F20 closure is not one-shot. The enforcement +> layer needs its own paired review.** A doc-coverage gate that hardens +> against pattern X can ship green against pattern Y on the same +> code-path. The script's invariant ("no test failures shipped") is +> declared once; each new failure mode (FAILED summary line / non-zero +> exit code without summary / hang / panic / OOM) needs its own +> orthogonal check. +> +> **Empirical pattern**: every enforcement layer needs its own paired +> orthogonal-failure review until the failure-mode class no longer +> recurs. Studio took two patches (M4.1 + v0.1.2) before the §6 gate +> stopped letting things through. + +### §3.6 The strip-#2 declared-empty-must-be-observed-empty discipline + +**Where the discipline was tested** + +ADR-0006 §"Strip list" item #2 directed the A1.1 lift to remove "ADR-0040 +honest-gate hooks (L2 verdict typing)" from `router.rs` + `ledger.rs`. +The strip list was authored from the Studio handoff doc's plan-time +view of upstream entanglement — i.e., a CTO Phase-1 belief about +what the upstream pin contained. + +P7 A1.1 lift agent searched the actual upstream pin +(`~/repos/cobrust-source-pin/crates/cobrust-llm-router/` at SHA +`61f2aff`, v0.1.1): + +```bash +grep -rn "L2Verdict\|gate_verdict\|L2.*Verdict\|HonestGate" \ + ~/repos/cobrust-source-pin/crates/cobrust-llm-router/src/ +# Result: zero hits. +``` + +The strip was a **no-op at this pin**. The honest-gate surface +evidently lived in a different upstream crate (the translation +pipeline, not the router crate). + +**The discipline applied** + +The lift didn't silently elide strip-#2 from the report. Filed P3 +finding +[`a1-1-strip-2-noop-at-pin-61f2aff.md`](https://github.com/Cobrust-lang/cobrust-studio/blob/main/docs/agent/findings/a1-1-strip-2-noop-at-pin-61f2aff.md) +explicitly recording the no-op: + +> *"ADR-0040's 'honest gate' surface evidently lives elsewhere in the +> upstream Cobrust workspace (likely the translation pipeline crates, +> not the router crate). At the pinned router-crate SHA, there is +> nothing to strip. The lift therefore proceeded with strip #2 as a +> verified no-op."* + +The finding's §"Conclusion" articulates the principle: + +> *"ADSD §'Atomic commits' + §'5-gate verification' both demand that +> declared invariants get verified in the same commit as the code they +> constrain. Strip #2 declared an invariant ('no honest-gate hooks in +> studio-router'). The honest verification was 'they were never in +> scope at this pin'; recording that explicitly closes the F1.0 / F19 +> risk class — namely, future readers seeing strip #2 in ADR-0006 +> might assume there must have been code removed, look in vain, and +> either re-add bogus honest-gate machinery to 'fix' what they think +> went missing, or distrust ADR-0006's other strip claims."* + +**Catalogue mapping** + +This is a **"declared-empty must be observed-empty" pattern** — a +proactive F1 family prevention. The principle: when an ADR declares +*the absence* of something (no honest-gate, no consensus mode, no +per-task routing), that absence must be **empirically observed** and +**recorded as observed**, not silently assumed. Otherwise a future +reader can't distinguish "we removed X" from "X was never there". + +> **Methodology learning: ADR strip-lists and constitution-prohibitions +> must record the *empirical observation* of absence, not just the +> *declaration* of absence.** Without the observation record, a future +> reader hitting an empty grep can't tell whether the absence is +> intentional (strip succeeded) or accidental (strip silently failed). + +This is a candidate for a new F-sub-form in the catalogue — a parallel +to F19/F20 framed around **strip-and-fork lift provenance**. For now, +documented in the finding itself; flagged for catalogue back-port. + +--- + +## §4 What Studio EXTENDED about ADSD + +The methodology learned things during this session. This section +captures the deltas — items worth back-porting to SKILL.md or to the +failure-modes catalogue. + +### §4.1 "Tag → audit → patch" as a RELEASE PATTERN, not just an audit gate + +**The pattern** + +ADSD v1.2.1 F19 documents "release-readiness agent runs in clean +shell before publish" as a **gate** — something that decides +GO / BLOCK before a tag is pushed. Studio's experience reframes this +as a **release pattern**, not a binary gate: + +``` +Tag the current candidate → v0.1. + ↓ +Audit the tagged artifact in clean shell → release-readiness agent + ↓ +If BLOCK: file finding + patch + tag v0.1. → patch dance + ↓ +Re-audit + ↓ +If GO: announce, publish notes → first usable tag +``` + +**Why this matters** + +The framing change is load-bearing: under tight timelines (Studio's +21-hour run), there is no time for *one* perfect tag. The right +pattern is *fast tag → fast audit → fast patch*, accepting that +v0.1.0 / v0.1.1 are intentionally just the experiment substrate that +the audit will reveal. + +Each tag is the **experiment**; the audit is the **observation**; +the patch is the **learning**. Three tags in one day for Studio is +not three failures — it's three completed experimental cycles, each +of which revealed an enforcement gap that intent-driven self-checks +had missed. + +**Empirical evidence** + +- v0.1.0 tag → M4.1 release-readiness audit → caught F-M4-01 SPA + fallback regression → v0.1.1 patch. +- v0.1.1 tag → release-tarball.sh in clean shell → caught Cargo.lock + staleness → v0.1.2 patch. +- v0.1.2 tag → release-readiness audit → green; no v0.1.3 needed + same-day. + +The pattern's success metric isn't "first tag is perfect"; it's +"convergence after K patches stays bounded". For Studio: K=2. + +**Back-port candidate for SKILL.md** + +Propose adding §4 (Quality & Verification) sub-section: + +> *"Tag → audit → patch as a release pattern: under acceleration, +> accept that the first tag will not be the publishable one. The +> right discipline is fast experimental cycle: tag, run the +> release-readiness audit in clean shell, patch the gap, re-tag. +> Cobrust Studio shipped v0.1.0 → v0.1.1 → v0.1.2 in 6 hours +> wall-clock; each tag was a learning step. CHANGELOG.md names each +> tag explicitly as broken/usable so users know which to skip."* + +### §4.2 Recursive F20 closure — enforcement layers need orthogonal-failure review + +**The pattern** + +F20 closure is not one-shot. When you harden enforcement layer A +against failure mode X, you reveal that layer A has a sibling gap +against failure mode Y (orthogonal failure on the same code path). +Closing X without checking Y leaves the layer half-closed. + +**Empirical evidence** + +The `doc-coverage.sh` §6 gate evolution: + +| Stage | Enforcement | Gap revealed | +|---|---|---| +| Pre-M4.1 | `grep '^test result' \| wc -l` | Counts both `ok` and `FAILED` as `result` lines | +| M4.1 | `grep -c '^test result: FAILED'` | Misses non-zero exit without summary line (e.g. lockfile mismatch exit 101) | +| v0.1.2 | Paired: `if ! cargo test` AND FAILED-grep | Both classes now caught | + +Each fix was complete against the bug class it was designed for. But +the enforcement layer had **orthogonal failure modes** (FAILED-line +emit-ing vs not-emit-ing) that needed their own paired review. + +**Forward implication** + +> **Methodology learning: when closing an F20 instance, scan for +> orthogonal failure modes on the same code path.** Ask: "could my +> enforcement layer still pass under a different failure mode of the +> same operation?" If yes, the closure is partial. + +This generalises beyond test gates. The same logic applies to: + +- Schema invariants in frontmatter (different shape of violation) +- CI lint scripts (different shape of bad input) +- Dispatch prompt template fields (different shape of agent shortcut) + +Back-port candidate for failure-modes-catalogue §F20 §"Prevention +going forward": add a fourth layer ("Layer 4: orthogonal-failure +review against every paired-gate enforcement"). + +### §4.3 Continuous persona testing executed in-sprint, with persona-output → PR mapping + +**The pattern** + +ADSD v1.2.1 §1 §"Continuous persona testing" documents persona +simulation as continuous dev cadence, not one-shot audit. Studio's +post-v0.1.2 turn was the first project to execute this as a +**deliberate sprint output**, not as a pre-release ceremony. + +**How it was executed** + +Three persona agents dispatched in parallel post-v0.1.2: + +| Persona | Profile | Verdict | Key catches | +|---|---|---|---| +| **Mei** | Python data scientist, target user | AMBER | Vocabulary confusion (what's an "ADR"?), missing "why not Linear/Notion?" framing, install path assumes `rustup` knowledge | +| **Aleksandr** | Senior Rust eng, technical skeptic | REAL (genuine assessment) | F-05 dead deps (`unicode-normalization`, `uuid`, `hex`, `tracing` carried from upstream lift but unused in studio-router), missing CI matrix | +| **Sarah** | OSS evaluator / governance | PASS-watch-6-month | Bus factor 1 (single contributor; flagged as adoption risk), no SECURITY.md, no CONTRIBUTING.md | + +The personas were given: +- Persona identity + background (years exp, prior burned-by experiences) +- Specific scenario ("you have 30 min, someone shared this on HN") +- Concrete actions to perform (open README, mentally try install) +- Stay-in-character constraint ("don't break into 'as an AI...'") +- Structured report fields aligned to persona's actual decision + ("would I upvote on HN?" / "what would I PR if I had a free + afternoon?") + +**Persona → PR mapping (empirical evidence the pattern works as a PR-driver, not as theatre)** + +Mei's friction items drove the M5 README rewrite directly: + +- *"What's an ADR? The vocabulary table dropped me in"* → README §"Methodology vocabulary" table added (`docs/agent/adr/`, `docs/agent/findings/`, "Wave", "Tx tag", "5 gates", "守闸"). +- *"Why not just use Linear?"* → README §"Why this and not Linear + git?" comparison matrix added. +- *"Is this production-ready?"* → README §"Honest status" section added, naming the v0.1.0/v0.1.1 patch dance up front. +- *"Bus factor 1 is a yellow flag"* → README §"Looking for 3-5 design partners" section added with concrete asks. + +Aleksandr's F-05 dead-deps catch landed in the same M5 commit +(`339e1ab`): +``` +Remove studio-router/Cargo.toml deps lifted but unused: + - unicode-normalization + - uuid + - hex + - tracing +(carried from upstream cobrust-llm-router @ 61f2aff; not used in +the post-strip surface.) +``` + +Sarah's bus-factor + governance findings drove the M5 CI matrix and +release workflow (`58cbe94`). + +**Why this matters** + +Mei's findings, in particular, were **structurally undiscoverable by +the internal review-claude pipeline**. The internal P7-REVIEW agent's +job is "is the code sound?" — it reads the code, the ADRs, the +findings. Mei's job is "would a Python user, with no Rust background, +recognise enough vocabulary to want to install this?" — she reads +only the README from a cold-context start. + +> **Methodology learning: persona-output is the highest-leverage +> source of README/positioning PRs.** Internal reviews maintain +> internal coherence; persona simulation creates **external coherence** +> — between the project's pitch and the user's mental model. Three +> persona dispatches @ 30 min each (90 min total) produced ~15 +> concrete PR items, of which 7 landed in the same wave. + +**Back-port candidate for SKILL.md §1**: extend §"Continuous persona +testing" with a sub-bullet: + +> *"Persona output → PR mapping: each persona finding should map to +> exactly one of {README edit, ADR addendum, finding, doc fix, code +> fix}. If a persona finding maps to 'no action / acknowledged', it's +> a research finding (file for the case study) not a product finding."* + +### §4.4 AI velocity confirmed at ~2.5× on a 5-day plan + +**The empirical evidence** + +CLAUDE.md §6 specified a 5-day MVP target: + +| M | Scope | Day target | Actual | +|---|---|---|---| +| M0 | scaffold + 5 ADRs + 5-gate CI | Day 1 | Day 1 hour 0-2 | +| M1 | backend MVP — Axum + routes + studio-router lift | Day 2 | Day 1 hours 2-20 (A1-A5) | +| M2 | frontend MVP — SvelteKit + 4 pages | Day 3 | Day 2 hours 0-5 | +| M3 | dogfood + polish + single binary | Day 4 | Day 2 hours 5-10 | +| M4 | release v0.1.0 + demo + reviewer invite | Day 5 | Day 2 hours 10-14 | +| M5 | (post-MVP, persona-driven) | not planned | Day 2 hours 14-18 | + +**Total wall-clock: ~21 hours** for a plan estimated at 5 human-days +(40 work-hours). Velocity multiplier: **~2.5×** if we count only +human-equivalent effort (the human's 3-4 hours was strategic; the +~125 commits were all agent-produced). + +The AI velocity heuristic in SKILL.md §5 predicted *"a 5-day human +plan = ~2-day AI plan with ≤4-way parallel"*. Studio confirms the +heuristic with N=2 evidence, at slightly more conservative +parallelism (≤3-way trio). + +**The catch** + +The 2.5× velocity multiplier did NOT translate to "first tag is +shippable". The 21-hour run produced *three tags* — v0.1.0 broken, +v0.1.1 broken, v0.1.2 usable. AI velocity buys faster experimental +cycles; it does NOT buy shippable-on-first-try. **The right framing +is "first usable tag in 21 hours" not "feature-complete in 21 hours".** + +Back-port candidate for SKILL.md §5 §"AI velocity planning": + +> *"AI velocity multiplier (~2.5× to ~10×) buys experimental cycles, +> not shippable-first-try. Plan for K=2 patch tags before first +> usable tag. Each patch is its own experimental cycle; aim for total +> wall-clock = (plan_days × velocity_inverse) × (1 + K × 0.1). For +> Studio: (5 days × 0.4) + (2 × 0.5 day) ≈ 3 days; reality was 0.9 +> day, comfortable under estimate."* + +### §4.5 Persona report as PR-driver, not as theatre + +(Already covered in §4.3 above; summarized here for catalogue +back-port.) + +The pattern: persona simulation produces actionable PRs when: +1. Personas are richly defined (years exp, prior burned-by, current + frustrations) — not "a Python dev" +2. Personas have a specific scenario ("you have 30 min on HN") +3. Personas have stay-in-character constraint enforced in prompt +4. Persona output is structured ("would I upvote", "what would I PR") + +Without these four, persona simulation regresses to "an AI agent +giving generic feedback" — which is theatre. + +### §4.6 The "constitution → ADR → finding → script-enforcement" stack as a 4-layer F20 discipline + +Studio's discipline can be described as a 4-layer stack: + +| Layer | Artifact | What it enforces | Where it can fail | +|---|---|---|---| +| 1: Constitution | `CLAUDE.md` | Strategic invariants ("5 gates green before merge") | Text-only; survives only in agent's session context | +| 2: ADR | `docs/agent/adr/NNNN-*.md` | Architectural commitments ("studio-router public surface is X") | Drifts from as-built; corrected via §Addendum | +| 3: Finding | `docs/agent/findings/*.md` | Empirical observations ("the grep leaked") | Filed but no script-level enforcement | +| 4: Script | `scripts/doc-coverage.sh` | Mechanical CI gate (paired exit-code + FAILED-grep) | The ultimate truth — if it passes, the build passes | + +**F20 mandates the gradient**: every rule at layer N must have a +paired enforcement at layer N+1. Studio's 4-finding count maps +1-to-1 to layer transitions: + +- F-A2-01 `last_verified_commit: HEAD` placeholder leaked → layer 1 + rule had no layer 4 enforcement → fixed in `f20-closure-last-verified-commit-enforcement.md` +- F-A4-01 9 failing tests under green-gate → layer 1 rule had no + layer 4 enforcement → fixed in `cto-shougate-test-gate-grep-leak.md` +- F-M4-01 SPA fallback `Path` → layer 2 ADR-0002 (single-binary) + had no layer 4 release-readiness audit covering SPA routes → fixed + in `m4-release-readiness-spa-fallback-extractor.md` +- A1-1 strip-2 no-op at pin `61f2aff` → layer 2 ADR-0006 §"Strip + list" item #2 had no layer 4 verification of the strip; fixed by + empirically observing the absence and filing the finding. + +> **Methodology learning: the 4-layer constitution → ADR → finding → +> script stack is the right abstraction for F20.** Every rule needs +> a script-level enforcement; every finding should record which +> layer's gap it closes. + +Back-port candidate for SKILL.md Part 3 (Documentation Discipline): +make the 4-layer model explicit. + +--- + +## §5 Numbers worth quoting + +| Metric | Value | +|---|---| +| Span wall-clock | ~21 hours (2026-05-11 17:22 → 2026-05-12 14:36) | +| Span 5-day human plan | compressed to 2 calendar days (~2.5× AI velocity) | +| Commits on main | 125 | +| Tags pushed | 3 (v0.1.0 / v0.1.1 / v0.1.2) | +| Rust crates | 3 (studio-router / studio-store / studio-server) | +| Binary size | 9.0 MiB (single-file deployment) | +| Rust tests at HEAD | 196 (32 ok groups, 0 FAILED) | +| Playwright e2e | 14 hermetic + 2 dogfood (all green) | +| Real-LLM e2e | PASS (codex-forwarder + gpt-5.5) | +| ADRs | 6 (0001..0006) | +| Findings | 4 (P0 / P1 / P2 / P3 all represented; 3 closed within session) | +| Module-docs | 4 (studio-router / studio-store / studio-server / web-frontend) | +| Opus sub-agent dispatches | ~18 (6 waves × 3-team trio + 4 reconcile rounds + 1 release-readiness agent) | +| Persona dispatches | 3 (Mei / Aleksandr / Sarah) | +| CI gates enforced | 6 | +| Human work-hours (estimated) | 3-4 (strategic + 守闸 only) | +| Agent work-hours (estimated) | ~22 active (across parallel sub-agents) | +| AI velocity multiplier observed | ~2.5× on a 5-day plan | +| F1.0 catches | 2 (BSD-sed; CTO 守闸 grep leak) | +| F19 catches | 2 (M4 SPA fallback; v0.1.1 Cargo.lock) | +| F20 catches | 2 (last_verified_commit HEAD placeholder; recursive doc-coverage §6 closure) | +| F21 catches | 1 prospective (zero git-author leak; all 125 commits attributed cleanly) | +| Methodology firsts | First F20 closure in non-origin project; first documented "tag → audit → patch" release pattern; first "recursive F20 closure" | + +--- + +## §6 What still ahead (post-session) + +These are out-of-scope for this case study but worth naming for completeness: + +- **AEAD real round-trip on `/login` (M5+)**: WebCrypto m2-stub auth + blob is opaque to the server today. Users set `ANTHROPIC_API_KEY` / + `OPENAI_API_KEY` env var as the actual auth path. Real + server-side decrypt deferred. +- **Linux + Windows tarball CI matrix**: `release.yml` workflow + landed at M5 (`58cbe94`); awaits next tag to fire. +- **ADSD case-study back-port**: this document. +- **Design partner recruitment**: README §"Looking for 3-5 design + partners" published; concrete asks enumerated. + +None of these block the N=2 dogfood validation conclusion: the +methodology survived contact with a new codebase under acceleration, +and Studio's session produced enough catalogue-augmenting evidence +to retrofit F19/F20/F21 into validated-pattern status. + +--- + +## §7 Patterns I'd carry forward (Studio → next ADSD project) + +1. **3-team trio dispatch** at ≤3-way parallel for narrow-scope + waves. Reserve P9 layer for waves needing sub-decomposition of + the ADR itself. +2. **ADR §Addendum YYYY-MM-DD pattern**: never edit §"Decision"; + append corrections preserving the original CTO Phase-1 text. The + blame-integrity move. +3. **doc-coverage.sh layered enforcement**: presence + shape + + reachability + paired-gate exit-code on `cargo test`. Six gates + minimum, not five. +4. **F21 prospective discipline**: verify `git config user.name` + before every commit; suffix every sub-agent handle with the + session ID. +5. **Tag → audit → patch as a release pattern**: under acceleration, + first tag is the experimental substrate; expect K=2 patch tags + before first usable. +6. **Persona dispatch → README rewrite pipeline**: each persona finding + maps to exactly one PR; persona output is the highest-leverage + external-coherence source. + +## §8 Patterns I'd add or strengthen for v1.2.2+ of ADSD + +1. **6-gate canonical (extend the standard 5-gate)** — add §6 + doc-coverage as a load-bearing gate, with paired exit-code + + FAILED-grep on `cargo test`. The 5-gate is insufficient under + aggressive parallelism. +2. **F20 recursive closure pattern documentation** — F20 closure is + not one-shot; every enforcement layer needs its own paired + orthogonal-failure review. +3. **F1 "declared-empty-must-be-observed-empty" sub-form** — when an + ADR declares the absence of something (strip-lists, prohibitions), + the absence must be empirically observed and recorded. +4. **Tag → audit → patch as a release pattern** — explicit named + pattern in §4 of SKILL.md, with the v0.1.0/v0.1.1/v0.1.2 sequence + as canonical example. +5. **AI velocity = experimental cycles, not shippable-first-try** — + sharpen the SKILL.md §5 velocity guidance to plan for K patch tags + before first usable. +6. **Persona output → PR mapping** — extend §1 continuous-persona + testing with the explicit "every persona finding maps to exactly + one PR" rule. + +## §9 Patterns I'd reconsider + +1. **3-team trio dispatch on single-surface waves**: Wave M2 (SvelteKit + frontend, 5 pages) used the 3-team pattern but the parallel + review surface was narrow — REVIEW agent had little to audit until + DEV merged. **Reserve 3-team trio for cross-crate Rust waves**; + single-P7-with-self-review-step is sufficient for narrow surface. +2. **Triple-track docs (zh / en / agent) at bus factor 1**: maintained + for methodology fidelity, but the cost is real (every doc edit + touches 3 files). Cobrust N=1 has the same observation. + Consider downgrading to dual-track (en + agent) below ~3 + contributors, per SKILL.md §3 escape hatch. + +--- + +## §10 Closing + +Cobrust Studio is not a "solved" project. It's at v0.1.2 with: +- A working 9 MiB single-binary web console for AI agent dispatch +- A 6-gate CI bar that enforces ADR + finding + bilingual doc + discipline mechanically +- 196 Rust tests + 14 Playwright e2e + 2 dogfood specs + real-LLM + e2e all green at HEAD +- A documented patch dance (v0.1.0 broken → v0.1.1 broken → v0.1.2 + usable) that names each gap by file:line + +The ADSD methodology distilled from Cobrust (N=1) was the +**experimental substrate** for Studio (N=2). The result confirms: + +- **Core invariants hold under acceleration.** 4-tier topology + (collapsed to P10+P7), two-phase dispatch, 5-gate verification, + atomic commits, worktree-per-sprint, F21 identity hygiene — all + executed as documented. +- **The 5-gate is insufficient; 6-gate is the new floor.** Studio's + M4.1 §6 + v0.1.2 §6 paired-gate work is the canonical evidence. +- **F19/F20/F21 are validated as prevention mechanisms, not just + diagnostic vocabulary.** Each fired in Studio; each prevented or + caught a real shipping bug. +- **The patch dance is a release pattern, not a failure pattern.** + Tag → audit → patch is the right discipline under acceleration. + +If you adopt ADSD on your project after reading this case study, +expect to: +- Land your first tag in days, not weeks +- Expect K=2 patch tags before first usable +- Spend ~10% of project time on doc-coverage discipline (worth it — + Studio's 4 findings are all directly attributable to gate-level + enforcement gaps that the discipline made visible) +- Run a persona dispatch every release — the output is your highest- + leverage external-coherence source. + +The N=2 evidence is in. ADSD v1.2.1 holds. + +--- + +**Cobrust Studio origin**: 2026-05-11 17:22 +0800. +**ADSD N=2 dogfood completed**: 2026-05-12 14:36 +0800. +**Case study authored**: 2026-05-12 (this document). + +— Signed-off: studio-p7-adsd-backport-opus47 + (working window 2026-05-12; back-port commissioned by P10 CTO + studio-cto-session-002-opus47 after the v0.1.2 release sealed and + persona-audit output landed in M5) From 9c1acade5832e31af53c3388e31cc98113d54b65 Mon Sep 17 00:00:00 2001 From: Hakureirm Date: Tue, 12 May 2026 14:45:15 +0800 Subject: [PATCH 10/16] docs(skill): SKILL.md cite Cobrust Studio as N=2 case study + Origin & lineage update MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Hook the new case-study/cobrust-studio-experience.md into SKILL.md at four touch-points: - YAML frontmatter description: extend "pull case-study/... on demand" to enumerate both N=1 and N=2 case studies with one-line motivators - §"Distilled from" header: add N=2-validated-against line summarising the Studio run (125 commits, 21 hours, v0.1.0/v0.1.1/v0.1.2 patch dance) and the first F20 closure in a non-origin project - Part 7 §"Templates & Examples": list both case studies under case-study/ with N=1 / N=2 designation - §"Cross-references" (within this skill): same dual entry under "Originals (distilled from Cobrust 12-day intensive run)" - §"Origin & lineage": full N=2 paragraph describing what Studio added (first F20 systemic closure in a non-origin project; first "tag → audit → patch" release pattern documentation; first recursive F20 closure) No methodology changes; this is reference-update only. The N=2 findings themselves are documented in the case-study file landed in the prior commit. Signed-off: studio-p7-adsd-backport-opus47 Co-Authored-By: Claude Opus 4.7 (1M context) --- .../skills/agent-driven-development/SKILL.md | 29 +++++++++++++++---- 1 file changed, 23 insertions(+), 6 deletions(-) diff --git a/plugins/adsd/skills/agent-driven-development/SKILL.md b/plugins/adsd/skills/agent-driven-development/SKILL.md index 21aed9c..5a80cdc 100644 --- a/plugins/adsd/skills/agent-driven-development/SKILL.md +++ b/plugins/adsd/skills/agent-driven-development/SKILL.md @@ -1,6 +1,6 @@ --- name: agent-driven-development -description: ADSD methodology for managing multi-agent software projects where AI agents produce ≥70% of code. Use when starting such a project, planning P10/P9/P7 sub-agent dispatch, running tactical or strategic project reviews, drafting ADR/finding/snapshot artifacts, designing pre-release multi-agent audit teams, or diagnosing multi-agent failure modes (snapshot sediment, post-compaction role-identity drift, silent miscompile, marketing overreach without benchmark cite, sub-agent KPI self-report fidelity gaps, attribution-policy scope leaks). Provides 4-tier role topology (P10 CTO / P9 tech lead / P7 senior eng / P0 atomic + external review), two-phase dispatch SOP (Phase 1 ADR spike → Phase 2 P9 impl), 8-dimension audit pattern (4 internal + 3 persona + deep-source-read), F1–F18 failure-modes catalogue, AI velocity planning heuristic, and ADR/finding/snapshot/dispatch-prompt-{p7,p9}/handoff-cover-letter templates under templates/. Read SKILL.md first; pull reference/failure-modes-catalogue.md and case-study/cobrust-multi-agent-experience.md on demand. +description: ADSD methodology for managing multi-agent software projects where AI agents produce ≥70% of code. Use when starting such a project, planning P10/P9/P7 sub-agent dispatch, running tactical or strategic project reviews, drafting ADR/finding/snapshot artifacts, designing pre-release multi-agent audit teams, or diagnosing multi-agent failure modes (snapshot sediment, post-compaction role-identity drift, silent miscompile, marketing overreach without benchmark cite, sub-agent KPI self-report fidelity gaps, attribution-policy scope leaks). Provides 4-tier role topology (P10 CTO / P9 tech lead / P7 senior eng / P0 atomic + external review), two-phase dispatch SOP (Phase 1 ADR spike → Phase 2 P9 impl), 8-dimension audit pattern (4 internal + 3 persona + deep-source-read), F1–F18 failure-modes catalogue, AI velocity planning heuristic, and ADR/finding/snapshot/dispatch-prompt-{p7,p9}/handoff-cover-letter templates under templates/. Read SKILL.md first; pull reference/failure-modes-catalogue.md and case-study/cobrust-multi-agent-experience.md (N=1) + case-study/cobrust-studio-experience.md (N=2, 2-day MVP applying the methodology under acceleration) on demand. --- # Agent-Driven Software Development (ADSD) @@ -8,9 +8,11 @@ description: ADSD methodology for managing multi-agent software projects where A > A methodology for managing software projects where the bulk of the work > is done by AI agents under human strategic direction. > -> **Distilled from**: Cobrust project, **12 days wall-clock (2026-04-30 → 2026-05-12)**, ~278 commits, 48+ ADRs, 24+ findings, 2 P0 codegen bugs found via organic stress test, v0.1.0 + v0.1.1 + v0.1.2 shipped + α Phase F.2 in flight. +> **Distilled from**: Cobrust project (N=1), **12 days wall-clock (2026-04-30 → 2026-05-12)**, ~278 commits, 48+ ADRs, 24+ findings, 2 P0 codegen bugs found via organic stress test, v0.1.0 + v0.1.1 + v0.1.2 shipped + α Phase F.2 in flight. > -> **Status**: extracted 2026-05-10. Apply as-is or adapt; this is +> **N=2 validated against**: Cobrust Studio (2026-05-11 → 2026-05-12), 125 commits over ~21 hours, 6 ADRs, 4 findings, v0.1.0 broken → v0.1.1 broken → v0.1.2 usable patch dance documented as the canonical "tag → audit → patch" release pattern. First F20 systemic closure in a non-origin project. See `case-study/cobrust-studio-experience.md`. +> +> **Status**: extracted 2026-05-10; N=2 validated 2026-05-12. Apply as-is or adapt; this is > battle-tested but not orthodoxy. --- @@ -862,8 +864,9 @@ See `templates/` folder: - `templates/dispatch-prompt-p7.md` - `templates/handoff-cover-letter.md` -See `case-study/` folder for Cobrust experience report: -- `case-study/cobrust-multi-agent-experience.md` +See `case-study/` folder for ADSD case-study reports: +- `case-study/cobrust-multi-agent-experience.md` — **N=1** Cobrust language project (12-day multi-agent build; ~278 commits; methodology distilled from this run) +- `case-study/cobrust-studio-experience.md` — **N=2** Cobrust Studio (2-day MVP applied the codified methodology; 125 commits; first F20 closure in a non-origin project; first "tag → audit → patch" release pattern documentation) --- @@ -897,7 +900,8 @@ Everything else is adaptable. - Part 6 Full failure-modes catalogue: `reference/failure-modes-catalogue.md` - Templates: `templates/*.md` -- Cobrust case study: `case-study/cobrust-multi-agent-experience.md` +- Cobrust case study (N=1, language project, 12-day multi-agent build): `case-study/cobrust-multi-agent-experience.md` +- Cobrust Studio case study (N=2, project-management console, 2-day MVP — first F20 closure in a non-origin project; first "tag → audit → patch" release pattern documentation): `case-study/cobrust-studio-experience.md` ### Cross-pollination from Anthropic + OpenAI public guidance (v1.2.0) @@ -920,6 +924,19 @@ translation pipeline. ~278 commits over 12 wall-clock days, 49 ADRs (0001..0048 documented here passed the test of "did we hit this in production and did the fix work?". +**N=2 validation** (2026-05-11 → 2026-05-12): Cobrust Studio — a +self-hosted web console for AI agent project management — applied +the codified methodology as input rather than co-evolving with it. +125 commits over ~21 hours wall-clock, 6 ADRs, 4 findings, 3 tags +(v0.1.0 broken / v0.1.1 broken / v0.1.2 usable). First F20 systemic +closure in a non-origin project; first documented "tag → audit → +patch" release pattern; first "recursive F20 closure" (enforcement +script auditing itself). See `case-study/cobrust-studio-experience.md`. +The methodology survived contact with a new codebase under +acceleration, surfaced 2 F1.0 catches and 2 F19/F20/F21 catches as +empirical evidence the v1.2.0/v1.2.1 catalogue additions are +load-bearing. + Specific Cobrust artifacts that inspired each Part: - Part 1 Topology: `findings/multi-agent-cobrust-topology.md` - Part 2 Two-phase dispatch: `feedback_p9_two_phase_dispatch` memory From 7321a6e7f6b648eb168cb3b3edeb614e316c4216 Mon Sep 17 00:00:00 2001 From: Hakureirm Date: Tue, 12 May 2026 15:01:09 +0800 Subject: [PATCH 11/16] =?UTF-8?q?feat(catalogue):=20F25-F28=20+=20F1.3/F1.?= =?UTF-8?q?4=20=E2=80=94=20Studio=20N=3D2=20case-study=20patterns=20promot?= =?UTF-8?q?ed=20to=20first-class=20entries?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Six Studio-surfaced patterns from cobrust-studio-experience.md promoted from case-study-only learnings to first-class catalogue entries. Bumps catalogue version 1.2.0 -> 1.2.6. New main entries: - F25 — Tag -> audit -> patch as a release pattern under AI velocity (discipline entry; legitimate-and-disciplined under three preconditions: honest CHANGELOG, audit-as-experiment, K-bound convergence) - F26 — Recursive enforcement-script closure required (F1 Sediment Family, orthogonal-failure sub-form; doc-coverage.sh §6 evolution + §5b paired-gate as empirical substrate) - F27 — Continuous persona testing as dev-loop primitive (discipline entry; persona -> PR -> land -> re-spawn loop with five preconditions for legitimate use) - F28 — Persona-simulation-as-validation epistemic risk (closed-feedback- loop sub-form; names the failure mode F27 regresses into without external grounding; mitigation = N=3 independent adoption / external user contact) New F1 Sediment Family sub-forms: - F1.3 — Local-vs-CI gate definition drift (sub-form of F1.2; M5.8 cargo fmt drift evidence) - F1.4 — Doc-coverage script enforces what it knows; README-vs-release-tag drifts silently (sub-form of F1.0; Sarah-v2 R9 evidence) F1 parent block upgraded from "6 sub-forms confirmed" to "8 sub-forms confirmed". Catalogue maintenance trailer updated F1-F11 -> F1-F28. Source citations (case study §§): - F25: §3.4 + §3.5 + §4.1 (tag dance) - F26: §3.5 + §4.2 (recursive closure) - F27: §4.3 + §4.5 (continuous persona) - F28: §4.5 + §10 (Sarah-v2 R8 closed-feedback-loop risk) - F1.3: §M5.8 (local vs CI fmt drift) - F1.4: §M5.8 Sarah-v2 R9 (README vs latest tag drift) Signed-off: studio-p7-adsd-catalogue-augment-opus47 Co-Authored-By: Claude Opus 4.7 (1M context) --- .../reference/failure-modes-catalogue.md | 740 +++++++++++++++++- 1 file changed, 728 insertions(+), 12 deletions(-) diff --git a/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md b/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md index 6153461..a903b02 100644 --- a/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md +++ b/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md @@ -1,11 +1,11 @@ --- -name: ADSD failure modes catalogue (F1-F21) -description: Concrete failure modes encountered in real ADSD projects with empirical evidence, root cause analysis, recovery patterns, and prevention mechanisms. F1 Sediment Family + F2-F21 individual entries. Add F22+ as your project hits new failure modes. +name: ADSD failure modes catalogue (F1-F28) +description: Concrete failure modes encountered in real ADSD projects with empirical evidence, root cause analysis, recovery patterns, and prevention mechanisms. F1 Sediment Family (8 sub-forms) + F2-F28 individual entries. Cobrust N=1 surfaced F1.0-F1.2 + F2-F24; Cobrust Studio N=2 surfaced F1.3, F1.4, F25-F28. Add F29+ as your project hits new failure modes. type: reference -version: 1.2.0 +version: 1.2.6 date: 2026-05-12 status: active -relates_to: [skill:SKILL.md §"Failure modes catalogue", case-study:cobrust-multi-agent-experience.md, reference:evals-first-development.md, reference:context-window-strategy.md, reference:cross-session-memory-architecture.md] +relates_to: [skill:SKILL.md §"Failure modes catalogue", case-study:cobrust-multi-agent-experience.md, case-study:cobrust-studio-experience.md, reference:evals-first-development.md, reference:context-window-strategy.md, reference:cross-session-memory-architecture.md] --- # Failure modes catalogue @@ -18,15 +18,18 @@ relates_to: [skill:SKILL.md §"Failure modes catalogue", case-study:cobrust-mult --- -## F1 — Declared rules without enforcement — **"F1 Sediment Family"** (**P0 SOP gap, 6 sub-forms confirmed**) +## F1 — Declared rules without enforcement — **"F1 Sediment Family"** (**P0 SOP gap, 8 sub-forms confirmed**) > **Status upgraded to "F1 Sediment Family" parent pattern** after 6 distinct -> sub-forms observed across Cobrust 11-day experiment. F1 is the single most +> sub-forms observed across Cobrust 11-day experiment, and 2 additional +> sub-forms (F1.3 local-vs-CI gate drift, F1.4 README-vs-release-tag drift) +> confirmed on Cobrust Studio's 21-hour N=2 dogfood. F1 is the single most > common systemic failure in ADSD-flavor projects. Original 3 sub-forms -> (F1.0 / F1.1 / F1.2) remain as implementation-level instances. New -> sub-forms F16, F17, F18 extend the family to identity, self-reporting, and -> attribution-policy dimensions — all share the same root: **declaration ≠ -> enforcement, and enforcement scope silently lags reality.** +> (F1.0 / F1.1 / F1.2) remain as implementation-level instances; F1.3 + F1.4 +> extend the family to enforcement-scaffold drift (script vs CI; script vs +> public surface). New sub-forms F16, F17, F18 extend the family to identity, +> self-reporting, and attribution-policy dimensions — all share the same root: +> **declaration ≠ enforcement, and enforcement scope silently lags reality.** > > **Family pattern one-liner**: Claim is written somewhere (constitution, > schema frontmatter, KPI card, attribution policy, auto-memory). No @@ -34,7 +37,9 @@ relates_to: [skill:SKILL.md §"Failure modes catalogue", case-study:cobrust-mult > turns. Violation is invisible until an auditor manually checks. > > See F16 (identity drift), F17 (self-report fidelity gap), F18 (attribution -> policy without dir-scope enforcement) for the three new sub-forms. +> policy without dir-scope enforcement) for the three new sub-forms; F1.3 +> (local-vs-CI gate drift), F1.4 (README-vs-release-tag drift) for the two +> Studio-surfaced scaffold-level sub-forms. ### F1.0 — Snapshot sediment ("重写忘删") @@ -133,6 +138,121 @@ specific milestone range and went stale.) Same applies to any "rule covers M0-M" pattern — it will go stale the moment M lands. +### F1.3 — Local-vs-CI gate definition drift (sub-form of F1.2) + +**Symptoms**: a project enforces "N gates green before merge" at two +layers — `scripts/doc-coverage.sh` (developer-local fast feedback) and +the GitHub Actions workflow (canonical merge gate). The mandate is +nominally identical at both layers, but the two layers define the +gate-set differently. Local script reports "all 6 gates passed"; CI +fails the PR on a 7th gate the local script never ran. The developer +sees a green local run + a red CI run and cannot reconcile them +without reading both scripts side-by-side. + +**Concrete shapes seen**: +- Local `doc-coverage.sh` runs fmt/clippy/build/test + 2 doc-shape + checks (6 steps). CI workflow runs the same 6 + a separate + `cargo fmt --check` job (7 jobs). Local passes; CI fails on fmt + drift because the local script never ran fmt-check. +- Local script runs `cargo test --workspace`; CI runs `cargo test + --workspace --all-features`. Feature-gated test fails only in CI. +- Local script uses one cargo binary; CI uses pinned toolchain + version. Toolchain-specific lint fails only in CI. + +**Root cause**: structurally identical to F1.2 (constitution rules +with partial-scope enforcement), but applied to the enforcement +*scaffold itself*. The "N gates" rule is declared in the project +constitution; the enforcement layer has two implementations (local +script + CI workflow), and their definitions of "N" diverge silently. +Without a meta-check that script-set ⊆ CI-set, drift is invisible +until the next CI red. + +**Evidence**: Cobrust Studio M5.8 sprint, 2026-05-12. Persona auditor +Sarah v2 caught the gap: local `scripts/doc-coverage.sh` reported "6 +gates passed"; the GitHub Actions matrix-CI workflow added at M5 (per +Sarah v1 dispatch) ran `cargo fmt --check` as a separate job and +failed on the same SHA the local script approved. Resolution: §5b +added to `doc-coverage.sh` to run `cargo fmt --check` alongside the +existing gates, restoring script ⊇ CI invariant. See Studio case +study §4.2 and persona-driven §M5.8. + +**Recovery**: +1. Establish the invariant **script ⊇ CI** (the local script runs at + least every check the CI workflow runs). +2. Add a meta-check: a small CI job that fails if any check in + `workflows/*.yml` lacks a corresponding step in `doc-coverage.sh` + (grep-driven; brittle but bounds the drift). +3. When CI fails on a gate the local script didn't run, the fix is + to extend the local script in the same PR, not to silently rely + on CI catching it. + +**Prevention going forward**: in the SAME PR that introduces a new +CI job, extend the local enforcement script to run it. The "N gates" +mandate must name a single source of truth (the local script) and +treat CI as the canonical re-runner of that script — not as a parallel +gate-set. See F26 (recursive enforcement-script closure) for the +multi-layer-review discipline this implies. + +### F1.4 — Doc-coverage script enforces what it knows, README-vs-release-tag drifts silently (sub-form of F1.0) + +**Symptoms**: a project's `scripts/doc-coverage.sh` enforces invariants +on artifacts it knows about — module-doc `last_verified_commit:` SHA +reachability, ADR roster completeness, findings frontmatter shape. The +script is rigorous on its declared scope. Meanwhile, the public README +ships claims that **the script has no clause for**: +- README badge shows version `vX.Y.Z` while the latest pushed tag is + `vX.Y.(Z+1)` (badge-vs-tag drift) +- README §"Install" describes a single-platform tarball while + `release.yml` builds a 5-platform matrix (asset-coverage drift) +- README §"Compare to X" cites old positioning while the current + positioning was updated 3 commits ago (narrative drift) + +The script is green; the public surface is stale. Discovery happens +only when a persona auditor or new visitor reads the README cold. + +**Root cause**: F1.0 family — the script enforces what it knows to +enforce. Its declared scope (module-doc / ADR / finding) is rigorous; +its undeclared scope (README ↔ latest tag, README ↔ release.yml +matrix, README ↔ current positioning) is unenforced. The script +doesn't know to enforce these; nobody told it to. + +**Evidence**: Cobrust Studio post-v0.1.3 sprint, persona auditor Sarah +v2 R9 finding (2026-05-12). README §"Releases" badge displayed +`v0.1.2` and §"Install" described a single-platform `aarch64-apple-darwin` +tarball, AFTER v0.1.3 had shipped with a 5-platform `release.yml` +matrix (Linux + macOS x86_64/aarch64 + Windows). `doc-coverage.sh` +green; `last_verified_commit:` rigorous on every module-doc; README +content not under any gate. See Studio case study §M5.8 and Sarah-v2 +R9. + +**Rule of thumb**: + +> **What the doc-coverage script doesn't enforce, drifts. The script +> enforces what it knows to enforce. Anything outside its declared +> scope is on human discipline alone — i.e., it will drift.** + +**Recovery**: +1. For every public-facing artifact (README, release notes, landing + page), enumerate the claims that are bound to a current-tag value + (badge SHA, asset names, platform matrix, version string). +2. Add a `scripts/doc-coverage.sh` clause per claim: + - Badge SHA must equal `git describe --tags --abbrev=0` + - Every asset URL in README must resolve via `gh api` + - Every platform mentioned in README must appear in + `.github/workflows/release.yml` matrix +3. Mark previously-aspirational claims (e.g. "single-platform tarball" + wording) as ASPIRATIONAL per F1's generalized prevention rule, or + add the enforcement. + +**Prevention going forward**: when introducing a new public-facing +claim (README §"Install", release notes), in the same commit add the +script clause that enforces the claim. F1 family applied to public +surface, not just internal scaffolding. Composes with F19 +(release-readiness independent install-test) and F8 (marketing +overreach without citation): F19 verifies the install path runs; F8 +verifies marketing claims have citations; F1.4 verifies README claims +track current tag. + ### Generalized prevention going forward (P0 SOP) > **Any project-level rule without an automated check is security @@ -1587,6 +1707,601 @@ This pattern composes with F19 (install-not-tested): both reflect a gap between --- +## F25 — Tag → audit → patch as a release pattern under AI velocity (discipline, not failure) + +> **Discipline entry, not a defect pattern**. F25 is the empirically validated +> *legitimate-and-disciplined* form of what would otherwise read as "shipping +> broken tags". The pattern only becomes anti-pattern when its three preconditions +> (honest CHANGELOG, audit-as-experiment, K-bound convergence) are violated — +> see §"When F25 degrades into anti-pattern" below. Catalogued here because under +> AI velocity (~2.5×-10×) the first tag will not be the publishable one, and the +> right discipline is to *plan for K patch tags* rather than aim for shippable-on-first-try. + +### Definition + +Under AI-velocity acceleration, a project ships its first tag with the +expectation that the **first release-readiness audit will reveal an enforcement +gap that intent-driven self-checks missed**. The pattern is: + +``` +Tag v0.1. ← experiment substrate + ↓ +Release-readiness audit in clean shell ← observation + ↓ (BLOCK) +Finding filed + patch + tag v0.1. ← learning + ↓ +Re-audit + ↓ (GO) +Announce, publish notes +``` + +Each tag is the experiment; each audit is the observation; each patch is +the learning. The pattern's success metric is **bounded convergence after K +patches**, not "first tag is perfect". For Cobrust Studio: K=2 (v0.1.0 broken +→ v0.1.1 broken → v0.1.2 usable, in 6 hours wall-clock). + +### Symptoms (legitimate form) + +- Multiple consecutive patch-tags in a single calendar day (v0.1.0 → v0.1.1 + → v0.1.2 in 6 hours) +- Each tag has its own CHANGELOG entry naming the gap explicitly + ("v0.1.1 stale Cargo.lock; cargo build --locked exit 101") +- Each tag has a corresponding finding under `docs/agent/findings/` filed + before the next patch +- README §"Honest status" or equivalent names the patch dance up front + for users +- Total K is bounded (typically 2–3); convergence is not "endless patch + spiral" + +### When F25 degrades into anti-pattern + +F25 becomes a defect pattern (and should be filed as a separate finding) +when any of the following hold: + +1. **No honest CHANGELOG**: subsequent tag silently overwrites prior + without naming the broken state. Users cannot distinguish which tags + to skip. *Recovery*: amend CHANGELOG at the next patch; never delete + the prior tag's broken state. +2. **Audit-as-ceremony, not audit-as-experiment**: the release-readiness + audit is rubber-stamping rather than truly running install commands + in a clean shell. Same F19 (release-readiness untested) instance, + wearing a release-pattern costume. +3. **K unbounded**: more than ~3 patch tags without convergence suggests + the project is missing a structural fix (the F20/F26 enforcement + layer the patches are nominally closing). *Recovery*: stop tagging; + land the enforcement-script fix; re-tag once. + +### Root cause + +AI-velocity acceleration buys experimental cycles, not shippable-first-try. +Under a 5-day human plan compressed to 2 days, the writer's mental model of +"what will install correctly" diverges from the actual artifact more than +under a 5-day human cadence. The release-readiness audit (F19's prevention +mechanism) catches the divergence; the patch closes it. The pattern is +*the right discipline* for AI velocity — but only with the three preconditions +above honored. + +### Evidence + +Cobrust Studio 2026-05-12, three consecutive tags in 6 hours wall-clock +(case study §3.4, §3.5, §4.1): + +1. **v0.1.0** (commit `a722e09`, tag `0a7fd3e`): SPA fallback regression + (`Path` on `Router::fallback`) shipped. Post-tag CTO 守闸 + release-readiness audit ran hermetic Playwright against + `./target/release/cobrust-studio` built from main HEAD; 13/14 e2e specs + failed. Finding `m4-release-readiness-spa-fallback-extractor.md` filed + P0. +2. **v0.1.1** (commit `15b6f46`): SPA fallback fixed via `Uri` extractor. + Stale Cargo.lock shipped; `cargo build --workspace --locked` exit 101. + `release-tarball.sh` errored; CHANGELOG names the gap. +3. **v0.1.2** (commit `7ea9ae3`): Cargo.lock regenerated + `doc-coverage.sh` + §6 hardened with paired exit-code + FAILED-grep gate. Release-readiness + audit returned GO. First usable tag. + +CHANGELOG names each broken tag explicitly; README §"Honest status" names +the patch dance up front. All three preconditions honored. K=2 (within +the bounded convergence claim). + +### Rule of thumb + +> **Under AI velocity, plan for K=2 patch tags before first usable. The +> right discipline is fast experimental cycle: tag, audit in clean shell, +> patch the gap, re-tag.** +> +> Hard preconditions for the pattern to remain legitimate-and-disciplined: +> +> 1. **Honest CHANGELOG**: each broken tag named with its gap; no quiet +> retag. +> 2. **Audit-as-experiment**: the release-readiness audit must actually +> run commands in a clean shell, not read the README. +> 3. **K-bound convergence**: K ≤ 3 typical. If K > 3, the underlying +> enforcement-script layer is missing — stop tagging, fix the +> enforcement, re-tag once. + +### Recovery + +When F25 is firing (legitimate use): + +1. After each patch tag, file a finding naming the gap as an instance + of F19/F20/F26 (which enforcement layer was missing). +2. Update `scripts/doc-coverage.sh` or equivalent enforcement script + in the same PR as the patch, closing the gap structurally — not + just fixing the symptom. +3. Verify convergence: each subsequent patch should close a *different* + gap. Two consecutive patches closing the same gap = K-bound violated, + stop tagging. + +When F25 has degraded into anti-pattern (quiet retag / endless spiral): + +1. Audit CHANGELOG: name every prior broken state retroactively. +2. Locate the missing enforcement layer (the F20 instance the patches + are nominally closing); land it; re-tag once. +3. Communicate to users: "we shipped K tags rapidly; here is what + each one fixed; here is the structural fix we landed at v0.1.". + +### Prevention going forward + +Adopt F25 as an explicit release pattern in `cto_operations_runbook.md` +§"Tagging policy" for any AI-velocity project: + +- Plan for K=2 patch tags in the release window. +- Spawn the release-readiness agent (F19) on **every** tag push, not + just the planned "final" one. +- CHANGELOG template includes a §"This tag is known-broken; upgrade to + v0.1." section for any tag the audit returned BLOCK on. +- README §"Honest status" names the current usable tag, not the latest + tag — users can find both with `git tag --sort=-creatordate`. + +This composes with F19 (release-readiness untested — F25's audit step +*is* an F19 prevention exercise) and F20 (constitution-vs-workflow +alignment — each patch is an F20 closure landed in the same PR as the +fix). + +--- + +## F26 — Recursive enforcement-script closure required (F1 Sediment Family, orthogonal-failure sub-form) + +> **F1 sub-form, confirmed**. Direct refinement of F20 (constitution-vs-workflow +> alignment). F20 closure is not one-shot; every enforcement layer needs its +> own paired review against orthogonal failure modes on the same code path. +> A doc-coverage gate hardened against pattern X can ship green against pattern +> Y on the same operation. Studio's `doc-coverage.sh` §6 evolution is the +> empirical substrate: two patches before the §6 gate stopped letting things +> through. + +### Definition + +An enforcement script (CI lint, doc-coverage gate, pre-commit hook) is +written or hardened to catch failure mode X on operation Op. The script +appears correct against X. The script ships green against failure mode Y +on the same operation Op — Y being a different shape of the same underlying +contract violation that X manifests. Each enforcement layer needs its own +paired orthogonal-failure review until the failure-mode class no longer +recurs. + +### Symptoms + +- An F20 closure (script hardened against bug pattern X) ships green + against bug pattern Y the same week +- The script's invariant is declared once ("no test failures shipped") but + the operation has multiple orthogonal failure shapes (FAILED summary line + emitted vs exit code only vs hang vs panic vs OOM) +- "Two strikes" pattern: same script, same invariant, two consecutive + bypasses through different failure modes +- Auditor's review of the script reads correct against the failure mode + that motivated the script's creation, but doesn't scan for orthogonal + failure modes on the same code path + +### Root cause + +Enforcement-script authors close the failure mode that triggered the script. +They do not scan the same operation for other failure modes that would +bypass the new check. The script's coverage is local to the bug; the +contract's coverage is global to the operation. Closing X without checking +Y leaves the layer half-closed. + +This is structurally a recursive application of F20 (constitution-vs-workflow: +mandate vs workflow has a gap). F26 is F20 applied to the workflow itself +— the enforcement layer is a workflow, the workflow has a gap, the gap +becomes a new finding, the new closure may itself have a gap. + +### Evidence + +Cobrust Studio `doc-coverage.sh` §6 evolution (2026-05-12; case study §3.5 +and §4.2): + +| Stage | Enforcement | Gap revealed | Closure tag | +|---|---|---|---| +| Pre-M4.1 | `grep '^test result' \| wc -l` | Counts both `ok` and `FAILED` as "result" lines; 9 failing tests shipped as "22 test groups all green" | A4 merge `8d5475f` shipped 9 failing integration tests under green-gate | +| M4.1 | `grep -c '^test result: FAILED'` | Misses non-zero exit without summary line (e.g. `cargo build --locked` exit 101 from lockfile mismatch) | v0.1.1 tag `15b6f46` shipped broken | +| v0.1.2 | Paired: `if ! cargo test ...` AND FAILED-grep | Both classes now caught | v0.1.2 tag `7ea9ae3` first usable | + +Each fix was complete against the bug class it was designed for. But the +enforcement layer had orthogonal failure modes (FAILED-line emit-ing vs +not-emit-ing on `cargo test --locked`) that needed their own paired review. + +A second F26 instance landed in M5.8: `doc-coverage.sh` §5b added `cargo +fmt --check` after Sarah-persona v2 caught local "6 gates passed" while +CI's separate `cargo fmt --check` job failed on the same SHA — the §5b +gate was missing because the §6 gate's authoring scope was "test-failure +shape", not "any orthogonal pre-merge check the project also runs in CI". +Same F26 pattern, different orthogonal failure axis. + +### Rule of thumb + +> **When closing an F20 instance, scan for orthogonal failure modes on the +> same code path BEFORE declaring the closure complete.** +> +> Ask explicitly: "could my enforcement layer still pass under a different +> failure shape of the same operation?" If yes, the closure is partial. +> +> Common orthogonal axes to enumerate per operation: +> +> | Operation | Orthogonal failure axes | +> |---|---| +> | `cargo test --locked` | exit code ≠ 0 / FAILED summary line / hang / panic / OOM / lockfile mismatch / build error | +> | Frontmatter SHA check | absent / placeholder string ("HEAD") / wrong hex shape / hex-shaped but unreachable / wrong-branch SHA | +> | README install command | URL 404 / URL redirect needs -fsSL / asset name typo / wrong-arch asset / missing dependency | +> | CI matrix job | platform missing / runner image deprecated / cache miss balloons time / artifact upload silently truncated | + +### Recovery + +When F26 fires (a closure shipped, then a sibling failure bypassed it): + +1. **Add the paired check to the same script in the same PR**. Don't + wait for the next sprint. +2. **Enumerate orthogonal failure axes for the operation** (use the table + above as starting point; extend per project). +3. **Add a "deliberately-broken-input test" in CI**: feed the enforcement + script a fixture for each orthogonal failure mode; assert exit ≠ 0. + This is the F20 §"Rule of thumb" layer-3 enforcement applied to F26. +4. **Document the closure as a finding**: `