diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 9488ca5..7b4d107 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -7,7 +7,7 @@ findings, atomic commits, doc-coverage on the same change. | Type | Where it lands | Bar | |---|---|---| -| New failure mode (F22+) | `reference/failure-modes-catalogue.md` | At least one concrete empirical instance with citation | +| New failure mode (F31+) | `reference/failure-modes-catalogue.md` | At least one concrete empirical instance with citation | | Case study extension | `case-study/-multi-agent-experience.md` | Real project, real outcomes, dated entries | | Template improvement | `templates/.md` | Backwards-compatible OR new template | | Section refinement in SKILL.md | `SKILL.md` | Cite at least one project where this refinement was tested | @@ -16,7 +16,7 @@ findings, atomic commits, doc-coverage on the same change. ## What we will not merge - Speculative methodology rules without an empirical instance -- F-pattern proposals that are restatements of existing F1-F21 +- F-pattern proposals that are restatements of existing F1-F30 - "Tone" rewrites that lose specific examples or evidence - Removal of attribution (e.g. dropping "discovered_by" frontmatter) @@ -28,7 +28,7 @@ findings, atomic commits, doc-coverage on the same change. doc + code (if any) + cross-references in the same commit. 3. **For F-pattern additions**: include `## FN — Title`, `**Signal**`, `**Root cause**`, `**Evidence**` (cite project + commit SHA or - equivalent), `**Rule of thumb**`. Mirror the existing F1-F21 entry + equivalent), `**Rule of thumb**`. Mirror the existing F1-F30 entry shape. 4. **For case studies**: use day-by-day or week-by-week structure. Mark counterfactuals (`What would have failed without this discipline:`). diff --git a/README.md b/README.md index d8e73f4..accc865 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # Agent-Driven Software Development (ADSD) -> Methodology distilled from running a 9-week multi-agent Rust compiler +> Methodology distilled from running a 12-day multi-agent Rust compiler > project where AI agents wrote ≥ 70% of the code under human strategic > direction. @@ -10,9 +10,9 @@ ## What this is ADSD is **not a framework**. It's a documented working style that survived -contact with reality: ~178 commits, ~2,611 tests, 43 ADRs, 19 findings, 21 -documented failure modes, 2 P0 codegen bugs caught via organic stress test, -and a 0.1.1 release shipped publicly. +contact with reality: ~278 commits, ~2,611 tests, 49 ADRs (0001..0048 + 0047a), +27 findings, 24 documented failure modes, 2 P0 codegen bugs caught via organic +stress test, and v0.1.2 stable shipped publicly + α Phase F.2 in flight. ADSD codifies the discipline that kept the multi-agent project coherent: ADRs as decision capture, findings as negative-result memory, bilingual @@ -46,7 +46,7 @@ simple workflows. After install, invoke via `/agent-driven-development` or let Claude pick it automatically based on context — the description-triggered activation fires -for multi-agent dispatch planning, ADR drafting, F1–F18 failure-mode triage, +for multi-agent dispatch planning, ADR drafting, F1–F30 failure-mode triage, pre-release audit team design, and similar prompts. ### As a personal skill (fallback, no plugin system) @@ -74,14 +74,14 @@ agent-driven-development/ ├── .claude-plugin/ │ └── marketplace.json # Self-hosted single-plugin marketplace catalog ├── plugins/ -│ └── agent-driven-development/ # Plugin root (matches marketplace.json source) +│ └── adsd/ # Plugin root (matches marketplace.json source) │ ├── .claude-plugin/ │ │ └── plugin.json # Plugin manifest │ └── skills/ │ └── agent-driven-development/ # Skill — auto-discovered by Claude Code │ ├── SKILL.md # Main methodology document (~36 KB) │ ├── reference/ -│ │ └── failure-modes-catalogue.md # F1-F21 anti-patterns with empirical evidence +│ │ └── failure-modes-catalogue.md # F1-F30 anti-patterns with empirical evidence │ ├── case-study/ │ │ └── cobrust-multi-agent-experience.md # The founding case study (N=1) │ └── templates/ @@ -100,16 +100,41 @@ agent-driven-development/ ## Quick start (after install) 1. Read [`SKILL.md`](./plugins/adsd/skills/agent-driven-development/SKILL.md) for the full methodology (~36 KB, 30 min). -2. Read [`reference/failure-modes-catalogue.md`](./plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md) for the F1–F18 anti-patterns you'll likely hit. Don't re-derive them. +2. Read [`reference/failure-modes-catalogue.md`](./plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md) for the F1–F30 anti-patterns you'll likely hit. Don't re-derive them. 3. Read [`case-study/cobrust-multi-agent-experience.md`](./plugins/adsd/skills/agent-driven-development/case-study/cobrust-multi-agent-experience.md) to see ADSD applied in practice (warts and all). 4. Copy the relevant template from [`templates/`](./plugins/adsd/skills/agent-driven-development/templates/) into your project's `docs/agent/` tree. 5. Start writing ADRs as decisions actually happen — not speculatively. +## Documentation + +User-facing docs are in [`docs/human/`](./docs/human/) (zh + en parallel per ADSD §3 bilingual mandate). Agent-facing meta-conventions for this repo are in [`docs/agent/`](./docs/agent/). + +### Bilingual user docs + +| Topic | 中文 | English | +|---|---|---| +| Getting started — 30-min onboarding | [`docs/human/zh/getting-started.md`](./docs/human/zh/getting-started.md) | [`docs/human/en/getting-started.md`](./docs/human/en/getting-started.md) | +| Concept map — mermaid diagrams + narrative | [`docs/human/zh/concept-map.md`](./docs/human/zh/concept-map.md) | [`docs/human/en/concept-map.md`](./docs/human/en/concept-map.md) | + +### Agent-facing meta-conventions + +- [`docs/agent/conventions.md`](./docs/agent/conventions.md) — repo structure, frontmatter contracts, bilingual mandate enforcement, commit message format, identity hygiene (F21) + +### Doc coverage gate + +`scripts/doc-coverage.sh` enforces ADSD §3 bilingual mandate on this repo itself: every `docs/human/zh/*.md` MUST have a parallel `docs/human/en/*.md`. Run locally before commits: + +```sh +bash scripts/doc-coverage.sh +``` + +The script also verifies reference files have YAML frontmatter and ADR files are zero-padded monotonic. This closes ADSD §3 mandate as F20 systemic prevention applied to ADSD itself. + ## Origin ADSD was extracted from the [Cobrust](https://github.com/Cobrust-lang/cobrust) project, a Rust-implemented Python successor with an AI-native compiler. -Cobrust shipped its `0.1.0` stable tag on 2026-05-10 after a 9-week run +Cobrust shipped its `0.1.0-beta` tag on 2026-05-10 and `0.1.0` stable on 2026-05-11, after an 11-day intensive run (first commit 2026-04-30 → v0.1.0 stable tag 2026-05-11; v0.1.2 + α Phase F.2 followed) with multiple parallel Claude agents (Opus 4.7 and Sonnet 4.6) coordinated via the methodology you'll find in [`SKILL.md`](./plugins/adsd/skills/agent-driven-development/SKILL.md). @@ -126,7 +151,7 @@ project so the methodology can be tested outside its founding context. File an issue describing your project if interested. ADSD is **battle-tested but not orthodoxy**. Adapt it. If you find a -failure mode we missed, propose F22+ via a PR to +failure mode we missed, propose F31+ via a PR to [`reference/failure-modes-catalogue.md`](./plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md). ## Contributing diff --git a/docs/agent/conventions.md b/docs/agent/conventions.md new file mode 100644 index 0000000..651a46c --- /dev/null +++ b/docs/agent/conventions.md @@ -0,0 +1,185 @@ +--- +name: ADSD repo conventions +description: Meta-conventions for this repo itself. ADSD codifies how to manage AI-agent projects; this file applies ADSD discipline to the ADSD methodology itself. Agents contributing to this repo should read this first. +type: convention +version: 1.0.0 +date: 2026-05-12 +status: active +relates_to: [SKILL.md §"Documentation Discipline", README.md §Contributing, CONTRIBUTING.md] +--- + +# ADSD repo conventions + +> ADSD is a methodology for managing AI-agent software projects. This repo IS such a project. Therefore, **ADSD applies to ADSD**. This file captures the meta-conventions specific to this repo's contributors (humans and AI agents). + +## Repo structure (binding) + +``` +agent-driven-development/ +├── .claude-plugin/marketplace.json # Plugin marketplace catalog +├── plugins/ +│ └── adsd/ +│ └── skills/ +│ └── agent-driven-development/ # The skill — auto-loaded by Claude Code +│ ├── SKILL.md # Main methodology (~36 KB) +│ ├── reference/ # Deep-dive references (F-patterns, evals, prompts, etc.) +│ ├── case-study/ # Founding case study (Cobrust N=1) +│ └── templates/ # Templates for ADR / finding / dispatch / snapshot / handoff +├── docs/ +│ ├── human/ +│ │ ├── zh/ # Chinese user docs — 与 en 一一对应 +│ │ └── en/ # English user docs — 1:1 parity with zh +│ └── agent/ +│ ├── conventions.md # This file +│ ├── adr/ # Meta-ADRs for ADSD itself +│ └── findings/ # Findings about ADSD's evolution +├── scripts/ +│ └── doc-coverage.sh # Enforces zh+en parity per ADSD §3 mandate +├── CONTRIBUTING.md # Human-facing contribution guide +├── LICENSE-APACHE / LICENSE-MIT +└── README.md # Entry point +``` + +**Binding constraints**: + +1. Every file in `docs/human/zh/` MUST have a parallel file at `docs/human/en/` with the same filename. Enforced by `scripts/doc-coverage.sh`. +2. Every reference under `plugins/adsd/skills/agent-driven-development/reference/` MUST have YAML frontmatter (`name`, `description`, `type`, `version`, `date`, `status`, `relates_to`). +3. The SKILL.md `description` field is the auto-activation trigger — keep it keyword-dense and specific. +4. ADRs in `docs/agent/adr/` are zero-padded sequential (`0001-*.md`, `0002-*.md`, ...). Once accepted, an ADR is immutable; supersede via a new ADR. + +## Frontmatter contracts + +### Reference files (in `plugins/adsd/skills/agent-driven-development/reference/`) + +```yaml +--- +name: +description: +type: reference +version: +date: +status: active | deprecated | candidate +relates_to: [skill:SKILL.md §section, reference:other-file.md, ...] +--- +``` + +### Meta-ADRs (in `docs/agent/adr/`) + +```yaml +--- +doc_kind: adr +adr_id: +title: +status: proposed | accepted | superseded | deprecated +date: +last_verified_commit: +supersedes: [, ...] +superseded_by: [, ...] +relates_to: [, , ...] +--- +``` + +### Findings (in `docs/agent/findings/`) + +```yaml +--- +doc_kind: finding +finding_id: +last_verified_commit: +status: open | closed | partial +discovered_by: +dependencies: [adr:, finding:, ...] +--- +``` + +## Bilingual docs mandate (ADSD §3 dogfood) + +The skill's SKILL.md §3 mandates that every public item gets entries in: + +- `docs/human/zh/.md` +- `docs/human/en/.md` (1:1 parity) +- Agent-facing schema (in this repo: SKILL.md + reference/) + +This rule applies to ADSD itself. `scripts/doc-coverage.sh` enforces zh+en parity. + +**Operative checks** (run by `doc-coverage.sh`): + +1. Every `docs/human/zh/*.md` has a parallel `docs/human/en/*.md` +2. Every `docs/human/en/*.md` has a parallel `docs/human/zh/*.md` +3. Parallel files have identical filenames (case-sensitive) +4. (Future) Section headers are 1:1 between zh and en + +CI fails if any check fails. + +## When to add a new ADR vs amend SKILL.md vs add a finding + +| Change type | Where | Trigger | +|---|---|---| +| New methodology rule | `docs/agent/adr/NNNN-.md` | The rule affects ≥2 reference files, templates, or SKILL.md sections | +| Refine existing reference | edit the reference file directly + note in commit | Single-file refinement | +| Document an ADSD evolution event | `docs/agent/findings/.md` | Real-world ADSD use surfaced a gap or worked unexpectedly well | +| Update SKILL.md | edit SKILL.md + cross-reference an ADR if it's a binding rule | Adds a new "Part N" or modifies an existing one | +| New cross-pollination ref (Anthropic / OpenAI / other) | `plugins/.../reference/.md` | New industry pattern worth adopting | + +## When NOT to add an ADR + +- Bug fix in a reference doc (typo, broken link) +- Updating frontmatter date / last_verified +- Adding an example to an existing section +- Re-organizing within a single file +- Translation update (zh ⟵→ en sync) + +Per ADSD §"ADR vs Finding distinction": ADRs are forward-looking decisions; small refinements don't need them. + +## Commit message format + +``` +(): [vX.Y.Z] +``` + +- ``: `feat`, `docs`, `fix`, `refactor`, `chore` +- ``: `skill`, `reference`, `case-study`, `templates`, `docs-zh`, `docs-en`, `meta` +- Include `[vX.Y.Z]` semver if the change is release-worthy + +Examples: + +``` +feat(reference): add evals-first-development.md (v1.2.0) +docs(zh): translate getting-started.md to match en parity +fix(skill): correct cross-reference path after plugin layout migration +chore(meta): bump SKILL.md description for trigger keyword coverage +``` + +Sign with session ID per F21 (Cross-session identity overload): + +``` +Co-Authored-By: Claude Opus 4.7 (session XYZ) +``` + +## Identity hygiene (F21 closure) + +Per F21 codification: + +- Do NOT sign as bare "review-claude" or "ADSD-author" in commits or files +- Use session-stamped attribution: `review-claude (session 4bb35f43)` or `ADSD-author (session XYZ)` +- Reserve plain handles for the abstract role in narrative prose only + +## Versioning policy + +- **v1.0.x** — initial release, plugin migration +- **v1.1.x** — F19/F20/F21 codification +- **v1.2.x** — cross-pollination references (Anthropic + OpenAI) +- **v1.3.x** — bilingual docs + remaining G3/G5/G6/G8/G9/G10/G12 gaps + +Semver bumps follow SemVer 2.0: + +- MAJOR: breaking change to skill format, plugin layout, or canonical paths +- MINOR: new reference file, new template, new ADR +- PATCH: refinement, typo, frontmatter update, translation sync + +## Cross-references + +- `CONTRIBUTING.md` — human-facing contribution flow +- `plugins/adsd/skills/agent-driven-development/SKILL.md` §"Documentation Discipline" — methodology origin of these rules +- `plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md` F1 family + F19 + F20 + F21 — the failure modes these conventions prevent +- `scripts/doc-coverage.sh` — machine enforcement of zh+en parity diff --git a/docs/human/en/concept-map.md b/docs/human/en/concept-map.md new file mode 100644 index 0000000..d0bb669 --- /dev/null +++ b/docs/human/en/concept-map.md @@ -0,0 +1,149 @@ +# ADSD concept map + +> Mermaid diagrams + short prose to unpack the full ADSD concept landscape at once. + +## Top-level view + +```mermaid +flowchart TB + Constitution[CLAUDE.md constitution] --> Decisions{Need a decision?} + Decisions -->|Yes, affects ≥2 files| ADR[ADR — decision record] + Decisions -->|No, single file| InCode[Just code it] + + Implementation[Implementation work] --> Failure{Did something break?} + Failure -->|Yes| Finding[Finding — failure record] + Failure -->|No| Continue[Continue] + + State[Project state] --> Snapshot[snapshot.md — state snapshot] + + ADR --> Sprint[Sprint = Wave + Tx] + Sprint --> Dispatch[Dispatch P9/P7 sub-agent] + + Dispatch --> Drating{D-Matrix assessment} + Drating -->|D0 doc-only| Sonnet[sonnet solo] + Drating -->|D1-D3 multi complexity| Pair[dev/test pair TDD] + Drating -->|D4 ADR| OpusSolo[opus solo, P9 personal] + Drating -->|D5 real LLM/consensus| OpusPair[opus dev + opus test] + + Pair --> CommitWave[Atomic commit + Wave merge] + OpusPair --> CommitWave + + CommitWave --> ReleaseGate{Release artifact?} + ReleaseGate -->|Yes| ReleaseReady[Release-readiness agent independent verify] + ReleaseReady -->|GO| Tag[git tag v0.X.Y] + ReleaseReady -->|BLOCK| Fix[fix-pack sprint] + Fix --> ReleaseReady +``` + +## Three abstraction layers (slow → fast) + +```mermaid +flowchart LR + Strategy[Strategy — month-scale] --> Tactical[Tactical — week-scale] + Tactical --> Execution[Execution — hour/day-scale] + + Strategy -.includes.-> Constitution[Constitution] & Wedge[Wedge / strategic direction] & Roadmap[Milestone roadmap] + Tactical -.includes.-> ADRs[ADRs] & Findings[Findings] & Waves[Waves] & PreMortem[Pre-mortem] + Execution -.includes.-> Dispatch[Sub-agent Dispatch] & Tx[Tx commits] & Gates[5-gate + 6th eval-gate] & Release[Release-readiness] + + Strategy -.through.- Tactical -.through.- Execution +``` + +- **Strategy layer**: CLAUDE.md rarely changes; month-scale decisions. Changing it = major project pivot. +- **Tactical layer**: ADR + Finding added weekly; milestone checkpoints. +- **Execution layer**: daily sprints, sub-agent dispatch, gate enforcement, atomic commits. + +## Failure modes (F1 Sediment Family) panorama + +```mermaid +flowchart TB + F1[F1 Sediment Family — declared-without-enforcement] --> F1_0[F1.0 schema invariant] + F1 --> F1_1[F1.1 snapshot HEAD freshness] + F1 --> F1_2[F1.2 ADR roster completeness] + F1 --> F16[F16 post-compaction identity drift] + F1 --> F17[F17 sub-agent KPI self-report] + F1 --> F18[F18 attribution policy scope] + F1 --> F19[F19 install-not-tested] + F1 --> F20[F20 constitution-vs-workflow] + F1 --> F21[F21 cross-session identity overload] + + F1_0 -.via.-> Enforce0[snapshot-lint Inv] + F1_1 -.via.-> Enforce1[pre-commit hook] + F16 -.via.-> Enforce16[auto-memory identity preamble] + F17 -.via.-> Enforce17[verification commands block in completion report] + F19 -.via.-> Enforce19[release-readiness agent in clean shell] + F20 -.via.-> Enforce20[D-matrix + dev/test pair workflow] + F21 -.via.-> Enforce21[session-ID stamping convention] +``` + +Each F-pattern has a corresponding enforcement mechanism. The F1 Family core lesson: **declaring rules isn't enough; you must have machine / workflow enforcement**. + +## Four-layer storage model (memory decision) + +```mermaid +flowchart TB + NewInfo{Where to write new info?} --> Type{What category?} + Type -->|Identity / operative rule / SOP| L1[L1 auto-memory
~/.claude/projects//memory/] + Type -->|Cross-file decision| L2A[L2 ADR
docs/agent/adr/] + Type -->|Failure / surprise / dead-end| L2B[L2 Finding
docs/agent/findings/] + Type -->|Project state fact| L2C[L2 Snapshot
project_state_snapshot.md] + Type -->|Mid-sprint working state| L3[L3 session scratch
notes in messages] + Type -->|Re-fetchable ephemeral output| L4[L4 ephemeral
don't store] + + L1 -.auto-load.-> Session[Session start] + L2A -.persists in.-> Repo[git history] + L2B -.persists in.-> Repo + L2C -.persists in.-> Repo + L3 -.persists until.-> SessionEnd[Session end] +``` + +When unsure, **default to L3 scratch**. Promotion to L1/L2 is a deliberate decision at sprint-end, not in-flight. + +## Dispatch protocol (dev/test pair pattern) + +```mermaid +sequenceDiagram + participant P9 as P9 Tech Lead + participant Test as P7 Test Agent + participant Dev as P7 Dev Agent + + P9->>P9: Assess D-rating (D1-D3 / D5 → pair) + P9->>Test: spawn (TDD step 1 — write failing test corpus) + Test-->>P9: [P7-TEST-CORPUS-READY] N tests, K fail + P9->>P9: review test corpus (10 min) + P9->>Dev: spawn (TDD dev step — implement + pass corpus) + Dev-->>P9: [P7-DEV-COMPLETION] cargo test 0 fail + P9->>P9: verify gate + atomic commit + P9-->>CTO: [P9-MILESTONE-COMPLETION] +``` + +**Why a separate test agent + dev agent is mandatory**: a single agent writing impl + test has confirmation bias — the test verifies what the agent intended, not what the spec demands. Separate test agent eliminates the bias. + +## Release closure (with release-readiness) + +```mermaid +flowchart LR + Code[Code Ready] --> Gate5[5-gate Green
fmt+clippy+build+test+doc-cov] + Gate5 --> Gate6[6th gate — Eval Delta Non-Regression] + Gate6 --> ReleaseFile[Edit Release Notes / README] + ReleaseFile --> ReleaseAgent[Spawn Release-readiness Agent
clean shell + curl + cargo install --dry-run] + ReleaseAgent --> Decision{GO or BLOCK?} + Decision -->|GO| Tag[git tag v0.X.Y] + Decision -->|BLOCK| Fix[Fix root cause] + Fix --> ReleaseAgent +``` + +**F19 closure key**: don't let the agent that wrote the docs self-verify the docs. **Independent release-readiness agent in a clean shell** is the only robust F19 defense. + +## Turning these diagrams into practice + +Each diagram is a "practice script": + +- Top-level view → follow this flow for a new project +- Three abstraction layers → team cadence, what to do daily/weekly/monthly +- F1 Family → consult this when you hit a wall, find the missing enforcement +- Storage four-layer → consult the decision tree before writing +- Dispatch protocol → P9 follows this sequence when initiating a sprint +- Release closure → mandatory path before any tag + +See [`getting-started.md`](./getting-started.md) 5-step practice section to map these diagrams to concrete commands. diff --git a/docs/human/en/getting-started.md b/docs/human/en/getting-started.md new file mode 100644 index 0000000..6b0e50a --- /dev/null +++ b/docs/human/en/getting-started.md @@ -0,0 +1,145 @@ +# Getting started + +> **Goal**: in 30 minutes, an engineer unfamiliar with ADSD has the ADRs + findings + sub-agent dispatch discipline running on their own project. + +## Who should read this + +- You're managing a project with **multi-agent parallelism** (≥3 AI agents working concurrently) +- You want to avoid the multi-agent endemic ailments: sediment / drift / silent regression +- You already use Claude Code / Cursor / similar IDE-agent tools at a basic level +- You have a git project to apply this methodology to + +If you're writing a single-agent small script, ADSD is overkill. Skip. + +## 30-second overview + +ADSD is the multi-agent working discipline distilled from 12 days of intensive Cobrust development (2026-04-30 → 2026-05-12, ~278 commits), codifying: + +1. **Decision capture** — every cross-file decision becomes an ADR (Architecture Decision Record) +2. **Failure capture** — every "this broke / surprised / dead-ended" becomes a Finding (negative result) +3. **Dispatch discipline** — D0-D5 difficulty matrix + dev/test pair TDD protocol + +Plus **bilingual docs mandate** + **wave + Tx atomic commits** + **F1-F30 anti-pattern catalogue** + **release-readiness pre-publish independent verification**. That's the full picture. + +Detailed architecture: [`concept-map.md`](./concept-map.md) + +## Three install paths + +### Method 1 (recommended) — Claude Code plugin + +``` +/plugin marketplace add Cobrust-lang/agent-driven-development +/plugin install adsd@adsd +``` + +Once installed, when a prompt mentions "multi-agent dispatch / ADR drafting / F1-F30 failure modes" etc., Claude auto-activates the ADSD skill. + +### Method 2 — Personal skill directory (fallback) + +```sh +mkdir -p ~/.claude/skills +git clone --depth 1 https://github.com/Cobrust-lang/agent-driven-development.git /tmp/adsd-src +cp -r /tmp/adsd-src/plugins/adsd/skills/agent-driven-development ~/.claude/skills/ +rm -rf /tmp/adsd-src +``` + +### Method 3 — Read-only (no install, just markdown) + +Read [`plugins/adsd/skills/agent-driven-development/SKILL.md`](https://github.com/Cobrust-lang/agent-driven-development/blob/main/plugins/adsd/skills/agent-driven-development/SKILL.md) top-to-bottom (~30 min) for the full methodology. No install required to learn. + +## First real use — 5 steps + +Assume you have a project at `~/my-project/` and want to start with ADSD. + +### Step 1: Create the project `CLAUDE.md` (constitution) + +Write a ~30-line project constitution at `~/my-project/CLAUDE.md` with at minimum: + +- **Project identity** — one-line pitch (what + who uses it) +- **What you keep** (good properties borrowed from other tools / languages / workflows) +- **What you drop** (explicit anti-patterns) +- **Engineering standards** — Elegant / Scientific / Efficient with 3-5 concrete rules each +- **Milestone roadmap** — M0 (scaffold) → M1 → ... 6-12 months out + +Reference: ADSD's own SKILL.md "Engineering standards" section is a template. + +### Step 2: Create `docs/agent/` + `docs/human/{zh,en}/` skeleton + +```sh +cd ~/my-project +mkdir -p docs/agent/adr docs/agent/findings docs/agent/modules +mkdir -p docs/human/zh docs/human/en +``` + +Copy ADSD's `templates/adr-template.md` to `docs/agent/adr/_template.md` as your ADR drafting template. Same for finding-template, snapshot-template. + +### Step 3: Write ADR-0001 (license choice) + +Every project's first ADR is typically the license choice (Apache+MIT dual, or BSL-1.1, or ...). This is **the start of mandatory ADR flow** — one cross-multifile decision running through the complete process: Context → Options → Decision → Consequences → Cross-references. + +### Step 4: Build `MEMORY.md` index (Claude Code auto-memory) + +If you use Claude Code, project-level memory lives in `~/.claude/projects//memory/`. Create the `MEMORY.md` index with one-line hooks: + +``` +- [Project identity preamble](identity.md) — read first when resuming a session +- [Subagent model tier rule](subagent_tiers.md) — D0-D5 matrix per ADSD +- [CTO operations runbook](runbook.md) — dispatch SOPs +``` + +See [`reference/cross-session-memory-architecture.md`](https://github.com/Cobrust-lang/agent-driven-development/blob/main/plugins/adsd/skills/agent-driven-development/reference/cross-session-memory-architecture.md). + +### Step 5: First sub-agent dispatch (using ADSD D-matrix) + +Use Claude Code's Agent tool to dispatch a concrete task. **The prompt MUST include difficulty self-rating**: + +``` +DIFFICULTY-RATING: D2 (multi-fn stdlib API new, single crate, ADR clear) +MODEL-DEV: sonnet +MODEL-TEST: sonnet +PAIR: yes + +MISSION: implement such that all passes. + +REQUIRED READS: +- /abs/path/to/ADR-0XXX.md +- /abs/path/to/test_corpus.rs +- see reference/prompt-engineering-patterns.md PT2 (few-shot output format) + +REPORT FORMAT: [P7-COMPLETION] with verification block (paste raw cargo test output, no paraphrase) +``` + +See [`reference/prompt-engineering-patterns.md`](https://github.com/Cobrust-lang/agent-driven-development/blob/main/plugins/adsd/skills/agent-driven-development/reference/prompt-engineering-patterns.md). + +## Verify you installed correctly + +Run these two checks: + +```sh +# 1. Verify plugin activated +/plugin status adsd + +# 2. In Claude Code, ask a question with ADSD keywords +"I need to plan a multi-agent dispatch, how do I use the D-matrix to assess difficulty?" +``` + +If Claude auto-references ADSD's `reference/` files, you installed correctly. If Claude answers from general knowledge, the skill didn't activate. + +## Next steps + +- Read [`concept-map.md`](./concept-map.md) for the complete ADSD concept diagram +- When you hit a wall, write a finding. Don't hide it. F1-F30 catalogue is at [`reference/failure-modes-catalogue.md`](https://github.com/Cobrust-lang/agent-driven-development/blob/main/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md); you may have hit the same one + +## FAQ + +**Q: My project is small. Do I really need ADRs?** +A: Only for decisions affecting ≥2 files. Single-file modifications don't write ADRs. Bug fixes don't write ADRs (but do write findings). + +**Q: Bilingual docs feel burdensome?** +A: ADSD mandates this because it addresses the real "Chinese teams are natively multilingual" problem. Single-language projects can relax this; but the README + getting-started bilingual pair is recommended. + +**Q: D-matrix is tedious, do I need to evaluate every time?** +A: Manual evaluation for the first 5 times; after that it becomes muscle memory. Skipping costs you model-tier mismatch (F20 family) — projects that hit it think it's worth it. + +**Q: I use OpenAI not Anthropic?** +A: ADSD is LLM-agnostic. D-matrix / dev-test pair / evals-first are all vendor-neutral. The Claude Code plugin is just a distribution channel; the methodology itself doesn't bind to Anthropic. diff --git a/docs/human/zh/concept-map.md b/docs/human/zh/concept-map.md new file mode 100644 index 0000000..44a1265 --- /dev/null +++ b/docs/human/zh/concept-map.md @@ -0,0 +1,149 @@ +# ADSD 概念图 + +> 用 mermaid 图表 + 简短文字, 把 ADSD 全套概念一图打散. + +## 顶层视图 + +```mermaid +flowchart TB + Constitution[CLAUDE.md 宪法] --> Decisions{需要决策吗?} + Decisions -->|是, 跨 ≥2 文件| ADR[ADR — 决策记录] + Decisions -->|否, 单文件| InCode[就在代码里改] + + Implementation[实施工作] --> Failure{出问题了吗?} + Failure -->|是| Finding[Finding — 失败记录] + Failure -->|否| Continue[继续] + + State[项目状态] --> Snapshot[snapshot.md — 状态快照] + + ADR --> Sprint[Sprint = Wave + Tx] + Sprint --> Dispatch[Dispatch P9/P7 sub-agent] + + Dispatch --> Drating{D-Matrix 评估} + Drating -->|D0 doc-only| Sonnet[sonnet solo] + Drating -->|D1-D3 多复杂度| Pair[dev/test pair TDD] + Drating -->|D4 ADR| OpusSolo[opus solo, P9 亲笔] + Drating -->|D5 真 LLM/consensus| OpusPair[opus dev + opus test] + + Pair --> CommitWave[原子 commit + Wave merge] + OpusPair --> CommitWave + + CommitWave --> ReleaseGate{Release artifact?} + ReleaseGate -->|是| ReleaseReady[Release-readiness agent 独立验证] + ReleaseReady -->|GO| Tag[git tag v0.X.Y] + ReleaseReady -->|BLOCK| Fix[fix-pack sprint] + Fix --> ReleaseReady +``` + +## 三层抽象 (从慢到快) + +```mermaid +flowchart LR + Strategy[战略层 — 月级别] --> Tactical[战术层 — 周级别] + Tactical --> Execution[执行层 — 小时/天级别] + + Strategy -.包含.-> Constitution[宪法] & Wedge[Wedge / 战略方向] & Roadmap[Milestone 路线] + Tactical -.包含.-> ADRs[ADRs] & Findings[Findings] & Waves[Waves] & PreMortem[Pre-mortem] + Execution -.包含.-> Dispatch[Sub-agent Dispatch] & Tx[Tx commits] & Gates[5-gate + 6th eval-gate] & Release[Release-readiness] + + Strategy -.通过.- Tactical -.通过.- Execution +``` + +- **战略层**: CLAUDE.md 不常改, 月级别决策. 改 = 项目重大转向. +- **战术层**: ADR + Finding 每周新增, milestone 检查点. +- **执行层**: 每日 sprint, sub-agent 派活, gate 通过, atomic commit. + +## 失败模式 (F1 Sediment Family) 全景 + +```mermaid +flowchart TB + F1[F1 Sediment Family — declared-without-enforcement] --> F1_0[F1.0 schema invariant] + F1 --> F1_1[F1.1 snapshot HEAD freshness] + F1 --> F1_2[F1.2 ADR roster completeness] + F1 --> F16[F16 post-compaction identity drift] + F1 --> F17[F17 sub-agent KPI self-report] + F1 --> F18[F18 attribution policy scope] + F1 --> F19[F19 install-not-tested] + F1 --> F20[F20 constitution-vs-workflow] + F1 --> F21[F21 cross-session identity overload] + + F1_0 -.通过.-> Enforce0[snapshot-lint Inv] + F1_1 -.通过.-> Enforce1[pre-commit hook] + F16 -.通过.-> Enforce16[auto-memory identity preamble] + F17 -.通过.-> Enforce17[verification commands block in completion report] + F19 -.通过.-> Enforce19[release-readiness agent in clean shell] + F20 -.通过.-> Enforce20[D-matrix + dev/test pair workflow] + F21 -.通过.-> Enforce21[session-ID stamping convention] +``` + +每个 F-pattern 都有对应的 enforcement 机制. F1 Family 的核心 lesson: **声明规则不够, 必须有机器/工作流强制**. + +## 四层 storage 模型 (memory 决策) + +```mermaid +flowchart TB + NewInfo{新信息要写哪?} --> Type{是哪类?} + Type -->|身份 / 操作规则 / SOP| L1[L1 auto-memory
~/.claude/projects//memory/] + Type -->|跨文件决策| L2A[L2 ADR
docs/agent/adr/] + Type -->|失败 / 意外 / 死胡同| L2B[L2 Finding
docs/agent/findings/] + Type -->|项目状态事实| L2C[L2 Snapshot
project_state_snapshot.md] + Type -->|本 sprint 工作中| L3[L3 session scratch
消息中的笔记] + Type -->|可再 fetch 的临时输出| L4[L4 ephemeral
不存] + + L1 -.auto-load.-> Session[Session start] + L2A -.持续到.-> Repo[git history] + L2B -.持续到.-> Repo + L2C -.持续到.-> Repo + L3 -.持续到.-> SessionEnd[Session end] +``` + +不确定就**默认 L3 scratch**. 升级到 L1/L2 是 sprint 收尾时**主动决策**, 不在过程中. + +## Dispatch 协议 (dev/test pair pattern) + +```mermaid +sequenceDiagram + participant P9 as P9 Tech Lead + participant Test as P7 Test Agent + participant Dev as P7 Dev Agent + + P9->>P9: 评估 D-rating (D1-D3 / D5 → pair) + P9->>Test: spawn (TDD step 1 — 写 failing 测试集) + Test-->>P9: [P7-TEST-CORPUS-READY] N tests, K fail + P9->>P9: review test corpus (10 min) + P9->>Dev: spawn (TDD dev step — 实现 + 通过 corpus) + Dev-->>P9: [P7-DEV-COMPLETION] cargo test 0 fail + P9->>P9: verify gate + atomic commit + P9-->>CTO: [P9-MILESTONE-COMPLETION] +``` + +**为什么必须独立 test agent + dev agent**: 同一个 agent 写 impl + test 会有 confirmation bias — test 验证的是它自己想做的, 不是 spec 要求的. 独立 test agent 消除偏见. + +## Release 闭环 (含 release-readiness) + +```mermaid +flowchart LR + Code[Code Ready] --> Gate5[5-gate Green
fmt+clippy+build+test+doc-cov] + Gate5 --> Gate6[6th gate — Eval Delta Non-Regression] + Gate6 --> ReleaseFile[Edit Release Notes / README] + ReleaseFile --> ReleaseAgent[Spawn Release-readiness Agent
clean shell + curl + cargo install --dry-run] + ReleaseAgent --> Decision{GO or BLOCK?} + Decision -->|GO| Tag[git tag v0.X.Y] + Decision -->|BLOCK| Fix[Fix root cause] + Fix --> ReleaseAgent +``` + +**F19 闭环关键**: 不让写文档的 agent 自验文档. **独立 release-readiness agent 在 clean shell 跑** 是 F19 唯一 robust 防御. + +## 怎么把这些图变成实战 + +每张图都是一种"实战剧本": + +- 顶层视图 → 起新项目时按这条流程 +- 三层抽象 → 团队节奏感, 每天/每周/每月各做什么 +- F1 Family → 撞坑时查这张图, 哪个 enforcement 缺了 +- Storage 四层 → 写东西前对照决策树 +- Dispatch 协议 → P9 发起 sprint 时按此 sequence +- Release 闭环 → tag 前必走这条 path + +参考 [`getting-started.md`](./getting-started.md) 的 5 步实战, 把这些图落到具体命令. diff --git a/docs/human/zh/getting-started.md b/docs/human/zh/getting-started.md new file mode 100644 index 0000000..0e7d0a1 --- /dev/null +++ b/docs/human/zh/getting-started.md @@ -0,0 +1,145 @@ +# 入门指南 + +> **目标**: 30 分钟内让一个不熟悉 ADSD 的工程师在自己项目里开始用 ADRs + findings + sub-agent 派活的规范. + +## 谁该读这份文档 + +- 你正在管理一个**多 agent 并行**的软件项目 (≥3 个 AI agent 同时干活) +- 你想避免 sediment / drift / silent regression 这些**多 agent 顽疾** +- 你已经会用 Claude Code / Cursor / 类似 IDE-agent 工具的基本操作 +- 你有一个 git 项目可以套这套方法论 + +如果你只是写一个单 agent 的小脚本, ADSD 是 overkill, 跳过. + +## 30 秒概览 + +ADSD 是从 Cobrust 项目 12 天密集开发实战 (2026-04-30 → 2026-05-12, ~278 commits) 提炼的**多 agent 工作纪律**, 把以下三件事做硬: + +1. **决策捕获** — 每个跨文件的决定都写 ADR (Architecture Decision Record) +2. **失败捕获** — 每次"翻车 / 意外 / 死胡同"都写 Finding (负向结果) +3. **派活有谱** — D0-D5 难度矩阵 + dev/test pair 的 TDD 派活协议 + +加上**双语文档强制** + **wave + Tx 原子提交** + **F1-F30 反模式目录** + **release-readiness 上线前独立验证**, 就是 ADSD 全貌. + +详细架构: [`concept-map.md`](./concept-map.md) + +## 三种安装方式 + +### 方式 1 (推荐) — Claude Code plugin + +``` +/plugin marketplace add Cobrust-lang/agent-driven-development +/plugin install adsd@adsd +``` + +装完后, 命中"multi-agent dispatch / ADR drafting / F1-F30 failure mode"等关键词时, Claude 会自动激活 ADSD skill. + +### 方式 2 — 个人 skill 目录 (回退方案) + +```sh +mkdir -p ~/.claude/skills +git clone --depth 1 https://github.com/Cobrust-lang/agent-driven-development.git /tmp/adsd-src +cp -r /tmp/adsd-src/plugins/adsd/skills/agent-driven-development ~/.claude/skills/ +rm -rf /tmp/adsd-src +``` + +### 方式 3 — 只读 (不装, 看 markdown) + +直接读 [`plugins/adsd/skills/agent-driven-development/SKILL.md`](https://github.com/Cobrust-lang/agent-driven-development/blob/main/plugins/adsd/skills/agent-driven-development/SKILL.md), 30 分钟读完核心方法论. 不装也能学. + +## 第一次实战 — 5 步落地 + +假设你有一个项目 `~/my-project/`, 想开始用 ADSD. + +### 步骤 1: 创建项目 `CLAUDE.md` (宪法) + +在 `~/my-project/CLAUDE.md` 写下 30 行的项目宪法, 至少包含: + +- **项目身份** — 一行 pitch (是什么 + 谁用) +- **要保留的东西** (从其他语言 / 工具 / 工作流借鉴的良性属性) +- **要丢弃的东西** (明确反模式) +- **工程标准** — Elegant / Scientific / Efficient 各 3-5 条具体规定 +- **里程碑表** — M0 (脚手架) → M1 → ... 现在 + 未来 6-12 个月 + +参考: ADSD 自己的 SKILL.md "Engineering standards" 段是模板. + +### 步骤 2: 创建 `docs/agent/` + `docs/human/{zh,en}/` 目录骨架 + +```sh +cd ~/my-project +mkdir -p docs/agent/adr docs/agent/findings docs/agent/modules +mkdir -p docs/human/zh docs/human/en +``` + +把 ADSD 的 `templates/adr-template.md` 复制到 `docs/agent/adr/_template.md` 作为 ADR 起草模板. 同理 finding-template, snapshot-template. + +### 步骤 3: 写 ADR-0001 (license 选择) + +每个项目第一个 ADR 通常是 license 选择 (Apache+MIT dual, 或 BSL-1.1, 或 ...). 这是**强制走 ADR 流程**的开始 — 一次跨多文件的决定, 走完整流程: Context → Options → Decision → Consequences → Cross-references. + +### 步骤 4: 建立 `MEMORY.md` 索引 (Claude Code auto-memory) + +如果你用 Claude Code, 项目级 memory 在 `~/.claude/projects//memory/`. 创建 `MEMORY.md` 索引, 一行一条: + +``` +- [Project identity preamble](identity.md) — read first when resuming a session +- [Subagent model tier rule](subagent_tiers.md) — D0-D5 matrix per ADSD +- [CTO operations runbook](runbook.md) — dispatch SOPs +``` + +详见 [`reference/cross-session-memory-architecture.md`](https://github.com/Cobrust-lang/agent-driven-development/blob/main/plugins/adsd/skills/agent-driven-development/reference/cross-session-memory-architecture.md). + +### 步骤 5: 第一次 sub-agent 派活 (用 ADSD D-matrix) + +用 Claude Code 的 Agent tool 派一个具体任务. **prompt 必须含 difficulty self-rating**: + +``` +DIFFICULTY-RATING: D2 (multi-fn stdlib API new, single crate, ADR clear) +MODEL-DEV: sonnet +MODEL-TEST: sonnet +PAIR: yes + +MISSION: 实现 使得 全部通过. + +REQUIRED READS: +- /abs/path/to/ADR-0XXX.md +- /abs/path/to/test_corpus.rs +- 见 reference/prompt-engineering-patterns.md PT2 (few-shot 输出格式) + +REPORT FORMAT: [P7-COMPLETION] with verification block (paste raw cargo test output, no paraphrase) +``` + +详见 [`reference/prompt-engineering-patterns.md`](https://github.com/Cobrust-lang/agent-driven-development/blob/main/plugins/adsd/skills/agent-driven-development/reference/prompt-engineering-patterns.md). + +## 验证你装对了 + +跑这两条命令: + +```sh +# 1. 验证 plugin 已激活 +/plugin status adsd + +# 2. 在 Claude Code 里问个问题, 含 ADSD 关键词 +"我需要 plan 一个 multi-agent dispatch, 怎么用 D-matrix 评估难度?" +``` + +如果 Claude 自动引到 ADSD 的 reference, 装对了. 如果 Claude 用通用知识回答, skill 没激活. + +## 下一步 + +- 读 [`concept-map.md`](./concept-map.md) 看 ADSD 完整概念图 +- 撞坑了写 finding, 不要藏起来. F1-F30 catalogue 在 [`reference/failure-modes-catalogue.md`](https://github.com/Cobrust-lang/agent-driven-development/blob/main/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md), 你可能撞上同一个 + +## 常见问题 + +**Q: 我项目很小, 真的需要 ADR 吗?** +A: 跨 ≥2 文件的决定才写. 单文件修改不写. 修 bug 不写 (但写 finding). + +**Q: zh + en 双语文档负担太重?** +A: ADSD 强制是因为它解决了"中国团队天然 multi-lingual"的真实问题. 单语项目可以放宽, 但 README + getting-started 双语建议保持. + +**Q: D-matrix 太繁琐, 我每次都得想一遍?** +A: 头 5 次手动评估, 之后就成肌肉记忆. 跳过的代价是 model tier 错配 (F20 family) — 真撞坑过的项目觉得值. + +**Q: 我用 OpenAI 不用 Anthropic?** +A: ADSD 是 LLM-agnostic. D-matrix / dev-test pair / evals-first 都 vendor-neutral. Claude Code plugin 部分只是发行渠道, 方法论本身不绑 Anthropic. diff --git a/plugins/adsd/skills/agent-driven-development/SKILL.md b/plugins/adsd/skills/agent-driven-development/SKILL.md index 7492bbe..0cb6d9d 100644 --- a/plugins/adsd/skills/agent-driven-development/SKILL.md +++ b/plugins/adsd/skills/agent-driven-development/SKILL.md @@ -1,6 +1,6 @@ --- name: agent-driven-development -description: ADSD methodology for managing multi-agent software projects where AI agents produce ≥70% of code. Use when starting such a project, planning P10/P9/P7 sub-agent dispatch, running tactical or strategic project reviews, drafting ADR/finding/snapshot artifacts, designing pre-release multi-agent audit teams, or diagnosing multi-agent failure modes (snapshot sediment, post-compaction role-identity drift, silent miscompile, marketing overreach without benchmark cite, sub-agent KPI self-report fidelity gaps, attribution-policy scope leaks). Provides 4-tier role topology (P10 CTO / P9 tech lead / P7 senior eng / P0 atomic + external review), two-phase dispatch SOP (Phase 1 ADR spike → Phase 2 P9 impl), 8-dimension audit pattern (4 internal + 3 persona + deep-source-read), F1–F18 failure-modes catalogue, AI velocity planning heuristic, and ADR/finding/snapshot/dispatch-prompt-{p7,p9}/handoff-cover-letter templates under templates/. Read SKILL.md first; pull reference/failure-modes-catalogue.md and case-study/cobrust-multi-agent-experience.md on demand. +description: ADSD methodology for managing multi-agent software projects where AI agents produce ≥70% of code. Use when starting such a project, planning P10/P9/P8/P7 sub-agent dispatch, running tactical or strategic project reviews, drafting ADR/finding/snapshot artifacts, designing pre-release multi-agent audit teams, or diagnosing multi-agent failure modes (snapshot sediment, post-compaction role-identity drift, silent miscompile, marketing overreach without benchmark cite, sub-agent KPI self-report fidelity gaps, attribution-policy scope leaks). Provides 5-tier role topology (P10 CTO / P9 tech lead / P8 domain expert / P7 senior eng / P0 atomic + external review), two-phase dispatch SOP (Phase 1 ADR spike → Phase 2 P9 impl), snapshot-first documentation discipline, staged mocked-to-live progression guidance, 8-dimension audit pattern (4 internal + 3 persona + deep-source-read), F1–F30 failure-modes catalogue, AI velocity planning heuristic, and ADR/finding/snapshot/dispatch-prompt-{p7,p9}/handoff-cover-letter templates under templates/. Read SKILL.md first; pull reference/failure-modes-catalogue.md and case-study/cobrust-multi-agent-experience.md on demand. --- # Agent-Driven Software Development (ADSD) @@ -8,9 +8,7 @@ description: ADSD methodology for managing multi-agent software projects where A > A methodology for managing software projects where the bulk of the work > is done by AI agents under human strategic direction. > -> **Distilled from**: Cobrust project, ~178 commits, ~24 hours of intense -> multi-agent work, 39 ADRs, 14 findings, 2 P0 codegen bugs found via -> organic stress test, 0.1.0-beta release plan. +> **Distilled from**: Cobrust project, **12 days wall-clock (2026-04-30 → 2026-05-12)**, ~278 commits, 48+ ADRs, 24+ findings, 2 P0 codegen bugs found via organic stress test, v0.1.0 + v0.1.1 + v0.1.2 shipped + α Phase F.2 in flight. > > **Status**: extracted 2026-05-10. Apply as-is or adapt; this is > battle-tested but not orthodoxy. @@ -54,9 +52,20 @@ timetable, hand-wave verification claims. ## Part 1 — Roles & Topology -ADSD uses a **4-tier role hierarchy** plus an external review track. +ADSD uses a **5-tier role hierarchy** plus an external review track. Roles map to model size + autonomy budget, not to humans. +### Why P8 became a first-class role + +As multi-agent projects grow beyond a single stream of work, P9 is easily overloaded if domain refinement stays informal. Once one lead agent is simultaneously doing milestone decomposition, cross-workstream coordination, domain-boundary design, acceptance ownership, and close-out policing, execution quality falls because task packages arrive underspecified. + +The correction is to make **P8 domain ownership explicit** instead of implicit: +- `P9` owns delivery orchestration across workstreams. +- `P8` owns domain-boundary refinement and acceptance inside one workstream. +- `P7` executes against a bounded package rather than improvising architecture. + +Treat this as a strong default once a project has multiple active workstreams, non-trivial acceptance boundaries, or a P9 that is starting to absorb both delivery orchestration and domain-detail ownership. If the project is still tiny, one agent can temporarily wear both P9 and P8 hats, but the handoff responsibilities should still be named separately so the split can become explicit later. + ### P10 — CTO / Architect (human-led, strategic) **Responsibility**: @@ -80,10 +89,10 @@ milestone. Most turns CTO is just merging PRs and unblocking P9. **Responsibility**: - Take CTO's strategic anchor → decompose into ≤ 5 sub-tasks -- Write **Task Prompts** for P7 sub-agents (the "六要素": working dir + - required reads + mission + deliverables + gates + report format) +- Sequence workstreams and delivery gates +- Write **Task Prompts** for downstream agents - Run **Two-phase dispatch SOP** (see Part 2) -- Receive [P7-COMPLETION] reports → verify gates → merge or reject +- Receive completion reports → verify gates → merge or reject **Model**: Opus or top-tier sonnet. P9 is reasoning-heavy. @@ -92,14 +101,31 @@ wall-clock. **Trigger words**: "tech-lead", "拆这个需求", "manage agent team". +### P8 — Domain Expert / Staff Engineer (agent-led, workstream owner) + +**Responsibility**: +- Refine one workstream's domain design before coding starts +- Define module and contract boundaries inside that workstream +- Shape acceptance criteria and the close-out checklist +- Package tasks so P7 can execute without changing architecture on the fly +- Review implementation completeness before work returns to P9 + +**Model**: Opus or top-tier sonnet. P8 is domain-detail heavy rather than milestone-orchestration heavy. + +**Cadence**: 1-3 P8 workstreams can sit under one P9. Each P8 sprint is usually 30-120 min of refinement plus close-out review. + +**Trigger words**: "domain owner", "staff engineer", "refine this workstream", "shape acceptance", "package implementation". + ### P7 — Senior Engineer (agent-led, executor) **Responsibility**: - Receive Task Prompt → first action `cd && pwd && git branch` (enforce working-dir discipline) - Read required-reads list before coding (enforce context loading) -- Implement deliverables → run gates locally → report - [P7-COMPLETION] +- Implement the scoped package defined by P9/P8 +- Start with tests or evals whenever the package changes a public behavior, + contract, or regression boundary +- Run gates locally → report [P7-COMPLETION] **Model**: Sonnet for mechanical fixes / well-defined tasks; Opus for complex codegen / novel design. @@ -393,7 +419,7 @@ audit topology is non-optional for high-stakes gates. ### Two-phase dispatch SOP The single most important pattern in ADSD. Used for any sprint where -a P9 sub-agent will produce code based on a CTO-level decision. +a downstream agent will produce code based on a CTO-level decision. ``` Phase 1 — CTO solo (30-60 min): @@ -403,28 +429,42 @@ Phase 1 — CTO solo (30-60 min): - Decision: which one + why - Done means: falsifiable success criteria - Cross-references: prior ADRs, findings + • If the capability is user-visible or contract-bearing, place the failing + eval/test skeleton now so the downstream implementation starts from a + visible proof obligation • Commit Phase 1 to main with `docs(adr): land ADR-NNNN — (CTO spike)` - • Place tests-corpus skeleton if needed (empty test fns with TODO bodies) -Phase 2 — P9 sub-agent (60-180 min, background): - • Reads ADR (Phase 1 commit) + related code +Phase 2 — P9/P8 refinement (30-120 min): + • P9 sequences the sprint and names the delivery gate + • P8 defines the workstream package: boundaries, acceptance, required reads, + and the exact tests/evals that must go green + • If test-first is part of the contract, the package explicitly tells P7 + what failing proof should exist before implementation starts + +Phase 3 — P7 implementation (60-180 min, background): + • Reads ADR + package + related code + • Lands tests/evals first when required by the package • Implements decision - • Reports [P9-COMPLETION] with branch + final SHA + gate verdicts + • Reports [P7-COMPLETION] with branch + final SHA + gate verdicts + • P8 checks domain completeness before P9 accepts delivery • CTO 守闸: smoke check + cold rebuild + 5-gate + merge --no-ff ``` -**Why two phases**: -- ADR is a strategic decision. Sub-agent shouldn't make it. -- ADR landed in Phase 1 = stable anchor for Phase 2 to read. -- If Phase 2 P9 sub-agent times out / drifts, Phase 1 ADR is still - preserved evidence of the strategic intent. -- Avoids the classic failure mode "P9 spent 2 hours implementing a - scope I never approved". - -**Failure mode**: skipping Phase 1, dispatching directly. Then sub-agent -either over-scopes (fixing things you didn't want fixed) or under-scopes -(missing the actual decision). 80% of "agent went off the rails" -stories trace to skipped Phase 1. +**Why this staged dispatch works**: +- ADR is a strategic decision. Downstream agents shouldn't invent it. +- P9 and P8 have different jobs; separating delivery orchestration from + domain-boundary packaging reduces overloaded handoffs. +- A visible failing eval/test before implementation is the cleanest way to + stop acceptance from drifting during execution. +- If implementation times out or drifts, the ADR and the packaged acceptance + still preserve the approved intent. +- Avoids the classic failure mode "the executor spent 2 hours implementing a + scope nobody actually packaged or approved". + +**Failure modes**: +- Skip Phase 1 and the sprint can over-scope or under-scope. +- Skip P8 packaging and P7 often improvises architecture under delivery pressure. +- Claim test-first in principle but never dispatch a failing proof, and the rule collapses into aspiration. ### Worktree-per-sprint pattern @@ -549,6 +589,10 @@ related: [<other findings>] **Snapshot** — *Compressed current state*. Updated end-of-turn. Source of truth for future agents loaded post-compaction. +Snapshot is the **canonical state document**. Other top-level docs such as +`README`, operator runbooks, and agent guidance files are projection layers. +When they disagree, snapshot wins until the projections are synchronized. + Schema invariants (use frontmatter to enforce): ```yaml schema_invariant: | @@ -561,6 +605,18 @@ schema_invariant: | These invariants prevent the most common failure mode: "重写新段、忘删旧段" (write new section, forget to delete old). +### Snapshot-first close-out + +For documentation-affecting work, close-out order matters: +1. Update snapshot first. +2. Update every projection document that depends on it. +3. Run the documentation verification surface. +4. Only then claim the work closed. + +This turns documentation truthfulness into a deliverable rather than a +best-effort cleanup task. A repo whose README is more current than its +snapshot is upside down. + ### Triple-tree doc (optional) For projects with multi-language audiences: @@ -648,6 +704,29 @@ marketing-overreach we had to walk back. ## Part 5 — Strategic Layer +### Staged capability progression + +When building agentic products, avoid jumping straight from mockups to +full live operation. A cleaner path is to stage reality in deliberate +cuts, with each cut proving one new class of truth. + +One example progression is: + +1. **Mocked orchestrator baseline** — contracts, persistence, and UI shells + can move before real runtime integration exists. +2. **Local read loop** — the product can read and render locally produced run + state without pretending it owns execution yet. +3. **HTTP transport cutover** — replace mocked reads with a real process + boundary while keeping the execution model stable. +4. **Live local server loop** — prove the operator can drive the system + through the actual local service surface. +5. **Local run production loop** — prove the system can produce and then + observe real local runs end-to-end. + +Do not treat this exact sequence as universal law. The reusable lesson is +that transport, runtime, UI, and production semantics should be introduced +in staged cuts with explicit proof obligations, rather than all at once. + ### Strategic vs tactical review cadence | Layer | What it asks | Cadence | Who | @@ -895,18 +974,29 @@ Everything else is adaptable. ## Cross-references (within this skill) -- Part 1 Topology details: `reference/role-topology.md` -- Part 2 Two-phase dispatch deep dive: `reference/two-phase-dispatch.md` -- Part 3 Snapshot discipline: `reference/snapshot-discipline.md` +### Originals (distilled from Cobrust 12-day intensive run) + - Part 6 Full failure-modes catalogue: `reference/failure-modes-catalogue.md` - Templates: `templates/*.md` - Cobrust case study: `case-study/cobrust-multi-agent-experience.md` +### Cross-pollination from Anthropic + OpenAI public guidance (v1.2.0) + +These references adopt established industry practices into ADSD discipline, distilled and adapted to ADSD's sub-agent dispatch context. Read in priority order: + +- **Evals-first development** (`reference/evals-first-development.md`) — Anthropic's "evals are the moat" claim applied to ADSD. Every public capability gets a falsifiable test corpus before implementation. The 6th gate (eval delta non-regression) closes F20 systemically. **Highest leverage of all v1.2.0 additions.** +- **Context-window strategy** (`reference/context-window-strategy.md`) — Positive practices for long agent sessions, complementing F16 (post-compaction identity drift). Three-tier model: persistent / session-scoped / transient. Bootstrap-from-cold prompt template. +- **Cross-session memory architecture** (`reference/cross-session-memory-architecture.md`) — Four-layer storage model (auto-memory / project artifacts / session scratch / ephemeral) with decision tree for "where does this go?". Codifies ADSD's hard-won memory file discipline. +- **Prompt engineering patterns** (`reference/prompt-engineering-patterns.md`) — Distilled patterns from Anthropic + OpenAI prompt guides: role priming, anti-hallucination guards, structured output, refusal/escalation conditions, ADSD-specific patterns (D-rating, identity hygiene, release-readiness guard). +- **Cost monitoring discipline** (`reference/cost-monitoring-discipline.md`) — Practical patterns for tracking LLM cost per sprint / per release / per project. Cost as a diagnostic signal for loops + drift + scope misestimation. Anthropic prompt caching + OpenAI structured outputs as cost-reduction levers. + +These five represent **5 of 12 v1.2.0 gap candidates** identified by review-claude self-audit. Remaining 7 (skills architecture, agent specialization, HITL tree, RCA template, MCP patterns, calibrated confidence, structured-output enforcement) are queued for v1.3.0. + ## Origin & lineage This skill is distilled from the Cobrust project (2026-04-30 to -2026-05-10) — a Rust-implemented Python successor with AI-native -compiler. ~178 commits, 39 ADRs, 14 findings, 2 stress-test farms, +2026-05-12) — a Rust-implemented Python successor with an LLM-driven +translation pipeline. ~278 commits over 12 wall-clock days, 49 ADRs (0001..0048 + 0047a), 27 findings, 2 stress-test farms, 4 parallel-agent topology stress-tested at 4-way max. Patterns documented here passed the test of "did we hit this in production and did the fix work?". diff --git a/plugins/adsd/skills/agent-driven-development/case-study/cobrust-multi-agent-experience.md b/plugins/adsd/skills/agent-driven-development/case-study/cobrust-multi-agent-experience.md index 560cebe..c222ada 100644 --- a/plugins/adsd/skills/agent-driven-development/case-study/cobrust-multi-agent-experience.md +++ b/plugins/adsd/skills/agent-driven-development/case-study/cobrust-multi-agent-experience.md @@ -1,14 +1,14 @@ --- -case_study_id: cobrust-multi-agent-2026-04-30-to-2026-05-10 -project: Cobrust (Rust-implemented Python successor + AI-native compiler) -duration: 11 days (~24 hours of intense agent work in final 36 hours) -human_time: ~6 hours (estimated, mostly strategic decisions + 守闸) +case_study_id: cobrust-multi-agent-2026-04-30-to-2026-05-12 +project: Cobrust (Rust-implemented Python successor + AI-native translation pipeline) +duration: 12 days wall-clock (2026-04-30 → 2026-05-12); main narrative covers Days 1-10 with Day 11+12 appendix events +human_time: ~6 hours (estimated, mostly strategic decisions + 守闸; expanded in Day 11+12 appendix) agent_time: ~80% of LOC produced -final_state: 0.1.0-beta release plan, ~178 commits, 39 ADRs, 14 findings +final_state: v0.1.0 + v0.1.0-beta + v0.1.0-beta.1 + v0.1.1 + v0.1.2 stable shipped + α Phase F.2 in flight; ~278 commits, 49 ADRs (0001..0048 + 0047a), 27 findings attribution_origin: review-claude window (third-party audit) --- -# Case study: Cobrust 11-day multi-agent build-up +# Case study: Cobrust 12-day multi-agent build-up This case study reports what worked and what failed in applying ADSD-flavor methodology to a real software project — Cobrust, a @@ -29,10 +29,10 @@ record of what ADSD prevents and what it doesn't. - **Goal at start**: Phase E (M0..M14) — language core + tooling - **Goal at day 11**: 0.1.0-beta public release with end-to-end Python library translation demo -- **Total commits**: ~178 -- **Total ADRs**: 39 (0001..0039 with some reservations) -- **Total findings**: 14 -- **Cumulative tests**: 2,541 passed / verified at HEAD `6008634` +- **Total commits**: ~278 (at HEAD `a2b3eab` 2026-05-12) +- **Total ADRs**: 49 files (0001..0048 + 0047a sub-numbered) +- **Total findings**: 27 +- **Cumulative tests**: 2,541 passed / verified at HEAD `6008634` (Day 8 anchor; current at HEAD ~2,611+, not re-baselined in case study) ## Topology actually used @@ -373,10 +373,10 @@ N = 5). | Metric | Value | |---|---| -| Total commits | ~178 | -| ADRs landed | 39 | -| Findings | 14 | -| Tests passing at HEAD | 2,541 | +| Total commits | ~278 (at HEAD `a2b3eab` 2026-05-12) | +| ADRs landed | 49 (0001..0048 + 0047a) | +| Findings | 27 | +| Tests passing at Day-8 anchor | 2,541 (current ~2,611+ not re-baselined) | | Test failures pre-cleanup | 2 (msgpack DoS + pyo3 compile) | | P0 codegen bugs found via stress-test farm | 2 | | Hours of human work (estimated) | ~6 | diff --git a/plugins/adsd/skills/agent-driven-development/case-study/cobrust-studio-experience.md b/plugins/adsd/skills/agent-driven-development/case-study/cobrust-studio-experience.md new file mode 100644 index 0000000..f8764bc --- /dev/null +++ b/plugins/adsd/skills/agent-driven-development/case-study/cobrust-studio-experience.md @@ -0,0 +1,1911 @@ +--- +case_study_id: cobrust-studio-2026-05-11-to-2026-05-12 +project: Cobrust Studio (AI-agent project-management console; self-hosted web UI + REST/SSE API over a markdown ADR/finding/ledger tree) +duration: ~21 hours wall-clock (2026-05-11 17:22:37 +0800 → 2026-05-12 14:36:16 +0800; 5-day human plan compressed to 2 calendar days) +human_time: ~3-4 hours (strategic decisions + 守闸 + persona-audit reading; no implementation code written by human) +agent_time: ~95% of LOC produced (3 Rust crates + SvelteKit 5 frontend); 18 opus sub-agent dispatches across 6 waves + 4 reconcile rounds + 1 release agent +final_state: v0.1.0 (broken) → v0.1.1 (broken) → v0.1.2 (usable) shipped within the same calendar day; 125 commits on main; 3 tags; 6 ADRs; 4 findings; 4 module-docs; 196 Rust tests / 14 hermetic Playwright e2e / 2 dogfood specs / real-LLM e2e — all green at HEAD +attribution_origin: studio-cto-session-002-opus47 + studio-p7-{a*,m*}-opus47 sub-agents (live dispatch window, no third-party audit gap) +relates_to: [case-study:cobrust-multi-agent-experience.md (N=1), reference:failure-modes-catalogue.md §F1.0/F19/F20/F21, SKILL.md §"Origin & lineage"] +--- + +# Case study: Cobrust Studio — N=2 dogfood, 2-day MVP exercised + extended ADSD v1.2.1 + +This case study reports what worked and what failed in applying ADSD +v1.2.1 to a **second, independent project** — Cobrust Studio, a +self-hosted web console for managing AI coding agents under engineering +discipline. The first ADSD case study +([`cobrust-multi-agent-experience.md`](cobrust-multi-agent-experience.md)) +documents a 12-day multi-agent build of the Cobrust language project +(N=1). Studio is the **N=2** evidence: a different codebase, a +different domain, a 10× shorter timeline, executed against the +already-codified methodology rather than co-evolving with it. + +If the Cobrust case study answers "did this discipline scale to a +12-day language compiler with a 4-way parallel agent team?", this +case study answers a sharper question: + +> **Does ADSD survive being applied as-written to a project it wasn't distilled from?** + +Short answer: yes, with two important caveats — **the methodology +was both validated and stressed in load-bearing ways**, and **Studio +surfaced new F-sub-forms that retroactively validate F19/F20/F21** +(added to the catalogue between N=1 and N=2). Where Cobrust *generated* +the failure-modes catalogue from its own scars, Studio *consumed* it +and reported back on which entries paid for themselves under +acceleration. + +This case study is also not a sanitised success story. v0.1.0 shipped +broken. v0.1.1 shipped broken (differently). v0.1.2 was the first +usable tag. Each broken tag is a data point about which enforcement +layer was missing; the patch dance below names file:line for every +gap. + +--- + +## §0 Dashboard (one-pager) + +``` +Project: Cobrust Studio +Repo: github.com/Cobrust-lang/cobrust-studio +License: Apache-2.0 OR MIT (ADR-0001) +Span (wall-clock): 2026-05-11 17:22 → 2026-05-12 14:36 (~21 hours) +Span (5-day human plan): collapsed to 2 calendar days (AI velocity ~2.5×) +Bus factor: 1 (single human contributor; explicit caveat) +Commits on main: 125 +Tags pushed: 3 (v0.1.0 broken / v0.1.1 broken / v0.1.2 usable) +Rust crates: 3 (studio-router / studio-store / studio-server) +Frontend: SvelteKit 5, 5 pages, Tailwind v4 +Binary deployment: single 9.0 MiB self-contained (rust-embed; ADR-0002) +Rust tests at HEAD: 196 (32 ok groups, 0 FAILED) +Playwright e2e: 14 hermetic + 2 dogfood (all green at HEAD) +Real-LLM e2e: PASS (codex-forwarder + gpt-5.5) +ADRs landed: 6 (0001..0006) +Findings filed: 4 (P0 / P1 / P2 / P3 all represented; 3 of 4 closed within session) +Module-docs: 4 (studio-router / studio-store / studio-server / web-frontend) +Opus agent dispatches: ~18 (6 waves × DEV+TEST+REVIEW trio + 4 reconcile rounds + 1 release agent) +Reconcile rounds: 4 (A2, A3, A4, A5 — multiple per wave on M1-era waves) +CI gates enforced: 6 (fmt / clippy -D warnings / build / test / doc-coverage §5 SHA / doc-coverage §6 cargo test) +Persona audits: 3 (Mei / Aleksandr / Sarah — post-v0.1.2, AMBER / REAL / PASS-watch-6-month) +F-catalogue catches: F1.0 ×2 / F19 ×2 / F20 ×2 / F21 ×1 +ADSD-firsts: First F20 systemic closure in a non-origin project + First documented "tag → audit → patch" release pattern + First "recursive F20 closure" (enforcement script auditing itself) + First N=2 dogfood of the methodology +``` + +--- + +## §1 Project shape & meta + +### What Studio is + +A 9 MiB single Rust binary that gives engineering teams a web UI + +REST/SSE API over a plain-markdown ADR/finding/ledger tree backed by +a git repo. Five pages: `/login` / `/adr` / `/agent` (dispatch) / +`/finding` / `/ledger`. The discipline it productises is the +**ADR + finding + bilingual docs + Tx-tagged waves + doc-coverage CI +gate** stack distilled from Cobrust. Studio's pitch: "if your team is +doing serious AI-driven development, you need answers to *what did +the agents decide / what went wrong / where did the tokens go / are +we drifting / is the methodology actually being followed?* — Studio +gives you all five against any git repo, no SaaS, no per-seat +pricing." + +### Why it was built + +Two reasons, neither of which is the publicly stated pitch: + +1. **N=2 ADSD dogfood.** The Cobrust language project (N=1) was + the substrate the methodology was distilled from. Distillation + from the same project that generates the data is methodologically + suspect — "did the methodology actually work, or did we just + describe what we did?" Studio is the **second, independent + application**: a different language stack (Axum + SvelteKit vs + Cranelift codegen), a different problem domain (CRUD + SSE vs + LLM-driven translation), a different parallelism profile (3-way + dev/test/review trio per wave vs 4-way parallel sprint farm). The + only constant is the methodology and the human at P10. + +2. **A vehicle for back-porting v1.2.1 catalogue entries + (F19/F20/F21) under acceleration.** F19/F20/F21 were added to the + ADSD failure-modes catalogue between N=1 ending and Studio + starting. They were untested under the conditions they describe + (a new project, a fresh constitution, a tight timeline that + pressures shortcuts). Studio's M0-M5 trajectory was the first + project to consume those entries as inputs rather than outputs. + +### Topology + +``` +Human (1, hakureirm <wbj010101@gmail.com>): + - Strategic decisions: license (Apache + MIT), repo namespace + (Cobrust-lang/cobrust-studio), public tag timing, persona-audit + follow-up direction + - Final 守闸 + merge approval on all wave merges + - ~3-4 hours total work; zero implementation code authored + +CTO agent (1, opus, studio-cto-session-002-opus47): + - M0..M5 milestone planning + Phase 1 ADR spikes (ADR-0001..0006) + - 18 sub-agent dispatches across 6 waves + - 4 reconcile rounds (DEV-TEST-REVIEW resolution per wave on A2-A5) + - 1 dedicated release-readiness agent (M4 post-tag audit) + - 3 persona-audit dispatches (Mei / Aleksandr / Sarah, post-v0.1.2) + +P7 sub-agents (~18 opus dispatches): + - studio-p7-a1-1-opus47 : router lift + strip + - studio-p7-a2-dev-opus47 : studio-store impl + - studio-p7-a2-test-opus47 : studio-store contract corpus + - studio-p7-a2-review-opus47: A2 audit + - studio-p7-a3-{dev,test,review}-opus47 : Axum core + - studio-p7-a4-{dev,test,review}-opus47 : 10 M1 routes + SSE + - studio-p7-a5-{dev,test,review}-opus47 : router wire + dispatch SSE + - studio-p7-m2-{dev,test,review}-opus47 : SvelteKit 5 frontend + - studio-p7-m3-{dev,test,review}-opus47 : rust-embed + dogfood + - studio-p7-m4-{dev,test,review}-opus47 : v0.1.0 release prep + - studio-cto-m4.1-release-readiness-opus47 : post-v0.1.0 audit (caught F-M4-01) + +Persona agents (3, sonnet, post-v0.1.2): + - Mei (Python data scientist, target user) + - Aleksandr (Rust skeptic, technical credibility) + - Sarah (OSS evaluator + tech-lead, governance) +``` + +**Critical attribution note (F21 hygiene)**: every CTO and P7 dispatch +in Studio carried an explicit session-handle suffix +(`studio-cto-session-002-opus47`, `studio-p7-a4-dev-opus47`). No +artifact in this repo signs bare "review-claude" or bare "the CTO". +The discipline came from F21 being on the table at session start — +empirical validation that **F21 prevention is cheap if applied +prospectively**. + +### Wave structure + +Six waves, each with the **3-team trio pattern** (DEV + TEST + REVIEW +in parallel, then CTO reconcile): + +| Wave | Scope | Merge SHA | Notes | +|---|---|---|---| +| A0/M0 | Workspace scaffold + 5 ADRs + 5-gate CI | `b7d8f71` | Initial commit; F1.0 BSD-sed caught on first run | +| A1.x | studio-router lift from cobrust-llm-router @ `61f2aff` + strip | `d616548` | Strip #2 verified as no-op (`a1-1-strip-2-noop-at-pin-61f2aff.md`) | +| A2 | studio-store: ADR/finding/ledger CRUD + SQLite index | `36651a4` | First `last_verified_commit: HEAD` leak (F-A2-01) | +| A3 | studio-server Axum core | `d26f3ac` | Second HEAD leak (F-A3-01); same wave fixed via doc-coverage §5 | +| A4 | 10 M1 HTTP routes + SSE | `8d5475f` | Shipped 9 failing integration tests under broken grep守闸 | +| A5 | Router wire + dispatch SSE + A4 baseline fixes | `0e699c4` | A5 DEV agent flagged the broken-baseline as side-effect; finding filed | +| M2 | SvelteKit 5 frontend (5 pages) | `bfbfb8f` | Vitest + Playwright scaffolding | +| M3 | rust-embed integration + dogfood smoke | `5685f49`, `a426067` | The `Path<String>` mounted on `Router::fallback` — landed here, caught at M4 | +| M4 | v0.1.0 release prep | `a722e09` | Tag `0a7fd3e` v0.1.0 — known-broken (SPA fallback) | +| M4.1 | Post-tag CTO 守闸 release-readiness audit | `503260d` | Caught F-M4-01; doc-coverage §6 added | +| v0.1.1 | SPA fallback `Path<String>` → `Uri` extractor | `15b6f46` | Tag — known-broken (stale Cargo.lock) | +| v0.1.2 | Cargo.lock refresh + doc-coverage §6 paired exit-code gate | `7ea9ae3` | Tag — first usable | +| M5 | persona-audit-driven README rewrite + F-05 dead deps + CI matrix | `339e1ab`, `58cbe94`, `ffaf1fb` | Mei/Aleksandr/Sarah outputs converted into concrete PRs | + +The Wave A waves used a 3-team-per-wave dispatch pattern (~3 P7 +dispatches per wave); Wave M3+ collapsed back to single-P7 dispatches +because the frontend work was less cross-cutting. The variance is +itself an N=2 data point: **3-team trio is overkill for +single-surface UI work; appropriate for cross-crate Rust changes**. + +--- + +## §2 What Studio validated about ADSD + +This section walks each ADSD invariant the project exercised. The +question: did the methodology, applied as written, behave the way +the catalogue claims? + +### §2.1 The 4-tier role topology held under tight-timeline pressure + +ADSD §1 specifies P10 (CTO) / P9 (tech lead) / P7 (senior engineer) / +P0 (atomic) + external review. Studio used **only P10 + P7** — +collapsed P9 into P10 because the wave scope was tractable for direct +CTO-to-P7 dispatch. **The ≤4-way parallel cap was honored throughout**; +peak concurrency was 3 (DEV + TEST + REVIEW trio). + +This is a meaningful adaptation: ADSD's case study #1 (Cobrust) ran +4-way parallel through a heavyweight P9-led decomposition, because +each milestone (M11.x, M12.x) was a multi-crate spike. Studio's +waves were narrower (single crate per wave on A-series; single page +per dispatch on M2). The trio pattern at ≤3-way is **the right +fidelity for narrow-scope waves**; the P9 layer is overhead for +projects shorter than ~5 days. + +> **Methodology learning: P9 is optional below a complexity floor.** +> When the wave plan fits in a single ADR with ≤5 sub-tasks, CTO → +> P7 trio direct is fine. Reserve P9 for waves that need +> sub-decomposition of the ADR itself. + +This learning is being back-ported into §1 of SKILL.md (see §6 +below). + +### §2.2 Two-phase dispatch SOP held — and ADR-0006 demonstrates the blame-integrity move + +The single most-validated pattern was the **CTO Phase 1 ADR spike → +P7 Phase 2 impl** loop. Every wave followed it: + +``` +Phase 1 (CTO): Commit ADR-NNNN with options/decision/done-means. + Land on main. +Phase 2 (P7): Dispatch with a working dir + required reads + (including the ADR) + mission + deliverables + gates. +Phase 3 (CTO): 守闸 — 5-gate green check + read the diff + merge. +``` + +**Concrete validation**: +[`docs/agent/adr/0006-studio-router-api-and-lift-provenance.md`](https://github.com/Cobrust-lang/cobrust-studio/blob/main/docs/agent/adr/0006-studio-router-api-and-lift-provenance.md) +was spiked CTO-solo at Phase 1 (commit `93ae8f8`, 2026-05-11 17:24). +The §"Decision" block enumerated the studio-router public surface +and proposed a builder shape (`with_config / with_cache / with_ledger +/ from_toml`). P7 A1.1 lifted the upstream code and **discovered the +real upstream builder shape was different** (`register_provider / +build(&cfg)` async, `from_toml_str(&str)` not `from_toml(&path)`). + +This is exactly the F2 layer-divergence pattern from Cobrust ADR-0033 +/ ADR-0035. The right move was the **blame-integrity addendum**: + +> ADR-0006 §"Addendum 2026-05-11 — post-A1.1 reality reconciliation" +> preserves the original §"Decision" text **unchanged**, then appends +> a §F-01 / §F-02 / §F-03 addendum block enumerating each correction +> with the as-built reality. The original CTO speculation is +> preserved verbatim; the corrections are dated, attributed +> (`studio-review-wave-a1-opus47`), and load-bearing for downstream +> implementation. + +This pattern — **don't rewrite the spike, append the correction** — +is identical to Cobrust ADR-0033 §"Layer correction". Studio's +contribution: a clean second instance, with explicit prose calling +out *why* the original text is preserved (audit trail / blame +integrity). Future ADSD users now have two case-study instances of +the pattern in the wild. + +> **Methodology learning: ADR addendum pattern is the BLAME-INTEGRITY MOVE.** +> When Phase 2 implementation reveals Phase 1 was speculative-wrong, +> never edit §"Decision". Append `§"Addendum YYYY-MM-DD"` with the +> reality and a pointer to the review that surfaced it. Anyone +> reading the ADR can see both the original strategic intent and the +> tactical correction, with the lineage intact. + +### §2.3 5-gate verification held — and gained a 6th gate the same session + +ADSD §"5-gate verification" specifies: fmt / clippy / build / test / +doc-coverage. Studio enforced all 5 from M0; the M0 scaffold's first +commit (`b7d8f71`) shipped with green CI on day 0 hour 0. + +**By M4.1, the gate count was 6**. Studio added a §6 gate to +`scripts/doc-coverage.sh`: + +```bash +# Excerpt: scripts/doc-coverage.sh §6 (post-M4.1 hardened) +if ! cargo test --workspace --locked --no-fail-fast > "$test_log" 2>&1; then + cargo_exit=$? + echo "doc-coverage: FAIL — cargo test exited $cargo_exit (lockfile mismatch / compile error / panic)" >&2 + exit 1 +fi +failed_count=$(grep -c '^test result: FAILED' "$test_log" || true) +if [ "${failed_count:-0}" -ne 0 ]; then + echo "doc-coverage: FAIL — cargo test reported $failed_count failed test groups" >&2 + exit 1 +fi +``` + +This gate exists because the **standard 5-gate** as documented in +ADSD was insufficient against the failure modes Studio hit at A4 and +v0.1.1. The CTO 守闸 SOP that wrapped the 5-gate used a `cargo test +| grep "^test result" | wc -l` pipeline that **counted both `ok` and +`FAILED` summary lines** as if they were the same. Then the +post-v0.1.1 audit caught a Cargo.lock staleness where `cargo test +--locked` exited 101 *without* emitting a `test result: FAILED` line +at all. Both gaps fixed at the script layer (see §3.4 below). + +> **Methodology learning: the canonical 5-gate is insufficient under +> aggressive parallelism. The 6th gate (paired exit-code + FAILED-grep +> on `cargo test`) closes two systemic gaps the 5-gate misses.** Back-port +> candidate for SKILL.md §"5-gate verification". + +### §2.4 3-team trio dispatch executed at ~3-way parallel under 4-way cap + +Each Wave-A and Wave-M sprint ran: + +``` + ┌─ studio-p7-{wave}-dev-opus47 (impl) +P7 ┼─ studio-p7-{wave}-test-opus47 (TDD contract corpus) + └─ studio-p7-{wave}-review-opus47 (audit — REVIEW only, no edits) + ↓ + CTO reconcile (merge DEV + TEST, address REVIEW findings) + ↓ + 守闸 (5-gate / 6-gate / read the diff) + ↓ + merge to main +``` + +Total opus dispatches: ~18 across 6 waves + 1 release-readiness agent ++ 3 persona agents = **22 opus sub-agents in 21 hours wall-clock**. +Token spend documented in `CHANGELOG.md §"Methodology firsts"`. CTO +reconcile rounds (4 of them, on A2/A3/A4/A5) were the most +human-time-expensive turns — typically 30-60 min each of human-driven +diff reading + small CTO edits to make DEV's wire-shape and TEST's +contract assumptions agree. + +The 3-team trio pattern is the **most ADSD-orthodox part of Studio's +execution**. It's the pattern §1 of SKILL.md specifies most directly, +and it worked as advertised — including the F-class catches (REVIEW +agent's audit reports are the source of the F-A2-01 / F-A3-01 / +F-A5-01 finding numbers below). + +### §2.5 Worktree-per-sprint pattern, scaled down to ~12 worktrees over 21 hours + +ADSD §"Worktree-per-sprint" specifies `git worktree add` per active +sprint; Studio created ~12 worktrees across the session +(`../studio-a2-dev`, `../studio-a2-test`, `../studio-a2-review`, +etc.), all cleaned up via `git worktree remove --force` post-merge. +**No worktree leaked into HEAD by accident**; no `target/` directory +collided. The pattern is identical to what cobrust-multi-agent +exercised, scaled down to single-day cadence. + +One M1 Pro 16GB machine, 3-way parallel cargo builds, zero exit-144 +(SIGUSR2) global lock starvation events. Cobrust hit this once at +6-way; Studio's ≤3-way cap never approached the ceiling. **The 4-way +parallel cap from §1 is real; 3-way is comfortable.** + +### §2.6 Atomic commits — code + tests + docs in one merge + +Every wave merge brought code + tests + module-docs in one commit. +Cross-references: + +- `36651a4 merge: A2 studio-store impl + contract corpus reconciled (Wave A2 complete)` + — brought `crates/studio-store/src/*.rs` + `crates/studio-store/tests/*.rs` + + `docs/agent/modules/studio-store.md` in one merge commit. +- `d26f3ac merge: A3 studio-server Axum core (Wave A3 complete)` — + same shape, scoped to server crate. + +**Atomic commit invariant violation count**: 1 (the A4 merge `8d5475f`, +which shipped 9 failing integration tests that compile-passed but +runtime-failed — see §3.3 below). One violation in 21 hours of +dispatch is in-line with the discipline; the violation itself produced +the catalogue's first **`cto-shougate-test-gate-grep-leak.md`** +finding. + +### §2.7 F21 identity hygiene held at 100% commit-attribution fidelity + +`git log --format='%an <%ae>' | sort -u`: + +``` +hakureirm <wbj010101@gmail.com> +``` + +**One author, one email, across all 125 commits**. Zero leak of the +macOS Full-Name default (which had leaked into an unrelated public +repo in a prior session, per F21 evidence). The discipline came from +F21 being on the table at session start: every dispatch prompt +specified `git config user.name` verification as a tier-0 step +before any commit. + +This is a **direct, prospective validation of F21's prevention +mechanism**. F21 was added to the catalogue from an N=1 negative case; +Studio is the N=2 positive case — F21 catches the leak if you remember +F21 exists. + +### §2.8 Triple-track doc discipline (zh / en / agent) enforced by doc-coverage.sh + +Every public crate ships with `docs/agent/modules/<crate>.md`; every +top-level doc has zh + en parity. Six ADRs, four module-docs, four +findings, all carry `last_verified_commit:` frontmatter that points +to a real, git-reachable SHA. The doc-coverage gate enforces this +mechanically — see §3.2 below. + +### §2.9 Honest fail acceptance — three patch-tags in one day + +Every project ships at v0.1.0 if not before. Studio shipped at +v0.1.0, then v0.1.1, then v0.1.2 *in the same calendar day*. The +CHANGELOG names each tag explicitly: + +- **v0.1.0**: known-broken — SPA fallback `Path<String>` regression on + `Router::fallback`. +- **v0.1.1**: known-broken (different bug) — stale Cargo.lock; `cargo + build --locked` returns 101. +- **v0.1.2**: first usable. + +No quiet retag. No "we'll bump the version and silently fix it." Each +patch tag is its own commit, its own CHANGELOG entry, its own +"`v0.1.<N-1>` is known-broken; upgrade to `v0.1.<N>`" note. **The +README's §"Honest status" section names the patch dance up front**: +*"If you'd prefer a year-old tag where you don't see the patch dance, +this isn't your project."* This is honest-fail-acceptance applied to +release-engineering, not just internal findings. + +--- + +## §3 What Studio STRESSED about ADSD + +This is the load-bearing section. Each item below: where the discipline +broke, how it was caught, what the fix was, what catalogue entry it +informs. Studio's value as N=2 evidence is concentrated here — +methodology that doesn't break under acceleration is methodology that +isn't being tested. + +### §3.1 F1.0 instance #1: BSD-sed in M0 doc-coverage.sh — declared invariant `ADR id monotonic` silently no-op'd on macOS + +**Where it broke** + +M0 (`b7d8f71`, the workspace-scaffold commit) shipped +`scripts/doc-coverage.sh` §4: + +```bash +# ORIGINAL (BSD-sed silent failure pattern) +for adr in $(ls docs/agent/adr/0*-*.md 2>/dev/null | sort); do + n=$(basename "$adr" | sed 's/^0*\([0-9]\+\).*/\1/') + # ... +done +``` + +On macOS (BSD sed), `\+` is **not a special character** — sed interprets +the regex literally. So `n` came back as the basename itself (e.g. +`0001-stack-choice.md`), the integer comparison `[ "$n" -le "$last" +]` returned a non-integer error, and `set -e` did **not** trip +because the construct was inside `$(...)` subshell expansion. The +gate printed `M0 — ADR id monotonic` and exited 0 on every run. + +**How it was caught** + +First-ever run of the gate from a clean macOS shell during M0 review. +CTO 守闸 noticed the gate "passed" against an ADR-roster that the +agent knew had a missing 0002 (intentionally — testing the monotonic +check should fail). Empirical confirmation: the gate was a no-op, not +a check. + +**Fix** + +`sed -E 's/^([0-9]+).*/\1/'` + a second `sed -E 's/^0+//'` to handle +leading zeros — POSIX-compatible regex (`-E` switch is GNU+BSD both). +Tested on macOS BSD sed and Linux GNU sed; both return monotonic +verdicts now. + +**Catalogue mapping** + +This is **F1.0 (declared invariants without enforcement) sub-form: cross-platform +shell silent failure**. The script declared an invariant ("ADR id +monotonic") and shipped a check that, on BSD tools, was equivalent +to no check. Same family as F1.2 (constitution rules with partial-scope +enforcement). + +> **Forward implication**: any project-level enforcement script should +> have a **deliberately-broken-input test** in CI: feed the script a +> known-bad fixture (intentionally non-monotonic ADR sequence), assert +> exit ≠ 0. If the test passes (gate caught the bad fixture), green. +> If the test fails (gate didn't catch), the gate is theatre. + +**This was the first F1.0 catch in Studio's session and the trigger +for tightening doc-coverage.sh's enforcement layer**. Two consecutive +F1.0 catches in the same session is the §"two strikes = systemic +blind spot" signal — see §3.2 below. + +### §3.2 F19/F20 paired instance: `last_verified_commit: HEAD` placeholder shipped twice in module-docs + +**Where it broke** + +Wave A2 merge `36651a4` (2026-05-12) shipped +`docs/agent/modules/studio-store.md` with frontmatter: + +```yaml +--- +doc_kind: module +crate: studio-store +last_verified_commit: HEAD # ← placeholder, never replaced +--- +``` + +`doc-coverage.sh` at that point did **not** check that +`last_verified_commit:` was a real SHA. The gate just checked frontmatter +*existed*. The literal string `HEAD` is frontmatter content; gate +passed. + +A2 external review (`studio-p7-a2-review-opus47`) caught it visually +as P2 finding F-A2-01. + +**24 hours later, Wave A3** merge `d26f3ac` shipped +`docs/agent/modules/studio-server.md` **with the same `HEAD` +placeholder**. Second instance, same blind spot. The A3 review caught +it (F-A3-01); but the structural issue — *the gate doesn't enforce* +— was diagnosed only after the second occurrence. + +**Two strikes = systemic blind spot** (per Cobrust F2 pattern). Filed +finding [`f20-closure-last-verified-commit-enforcement.md`](https://github.com/Cobrust-lang/cobrust-studio/blob/main/docs/agent/findings/f20-closure-last-verified-commit-enforcement.md) +naming the gap as an F20 instance (constitution-vs-workflow alignment). + +**Fix** + +`scripts/doc-coverage.sh` §5 extended in the **same commit as the +A3 review fix** (per F20 §"Rule of thumb": *every binding constitution +rule must have a paired enforcement step in the same PR that introduces +it*): + +```bash +check_last_verified() { + local file="$1" + grep -q "^last_verified_commit:" "$file" || fail "missing frontmatter" + local sha + sha=$(grep "^last_verified_commit:" "$file" | head -1 \ + | sed -E 's/^last_verified_commit:[[:space:]]*//') + if [ "$sha" = "HEAD" ] || [ -z "$sha" ]; then + fail "$file last_verified_commit='$sha' is a placeholder (F20)" + fi + if ! echo "$sha" | grep -qE '^[0-9a-f]{7,40}$'; then + fail "$file last_verified_commit='$sha' does not look like a git SHA (F20)" + fi + # F-A3-01 closure: hex-shape alone passes `deadbee` (valid hex, + # not a real commit). git cat-file -e is the canonical reachability check. + if ! git cat-file -e "${sha}^{commit}" 2>/dev/null; then + fail "$file last_verified_commit='$sha' is hex-shaped but NOT a reachable git commit (F20)" + fi +} +``` + +Three layers of check now: presence + shape + git-reachability. The +reachability check (`git cat-file -e <sha>^{commit}`) is the +F-A3-01 closure — without it, a typo like `deadbee` passes +hex-validation but doesn't actually point to a real commit. + +**Catalogue mapping** + +This is the **first F20 systemic closure landed in Cobrust Studio** +— the finding's title literally is `f20-closure-last-verified-commit-enforcement`, +and the §"Conclusion" states: *"this finding is the first F20-class fix landed +in Cobrust Studio. Mechanism is now load-bearing: any future module-doc or +finding that lands with `last_verified_commit: HEAD` will be caught by CI +on the same PR that introduces it. The placeholder pattern is dead."* + +> **First-ever validation of F20's prevention mechanism in a non-Cobrust +> project.** F20 was added to the catalogue from Cobrust's TDD-mandate-without-enforcement +> N=1 negative case. Studio is the first project to land an F20 *closure* +> against a brand-new instance — confirming F20's §"Rule of thumb" is +> actionable, not just diagnostic. + +> **Forward implication**: F19 (release-readiness untested) and F20 +> (constitution-vs-workflow alignment) **pair naturally**. F19 is "did +> you run it?"; F20 is "did your runner enforce it?". Any project that +> takes F20 seriously will produce F19 closures automatically — and +> vice versa. + +### §3.3 F1.0 instance #2: CTO 守闸 grep leak — A4 merged with 9 failing integration tests under green-gate report + +**Where it broke** + +A4 merge `8d5475f` (10 M1 HTTP routes + SSE; 2026-05-12) was +ratified by CTO 守闸 using: + +```bash +# WRONG — counts both `ok` and `FAILED` as "test groups" +cargo test --workspace --locked --no-fail-fast 2>&1 \ + | grep "^test result" | wc -l \ + | xargs -I{} echo "{} test groups all green" +``` + +This pipeline counts every line that **starts with** `test result:` — +including `test result: ok.` and `test result: FAILED.`. Both shapes +match; both increment the counter. The守闸 report said "22 test +groups all green"; in reality, **9 of the 22 were FAILED**. + +The 9 failures were API-shape drift between A4 P7 DEV's wire shape +and A4 P7 TEST's contract assumptions (the same drift class as A2 +reconcile — but uncaught because of the broken grep). Specifically: + +| File | Failed tests | +|---|---| +| `tests/adr_routes.rs` | 4 (post_adr_malformed_body, get_adr_by_id, post_adr_then_list, post_adr_persists) | +| `tests/auth_route.rs` | 1 (set_endpoint_malformed) | +| `tests/events_route.rs` | 1 (events_sse_emits_on_adr_create) | +| `tests/finding_routes.rs` | 2 (post_finding_malformed, post_finding_then_list) | +| `tests/ledger_route.rs` | 1 (ledger_recent_n_zero) | + +The A4 守闸 commit `6775cce` ("M4.1 守闸 — apply A3 review P2 fixes") +did NOT address these — it fixed clippy and lib doc edits but didn't +run a clean test gate against the new integration corpus. + +**How it was caught** + +Wave A5 dispatch (the next sprint). A5 P7 DEV agent ran `cargo test` +against base `6775cce` as a sanity check before starting impl — and +**reported 9 pre-existing failures** in its `[P7-COMPLETION]` mid-flight +("base branch has 5 pre-existing failing test files; should I work +on top or wait for fix?"). + +The CTO immediately recognised: "the 5-gate I claimed green at A4 +was wrong" — that the green claim came from a grep pipeline that +swallowed FAILED-grep into a generic line-count. Filed finding +[`cto-shougate-test-gate-grep-leak.md`](https://github.com/Cobrust-lang/cobrust-studio/blob/main/docs/agent/findings/cto-shougate-test-gate-grep-leak.md) +with severity P1, naming three structural takeaways: + +1. CTO 守闸 SOP must use exit-code-aware test-gate checks (either + propagate cargo's exit code OR grep for FAILED explicitly — not + count `^test result` lines). +2. P7 TEST agents must run BOTH `cargo check` AND `cargo test` (with + acceptance that test FAIL is expected at TDD-red — but the agent + must REPORT the failure shape, not claim "all green"). +3. Same-PR enforcement: extend `scripts/doc-coverage.sh` to run + `cargo test` and explicitly check the summary line. This is the + F20 closure for "atomic commit invariant" → script-level enforcement. + +**Fix** + +Landed at M4.1 (`503260d` "fix: M4.1 守闸 — close cto-shougate finding +via doc-coverage §6 test gate"). `scripts/doc-coverage.sh` §6 added: + +```bash +if ! cargo test --workspace --locked --no-fail-fast > "$test_log" 2>&1; then + cargo_exit=$? + echo "doc-coverage: FAIL — cargo test exited $cargo_exit"; exit 1 +fi +failed_count=$(grep -c '^test result: FAILED' "$test_log" || true) +if [ "${failed_count:-0}" -ne 0 ]; then + echo "doc-coverage: FAIL — $failed_count failed groups"; exit 1 +fi +``` + +Note: **paired** check on exit code AND FAILED-grep. Either one alone +is insufficient (the original CTO grep was a FAILED-grep-only variant +that swallowed the non-zero exit through the pipe). + +The finding's `status: closed_by_m4.1` records this closure. The +file's opening note (added at closure time): + +> *"Closure 2026-05-12 (M4.1 守闸): `scripts/doc-coverage.sh` §6 now +> runs `cargo test --workspace --locked --no-fail-fast` and explicitly +> greps `^test result: FAILED` to enforce the gate at script level. +> F20 systemic enforcement complete — the broken `grep "^test result" +> | wc -l` pattern that A4 merge tripped on can no longer ship green."* + +**Catalogue mapping** + +This is **F1.0 (declared invariant `5 gates green` lacking enforcement +in the verification mechanism itself)** + **F20 (constitution-vs-workflow: +the SOP's grep was the workflow; CLAUDE.md's "5 gates green before +any merge" was the constitution; the gap was the grep)**. + +> **The CTO 守闸 procedure is itself a workflow. F20 applies to the procedure +> as much as it applies to the code being reviewed.** Studio's evidence +> shows the discipline must be **layered** — the constitution rule, the +> SOP grep, the doc-coverage script, and a deliberately-broken-input +> test that confirms each layer catches what the upper layer would otherwise +> miss. + +### §3.4 F-M4-01: SPA fallback `Path<String>` shipped to v0.1.0 — caught by post-tag M4 release-readiness audit + +**Where it broke** + +M3 rust-embed integration (`5685f49`) mounted `embed::serve_asset` +via `axum::Router::fallback(...)`. The handler signature was: + +```rust +pub async fn serve_asset(Path(path): Path<String>) -> Response { ... } +``` + +The structural Axum bug: **`axum::extract::Path<T>` only extracts +from matched route patterns; `Router::fallback` does NOT match a +pattern** — it's a catch-all that the framework dispatches to when +no other route matches. So `Path<String>` has nothing to extract from, +and every request to a SPA route (`/login`, `/adr`, `/agent`, +`/finding`, `/ledger`) returned the Axum runtime error: + +``` +Wrong number of path arguments for `Path`. Expected 1 but got 0. +Note that multiple parameters must be extracted with a tuple `Path<(_, _)>` +or a struct `Path<YourParams>` +``` + +as the response body, instead of the SvelteKit `index.html` shell. +**The frontend was unreachable**. Every navigation to a SPA route +returned an Axum error string. **v0.1.0 shipped with this regression.** + +The bug was hidden from prior audits because: + +1. `scripts/smoke-dogfood.sh` only tests `GET /` (which uses + `embed::serve_index`, a separate handler with no extractor) and + `GET /api/*` paths (which never reach the embed fallback). It + never exercises a SPA route through the binary. +2. `embed::serve_asset`'s collocated unit test called the function + *directly* with a literal `Path("adr/3".to_string())` instead of + going through the Axum router — so the extractor plumbing was + never exercised in the unit test either. +3. M3 review forecast (`studio-review-wave-m3-opus47`) said the + 13-of-14 prior-fail Playwright state was "rust-embed not on TEST + branch yet; post-merge all 14 will pass." **This was wrong** — + the bug is in the rust-embed integration's extractor choice, not + in branch merge state. The forecast was speculative; the empirical + measurement was deferred. +4. M4 TEST agent (the wave that was supposed to validate the release) + returned mid-flight without running Playwright. The CTO did not + re-dispatch; instead, the CTO ran the audit directly. + +**How it was caught** + +**Post-tag CTO 守闸 M4 release-readiness audit** ran hermetic +Playwright (`STUDIO_E2E=1 pnpm run test:e2e`) against +`./target/release/cobrust-studio` built from main HEAD `a722e09` +(== v0.1.0). 13 of 14 e2e specs failed at the first +`page.goto('/login')` step. Inspection of Playwright's +`error-context.md` showed the exact Axum error string. Hypothesis +confirmed in <60 seconds. + +Filed P0 finding +[`m4-release-readiness-spa-fallback-extractor.md`](https://github.com/Cobrust-lang/cobrust-studio/blob/main/docs/agent/findings/m4-release-readiness-spa-fallback-extractor.md). + +**Fix** + +`v0.1.1` (commit `15b6f46`): replace `Path<String>` with +`axum::http::Uri`: + +```rust +use axum::http::Uri; + +pub async fn serve_asset(uri: Uri) -> Response { + serve_path(uri.path()) +} +``` + +Locked against regression by new unit test +`serve_asset_handles_spa_routes_login_agent_etc` that exercises +**every** SPA route through the fixed `Uri` extractor (not just the +literal-Path collocated test pattern that had failed to catch the bug). + +**Catalogue mapping** + +This is **F19 (release-readiness untested) — first time the F19 +prevention mechanism caught a real shipping bug in Cobrust Studio**. +The finding's §"Forward implications" makes this explicit: + +> *"The smoke-dogfood.sh script SHOULD probe a SPA route (e.g., +> `curl /login | grep '<html'`) to catch this class of regression at +> the script level. Filed for v0.1.2."* +> +> *"M4 release-readiness pattern: the F19 mandate 'any public-facing +> install / quickstart / release command must pass independent +> execution in a clean shell' implicitly extends to 'any public +> ROUTE must be hit by an independent caller before publish.' +> smoke-dogfood.sh covers /api/* and /; v0.1.1 forward should cover +> SPA routes too."* + +> **Methodology learning: F19 extends from "install commands" to +> "every public surface".** The original F19 (Cobrust v0.1.x release +> notes) was about `cargo install` URLs and curl commands. Studio's +> instance generalises it to **any public-facing route that a real +> user would hit through normal use**. The mechanism is the same — +> independent caller (Playwright + curl) probing a clean-shell binary. + +### §3.5 F20 recursive closure: doc-coverage §6 hardened against `cargo test --locked` exit 101 leaking past FAILED-grep + +**Where it broke** + +v0.1.1 (`15b6f46`) shipped with the workspace version bumped from +0.1.0 → 0.1.1 in `Cargo.toml`, but `Cargo.lock` still referenced the +v0.1.0 workspace versions (`studio-server v0.1.0` etc.). Any user +running `cargo build --workspace --locked` against v0.1.1 — including +`scripts/build-release.sh`, the CI release workflow, or the M3-docs +recommended user clone path — got: + +``` +error: the lock file CARGO_LOCK needs to be updated but --locked was passed +to prevent this +``` + +`cargo test --workspace --locked` exits 101 (cargo's "build failed +or lockfile mismatch" code) **WITHOUT** ever running tests — so it +never emits a `test result: FAILED` line. + +The `doc-coverage.sh` §6 gate, hardened at M4.1 against the +`grep '^test result'` swallow-fail pattern, used: + +```bash +test_output=$(cargo test --workspace --locked --no-fail-fast 2>&1) +failed_count=$(echo "$test_output" | grep -c '^test result: FAILED') +[ "$failed_count" -eq 0 ] || exit 1 +``` + +This **only catches `FAILED` summary lines**. Exit code 101 from +lockfile-mismatch doesn't produce any summary line — so `failed_count` +is 0, gate passes green, **v0.1.1 ships broken-from-tag**. + +**How it was caught** + +Post-v0.1.1 tag, the `scripts/release-tarball.sh` build pipeline +errored at `cargo build --workspace --locked`. The doc-coverage §6 +gate had passed green just minutes before. CTO 守闸 immediately +identified the recursive pattern: **the gate that was supposed to +enforce F20 ("constitution-vs-workflow alignment") had itself a F20 +gap** — the workflow's enforcement was incomplete. + +This is the **recursive F20 closure**: F20 applied to its own +enforcement script. + +**Fix** + +v0.1.2 (`7ea9ae3`). Two changes: + +1. **Cargo.lock regenerated** via `cargo build` against the new + workspace version. Lockfile now consistent with `0.1.1+`. +2. **doc-coverage.sh §6 paired-gate** — separate `if !cargo test + ...` for exit code AND `failed_count` check for FAILED-grep. Either + non-zero fails the gate. + +```bash +# v0.1.2: paired gate. EITHER cargo exit != 0 OR FAILED count > 0 fails the script. +if ! cargo test --workspace --locked --no-fail-fast > "$test_log" 2>&1; then + cargo_exit=$? + echo "doc-coverage: FAIL — cargo test exited $cargo_exit (lockfile mismatch / compile error / panic)" >&2 + exit 1 +fi +failed_count=$(grep -c '^test result: FAILED' "$test_log" || true) +if [ "${failed_count:-0}" -ne 0 ]; then + echo "doc-coverage: FAIL — $failed_count failed groups" >&2 + exit 1 +fi +``` + +The CHANGELOG names the gap explicitly: + +> *"v0.1.1 Cargo.lock stale ... v0.1.1's commit shipped with Cargo.lock +> still referencing the v0.1.0 workspace versions. ... cargo test +> --locked exited 101 but the grep returned 0 FAILED, so the script +> passed green. v0.1.2 closes."* + +**Catalogue mapping** + +This is the **first documented "F20 recursive closure"** instance — +F20 applied to its own enforcement script, with each enforcement +layer requiring its own paired review. Studio is the empirical +substrate for the pattern: + +> **Methodology learning: F20 closure is not one-shot. The enforcement +> layer needs its own paired review.** A doc-coverage gate that hardens +> against pattern X can ship green against pattern Y on the same +> code-path. The script's invariant ("no test failures shipped") is +> declared once; each new failure mode (FAILED summary line / non-zero +> exit code without summary / hang / panic / OOM) needs its own +> orthogonal check. +> +> **Empirical pattern**: every enforcement layer needs its own paired +> orthogonal-failure review until the failure-mode class no longer +> recurs. Studio took two patches (M4.1 + v0.1.2) before the §6 gate +> stopped letting things through. + +### §3.6 The strip-#2 declared-empty-must-be-observed-empty discipline + +**Where the discipline was tested** + +ADR-0006 §"Strip list" item #2 directed the A1.1 lift to remove "ADR-0040 +honest-gate hooks (L2 verdict typing)" from `router.rs` + `ledger.rs`. +The strip list was authored from the Studio handoff doc's plan-time +view of upstream entanglement — i.e., a CTO Phase-1 belief about +what the upstream pin contained. + +P7 A1.1 lift agent searched the actual upstream pin +(`~/repos/cobrust-source-pin/crates/cobrust-llm-router/` at SHA +`61f2aff`, v0.1.1): + +```bash +grep -rn "L2Verdict\|gate_verdict\|L2.*Verdict\|HonestGate" \ + ~/repos/cobrust-source-pin/crates/cobrust-llm-router/src/ +# Result: zero hits. +``` + +The strip was a **no-op at this pin**. The honest-gate surface +evidently lived in a different upstream crate (the translation +pipeline, not the router crate). + +**The discipline applied** + +The lift didn't silently elide strip-#2 from the report. Filed P3 +finding +[`a1-1-strip-2-noop-at-pin-61f2aff.md`](https://github.com/Cobrust-lang/cobrust-studio/blob/main/docs/agent/findings/a1-1-strip-2-noop-at-pin-61f2aff.md) +explicitly recording the no-op: + +> *"ADR-0040's 'honest gate' surface evidently lives elsewhere in the +> upstream Cobrust workspace (likely the translation pipeline crates, +> not the router crate). At the pinned router-crate SHA, there is +> nothing to strip. The lift therefore proceeded with strip #2 as a +> verified no-op."* + +The finding's §"Conclusion" articulates the principle: + +> *"ADSD §'Atomic commits' + §'5-gate verification' both demand that +> declared invariants get verified in the same commit as the code they +> constrain. Strip #2 declared an invariant ('no honest-gate hooks in +> studio-router'). The honest verification was 'they were never in +> scope at this pin'; recording that explicitly closes the F1.0 / F19 +> risk class — namely, future readers seeing strip #2 in ADR-0006 +> might assume there must have been code removed, look in vain, and +> either re-add bogus honest-gate machinery to 'fix' what they think +> went missing, or distrust ADR-0006's other strip claims."* + +**Catalogue mapping** + +This is a **"declared-empty must be observed-empty" pattern** — a +proactive F1 family prevention. The principle: when an ADR declares +*the absence* of something (no honest-gate, no consensus mode, no +per-task routing), that absence must be **empirically observed** and +**recorded as observed**, not silently assumed. Otherwise a future +reader can't distinguish "we removed X" from "X was never there". + +> **Methodology learning: ADR strip-lists and constitution-prohibitions +> must record the *empirical observation* of absence, not just the +> *declaration* of absence.** Without the observation record, a future +> reader hitting an empty grep can't tell whether the absence is +> intentional (strip succeeded) or accidental (strip silently failed). + +This is a candidate for a new F-sub-form in the catalogue — a parallel +to F19/F20 framed around **strip-and-fork lift provenance**. For now, +documented in the finding itself; flagged for catalogue back-port. + +--- + +## §4 What Studio EXTENDED about ADSD + +The methodology learned things during this session. This section +captures the deltas — items worth back-porting to SKILL.md or to the +failure-modes catalogue. + +### §4.1 "Tag → audit → patch" as a RELEASE PATTERN, not just an audit gate + +**The pattern** + +ADSD v1.2.1 F19 documents "release-readiness agent runs in clean +shell before publish" as a **gate** — something that decides +GO / BLOCK before a tag is pushed. Studio's experience reframes this +as a **release pattern**, not a binary gate: + +``` +Tag the current candidate → v0.1.<N> + ↓ +Audit the tagged artifact in clean shell → release-readiness agent + ↓ +If BLOCK: file finding + patch + tag v0.1.<N+1> → patch dance + ↓ +Re-audit + ↓ +If GO: announce, publish notes → first usable tag +``` + +**Why this matters** + +The framing change is load-bearing: under tight timelines (Studio's +21-hour run), there is no time for *one* perfect tag. The right +pattern is *fast tag → fast audit → fast patch*, accepting that +v0.1.0 / v0.1.1 are intentionally just the experiment substrate that +the audit will reveal. + +Each tag is the **experiment**; the audit is the **observation**; +the patch is the **learning**. Three tags in one day for Studio is +not three failures — it's three completed experimental cycles, each +of which revealed an enforcement gap that intent-driven self-checks +had missed. + +**Empirical evidence** + +- v0.1.0 tag → M4.1 release-readiness audit → caught F-M4-01 SPA + fallback regression → v0.1.1 patch. +- v0.1.1 tag → release-tarball.sh in clean shell → caught Cargo.lock + staleness → v0.1.2 patch. +- v0.1.2 tag → release-readiness audit → green; no v0.1.3 needed + same-day. + +The pattern's success metric isn't "first tag is perfect"; it's +"convergence after K patches stays bounded". For Studio: K=2. + +**Back-port candidate for SKILL.md** + +Propose adding §4 (Quality & Verification) sub-section: + +> *"Tag → audit → patch as a release pattern: under acceleration, +> accept that the first tag will not be the publishable one. The +> right discipline is fast experimental cycle: tag, run the +> release-readiness audit in clean shell, patch the gap, re-tag. +> Cobrust Studio shipped v0.1.0 → v0.1.1 → v0.1.2 in 6 hours +> wall-clock; each tag was a learning step. CHANGELOG.md names each +> tag explicitly as broken/usable so users know which to skip."* + +### §4.2 Recursive F20 closure — enforcement layers need orthogonal-failure review + +**The pattern** + +F20 closure is not one-shot. When you harden enforcement layer A +against failure mode X, you reveal that layer A has a sibling gap +against failure mode Y (orthogonal failure on the same code path). +Closing X without checking Y leaves the layer half-closed. + +**Empirical evidence** + +The `doc-coverage.sh` §6 gate evolution: + +| Stage | Enforcement | Gap revealed | +|---|---|---| +| Pre-M4.1 | `grep '^test result' \| wc -l` | Counts both `ok` and `FAILED` as `result` lines | +| M4.1 | `grep -c '^test result: FAILED'` | Misses non-zero exit without summary line (e.g. lockfile mismatch exit 101) | +| v0.1.2 | Paired: `if ! cargo test` AND FAILED-grep | Both classes now caught | + +Each fix was complete against the bug class it was designed for. But +the enforcement layer had **orthogonal failure modes** (FAILED-line +emit-ing vs not-emit-ing) that needed their own paired review. + +**Forward implication** + +> **Methodology learning: when closing an F20 instance, scan for +> orthogonal failure modes on the same code path.** Ask: "could my +> enforcement layer still pass under a different failure mode of the +> same operation?" If yes, the closure is partial. + +This generalises beyond test gates. The same logic applies to: + +- Schema invariants in frontmatter (different shape of violation) +- CI lint scripts (different shape of bad input) +- Dispatch prompt template fields (different shape of agent shortcut) + +Back-port candidate for failure-modes-catalogue §F20 §"Prevention +going forward": add a fourth layer ("Layer 4: orthogonal-failure +review against every paired-gate enforcement"). + +### §4.3 Continuous persona testing executed in-sprint, with persona-output → PR mapping + +**The pattern** + +ADSD v1.2.1 §1 §"Continuous persona testing" documents persona +simulation as continuous dev cadence, not one-shot audit. Studio's +post-v0.1.2 turn was the first project to execute this as a +**deliberate sprint output**, not as a pre-release ceremony. + +**How it was executed** + +Three persona agents dispatched in parallel post-v0.1.2: + +| Persona | Profile | Verdict | Key catches | +|---|---|---|---| +| **Mei** | Python data scientist, target user | AMBER | Vocabulary confusion (what's an "ADR"?), missing "why not Linear/Notion?" framing, install path assumes `rustup` knowledge | +| **Aleksandr** | Senior Rust eng, technical skeptic | REAL (genuine assessment) | F-05 dead deps (`unicode-normalization`, `uuid`, `hex`, `tracing` carried from upstream lift but unused in studio-router), missing CI matrix | +| **Sarah** | OSS evaluator / governance | PASS-watch-6-month | Bus factor 1 (single contributor; flagged as adoption risk), no SECURITY.md, no CONTRIBUTING.md | + +The personas were given: +- Persona identity + background (years exp, prior burned-by experiences) +- Specific scenario ("you have 30 min, someone shared this on HN") +- Concrete actions to perform (open README, mentally try install) +- Stay-in-character constraint ("don't break into 'as an AI...'") +- Structured report fields aligned to persona's actual decision + ("would I upvote on HN?" / "what would I PR if I had a free + afternoon?") + +**Persona → PR mapping (empirical evidence the pattern works as a PR-driver, not as theatre)** + +Mei's friction items drove the M5 README rewrite directly: + +- *"What's an ADR? The vocabulary table dropped me in"* → README §"Methodology vocabulary" table added (`docs/agent/adr/`, `docs/agent/findings/`, "Wave", "Tx tag", "5 gates", "守闸"). +- *"Why not just use Linear?"* → README §"Why this and not Linear + git?" comparison matrix added. +- *"Is this production-ready?"* → README §"Honest status" section added, naming the v0.1.0/v0.1.1 patch dance up front. +- *"Bus factor 1 is a yellow flag"* → README §"Looking for 3-5 design partners" section added with concrete asks. + +Aleksandr's F-05 dead-deps catch landed in the same M5 commit +(`339e1ab`): +``` +Remove studio-router/Cargo.toml deps lifted but unused: + - unicode-normalization + - uuid + - hex + - tracing +(carried from upstream cobrust-llm-router @ 61f2aff; not used in +the post-strip surface.) +``` + +Sarah's bus-factor + governance findings drove the M5 CI matrix and +release workflow (`58cbe94`). + +**Why this matters** + +Mei's findings, in particular, were **structurally undiscoverable by +the internal review-claude pipeline**. The internal P7-REVIEW agent's +job is "is the code sound?" — it reads the code, the ADRs, the +findings. Mei's job is "would a Python user, with no Rust background, +recognise enough vocabulary to want to install this?" — she reads +only the README from a cold-context start. + +> **Methodology learning: persona-output is the highest-leverage +> source of README/positioning PRs.** Internal reviews maintain +> internal coherence; persona simulation creates **external coherence** +> — between the project's pitch and the user's mental model. Three +> persona dispatches @ 30 min each (90 min total) produced ~15 +> concrete PR items, of which 7 landed in the same wave. + +**Back-port candidate for SKILL.md §1**: extend §"Continuous persona +testing" with a sub-bullet: + +> *"Persona output → PR mapping: each persona finding should map to +> exactly one of {README edit, ADR addendum, finding, doc fix, code +> fix}. If a persona finding maps to 'no action / acknowledged', it's +> a research finding (file for the case study) not a product finding."* + +### §4.4 AI velocity confirmed at ~2.5× on a 5-day plan + +**The empirical evidence** + +CLAUDE.md §6 specified a 5-day MVP target: + +| M | Scope | Day target | Actual | +|---|---|---|---| +| M0 | scaffold + 5 ADRs + 5-gate CI | Day 1 | Day 1 hour 0-2 | +| M1 | backend MVP — Axum + routes + studio-router lift | Day 2 | Day 1 hours 2-20 (A1-A5) | +| M2 | frontend MVP — SvelteKit + 4 pages | Day 3 | Day 2 hours 0-5 | +| M3 | dogfood + polish + single binary | Day 4 | Day 2 hours 5-10 | +| M4 | release v0.1.0 + demo + reviewer invite | Day 5 | Day 2 hours 10-14 | +| M5 | (post-MVP, persona-driven) | not planned | Day 2 hours 14-18 | + +**Total wall-clock: ~21 hours** for a plan estimated at 5 human-days +(40 work-hours). Velocity multiplier: **~2.5×** if we count only +human-equivalent effort (the human's 3-4 hours was strategic; the +~125 commits were all agent-produced). + +The AI velocity heuristic in SKILL.md §5 predicted *"a 5-day human +plan = ~2-day AI plan with ≤4-way parallel"*. Studio confirms the +heuristic with N=2 evidence, at slightly more conservative +parallelism (≤3-way trio). + +**The catch** + +The 2.5× velocity multiplier did NOT translate to "first tag is +shippable". The 21-hour run produced *three tags* — v0.1.0 broken, +v0.1.1 broken, v0.1.2 usable. AI velocity buys faster experimental +cycles; it does NOT buy shippable-on-first-try. **The right framing +is "first usable tag in 21 hours" not "feature-complete in 21 hours".** + +Back-port candidate for SKILL.md §5 §"AI velocity planning": + +> *"AI velocity multiplier (~2.5× to ~10×) buys experimental cycles, +> not shippable-first-try. Plan for K=2 patch tags before first +> usable tag. Each patch is its own experimental cycle; aim for total +> wall-clock = (plan_days × velocity_inverse) × (1 + K × 0.1). For +> Studio: (5 days × 0.4) + (2 × 0.5 day) ≈ 3 days; reality was 0.9 +> day, comfortable under estimate."* + +### §4.5 Persona report as PR-driver, not as theatre + +(Already covered in §4.3 above; summarized here for catalogue +back-port.) + +The pattern: persona simulation produces actionable PRs when: +1. Personas are richly defined (years exp, prior burned-by, current + frustrations) — not "a Python dev" +2. Personas have a specific scenario ("you have 30 min on HN") +3. Personas have stay-in-character constraint enforced in prompt +4. Persona output is structured ("would I upvote", "what would I PR") + +Without these four, persona simulation regresses to "an AI agent +giving generic feedback" — which is theatre. + +### §4.6 The "constitution → ADR → finding → script-enforcement" stack as a 4-layer F20 discipline + +Studio's discipline can be described as a 4-layer stack: + +| Layer | Artifact | What it enforces | Where it can fail | +|---|---|---|---| +| 1: Constitution | `CLAUDE.md` | Strategic invariants ("5 gates green before merge") | Text-only; survives only in agent's session context | +| 2: ADR | `docs/agent/adr/NNNN-*.md` | Architectural commitments ("studio-router public surface is X") | Drifts from as-built; corrected via §Addendum | +| 3: Finding | `docs/agent/findings/*.md` | Empirical observations ("the grep leaked") | Filed but no script-level enforcement | +| 4: Script | `scripts/doc-coverage.sh` | Mechanical CI gate (paired exit-code + FAILED-grep) | The ultimate truth — if it passes, the build passes | + +**F20 mandates the gradient**: every rule at layer N must have a +paired enforcement at layer N+1. Studio's 4-finding count maps +1-to-1 to layer transitions: + +- F-A2-01 `last_verified_commit: HEAD` placeholder leaked → layer 1 + rule had no layer 4 enforcement → fixed in `f20-closure-last-verified-commit-enforcement.md` +- F-A4-01 9 failing tests under green-gate → layer 1 rule had no + layer 4 enforcement → fixed in `cto-shougate-test-gate-grep-leak.md` +- F-M4-01 SPA fallback `Path<String>` → layer 2 ADR-0002 (single-binary) + had no layer 4 release-readiness audit covering SPA routes → fixed + in `m4-release-readiness-spa-fallback-extractor.md` +- A1-1 strip-2 no-op at pin `61f2aff` → layer 2 ADR-0006 §"Strip + list" item #2 had no layer 4 verification of the strip; fixed by + empirically observing the absence and filing the finding. + +> **Methodology learning: the 4-layer constitution → ADR → finding → +> script stack is the right abstraction for F20.** Every rule needs +> a script-level enforcement; every finding should record which +> layer's gap it closes. + +Back-port candidate for SKILL.md Part 3 (Documentation Discipline): +make the 4-layer model explicit. + +--- + +## §5 Numbers worth quoting + +| Metric | Value | +|---|---| +| Span wall-clock | ~21 hours (2026-05-11 17:22 → 2026-05-12 14:36) | +| Span 5-day human plan | compressed to 2 calendar days (~2.5× AI velocity) | +| Commits on main | 125 | +| Tags pushed | 3 (v0.1.0 / v0.1.1 / v0.1.2) | +| Rust crates | 3 (studio-router / studio-store / studio-server) | +| Binary size | 9.0 MiB (single-file deployment) | +| Rust tests at HEAD | 196 (32 ok groups, 0 FAILED) | +| Playwright e2e | 14 hermetic + 2 dogfood (all green) | +| Real-LLM e2e | PASS (codex-forwarder + gpt-5.5) | +| ADRs | 6 (0001..0006) | +| Findings | 4 (P0 / P1 / P2 / P3 all represented; 3 closed within session) | +| Module-docs | 4 (studio-router / studio-store / studio-server / web-frontend) | +| Opus sub-agent dispatches | ~18 (6 waves × 3-team trio + 4 reconcile rounds + 1 release-readiness agent) | +| Persona dispatches | 3 (Mei / Aleksandr / Sarah) | +| CI gates enforced | 6 | +| Human work-hours (estimated) | 3-4 (strategic + 守闸 only) | +| Agent work-hours (estimated) | ~22 active (across parallel sub-agents) | +| AI velocity multiplier observed | ~2.5× on a 5-day plan | +| F1.0 catches | 2 (BSD-sed; CTO 守闸 grep leak) | +| F19 catches | 2 (M4 SPA fallback; v0.1.1 Cargo.lock) | +| F20 catches | 2 (last_verified_commit HEAD placeholder; recursive doc-coverage §6 closure) | +| F21 catches | 1 prospective (zero git-author leak; all 125 commits attributed cleanly) | +| Methodology firsts | First F20 closure in non-origin project; first documented "tag → audit → patch" release pattern; first "recursive F20 closure" | + +--- + +## §6 What still ahead (post-session) + +These are out-of-scope for this case study but worth naming for completeness: + +- **AEAD real round-trip on `/login` (M5+)**: WebCrypto m2-stub auth + blob is opaque to the server today. Users set `ANTHROPIC_API_KEY` / + `OPENAI_API_KEY` env var as the actual auth path. Real + server-side decrypt deferred. +- **Linux + Windows tarball CI matrix**: `release.yml` workflow + landed at M5 (`58cbe94`); awaits next tag to fire. +- **ADSD case-study back-port**: this document. +- **Design partner recruitment**: README §"Looking for 3-5 design + partners" published; concrete asks enumerated. + +None of these block the N=2 dogfood validation conclusion: the +methodology survived contact with a new codebase under acceleration, +and Studio's session produced enough catalogue-augmenting evidence +to retrofit F19/F20/F21 into validated-pattern status. + +--- + +## §7 Patterns I'd carry forward (Studio → next ADSD project) + +1. **3-team trio dispatch** at ≤3-way parallel for narrow-scope + waves. Reserve P9 layer for waves needing sub-decomposition of + the ADR itself. +2. **ADR §Addendum YYYY-MM-DD pattern**: never edit §"Decision"; + append corrections preserving the original CTO Phase-1 text. The + blame-integrity move. +3. **doc-coverage.sh layered enforcement**: presence + shape + + reachability + paired-gate exit-code on `cargo test`. Six gates + minimum, not five. +4. **F21 prospective discipline**: verify `git config user.name` + before every commit; suffix every sub-agent handle with the + session ID. +5. **Tag → audit → patch as a release pattern**: under acceleration, + first tag is the experimental substrate; expect K=2 patch tags + before first usable. +6. **Persona dispatch → README rewrite pipeline**: each persona finding + maps to exactly one PR; persona output is the highest-leverage + external-coherence source. + +## §8 Patterns I'd add or strengthen for v1.2.2+ of ADSD + +1. **6-gate canonical (extend the standard 5-gate)** — add §6 + doc-coverage as a load-bearing gate, with paired exit-code + + FAILED-grep on `cargo test`. The 5-gate is insufficient under + aggressive parallelism. +2. **F20 recursive closure pattern documentation** — F20 closure is + not one-shot; every enforcement layer needs its own paired + orthogonal-failure review. +3. **F1 "declared-empty-must-be-observed-empty" sub-form** — when an + ADR declares the absence of something (strip-lists, prohibitions), + the absence must be empirically observed and recorded. +4. **Tag → audit → patch as a release pattern** — explicit named + pattern in §4 of SKILL.md, with the v0.1.0/v0.1.1/v0.1.2 sequence + as canonical example. +5. **AI velocity = experimental cycles, not shippable-first-try** — + sharpen the SKILL.md §5 velocity guidance to plan for K patch tags + before first usable. +6. **Persona output → PR mapping** — extend §1 continuous-persona + testing with the explicit "every persona finding maps to exactly + one PR" rule. + +## §9 Patterns I'd reconsider + +1. **3-team trio dispatch on single-surface waves**: Wave M2 (SvelteKit + frontend, 5 pages) used the 3-team pattern but the parallel + review surface was narrow — REVIEW agent had little to audit until + DEV merged. **Reserve 3-team trio for cross-crate Rust waves**; + single-P7-with-self-review-step is sufficient for narrow surface. +2. **Triple-track docs (zh / en / agent) at bus factor 1**: maintained + for methodology fidelity, but the cost is real (every doc edit + touches 3 files). Cobrust N=1 has the same observation. + Consider downgrading to dual-track (en + agent) below ~3 + contributors, per SKILL.md §3 escape hatch. + +--- + +## §10 Closing + +Cobrust Studio is not a "solved" project. It's at v0.1.2 with: +- A working 9 MiB single-binary web console for AI agent dispatch +- A 6-gate CI bar that enforces ADR + finding + bilingual doc + discipline mechanically +- 196 Rust tests + 14 Playwright e2e + 2 dogfood specs + real-LLM + e2e all green at HEAD +- A documented patch dance (v0.1.0 broken → v0.1.1 broken → v0.1.2 + usable) that names each gap by file:line + +The ADSD methodology distilled from Cobrust (N=1) was the +**experimental substrate** for Studio (N=2). The result confirms: + +- **Core invariants hold under acceleration.** 4-tier topology + (collapsed to P10+P7), two-phase dispatch, 5-gate verification, + atomic commits, worktree-per-sprint, F21 identity hygiene — all + executed as documented. +- **The 5-gate is insufficient; 6-gate is the new floor.** Studio's + M4.1 §6 + v0.1.2 §6 paired-gate work is the canonical evidence. +- **F19/F20/F21 are validated as prevention mechanisms, not just + diagnostic vocabulary.** Each fired in Studio; each prevented or + caught a real shipping bug. +- **The patch dance is a release pattern, not a failure pattern.** + Tag → audit → patch is the right discipline under acceleration. + +If you adopt ADSD on your project after reading this case study, +expect to: +- Land your first tag in days, not weeks +- Expect K=2 patch tags before first usable +- Spend ~10% of project time on doc-coverage discipline (worth it — + Studio's 4 findings are all directly attributable to gate-level + enforcement gaps that the discipline made visible) +- Run a persona dispatch every release — the output is your highest- + leverage external-coherence source. + +The N=2 evidence is in. ADSD v1.2.1 holds. + +--- + +**Cobrust Studio origin**: 2026-05-11 17:22 +0800. +**ADSD N=2 dogfood completed**: 2026-05-12 14:36 +0800. +**Case study authored**: 2026-05-12 (this document). + +— Signed-off: studio-p7-adsd-backport-opus47 + (working window 2026-05-12; back-port commissioned by P10 CTO + studio-cto-session-002-opus47 after the v0.1.2 release sealed and + persona-audit output landed in M5) + +--- + +## §11 M6/M7 cycle empirical evidence (2026-05-12 evening) + +This section documents the second major wave of Cobrust Studio development, +covering **M6 (ADR-0007 AEAD round-trip)** and **M7 (ADR-0008 multi-provider +/login)**, both completed on the same calendar day as the v0.1.0–v0.1.2 +patch dance. The ADSD methodology was applied a **third and fourth time** via +the two-phase dispatch SOP in immediate succession, producing v0.2.0 (M6), +v0.2.1 (infrastructure patch), and v0.3.0 (M7) within ~6 hours wall-clock. + +The empirical findings from this cycle are qualitatively different from §2–§4: +where §2–§4 document the methodology's first real-world pressure-test (N=2 +dogfood), §11 documents the methodology **operating as a repeatable cadence** +— what happens when you apply the two-phase SOP twice in a row with no +intervening friction, and whether the patterns hold under that pressure. + +Dashboard update: + +``` +New tags in this cycle: 3 (v0.2.0 / v0.2.1 / v0.3.0) +New ADRs landed: 2 (ADR-0007 / ADR-0008) +New commits (M6+M7): ~13 (6 M6 commits + 7 M7 commits, including fixes) +Wall-clock total: ~6 hours (ADR-0007 spike → v0.3.0 tag) +Sarah persona cycles: 4 (v1 post-M4 → v2 post-M5 → v3 post-M6 → v4 post-M7) +P9 sub-agent dispatches: 2 (one for M6, one for M7; both opus, both 守闸'd) +Methodology firsts: Two consecutive two-phase SOP applications without + intervening friction; Sarah persona verdict path from + "6+ months out" to "pilot-ready NOW"; persona-found + bug fixed in the same cycle (same-cycle-closure) +``` + +--- + +### §11.1 Two-phase dispatch SOP applied twice consecutively + +**The M6 cycle (ADR-0007)** + +Phase 1 (CTO solo): `ADR-0007 secret-storage AEAD round-trip` was written +and committed before any implementation. The ADR documented: + +- Algorithm choice (AES-256-GCM + Argon2id; 4 options considered, 3 rejected) +- Wire format (packed `salt(16) || nonce(12) || ciphertext+tag` in the + `ciphertext` column of `session_kv` — avoids schema migration on the already-shipped table) +- Dispatch integration pattern (`Arc<RwLock<Option<SessionKey>>>` in AppState) +- 7 falsifiable Done-means criteria (unit tests / integration tests / E2E spec / doc-coverage / README / CHANGELOG / smoke-dogfood) +- `--dev-api-key` escape hatch for headless CI flows +- An explicit Phase 2 worktree target: `feature/m6-aead-round-trip` + +Phase 2 (P9 dispatch): the P9 agent received the ADR as its primary read, +implemented in worktree `feature/m6-aead-round-trip`, produced 6 commits, and +reported `[P9-COMPLETION]` with all 7 gates green. Wall-clock: **120 minutes**. +CTO 守闸 verified the diff, ran cold rebuild from clean `target/`, and merged +`--no-ff` at commit `dd0b181`. + +**The M7 cycle (ADR-0008)** + +After v0.2.0 tagged and Sarah v3 audited (see §11.2), Phase 1 for M7 was +written immediately: `ADR-0008 multi-provider /login`. The ADR documented: + +- 4 options (Option A: explicit field only; Option B: auto-detect from URL; + Option C: explicit field + URL hint; Option D: per-provider routes) +- Chose **Option C** — unambiguous wire format + friendly UX +- Wire-format additivity: `LoginRequest` gains `provider_kind` with + `#[serde(default)]` defaulting to `Anthropic` for v0.2.x back-compat +- `EndpointSecret` gains the same field so `provider_kind` lives **inside** the + AEAD ciphertext, not in SQLite plaintext metadata +- Dispatch match arm: `match secret.provider_kind { Anthropic => ..., Openai + => ..., Synthetic => Err(503) }` +- SvelteKit URL-hint logic: `$effect` reactive binding auto-suggests provider + based on URL typed, user can override +- 7 Done-means criteria (2 unit / 6 integration / 1 E2E / 7-gate CI / 2 doc + updates / CHANGELOG / README update) +- Phase 2 worktree: `feature/m7-multi-provider-login` + +Phase 2 (P9 dispatch): 7 commits, all 7 gates green, **90 minutes** wall-clock. +Merge `--no-ff` at commit `ae9df29`. + +**Why the second cycle was 30 minutes faster (90 vs 120 min)** + +Three compounding factors: + +1. **P9 prompt template reused verbatim.** The M6 dispatch prompt's structure + (working dir + required reads list + mission + deliverables + 7-gate target + + report format) was copy-adapted for M7 in under 5 minutes. No template + design overhead. + +2. **Test skeleton was a known pattern.** The M6 cycle established the shape + of `tests/secret_roundtrip.rs` (integration-test file with wiremock stub + + `#[ignore]`-attributed placeholder tests). M7's `tests/multi_provider_login.rs` + followed the identical pattern; the P9 agent had the M6 test file as a + required read and replicated the structure without hesitation. + +3. **SvelteKit form integration had M6 as a reference.** M6 had already added + the fourth input (Passphrase), restructured the SvelteKit `/login` page, + and wired `POST /api/login`. M7's addition of a Provider `<select>` dropdown + + `$effect` URL-hint was a targeted extension onto an already-known surface. + The P9 agent did not need to discover the SvelteKit form's structure; it was + already documented in the ADR-0008 Phase 1 spike (CTO Phase 1 had read M6's + form implementation and documented the exact extension point). + +**Methodology conclusion**: the two-phase SOP is **self-bootstrapping** when +applied consecutively. Each cycle leaves artifacts (test pattern, form shape, +dispatch prompt structure) that reduce the friction of the next cycle. This is +not specifically documented in ADSD §"Two-phase dispatch SOP" and is worth +adding as an operational note: *"The second cycle of a two-phase dispatch series +runs measurably faster than the first because the P9 prompt template, test +skeleton pattern, and integration surface are already established."* + +--- + +### §11.2 Continuous persona testing — 4 cycle Sarah path + +Sarah Chen is the Studio persona representing an OSS tech lead evaluating +AI-tooling for adoption at a 10–50 person engineering team. Her profile: +8 years Rust experience, responsible for build-vs-buy decisions, governance +concerns (bus factor, SECURITY.md, CONTRIBUTING.md), and pilot-readiness +gates for tooling used in production adjacent workflows. + +Sarah ran **4 audit cycles in a single day**, each dispatched after a tag: + +| Cycle | Triggered by | Verdict | Key gate states | +|---|---|---|---| +| **v1** (post-M4) | v0.1.2 first usable tag | "6+ months out" | Gate #1 (AEAD round-trip) open; Gate #2 (multi-provider) open; Gate #3 (5-platform green) open | +| **v2** (post-M5) | v0.1.3 CI matrix + persona-driven polish | "3 months out IF 3 pilot-gates close" | Gates named explicitly: #1 AEAD, #2 multi-provider, #3 5-platform tarball | +| **v3** (post-M6 / v0.2.0–v0.2.1) | v0.2.1 5-platform green after macos-13 patch | "2 months out — gate #2 closed; gate #3 'one tag away'" | Gate #1 (AEAD) CLOSED; gate #2 (multi-provider) remains; gate #3 (5-platform) → predicted need for runner-pool patch (see §11.4) | +| **v4** (post-M7 / v0.3.0) | v0.3.0 multi-provider /login | **"pilot-ready NOW for 1-5 person teams"** | All 3 pilot-gates CLOSED; remaining items are social/outreach, not code | + +The verdict shift from v3 to v4 — a single-version jump from "2 months out" +to "pilot-ready NOW" — is the most concentrated signal in the four-cycle path. +It validates that **ADSD's two-phase SOP, when applied cleanly to the right +ADR, can close a persona-level gate in a single sprint**. Sarah v3's feedback +on multi-provider was specific and actionable ("add a `provider_kind` field to +`LoginRequest`; the fix is ~50 LoC in the LoginRequest struct and a match arm in +`resolve_router`"). ADR-0008 Phase 1 adopted that framing verbatim as the +decision rationale. M7 P9 closed the gate. + +**Cost vs value of 4 persona cycles** + +Each Sarah cycle cost approximately 30–40 minutes sonnet wall-clock (persona +dispatch + structured report output). Total for 4 cycles: ~2–2.5 hours. For +that cost, the project received: + +- A named set of pilot-readiness gates that organized the M6 and M7 sprint + priorities (instead of "what should we build next?", the answer was "what + closes Sarah's next gate?") +- Actionable PRs from each cycle (passphrase strength validation, Argon2id + benchmark, README security hierarchy table, passphrase rotation docs, + provider dropdown UX, deprecation warning on `api_key_env`) +- A public-facing verdict that could be quoted in design-partner outreach + ("our evaluator persona upgraded from '2 months out' to 'pilot-ready' in + a single sprint") + +**Framing: persona as pilot-readiness oracle** + +The four-cycle Sarah path demonstrates a specific application pattern not +explicitly named in ADSD §1 §"Continuous persona testing": the persona as a +**pilot-readiness oracle**. Each cycle produces a structured verdict with +explicit gate conditions. The project's sprint priorities are derived from +the gate conditions. When the gates close, the verdict changes. + +This is more structured than the §1 description ("spawn the same persona +after sprint completion → verify fix actually closes gap"). The oracle +framing adds: + +1. Each persona cycle's verdict is explicitly conditioned on named gates +2. The gates are stable across cycles (same 3 gates v1 through v4) +3. Sprint priorities are directly derived from open gates +4. Verdict change is the measure of sprint success, not just "gates closed" + +Back-port candidate for SKILL.md §1 §"Continuous persona testing": + +> *"Pilot-readiness oracle variant: for pre-release cycles, structure the +> persona's verdict as a named set of pilot-gates. Each cycle reports which +> gates are open vs closed. Sprint priorities derive directly from open gates. +> The verdict sequence (6+ months → 3 months → 2 months → pilot-ready NOW) +> is the empirical evidence that the sprint plan is closing the right gaps."* + +--- + +### §11.3 F1.0 declared-invariant gap → P9 implementation bug (seal-salt mismatch) + +**The bug** + +The M6 P9 implementation of `SessionKey::seal()` generated a **fresh random +salt on every call** and packed it into the blob header (`blob[..16]`). But +the `SessionKey` itself was derived from a **different** salt at login time +(the salt generated during the Argon2id KDF step in `POST /api/login`). + +Result: `blob[..16]` (packed salt) ≠ `self.salt` (derive salt). Any subsequent +`SessionKey::derive(passphrase, blob[..16])` produced a different 32-byte key +from the one stored in memory → AES-GCM tag mismatch → `SecretError::Open` +→ false-positive `wrong_passphrase` 400 on every re-login with the correct +passphrase. + +The symptom: Playwright login-aead.spec.ts test 2 (which exercised the re- +derive path: login → session drop → re-login same passphrase) and integration +test `restart_drops_key_returns_401` (which tested re-derive after simulated +restart) both reported `authenticated=false` after a valid second login. + +The fix (commit `3753a2b`): `SessionKey` now carries its `derive_salt` as a +field; `seal()` packs `self.salt` (not a fresh random salt) into the blob +header. Nonce remains fresh per seal (AES-GCM uniqueness requirement is per- +nonce, not per-salt). New test `seal_then_re_derive_then_open_round_trips` +locks the contract. + +**Root cause: F1.0 (declared-invariant gap)** + +ADR-0007 §"Wire format" stated explicitly: + +> *"packed salt enables re-derive — at restart the user re-types passphrase, +> server runs `derive(passphrase, blob[..16])` to reconstruct the key"* + +This is a declared invariant: the blob's first 16 bytes are the salt used to +derive the key, enabling re-derivation from the same passphrase. + +The P9 test corpus (6 unit tests as specified in ADR-0007's Done-means §1) +tested: +- `argon2id_kdf_deterministic` — same passphrase + salt → same key +- `aes_gcm_round_trip` — encrypt-decrypt round-trip +- `wrong_passphrase_fails_open` — wrong passphrase → error +- `tampered_ciphertext_fails_open` — bit flip → error +- `tampered_salt_fails_open` — flip salt bytes → different key → error +- `malformed_blob_too_short` — short input → error + +**None of these 6 tests exercised the re-derive path**: `key.seal()` followed +by `derive(same_passphrase, blob[..16])` followed by `key2.open(blob)`. The +test corpus ran `seal` then `open` with the same key — which passes trivially +because the wrong salt is packed but the same wrong-salt key is used to +open. The test couldn't detect the bug because it never exercised the contract +path. + +The bug was **structurally invisible** to the unit test corpus. The Playwright +E2E test caught it because it exercised the re-derive path *naturally* — the +test simulated what a real user does (restart browser, re-enter passphrase, +expect dispatch to work). + +**This is textbook F1.0**: the invariant was declared ("packed salt enables +re-derive") but the test corpus did not contain a test that would prove the +invariant holds on the code path that exercises it. The gap was structural, +not an oversight — the 6 tests in ADR-0007's Done-means were necessary but +not sufficient. + +**Methodology finding: persona-found bug + same-cycle closure** + +The bug was caught by the E2E test the same day as the v0.2.0 tag. This is a +data point that ADSD §1 §"Continuous persona testing" coverage caught what the +unit test corpus structurally missed: + +1. Unit test corpus: exercises individual operations on individual components + (key derivation, encryption, tamper detection) +2. Integration test corpus: exercises API-level round-trips (login → dispatch → + logout), which happen to exercise `seal` but not re-derive +3. Playwright E2E: exercises user-level scenarios (browser session drop + + re-login), which naturally exercises the re-derive path + +The E2E tests are the **orthogonal coverage layer** that the unit and +integration tests structurally cannot provide. This generalises: for any +invariant that depends on a **sequence of user-level actions across session +boundaries** (login → restart → re-login; install → upgrade → re-install; +publish → consumer → upgrade), the test that proves the invariant must simulate +that sequence end-to-end. + +**Back-port candidate for failure-modes-catalogue §F1.0 §"Prevention"**: + +> *"For any ADR §'Wire format' or §'Decision' that declares a re-derive / +> re-construct / re-derive path ('packed salt enables re-derive'), the +> Done-means test corpus MUST include a test that exercises that path +> end-to-end: derive → seal → extract-from-blob → re-derive → open. +> Unit tests that only run `seal; open` on the same key cannot detect +> derive-salt vs seal-salt mismatch."* + +--- + +### §11.4 macos-13 (Intel) runner queue stall — infrastructure-not-code + +**The pattern** + +v0.1.3 (M5 CI matrix release) and v0.2.0 (M6 AEAD release) both shipped +**4 of 5 platform tarballs** because the GitHub-hosted `macos-13` (Intel +x86_64) runner queue stalled for 30+ minutes on the `x86_64-apple-darwin` +build job. The job eventually timed out or the release was tagged incomplete. + +Sarah v3's audit — dispatched against v0.2.0 + the v0.2.1 state — included an +explicit prediction: + +> *"If this stalls again, consider whether the cross-compile setup needs to +> change. The macos-13 runner pool appears to have queue depth issues."* + +The v0.2.1 release addressed this directly: `.github/workflows/release.yml` +was patched to cross-compile `x86_64-apple-darwin` from `macos-14` (Apple +Silicon) using `--target=x86_64-apple-darwin`. Rust + Apple clang both support +this natively. The only change was the runner label (`macos-13` → `macos-14` +with the existing `--target=x86_64-apple-darwin` flag triggering +cross-compilation). **v0.2.1 shipped all 5 platform tarballs first-time green.** + +**The lesson for ADSD** + +Not every release-cycle regression is a code bug. CI infrastructure +dependencies — GitHub-hosted runner pool queue depths, external service +availability, macOS runner generations — can stall a release in ways that +are invisible from the code itself. The release.yml is correct; the runner +pool is the failure mode. + +The ADSD §4 "tag → audit → patch" pattern applies here, but with a critical +distinction: **v0.2.1 contained no code changes**. The patch was +infrastructure-only. The existing ADSD framing of "no tag→patch dance" as +a failure pattern should be refined: + +> **"No CODE tag→patch dance" is the rule. Infrastructure patches between +> tags are acceptable when the audit predicted the failure mode.** A +> release.yml runner-label fix that addresses a predicted runner-pool stall +> is not a methodology failure; it's the pattern working correctly (Sarah v3 +> predicted the stall; v0.2.1 closed it). + +This refinement matters for future ADSD projects that run multi-platform CI: +the infrastructure layer (runner pools, action versions, Docker image +availability, certificate expiry) is a legitimate release-infra concern that +sits outside the code quality envelope. Auditing "release readiness" must +include the infrastructure layer, not just the code. + +**Sarah v3 as a predictive audit** + +Sarah v3's explicit prediction of the runner-pool stall — before v0.2.1 was +tagged — is notable. The prediction was based on observing the pattern twice +(v0.1.3 and v0.2.0 both missing the Intel tarball) and inferring that the +`macos-13` runner pool was structurally insufficient. This is the +**predictive audit** pattern: a persona or reviewer that has enough context +to identify failure modes the team hasn't explicitly discussed. + +The mechanism: Sarah v3 had read the CHANGELOG (which named "4 of 5 platform +tarballs" for v0.1.3) and the release.yml. Two data points of the same pattern += structural inference. ADSD §1 external review discipline already notes that +external reviewers "find what the internal team won't think to find"; this is a +persona-level instance of that capability. + +--- + +### §11.5 Autonomous loop discipline + autonomous-vs-confirm boundary + +**The restatement pattern** + +Across the full Studio project history (M0 through M7, spanning roughly 6 +hours wall-clock for the M6/M7 segment), the user explicitly restated the +"autonomous loop, don't ask for permission" rule a total of **4 times**. Each +restatement occurred when the CTO agent paused to ask for confirmation before +an action that was clearly autonomous-safe: + +| Restate # | Context | What the agent asked | Why it was wrong | +|---|---|---|---| +| 1 | Early M1 | "Should I proceed with the router lift?" | Lift was already specified in ADR-0006; Phase 2 was in flight | +| 2 | M4 post-tag | "Should I dispatch the M4.1 release-readiness audit?" | M4 tag was already pushed; the SOP mandates post-tag audit | +| 3 | M5→M6 transition | "Should I start M6 now?" | Sarah v2 had explicitly named AEAD as pilot-gate #2; M6 was the next clear action | +| 4 | Post-v0.3.0 | "Should I update the Show HN draft?" | Editing a local file in the project repo is autonomous-safe by any reasonable boundary | + +Restatement #4 is the canonical example: the agent paused to ask permission +before editing `docs/outreach/show-hn-draft-v1.md` — a local file, in the +project repo, with no external publication step involved. The user's response +was "你是 CTO,这种事情不需要问。" (You're the CTO; you don't need to ask +for this.) + +**The autonomous-vs-confirm boundary (canonical refinement)** + +The 4 restatements across the project's history have enough pattern to +formalize. The boundary is: + +**Autonomous (proceed without asking)**: +- Edit local files (code, docs, configuration) +- Commit to the working branch +- Push to the project's remote (non-force) +- Merge feature branches to main (--no-ff) +- Tag a release +- Dispatch sub-agents (within the 4-way parallel cap) +- Update documentation, READMEs, CHANGELOG entries +- Run test suites, CI gates, verification scripts + +**Requires P10 confirmation**: +- Post to an external service (HN, Twitter/X, LinkedIn, email blast) +- DM specific individuals (potential design partners, press, investors) +- Spend money (API credits beyond project budget, compute infrastructure) +- Force-push to public main (or any destructive git operation) +- Publish a GitHub Release with release notes (the tag is autonomous; the + public announcement text warrants a quick P10 read) +- License or legal decisions + +The boundary is: **local + reversible + no external audience = autonomous; +external + irreversible + involves real people or money = confirm**. + +**Why this matters for ADSD §"Operating instructions for agents"** + +ADSD SKILL.md §8 ("When to bend ADSD") notes "Default to proceed" but the +catalogue doesn't give explicit boundary examples. The Studio experience +provides the canonical boundary definition with 4 concrete restatement +instances as evidence. The lesson is: + +> *"An agent that asks 'should I edit this file?' is not operating autonomously. +> An agent that asks 'should I post this to HN?' is operating correctly. The +> boundary is external audience + irreversibility."* + +Back-port candidate for SKILL.md §5 §"Operating instructions for agents": + +> *"Autonomous-vs-confirm boundary (empirical from Studio N=2): editing +> local files, committing, pushing, merging, tagging, dispatching sub-agents, +> running scripts — all autonomous. Posting to external services, DMing +> individuals, spending money, force-pushing public main — confirm with P10. +> If you pause before editing a local documentation file, you are being too +> conservative; the user will correct you."* + +--- + +### §11.6 New catalogue entries proposed: F29 and F1.5 + +The M6/M7 cycle surfaces two failure-mode patterns worth proposing for the +catalogue. Both pass the bar of "actionable in future projects, not a one-off +curiosity." + +**F29 proposal: cross-platform runner-pool dependency as a release-infra failure mode** + +*Distinct from F1.0 (declared invariant gap) because the failure is not in +code or documentation — it's in the infrastructure layer that executes the +release*. A release workflow declares "all 5 platforms ship as tarballs" (the +intent). The release.yml is correct (the code). The GitHub-hosted runner pool +for one of the 5 targets (`macos-13` Intel) has insufficient queue depth or +availability. Two consecutive releases ship 4/5 tarballs despite correct code. + +This is a new failure mode class: **infrastructure-not-code release +regression**. F1.0 handles "the code declares an invariant the tests don't +enforce." F29 handles "the release workflow declares a multi-platform target +that the runner infrastructure can't reliably serve." + +The recovery pattern (cross-compile from a more reliable runner with the same +`--target=X` flag) is actionable and applicable to any multi-platform CI +release that uses GitHub-hosted runners. + +Evidence: v0.1.3 and v0.2.0 both missing Intel macOS tarball; v0.2.1 fixed +via `macos-14 --target=x86_64-apple-darwin` runner-label patch. Sarah v3 +predicted the failure before v0.2.1. + +*Candidate F29 is proposed for the catalogue at time of this case-study +back-port. Promoted from candidate if a second instance is observed in a +different ADSD project.* + +**F1.5 proposal: test-corpus structural blind spot (re-derive path gap)** + +*F1 Sediment Family sub-form.* The existing F1.0 covers "declared invariants +without enforcement" at the schema / snapshot / constitution level. The M6 +seal-salt bug introduces a narrower sub-form: **a declared wire-format +invariant has a re-construct path that the test corpus structurally cannot +exercise because the test always uses the same in-memory key for both seal +and open**. + +The pattern: ADR declares "packed field enables re-derive" (or "packed field +enables re-construct / re-validate / re-open"). The unit test corpus tests +`seal()` and `open()` on the same key object, not `seal()` → extract field +→ `derive(same_params, extracted_field)` → `open()`. The in-memory key +object bypasses the serialization-deserialization path that the packed field +is meant to support. Bugs in the packed field's content (wrong value packed) +are invisible. + +This sub-form is distinct from F1.0 because: +- The enforcement mechanism (unit tests) exists and passes +- The gap is that the tests don't cover the *path being claimed* (re-derive), + only the *happy path* (direct key reuse) +- Detection requires E2E or integration tests that simulate the full + user-level sequence including session drops + +Evidence: ADR-0007 §"Wire format" ("packed salt enables re-derive"); M6 P9 +unit tests passing; Playwright E2E test detecting the bug on the first run. +Fix at commit `3753a2b`. + +*Candidate F1.5 is proposed for the F1 Sediment Family. Both F29 and F1.5 +should land in the failure-modes-catalogue at v1.2.7 when the case study +back-port is complete.* + +--- + +### §11.7 Updated numbers (cumulative through v0.3.0) + +| Metric | v0.1.2 baseline (§5) | M6/M7 additions | Cumulative | +|---|---|---|---| +| Tags pushed | 3 | 3 (v0.2.0 / v0.2.1 / v0.3.0) | 6 | +| ADRs | 6 | 2 (ADR-0007 / ADR-0008) | 8 | +| P9 sub-agent dispatches | 0 (all P7 in N=2) | 2 | 2 | +| Persona cycles | 3 (Mei/Aleksandr/Sarah v1) | 3 (Sarah v2/v3/v4) | 6 | +| Rust tests at HEAD | 196 | +~25 (secret module + integration + re-derive) | ~221 | +| Two-phase SOP applications | 0 (Phase-1-only per wave in N=2) | 2 (M6+M7 both full two-phase) | 2 | +| F1.0 catches | 2 (BSD-sed; grep leak) | 1 (seal-salt mismatch) | 3 | +| Infrastructure-not-code patches | 0 | 1 (v0.2.1 runner-label fix) | 1 | +| Persona verdict shifts | 0 | 3 (Sarah v1→v2→v3→v4) | 3 | +| Autonomous-loop restatements | 2 | 2 | 4 total across project | + +--- + +### §11.8 Closing for the M6/M7 cycle + +The M6/M7 cycle answers a question that §2–§4 could not: **what does ADSD +look like when it's working reliably, not being stress-tested?** + +The stress-testing phase (N=2 dogfood) produced the F19/F20/F21 catches, the +three-patch-tag dance, the grep-leak finding. All of those were methodology +discovering its own enforcement gaps. The M6/M7 cycle ran the two-phase SOP +twice in a row with no grep leaks, no bad-baseline agents, no infrastructure +surprises (the macos-13 stall was predicted and patched cleanly). The primary +anomaly — the seal-salt bug — was caught by the E2E layer the same day it was +introduced and closed with a single commit. + +What that says about the methodology: + +1. **Two-phase SOP is genuinely repeatable.** Applied once (M6), the pattern + established templates and patterns that made the second application (M7) + 30 minutes faster. The SOP is not a ceremonial overhead; it compounds. + +2. **Persona-as-oracle produces a convergent verdict path.** Sarah's 4 cycles + produced a monotone improving sequence terminating in "pilot-ready NOW." + The gates were stable; the sprints were pointed at the gates; the gates + closed. This is the methodology working as designed. + +3. **E2E coverage is the orthogonal layer that unit tests cannot substitute.** + The seal-salt bug was structurally invisible to 6 unit tests and 3 + integration tests. One Playwright test caught it. The lesson generalises: + for any invariant that lives on a path across session boundaries, E2E + coverage is not optional. + +4. **Infrastructure is part of the release envelope.** The macos-13 stall is + not a methodology failure; it's a reminder that "release readiness" extends + to the runner pool, not just the code. The F29 candidate entry captures + this for future projects. + +5. **Autonomous-vs-confirm boundary needs explicit documentation.** Four + restatements across the project's history is a signal that the ADSD + methodology's "default to proceed" guidance is insufficient without explicit + boundary examples. The boundary (local + reversible = autonomous; external + + irreversible = confirm) is actionable and should land in SKILL.md. + +--- + +**M6/M7 section authored**: 2026-05-12 (evening) + +— Signed-off: adsd-case-study-update-m6m7-sonnet46 + Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> diff --git a/plugins/adsd/skills/agent-driven-development/reference/cobrust-f41-f43/F41-source-surface-leakage-codegen-primitive.md b/plugins/adsd/skills/agent-driven-development/reference/cobrust-f41-f43/F41-source-surface-leakage-codegen-primitive.md new file mode 100644 index 0000000..bb8f186 --- /dev/null +++ b/plugins/adsd/skills/agent-driven-development/reference/cobrust-f41-f43/F41-source-surface-leakage-codegen-primitive.md @@ -0,0 +1,136 @@ +--- +catalogue_id: F41 +title: "Source-surface leakage of codegen-internal primitive — type-suffix name fossilizes in user-facing API" +family: F1-Sediment (design-surface contamination sub-form) +severity: P1 +status: ratified_2026-05-20 +empirical_project: Cobrust Phase G sprint (2026-05-19/20) +cobrust_local_id: F38 (f38-source-surface-leakage-codegen-primitive.md) +date_ratified: 2026-05-20 +cobrust_sha: 46c0946 +resolution_adr: ADR-0064 (print-monomorphization-source-surface-cleanup) +constitutional_binding: CLAUDE.md §2.5 (LLM-first design principle, training-data-overlap rule) +--- + +# F41 — Source-surface leakage of codegen-internal primitive + +## Pattern + +A codegen-internal primitive — named by type shape (`<verb>_<type>`, e.g., +`print_int`, `print_str`) — leaks into the source-face PRELUDE during a +demo sprint. It fossilizes when subsequent waves do not audit the question: +"is this name source-face API or codegen-internal symbol?" + +The leak path: + +1. Demo sprint needs to prove codegen works → quickest route is direct + monomorphic names (`print_int`, `print_str`). +2. Demo lands, wave closes, no cleanup ADR authored. +3. Next wave sees the names in PRELUDE, writes examples against them, + accumulates usage at call sites. +4. By the time an audit catches it, migration cost is non-trivial + (50-100+ call sites across examples, fixtures, skills). + +This is not a logic bug. It is a **design-surface contamination bug**: the +internal implementation vocabulary bleeds into the user vocabulary. + +## Root cause + +Two independent dynamics compound: + +- **Sprint-tempo bias**: demo-ware ships the shortest path to visible output. + Monomorphic names (`print_int`) are that shortest path. No gate asks "is + this user-facing?" at demo time. +- **Accumulation drift (F1 Sediment)**: wave-2 onward does not re-examine + whether PRELUDE entries are source-face intentional. Each usage is another + call site, each call site raises the migration cost, which raises the + perceived risk of cleanup, which delays the cleanup further. + +## Why this is critical for ADSD / LLM-first projects + +Per CLAUDE.md §2.5 (LLM-first design principle, constitutional north star): + +> Cobrust is the language LLM agents write correctly on the first try. + +The **training-data-overlap rule** is the key binding: + +- LLMs trained on Python/Rust write `print(x)` — one of the highest-frequency + call patterns in any Python corpus. +- `print_int(x)` appears in neither Python nor Rust training data. It is a + Cobrust-internal artifact. +- Result: LLM generates `print(x)` → `NameError: print_int is not defined` → + LLM confused by gap between prior and actual API → corrective loop consumes + tokens and latency for zero semantic value. + +Every type-suffix source-face name is a **friction multiplier on every future +LLM-driven generation session** against the codebase. + +## Empirical evidence (Cobrust 2026-05-19/20) + +**Affected names (Phase E demo era, Cobrust 2026-04):** + +| Source-face name (wrong) | Should be | Internal C-ABI symbol | +|--------------------------|--------------|--------------------------| +| `print_int` | `print` | `__cobrust_print_int` | +| `print_str` | `print` | `__cobrust_print_str` | +| `print_bool` | `print` | `__cobrust_print_bool` | +| `print_float` | `print` | `__cobrust_print_float` | + +**Call-site count at cleanup (ADR-0064 sprint):** +- 133 `.cb` call sites + ~200 Rust inline-source test strings refactored. +- Net source delta ~333 LOC across 4 cleanup commits. + +**Sprint commit references (Cobrust main):** +- `c73be4e` — PRELUDE table: remove `print_int`/`str`/`bool`/`float` source-face entries +- `b51b907` — polymorphic `print()` dispatch in `synth_call` + codegen monomorphization +- `5e87e77` — mechanical refactor: 133 `.cb` call sites + Rust inline strings → `print()` +- `46c0946` — Phase 4 fix: `Ty::None` callret locals must dispatch to `__cobrust_println_int` + not str-buf (caught by regression during cleanup) + +**Ratified at:** commit `46c0946` (feature/0064-print-mono, rebased on main 2026-05-20). + +**Post-ratification state:** +- Zero `print_int`/`print_str`/`print_bool`/`print_float` call-sites in any `.cb` file + under `examples/`. Confirmed via `grep -rEn "print_(int|str|bool|float)\(" examples/ --include="*.cb"` → empty. +- LC-100 12/12 maintained (including LC-05 which caught a `Ty::None` dispatch bug exposed by cleanup). +- 5+ integration tests passing for polymorphic `print`. + +## Detection rule (CI gate candidate) + +For every function listed in the PRELUDE source-face table: + +> If the function name matches `<verb>_<type>` where `<type>` ∈ +> `{int, str, bool, float, list, dict, set, tuple, ...}`, file an audit issue: +> "should this be polymorphic in source?" + +``` +for name in PRELUDE.source_face_names: + if re.match(r'^[a-z_]+_(int|str|bool|float|list|dict|set|tuple)$', name): + emit_audit_warning( + f"PRELUDE name '{name}' matches type-suffix pattern — " + "verify it is source-face intentional, not codegen-internal leakage" + ) +``` + +Candidate for a lint pass in CI. Zero false-positive risk on a well-curated PRELUDE: +intentional type-suffix names are rare; any hit deserves a justification comment. + +## Resolution path + +1. **Identify**: grep PRELUDE source-face table for `<verb>_<type>` names. +2. **Classify**: for each hit, determine whether it is source-face intentional + (user writes it) or codegen-internal (should be hidden behind a polymorphic + dispatch). +3. **Cleanup sprint**: remove the monomorphic names from PRELUDE; add polymorphic + dispatch that routes `print(x: T)` to `__cobrust_print_T` post-typecheck. +4. **Mechanical refactor**: batch-rename all call sites (mirrors LC-100 &borrow + 226-site batch pattern — treat as a mechanical sprint, not a semantic one). +5. **Gate**: add CI lint to prevent re-introduction. + +## Related findings + +| Finding | Relationship | +|---------|--------------| +| F36 — fixture-name-vs-behavior drift (Cobrust F36) | Same family: wave-1 demo-ware fossilizes without audit checkpoint | +| F37 — silent-rot-on-accepted-debt (Cobrust F37) | Same family: accepted debt silently accumulates usage; no discipline at debt boundary | +| F1 — Declared rules without enforcement | Parent family: "design surface should be polymorphic" is common sense; no enforcement gate exists at PRELUDE authorship time | diff --git a/plugins/adsd/skills/agent-driven-development/reference/cobrust-f41-f43/F42-device-name-leakage-public-artifacts.md b/plugins/adsd/skills/agent-driven-development/reference/cobrust-f41-f43/F42-device-name-leakage-public-artifacts.md new file mode 100644 index 0000000..b111c4c --- /dev/null +++ b/plugins/adsd/skills/agent-driven-development/reference/cobrust-f41-f43/F42-device-name-leakage-public-artifacts.md @@ -0,0 +1,141 @@ +--- +catalogue_id: F42 +title: "Device-identifying names leaked into git history and repo files via sub-agent memory read-through" +family: F1-Sediment (opsec-boundary sub-form) +severity: P1 (privacy / opsec — identifying info in public repo) +status: ratified_2026-05-19 +empirical_project: Cobrust pre-publish privacy sweep (2026-05-19) +cobrust_local_id: F39 (f39-device-name-leakage-in-commits.md) +date_ratified: 2026-05-19 +cobrust_sha: d012df9 +resolution: git filter-repo force-rewrite + rename + CI grep gate (Option A) +discovered_by: P10 CTO emergency audit — pre-publish privacy sweep +--- + +# F42 — Device-identifying names leaked into public artifacts via sub-agent memory read-through + +## Pattern + +Sub-agents writing commit messages, ADRs, and module documentation frequently embed +**device-identifying strings** sourced from operator memory references — hostnames, +IP addresses, SSH port numbers, GPU model SKUs, OS kernel versions, user login names — +into public-repo artifacts that land on `main`. Pre-publish, this leaks operator +infrastructure opsec into a soon-public repository. + +The mechanism is a **memory read-through without opsec boundary**: + +1. Operator stores concrete connection info in agent memory (e.g., `reference_x86_workstation.md`) + so they can reconnect quickly between sessions. +2. Sub-agents reading that memory treat the literals as **publishable grounding detail** + (it "contextualizes" the work) rather than **opsec-sensitive material**. +3. No pre-write rule prohibits embedding these strings. CI does not grep commit/diff text + for banned patterns. +4. Strings accumulate in commit messages (not trivially rewriteable in a normal git flow), + ADRs, workflow files, and architecture pages over many sprint sessions. + +## Root cause + +This is F1-family: the rule "don't embed infrastructure literals in publishable text" +exists as common sense, but no enforcement gate verifies it at write time or CI time. + +Two independent contributing factors: + +- **Memory-to-artifact boundary ambiguity**: agents correctly use memory to orient + themselves. The distinction "this literal is ops-private" vs. "this literal is + publishable" is not enforced at the tool boundary. Any memory read can silently + propagate private literals into any subsequent write. +- **Commit message irreversibility**: file contents can be edited in place; commit + messages require history rewrite. The longer the leak persists, the more invasive + the remediation (force-push, filter-repo, coordinated branch cleanup). + +## Empirical evidence (Cobrust 2026-05-19, pre-rewrite) + +**Quantified leak inventory:** +- **31 commit messages** across `main` + feature branches contained one or more of: + `DG-Workstation-2x3090`, `wubingjing`, `112.74.60.44`, `port 10040`, `Linux 6.x kernel`. +- **18 repo files** carried the same strings inline: + - 8 ADRs + - 2 architecture pages + - 4 test files + - 1 module documentation page + - 1 spike document + - 1 GitHub Actions workflow file +- **Workflow filename** `.github/workflows/workstation-gates.yml` itself hinted at + the host identity tier via its name. + +**Remediation executed (Cobrust 2026-05-19):** +- `git filter-repo --replace-text` + `--replace-message` rewrote all branches, + mapping device-identifying strings to neutral placeholders: + - hostname → `<self-hosted-runner>` + - user login → `<runner-user>` + - IP address → `<runner-ip>` + - SSH port → `<runner-port>` + - GPU model SKU → `<gpu-host>` + - OS kernel version → `linux x86_64 host` +- 18 leftover worktree branches deleted (local + remote). +- Workflow renamed to `.github/workflows/self-hosted-gates.yml`. +- Force-pushed `main` with rewritten history (solo dev, no external consumers, + operator explicit authorization). +- Ratified at commit `d012df9`. + +## Detection rule (CI gate — open as of ratification) + +Add a pre-commit / CI grep gate that fails the build if any banned literal reappears: + +```bash +# .github/workflows/opsec-lint.yml (or pre-commit hook) +BANNED_PATTERNS=( + "DG-Workstation" # specific host class name + "wubingjing" # specific user login + "[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+" # any IPv4 (catch-all) + "port [0-9]{4,5}" # explicit SSH port references + "RTX [0-9]{4}" # GPU model SKU + "Linux [0-9]+\.[0-9]+" # minor kernel version +) +for pattern in "${BANNED_PATTERNS[@]}"; do + if git diff --cached | grep -qE "$pattern"; then + echo "OPSEC LINT FAIL: banned pattern '$pattern' in staged diff" + exit 1 + fi +done +``` + +Apply to commit messages via `commit-msg` hook as well as file content via `pre-commit`. + +## Going-forward rule + +When writing commit messages, ADRs, module docs, or any other publishable artifact, +**never** embed: + +- Specific hostnames (use `<self-hosted-runner>` or `runner host`). +- Specific user logins (use `<runner-user>` or `the operator account`). +- IP addresses (use `<runner-ip>` or `the runner endpoint`). +- SSH port numbers (use `<runner-port>` or `the SSH port`). +- GPU model SKUs as tier identifiers (use `<gpu-host>` or describe capability: "x86_64 GPU host with CUDA"). +- OS minor version + kernel version (use `linux x86_64 host`). + +Initials-only references (e.g., "DG verify", "on DG") are acceptable when the +two-letter token does not uniquely identify a public-facing artifact. + +## Resolution path + +If the leak has already accumulated: + +1. **Audit**: `git log --all --oneline | xargs -I{} git show {} -- | grep -E "<pattern>"` to + quantify the blast radius across all branches and files. +2. **Triage**: separate file-content leaks (patchable in place) from commit-message leaks + (require filter-repo rewrite). +3. **Rewrite**: `git filter-repo --replace-text replacements.txt --replace-message replacements.txt` + where `replacements.txt` maps each banned literal to its neutral placeholder. +4. **Branch cleanup**: delete worktree branches that carried unrewritten history. +5. **Gate**: add CI opsec lint as described in the Detection Rule above. +6. **Memory cross-link**: add an in-repo finding file so future agents resuming without the + operator's memory entry still have the rule available. + +## Related findings + +| Finding | Relationship | +|---------|--------------| +| F1 — Declared rules without enforcement | Parent family: opsec boundary exists as common sense, but no enforcement gate at write time | +| F43 — SPOF heavy-build host (Cobrust F40) | Same origin: over-reliance on a named private host created both the opsec exposure (F42) and the availability failure (F43) | +| F35 — commit-message scope drift (upstream catalogue) | Adjacent: commit messages carry unintended context; this finding is the opsec variant | diff --git a/plugins/adsd/skills/agent-driven-development/reference/cobrust-f41-f43/F43-spof-heavy-build-host.md b/plugins/adsd/skills/agent-driven-development/reference/cobrust-f41-f43/F43-spof-heavy-build-host.md new file mode 100644 index 0000000..4e92640 --- /dev/null +++ b/plugins/adsd/skills/agent-driven-development/reference/cobrust-f41-f43/F43-spof-heavy-build-host.md @@ -0,0 +1,131 @@ +--- +catalogue_id: F43 +title: "Single-point-of-failure heavy-build host — SSH-gated workstation as sole verification path collapses when host dies" +family: infrastructure-resilience (SPOF sub-form) +severity: P1 (pipeline halt — full sprint blocked on single host availability) +status: ratified_2026-05-20 +empirical_project: Cobrust Phase J wave-2 sprint (2026-05-19/20) +cobrust_local_id: F40 (f40-single-point-of-failure-heavy-build-host.md) +date_ratified: 2026-05-20 +cobrust_sha: 9cb84b5 +resolution: DG abandonment policy — all heavy gates route to GH Actions CI +--- + +# F43 — Single-point-of-failure heavy-build host + +## Pattern + +Depending on a single SSH-reachable workstation for full-workspace cargo verification +creates a single point of failure. When the host becomes unavailable — network reset, +ISP interruption, OS issue, power event — the entire heavy-build pipeline collapses +with no fallback path and no clear error escalation. + +The failure mode has three compounding layers: + +1. **Hard dependency**: all heavy-build gates (`cargo test --workspace`, + `cargo build --workspace`) route exclusively through the SSH host. +2. **Silent retry loop**: sub-agents follow their SOP and retry the SSH connection + on failure, consuming tool budget on failed invocations for the duration of the + outage, without escalating "host is unreachable — route to CI." +3. **No fallback policy**: no written rule exists for "if the SSH host is down, do + this instead." The agent cannot route around the failure. + +## Root cause + +This is an **infrastructure resilience gap** compounded by an **F1 Sediment pattern**: + +- The policy "heavy builds run on the SSH host" was written once into the dispatch + SOP (Mode C VERIFY LOOP). It was never given a "what if the host is down?" fallback. +- Sub-agents execute the SOP faithfully, including the retry loop, without the + meta-rule "if retry > N, escalate and route differently." +- The same implicit coupling that created the F42 opsec leak (over-reliance on a + named private host) also created this availability failure. + +## Why reproducibility matters for ADSD + +Per CLAUDE.md §3 dispatch reproducibility, verification must be reproducible by any +contributor. An SSH-credential-gated single host violates this in three ways: + +1. **Credential dependency**: new contributor (human or agent) cannot run heavy-build + gates without SSH credentials to the specific host. +2. **Availability dependency**: the host must be alive, reachable, and fully configured + (current repo clone, correct Rust toolchain, working PATH). +3. **Opacity**: any of the three failing silently stalls a sprint without a clear + diagnostic message distinguishing "code is broken" from "host is down." + +Any of these three failing silently is indistinguishable from a code regression +until diagnosed — wasting agent time on root-cause analysis of an infra issue. + +## Empirical evidence (Cobrust 2026-05-19/20) + +**Incident:** +- SSH endpoint failed throughout an 8+ hour session with: + `kex_exchange_identification: read: Connection reset by peer` +- Sub-agents continued retrying (per Mode C SOP) rather than escalating. +- Tool budget consumed on failed SSH invocations. +- Mac single-crate per-crate verify (`cargo test -p <crate>`) was sufficient to + unblock the session but was ad-hoc — no policy existed for this fallback. +- Host degradation went unflagged for the full session. + +**Archaeology SHA:** `9cb84b5` — the commit where the DG self-hosted-runner +abandonment policy was explicitly documented ("Mac single-crate + CI authoritative"). +Related: `d012df9` renamed `workstation-gates.yml` → `self-hosted-gates.yml` in the +same cleanup session. + +**Quantified cost:** 8+ hours of sprint time during which no heavy full-workspace +verification was possible; multiple sub-agent dispatches consumed tool budget on +failed SSH retries before escalation. + +## Resolution path (adopted 2026-05-20, Cobrust) + +**Adopted policy — DG abandonment / GH Actions primary:** + +- ALL heavy full-workspace cargo (`cargo test --workspace`, `cargo build --workspace`) + routes to GH Actions CI (ubuntu-latest + macos-latest matrix). +- Mac local = single-crate quick-feedback only (`cargo test -p <crate>`). +- No SSH credentials in dispatch templates. +- No `ssh -p <port> <user>@<host>` patterns in SOP blocks. + +GH Actions is the authoritative 2-OS matrix verifier. It is reproducible, +credential-free, and available to all contributors. + +**Dispatch template change:** replace Mode C VERIFY LOOP SSH block with +"push branch → GH Actions CI passes → merge." + +## Detection rule (process audit) + +Signs that a project has drifted into SPOF-build territory: + +1. Any SOP template contains a literal `ssh -p <port> <user>@<host>` command as a + required verification step. +2. No documented fallback for "what if that host is unreachable." +3. CI definition (`.github/workflows/`) references a `self-hosted` runner with no + redundancy (single runner label, no runner pool). +4. Sub-agents report "SSH connection refused / reset" but continue retrying rather + than escalating within 2-3 attempts. + +Remediation audit question: "Can a brand-new contributor with only a GitHub account +run every required verification gate?" If no, identify and route around the gap. + +## General ADSD mitigation + +For any ADSD project that uses a self-hosted runner or SSH-gated build host: + +1. **Define the fallback policy in writing** before the first sprint that uses the host. + Include: escalation threshold (e.g., "3 consecutive SSH failures"), fallback path + (e.g., "push to GH Actions CI"), and clear ownership. +2. **Use cloud CI as the authoritative gate**. Self-hosted runners are opt-in acceleration, + never the sole verification path. +3. **No host-specific identifiers in dispatch templates**. SOPs must be portable: any + runner with the right toolchain should satisfy the gate. +4. **Sub-agent retry cap**: dispatch prompts should include "if this SSH command fails + N consecutive times, stop retrying and report 'verification offloaded to CI'." + +## Related findings + +| Finding | Relationship | +|---------|--------------| +| F42 — Device-name leakage (Cobrust F39) | Co-origin: the same named private host created both the opsec exposure (F42) and the availability failure (F43) | +| F37 (Cobrust) — Silent-rot-on-accepted-debt | The host degradation was not escalated; sub-agents silently retried | +| F29 — Cross-platform runner-pool dependency (upstream catalogue) | Adjacent: F29 covers runner-pool failure at the CI-matrix level; F43 covers SSH-gated single host at the sprint-verification level | +| F1 — Declared rules without enforcement | Parent family: "have a fallback" is common sense; no enforcement gate at SOP authorship time | diff --git a/plugins/adsd/skills/agent-driven-development/reference/cobrust-f41-f43/README.md b/plugins/adsd/skills/agent-driven-development/reference/cobrust-f41-f43/README.md new file mode 100644 index 0000000..56fa6e4 --- /dev/null +++ b/plugins/adsd/skills/agent-driven-development/reference/cobrust-f41-f43/README.md @@ -0,0 +1,69 @@ +--- +batch_id: cobrust-f41-f43 +title: "F41-F43: Cobrust empirical corroboration batch (source-surface leakage + device-name redaction + SPOF build host)" +date: 2026-05-21 +cobrust_baseline: Phase J wave-2 FULL CLOSED (main HEAD 53b5ed2 at time of filing) +prior_batch: cobrust-f31-f39 (PR #1, open) +--- + +# Cobrust F41-F43 batch — README + +Three new failure-mode findings empirically corroborated by Cobrust Phase G/J +sprints (2026-05-19/20), submitted as a follow-up to PR #1 (F31-F40 batch). + +## Slot mapping + +| This batch | Cobrust local ID | Ratified SHA | Incident date | +|------------|-----------------|--------------|---------------| +| **F41** | F38 — source-surface leakage of codegen primitive | `46c0946` | 2026-05-19/20 | +| **F42** | F39 — device-name leakage in commits + repo files | `d012df9` | 2026-05-19 | +| **F43** | F40 — single-point-of-failure heavy-build host | `9cb84b5` | 2026-05-19/20 | + +## Finding summaries + +### F41 — Source-surface leakage of codegen-internal primitive + +A codegen-internal monomorphic name (`print_int`, `print_str`, ...) leaks into the +source-face PRELUDE during a demo sprint and fossilizes as examples accumulate usage +against it. This directly violates the LLM-first training-data-overlap rule: LLMs +trained on Python/Rust write `print(x)`, not `print_int(x)`. Cleanup required 333 LOC +across 4 commits. Resolution: polymorphic dispatch + CI lint on type-suffix PRELUDE names. + +**Key metric:** 133 `.cb` call sites + ~200 Rust inline strings refactored. +**Cobrust SHA:** `46c0946` (ADR-0064 ratified). + +### F42 — Device-identifying names leaked into git history via sub-agent memory read-through + +Sub-agents treating operator memory references (SSH host, IP, port, GPU SKU) as +publishable grounding detail embedded opsec-sensitive strings into 31 commit messages +and 18 repo files before a pre-publish audit caught them. Required `git filter-repo` +history rewrite + force-push. Resolution: going-forward opsec boundary rule + CI grep gate. + +**Key metric:** 31 commit messages + 18 repo files rewritten. +**Cobrust SHA:** `d012df9` (Option-A privacy rewrite). + +### F43 — Single-point-of-failure heavy-build host + +Routing all full-workspace cargo verification through a single SSH-gated workstation +created a pipeline-halting SPOF: when the host died, sub-agents retried silently for 8+ +hours consuming tool budget, with no fallback policy. Resolution: DG abandonment — all +heavy gates route to GH Actions CI; Mac local = single-crate quick-feedback only. + +**Key metric:** 8+ hour sprint blocked on unavailable SSH host. +**Cobrust SHA:** `d012df9` (DG abandonment, same session as F42 remediation). + +## Files in this batch + +``` +cobrust-f41-f43/ + README.md (this file) + F41-source-surface-leakage-codegen-primitive.md + F42-device-name-leakage-public-artifacts.md + F43-spof-heavy-build-host.md +``` + +## Relationship to PR #1 (cobrust-f31-f39) + +This batch is independent of PR #1. It can be merged before or after PR #1. +The F41-F43 slot numbers were chosen to be free given that PR #1 claims F31-F40 +(using upstream F38/F39/F40 for different patterns than the Cobrust local ones). diff --git a/plugins/adsd/skills/agent-driven-development/reference/context-window-strategy.md b/plugins/adsd/skills/agent-driven-development/reference/context-window-strategy.md new file mode 100644 index 0000000..396df30 --- /dev/null +++ b/plugins/adsd/skills/agent-driven-development/reference/context-window-strategy.md @@ -0,0 +1,200 @@ +--- +name: Context-window strategy for long agent sessions +description: Positive practices for organizing a multi-hour / multi-week agent session so the context stays useful across compaction events. Complements F16 (post-compaction identity drift) by codifying what should be in memory, what should be re-derivable, what's transient. +type: reference +version: 1.0.0 +date: 2026-05-12 +status: active +relates_to: [skill:SKILL.md §"Snapshot discipline", reference:failure-modes-catalogue.md F16, reference:cross-session-memory-architecture.md] +--- + +# Context-window strategy + +> Long sessions degrade. Compaction is automatic but lossy. The agent that survives the compaction is the one whose **identity, current state, and operative rules are in the persistent layer**, not the transcript. This reference codifies what goes where. + +## When this applies + +- Any session expected to run > 50K tokens or > 4 hours wall-clock +- Any agent role that should survive context compaction (CTO, P9 tech lead, review-claude) +- Any project where memory files / snapshot files / handoff docs exist + +If you're a one-shot agent (< 5 tool calls, single deliverable), this reference is overkill — just execute and return. + +## Three-tier model + +Adopt three explicit context tiers. Every piece of information lives in exactly one tier: + +``` +┌────────────────────────────────────────────────────────────────────────┐ +│ TIER 1: Persistent (auto-memory, repo, version control) │ +│ Survives compaction + session restart + machine change │ +│ - Identity preamble (you are P9 / CTO / review-claude) │ +│ - Operative rules (D-matrix, dev/test pair, F1-Fxx awareness) │ +│ - Project snapshot (HEAD, ADR roster, finding ledger) │ +│ - Cross-references (memory file → other memory files) │ +├────────────────────────────────────────────────────────────────────────┤ +│ TIER 2: Session-scoped (this conversation's context) │ +│ Survives within session but not across sessions │ +│ - Current sprint's working state (which Tx in progress) │ +│ - Files read this session (don't re-read what you already have) │ +│ - Decisions made this turn (will go to ADR or memory at end of sprint) │ +├────────────────────────────────────────────────────────────────────────┤ +│ TIER 3: Transient (one tool call at a time) │ +│ Doesn't need to persist; if needed again, re-fetch │ +│ - Bash output of intermediate verification │ +│ - Read-tool output for files not central to decision │ +│ - Search results, grep outputs │ +└────────────────────────────────────────────────────────────────────────┘ +``` + +The discipline: **Tier 1 must be sufficient to bootstrap a fresh session.** A new agent reading only Tier 1 + the current user prompt should be able to make a correct decision about what to do next. + +## Anthropic-pattern adoption + +### "If you can't answer 'what's my role this session?' in 1 sentence: you've drifted" + +From Claude Code docs on subagents: identity must be re-asserted at compaction boundaries. ADSD encodes this in `feedback_p10_post_compaction_identity_recovery.md` (F16 mitigation). + +Concrete check: when you receive a message and the prior turn was > 30 turns ago, ask yourself the three questions in `feedback_p10_post_compaction_identity_recovery.md §"Self-check trigger"` before acting. + +### Memory file is read on every session start (auto-load) + +Anthropic Claude Code auto-loads `MEMORY.md` index every session. Use this: + +- `MEMORY.md` = the table of contents (one-line per memory file with hook) +- Each memory file = self-contained chapter +- Read order matters — put the most critical file at line 1 (e.g. identity recovery) + +ADSD example: Cobrust's `MEMORY.md` has 14 entries, identity-recovery first, snapshot second, runbook third. New session in Cobrust dir reads index, knows where to look. + +### Skill description is the trigger + +Anthropic skills auto-activate when description keywords match user prompt. So: + +- `description` field = precise + keyword-rich + scoped (NOT generic) +- A skill named "agent dispatch" with description "general agent stuff" won't trigger usefully +- A skill named "agent dispatch" with description "multi-agent dispatch planning, P9 tech lead role, dev/test pair pattern" triggers on the right turns + +Keep skill descriptions tight (~30 words), keyword-dense. + +## OpenAI-pattern adoption + +### Conversation summary turn (Anthropic also uses this) + +When approaching context limit, take a deliberate "summary turn": + +- List of decisions made this session +- Files modified this session +- Open questions +- Next action + +This synthetic message becomes the bootstrap for compaction. Better than letting the system auto-compact a random middle chunk. + +Mechanism: just write a paragraph or YAML block titled "## Session checkpoint <timestamp>" with the structure above. + +### Cache 友好 (cost optimization) + +OpenAI + Anthropic both cache prefix tokens. Strategy: + +- Put unchanging context (system prompt, project preamble, tool definitions) FIRST +- Put changing context (recent messages, current task) LAST +- Don't shuffle the order; let cache hit + +For ADSD: the auto-loaded memory + skill content sits at session start → cached. New user prompts append → small delta. This is already the right shape. + +### Don't re-read files you've already read + +OpenAI guidance: assume tool result outputs stay in context for the rest of the session. Don't `Read` the same file twice unless you wrote to it. + +ADSD anti-pattern: re-reading SKILL.md or constitution every turn out of nervous-habit wastes context. Trust the agent's memory of recent reads. + +## ADSD integration with existing patterns + +### Snapshot.md as Tier 1 checkpoint + +ADSD's `project_state_snapshot.md` is the canonical Tier 1 checkpoint. It contains: + +- HEAD SHA +- ADR roster +- Finding ledger +- Phase F milestones +- Binary verification claim + +A fresh session reads snapshot.md and bootstraps situational awareness in ~200 lines. Don't replicate this in transient context. + +### Handoff cover-letter as Tier 1 cross-session + +When ending a sprint, write a handoff cover-letter (template in `templates/handoff-cover-letter.md`) that becomes the bootstrap for the receiving session. Don't rely on transcript transfer. + +### F16 mitigation: identity recovery preamble + +Identity is Tier 1. The skill description triggers; the memory file confirms; the operative rules guide. If identity drifts post-compaction → re-read the identity recovery memory. + +### Long-session bookkeeping rhythm + +Every ~30 tool calls or hourly (whichever first), explicitly: + +1. Update snapshot.md with latest HEAD + new ADRs/findings +2. Commit any in-progress work (don't let it rot in working tree) +3. Write a session-checkpoint paragraph (per OpenAI pattern above) +4. Run snapshot-lint to verify the Tier 1 invariants + +This rhythm prevents the "20-tool-call-no-checkpoint" cliff where compaction loses critical state. + +## Concrete templates + +### Session-checkpoint format (insert as message in long sessions) + +```yaml +## Session checkpoint <ISO timestamp> + +decisions_this_session: + - <decision 1, with ADR or finding link if applicable> + - <decision 2> + +files_modified: + - <path>: <one-line change summary> + +open_questions: + - <q1> + - <q2> + +next_action: + who: <agent role> + what: <one-sentence action> + blocking_on: <user拍板 | dependency | timing> +``` + +### Bootstrap-from-cold prompt (for fresh session resuming work) + +When a fresh Claude Code session starts on an in-flight project: + +``` +First action (mandatory before any tool): +1. Read MEMORY.md (table of contents) +2. Read project_state_snapshot.md (HEAD + roster + ledger) +3. Read cto_operations_runbook.md (SOPs) +4. Read feedback_subagent_model_tier.md (D-matrix) +5. Then look at user's prompt + decide what to do +``` + +If MEMORY.md doesn't exist in the project, refuse to act until user clarifies role / project. + +## Pitfalls + +| Pitfall | Symptom | Recovery | +|---|---|---| +| Re-reading the same file 10× per session | Tool call waste, slow turns | Track read files in working memory; trust prior reads | +| Putting transient bash output in memory | Memory file grows unbounded | Memory is for stable facts; transient goes to scratch | +| Identity in skill description only (F16) | Post-compaction drift to executor mode | Mirror identity preamble in auto-memory; F16 mitigation | +| Tier 1 file never updated as project evolves | Snapshot becomes lying narrative | Pre-commit hook runs snapshot-lint | +| Cache-busting by shuffling system prompt order | Token cost inflates | Lock system prompt order; mutations go to user-turn | + +## Cross-references + +- `reference/cross-session-memory-architecture.md` — what goes in memory vs ADR vs finding vs snapshot +- `reference/failure-modes-catalogue.md` F16 — post-compaction identity drift (the negative form) +- `templates/snapshot-template.md` — Tier 1 bootstrap doc +- `templates/handoff-cover-letter.md` — cross-session handoff +- Anthropic Claude Code docs: subagents, MEMORY.md auto-load, plan mode +- OpenAI: cache optimization guidance + summary turn pattern diff --git a/plugins/adsd/skills/agent-driven-development/reference/cost-monitoring-discipline.md b/plugins/adsd/skills/agent-driven-development/reference/cost-monitoring-discipline.md new file mode 100644 index 0000000..3dfecde --- /dev/null +++ b/plugins/adsd/skills/agent-driven-development/reference/cost-monitoring-discipline.md @@ -0,0 +1,247 @@ +--- +name: Cost monitoring + budget gate discipline +description: Practical patterns for tracking LLM cost across ADSD sprints, setting budget gates per sprint and per release, and recognizing when cost is signaling a deeper problem (loop / drift / over-spawn). +type: reference +version: 1.0.0 +date: 2026-05-12 +status: active +relates_to: [skill:SKILL.md §"Wave + Tx pattern", reference:failure-modes-catalogue.md F12] +--- + +# Cost monitoring discipline + +> "Token cost is not a constraint" (per Cobrust constitution) does NOT mean "ignore cost." It means cost is not the primary correctness gate. Cost is still a **signal**: a sprint costing 10× the expected budget is telling you something — either the work is harder than estimated, or an agent is looping, or someone spawned 10× more sub-agents than needed. + +## When this applies + +- Any multi-sprint project running parallel agents +- Any sprint exceeding ~$5 in LLM cost +- Any sprint where consensus mode or stress sweeps fire (10×+ multipliers) +- Any release where you want to defend "we shipped at $X cost" + +If you're a one-shot agent with one tool call, this reference is overkill. + +## Three budget tiers + +``` +┌─────────────────────────────────────────────────────────────────────────┐ +│ TIER-A: Per-sprint budget │ +│ Set BEFORE dispatch. Stop and escalate if exceeded. │ +│ Typical: $1-$5 for sonnet sprint, $5-$15 for opus sprint │ +├─────────────────────────────────────────────────────────────────────────┤ +│ TIER-B: Per-release budget │ +│ Sum across all sprints leading to a tag. │ +│ Typical: $10-$50 per v0.X release; $50-$200 per v1.X major │ +├─────────────────────────────────────────────────────────────────────────┤ +│ TIER-C: Per-project lifetime budget │ +│ Track for sanity / ROI conversation with funding source. │ +│ Typical: $100-$1000 for a research-grade project; $1K-$10K for products │ +└─────────────────────────────────────────────────────────────────────────┘ +``` + +Each tier escalates differently: + +- TIER-A breach → STOP, report, ask user before continuing +- TIER-B breach → publish a finding documenting why; reassess release scope +- TIER-C breach → strategic review (ROI / pivot question) + +## Cost ledger (ADSD pattern) + +Every project running parallel agents must maintain a per-dispatch ledger. ADSD's recommended schema (codified in `cobrust-llm-router` if using a custom router, or in `.adsd/ledger.jsonl` as append-only log): + +``` +{ + "timestamp_utc": "2026-05-12T03:45:00Z", + "sprint_id": "lc100-stress-sweep", + "agent_role": "P7-sonnet-test-B1", + "session_id": "abc12345", + "provider": "anthropic", + "model": "claude-sonnet-4-6", + "prompt_tokens": 12345, + "completion_tokens": 6789, + "total_tokens": 19134, + "cost_micro_usd": 142500, + "cache_hit": false, + "task_tag": "test_corpus_generation", + "outcome": "ok" +} +``` + +Append on every API call. Materialize SQLite index for fast queries. + +### Useful ledger queries + +```sql +-- Cost per sprint +SELECT sprint_id, sum(cost_micro_usd)/1e6 as usd +FROM ledger +GROUP BY sprint_id +ORDER BY usd DESC; + +-- Cost per model +SELECT model, count(*) as calls, sum(total_tokens) as tokens, sum(cost_micro_usd)/1e6 as usd +FROM ledger +WHERE timestamp_utc > date('now', '-7 days') +GROUP BY model; + +-- Cache hit rate (savings) +SELECT cache_hit, count(*) as calls, sum(total_tokens) as tokens +FROM ledger +WHERE timestamp_utc > date('now', '-7 days') +GROUP BY cache_hit; +``` + +## Pre-sprint budget estimation + +Before dispatching a sprint, write the budget estimate in the dispatch prompt itself: + +``` +BUDGET ESTIMATE (must include in P9 dispatch): +- Phase 1 (P9 opus ADR drafting): ~30K prompt + 5K completion = ~$2 +- Phase 2 (4 × P7 sonnet pairs, ~25 problems each): + - Per pair: 5 reads × 10K + 10 writes × 5K = ~$1 + - 4 pairs × 2 agents = ~$8 +- Phase 3 (P9 opus triage): ~10K + 5K = ~$1 +- Phase 4 (decision report): ~5K = ~$0.5 + +TOTAL ESTIMATE: $11.50 ± 30% = $8-$15 range +TIER-A BUDGET: $20 (~30% headroom) + +Escalate at $15 actual if Phase 2 still in progress. +``` + +Estimation accuracy improves with practice. Track estimate vs actual across 10+ sprints to calibrate. + +## In-flight monitoring + +For long-running sprints (> 4 hr wall-clock or > $5 budget), check the ledger every ~1 hr: + +```bash +# Quick health check +sqlite3 .adsd/ledger.db " + SELECT + sprint_id, + count(*) as calls, + sum(total_tokens) as tokens, + sum(cost_micro_usd)/1e6 as usd + FROM ledger + WHERE sprint_id = '<current-sprint>' +" +``` + +If actual ≥ 70% of TIER-A budget and the sprint is < 50% complete → escalate early. Don't wait for the breach. + +## Cost as a signal + +Cost is not just expense — it's a diagnostic indicator: + +### High cost without progress = loop + +If a sprint is at $10 with 0 new commits, the agent is likely in a loop. Symptoms: + +- Same files re-read 5+ times in ledger +- Same tool sequence repeating +- No new test cases / no new ADR sections + +Recovery: kill the sprint, audit the prompt for ambiguity, re-dispatch with sharper scope. + +### High cache miss rate = context shuffling + +If cache hit rate < 30% on Anthropic/OpenAI, the prompt structure is changing per-call. Likely cause: system prompt or memory file being mutated mid-sprint. + +Recovery: lock memory updates to inter-sprint boundaries. Don't edit memory while a sprint is running. + +### Cost spike at specific phase = under-estimated scope + +If Phase 2 of a 4-phase sprint costs 3× the budget for that phase, the work was scoped wrong. The next dispatch should split Phase 2 into 2a + 2b. + +This is a productive finding — write it up as a finding entry under `docs/agent/findings/sprint-<id>-cost-overrun.md`. + +## Anthropic-pattern adoption + +### Prompt caching reduces cost dramatically + +Anthropic caches stable prefixes (system prompt, project preamble) at ~10% of full cost. + +For ADSD: structure agent prompts so: + +1. System role + project preamble (cached) — top +2. Required-reads + RFC fragments (cached) — middle +3. User-turn / sprint-specific context — bottom + +Don't shuffle the order — that breaks the cache. ADSD memory files + dispatch-prompt templates already shape this. + +### Model selection by D-rating + +Anthropic explicitly recommends "use the cheapest model that passes your eval." ADSD's D0-D5 matrix is the practical implementation: + +- D0/D1 sonnet: ~5-10× cheaper than opus, generally sufficient +- D2 sonnet (with eval pair): fine if test corpus catches edge cases +- D3+ opus: pay the premium when the task requires it + +Don't default to opus for every task — that's overspending. Don't default to sonnet for D3+ tasks — that's underspending leading to F20. + +## OpenAI-pattern adoption + +### Structured outputs reduce iteration + +OpenAI's structured-outputs (JSON schema) feature reduces "re-prompt for fix" cycles. Each correct-format reply saves 1× the call cost. + +ADSD shape: P7/P9 completion reports include YAML block (per prompt-engineering-patterns PT4). Saves ~20-30% across a typical multi-call sprint vs free-text reports. + +### Streaming saves wall-clock but not token cost + +OpenAI streaming saves user-perceived latency but not token cost. ADSD should use streaming for UX where it helps (release-readiness audit feedback to user) but understand it doesn't reduce $. + +## ADSD integration with existing patterns + +### Dispatch prompt budget block + +Add to `templates/dispatch-prompt-p9.md` § just below DIFFICULTY-RATING: + +``` +BUDGET ESTIMATE (must include): +- Phase-by-phase cost estimate +- TIER-A budget with ~30% headroom +- Early escalation threshold + +If actual cost exceeds estimate by 50% mid-sprint, STOP and report. +``` + +### Release-readiness ledger snapshot + +Include in `[P7-RELEASE-READY-VERDICT]`: + +``` +Cost snapshot at release tag: +- This sprint: $X.YY +- Prior sprints to this tag: $Z.WW +- Total release-bearing cost: $A.BB +``` + +Defensible "we shipped at $X" claim. + +### Cost as F-pattern detector + +The F-pattern catalogue should include cost-anomaly as a diagnostic. Add to dispatch: + +> If actual cost > 2× estimate, flag as potential F-pattern occurrence (likely F13 plan-vs-execute, F17 self-report, or unidentified). Findings entry mandatory. + +## Pitfalls + +| Pitfall | Symptom | Fix | +|---|---|---| +| No estimate, no monitoring | Bill shock at month end | Pre-sprint estimate + ledger | +| Ignore cost as "not a constraint" | Drift to over-spawning sub-agents | Cost as signal, not constraint | +| Cache miss not measured | Cost stays high after prompt-engineering optimization | Track cache hit rate per sprint | +| Over-using opus | Sonnet would suffice; 5-10× overspend | D-matrix rigor (PT7 in prompt patterns) | +| Cost ledger stale | Decisions made on outdated data | Append on every API call, not batched | + +## Cross-references + +- `templates/dispatch-prompt-p9.md` — budget estimate block (add per this reference) +- `reference/prompt-engineering-patterns.md` PT7 — D-rating drives cost +- `reference/evals-first-development.md` — eval delta lets you compare cost across optimizations +- `reference/failure-modes-catalogue.md` — F12 (model output starvation), cost signal for diagnosis +- Anthropic prompt caching docs: https://platform.claude.com/docs/en/prompt-caching +- OpenAI structured outputs: https://platform.openai.com/docs/guides/structured-outputs diff --git a/plugins/adsd/skills/agent-driven-development/reference/cross-session-memory-architecture.md b/plugins/adsd/skills/agent-driven-development/reference/cross-session-memory-architecture.md new file mode 100644 index 0000000..a7d2aef --- /dev/null +++ b/plugins/adsd/skills/agent-driven-development/reference/cross-session-memory-architecture.md @@ -0,0 +1,227 @@ +--- +name: Cross-session memory architecture +description: ADSD's distinction between auto-memory, project artifacts, scratch context, and ephemeral state. Codifies what survives session boundaries and how to design new memory entries that don't decay. +type: reference +version: 1.0.0 +date: 2026-05-12 +status: active +relates_to: [skill:SKILL.md §"Snapshot discipline", reference:context-window-strategy.md, reference:failure-modes-catalogue.md F1 family + F16 + F17] +--- + +# Cross-session memory architecture + +> ADSD's hard-won memory discipline: not everything that lasts deserves to last; not everything ephemeral should be ephemeral. Four storage layers, each with a different persistence contract. + +## When this applies + +- You're about to write something down and don't know where it goes +- You're designing a new memory file or template +- You're auditing why a piece of project knowledge keeps getting re-derived + +If you're producing one commit and done, just commit it. This reference is for projects with state. + +## Four storage layers + +``` +┌──────────────────────────────────────────────────────────────────────────┐ +│ LAYER 1: Auto-memory (~/.claude/projects/<proj>/memory/) │ +│ Survives: all sessions, all hosts (if synced) │ +│ Auto-loaded at session start via MEMORY.md index │ +│ Contains: identity preamble, operative rules, cross-session SOPs │ +│ Mutation policy: edit in-place; index entries are one-line hooks │ +├──────────────────────────────────────────────────────────────────────────┤ +│ LAYER 2: Project artifacts (repo's docs/ + ADR + findings + snapshot) │ +│ Survives: as long as the repo does │ +│ Contains: decisions (ADR), negative results (finding), state (snapshot) │ +│ Mutation policy: ADR immutable once accepted; finding append-only; │ +│ snapshot updated atomically with HEAD │ +├──────────────────────────────────────────────────────────────────────────┤ +│ LAYER 3: Session scratch (this conversation's working notes) │ +│ Survives: within session only │ +│ Contains: in-progress reasoning, intermediate computations, working set │ +│ Mutation policy: free-form; nothing committed unless promoted to L1/L2 │ +├──────────────────────────────────────────────────────────────────────────┤ +│ LAYER 4: Ephemeral (single tool call output) │ +│ Survives: only as long as tool result is in context window │ +│ Contains: bash stdout, file read contents, grep results │ +│ Mutation policy: re-fetch if needed; don't memorialize │ +└──────────────────────────────────────────────────────────────────────────┘ +``` + +## Decision tree: where does this go? + +``` +Is it identity, role, or operative rule? ─yes─→ Layer 1 (auto-memory) + - new file under memory/ + - one-line hook in MEMORY.md + - frontmatter: type=feedback + │ + no + ▼ +Is it a binding decision affecting ≥2 files? ─yes─→ Layer 2 (ADR) + docs/agent/adr/NNNN-*.md + │ + no + ▼ +Is it a negative result / surprise / failure? ─yes─→ Layer 2 (finding) + docs/agent/findings/*.md + │ + no + ▼ +Is it a state fact (HEAD, version, count)? ─yes─→ Layer 2 (snapshot.md) + project_state_snapshot.md + │ + no + ▼ +Is it in-progress reasoning this sprint? ─yes─→ Layer 3 (session scratch — comment, message) + │ + no + ▼ +Is it a one-time output that can be re-fetched? ─yes─→ Layer 4 (ephemeral) + no memorialization needed +``` + +When in doubt, **default to Layer 3 scratch**. Promotion to L1/L2 happens deliberately at sprint end, not in the moment. + +## Layer 1 (auto-memory) deep dive + +### File naming convention + +- `feedback_<topic>.md` — operative rules / SOPs / user-mandated guidance (e.g. `feedback_subagent_model_tier.md`, `feedback_p10_post_compaction_identity_recovery.md`) +- `reference_<topic>.md` — pointers to external systems (e.g. `reference_proxy_config.md`) +- `project_state_snapshot.md` — current canonical state (single file, mutated atomically) +- `MEMORY.md` — index (one-line hooks pointing to files above) + +### Frontmatter contract + +```yaml +--- +name: <human-readable title> +description: <one-line trigger for the agent reading the index> +type: feedback | reference | snapshot +originSessionId: <session UUID, optional> +last_verified_date: <ISO date, optional but recommended> +related_memory: [<other_file>.md, ...] +--- +``` + +### MEMORY.md index format + +``` +- [<file's human title>](file.md) — <one-line hook describing when to read this> +``` + +Top entries are read-first. Place identity-recovery / role-clarifying files at the top. + +### Mutation discipline + +Auto-memory mutates in-place (no git history). Therefore: + +- Date your edits — `## Extension 2026-05-12: ...` rather than overwriting silently +- Don't delete past sections; mark them `## Deprecated 2026-05-12: was X, now Y because Z` +- One-line description in MEMORY.md must stay accurate; update it when content shifts + +### When to add a new memory file vs extend an existing one + +Add new file when: +- New topic area not covered by existing file +- File would grow > 200 lines +- The rule applies to a different agent role than existing file's audience + +Extend existing file when: +- New sub-rule of an existing rule +- Refinement / amendment to existing operative practice +- The "## Extension <date>:" pattern keeps mutations auditable + +## Layer 2 (project artifacts) deep dive + +### ADR vs Finding distinction + +ADRs are **forward-looking decisions** ("we will do X going forward"). Findings are **backward-looking observations** ("we hit Y; here's what we learned"). They're not interchangeable. + +A failure observation (finding) → drives a future decision (ADR). The finding doesn't bind anyone; the ADR does. Don't put binding rules in findings; don't put incident history in ADRs. + +### Snapshot.md responsibility + +Single source of truth for current project state: + +- Current HEAD SHA (auto-updated post-merge) +- ADR roster table (each accepted ADR listed) +- Finding ledger (each finding with status: open / closed) +- Phase / milestone progress +- Binary verification claim (e.g. "cobrust build hello.cb passes at HEAD") + +Snapshot has its own enforcement: `scripts/snapshot-lint.sh` validates the invariants are met. Without snapshot-lint, snapshot drifts (F1.1 — declared invariant without enforcement). + +## Layer 3 vs Layer 4 boundary + +Most agent failures come from **misplacing Layer 3 facts into Layer 4 (forgetting useful state) or Layer 4 facts into Layer 3 (cluttering working memory)**. Examples: + +- ❌ Re-reading the same source file 5 times (Layer 4 treated as Layer 3 — already had it, should re-use) +- ❌ Writing intermediate bash output to a memory file (Layer 4 promoted to Layer 1 — bloat) +- ✅ Keeping a running list of "files I've read this turn" in scratch (Layer 3 working set) +- ✅ Discarding `grep -c` count after using it (Layer 4 ephemeral) + +## Anthropic-pattern adoption + +### MEMORY.md auto-load contract + +Anthropic Claude Code auto-loads MEMORY.md at session start. ADSD uses this: + +- Memory files are the agent's "world model" at boot time +- Index hooks must be precise — the agent decides which to read based on hooks +- A line-1 entry like `[Identity recovery SOP] — read if post-compaction or fresh session` ensures the right file gets opened + +### Anti-pattern: stale memory + +Anthropic warns: memory is point-in-time. Don't trust years-old memory entries blindly. ADSD codifies this: + +- Frontmatter `last_verified_date` field +- Pre-action verification when memory makes a claim about file paths or current state +- Stale memory entries get marked deprecated, not silently re-relied-upon + +## OpenAI-pattern adoption + +### Vector store + retrieval (NOT YET in ADSD) + +OpenAI's Assistants API does retrieval over uploaded files. ADSD currently relies on the agent's context window + memory; retrieval not adopted. + +For ADSD v1.3.0+: consider retrieval if memory + repo content together exceed context budget. Until then, the four-tier model is sufficient. + +### Threads (session scoping) + +OpenAI threads are persistent multi-session conversations. ADSD's analog: per-project memory folder. Same idea — bound the persistence to the project, not the global model. + +## ADSD integration with existing patterns + +### Snapshot-lint enforcement loop + +Layer 2 snapshot.md has invariants (HEAD freshness, ADR roster completeness). `scripts/snapshot-lint.sh` runs these as Inv 1-4. Pre-commit hook fires snapshot-lint, blocking commits that violate invariants. This is the F1.1 closure mechanism. + +### CTO operations runbook is the Layer 1 cookbook + +`cto_operations_runbook.md` codifies P9 dispatch SOPs, conflict resolution, gates. It's auto-memory because it must survive session boundaries — every new session running CTO role reads it on bootstrap. + +### Identity recovery memory closes F16 + +`feedback_p10_post_compaction_identity_recovery.md` lives in Layer 1 specifically because identity must survive compaction. The corresponding F-pattern is the negative form of why this memory exists. + +## Pitfalls + +| Pitfall | Layer confusion | Recovery | +|---|---|---| +| Memory file holds in-progress sprint notes | L3 → L1 leak | Move to a scratch message, promote permanent rule to L1 if it's actually a rule | +| ADR captures incident history | L2-finding → L2-ADR confusion | Rewrite as finding; if a decision was made, separate ADR linking the finding | +| Snapshot.md not updated post-merge | L2 staleness | snapshot-lint pre-commit hook (F1.1 mitigation) | +| "I'll remember to do X" (Layer 3) becomes binding | L3 informal → expected L1 | Either codify in memory or accept it'll be forgotten | +| Re-reading same memory file every turn | L4-style use of L1 | Trust L1 was loaded at session start; don't re-fetch | +| MEMORY.md hook is generic ("various rules") | Index loses dispatch value | Rewrite as specific keyword-dense one-liner | + +## Cross-references + +- `reference/context-window-strategy.md` — what to put in context (different question than where to put facts) +- `reference/failure-modes-catalogue.md` F1 family + F16 + F17 — anti-patterns this architecture mitigates +- `templates/snapshot-template.md` — Layer 2 snapshot template +- `SKILL.md` §"Snapshot discipline" — the operative discipline this architecture supports +- Anthropic Claude Code memory docs — MEMORY.md auto-load contract +- OpenAI Assistants API — threads + retrieval (for ADSD future consideration) diff --git a/plugins/adsd/skills/agent-driven-development/reference/evals-first-development.md b/plugins/adsd/skills/agent-driven-development/reference/evals-first-development.md new file mode 100644 index 0000000..1047be6 --- /dev/null +++ b/plugins/adsd/skills/agent-driven-development/reference/evals-first-development.md @@ -0,0 +1,211 @@ +--- +name: Evals-first development discipline +description: Build the evaluation harness before the feature. Anthropic's central claim "evals are the moat" applied to ADSD. Every public capability gets a falsifiable test before implementation, not after. +type: reference +version: 1.0.0 +date: 2026-05-12 +status: active +relates_to: [skill:SKILL.md §"Wave + Tx pattern", reference:failure-modes-catalogue.md F19+F20] +--- + +# Evals-first development + +> **Anthropic central claim**: "Evals are the moat. Better evals beat better models." +> ADSD adopts this: every Cobrust public capability has a falsifiable acceptance signal **before** the impl Tx fires. This is the positive form of F20 (constitution-vs-workflow alignment) — workflow enforces the rule by requiring the eval to fail-then-pass. + +## When this applies + +For any task type: + +- Adding a new public API surface (CLI flag, language feature, stdlib fn) +- Translating a Python library (every entrypoint = its own eval slice) +- Migrating an internal API (eval guards behavior preservation) +- Performance-claim release (eval = repeatable benchmark) +- LLM-driven anything (eval is the only way to detect prompt drift) + +Skip evals only for: + +- One-shot scripts with no future maintenance +- Doc-only changes (release-readiness verify is its own form, see F19) +- Strict refactors with full test coverage already + +## Anthropic-pattern adoption + +### Eval as code, not document + +Anthropic's evals are runnable artifacts — `pytest`-style scripts, JSON-line inputs/outputs, scoring functions. Not markdown narratives. + +ADSD shape: + +``` +project-root/ +├── evals/ +│ ├── <feature-name>/ +│ │ ├── cases.jsonl # one (input, expected) per line +│ │ ├── score.py # scoring fn (exact / fuzzy / LLM-judge) +│ │ ├── run.sh # entrypoint +│ │ └── REPORT.md # last-run summary, machine-updated +│ └── README.md # eval directory index +``` + +### Eval categories (Anthropic taxonomy) + +| Category | When | Example for Cobrust | +|---|---|---| +| **Exact match** | Deterministic output | `cobrust build hello.cb` exit 0 + stdout `hello, world` | +| **Fuzzy match** | Allows whitespace / order drift | TOML round-trip; output equivalent under canonicalization | +| **Regex / structural** | Format known, content variable | Compiler error messages match pattern `/error\[\w+\]:/` | +| **LLM-judge** | Open-ended (docs, NL output) | Translated library's docstrings preserve original meaning | +| **Differential** | Compare against oracle | `cobrust-tomli.parse(s)` == `cpython tomllib.loads(s)` for 1000+ fuzz inputs | + +Differential evals are ADSD's strongest pattern — already baked in for tomli T1.1. Generalize. + +### Minimum eval bar (Anthropic guideline) + +- **≥ 50 cases** per public capability +- **≥ 10 adversarial cases** (boundary conditions, malformed input, edge encodings) +- **Reproducible**: `bash evals/<name>/run.sh` exits non-zero on regression +- **Cheap**: full eval suite runs in < 5 min, fuzz suite in < 30 min + +### Eval delta as merge gate + +The Anthropic moat is enforced via: **PR must report eval delta**. + +``` +[P9-COMPLETION] eval-delta block (required for any merge touching public surface): +- evals/<name>/cases.jsonl: +N cases (was M, now M+N) +- pass rate before: <X>/<M> → after: <Y>/<M+N> +- adversarial cases: +K new (was J, now J+K) +- regression check: 0 prior cases newly failing +- if ANY prior case newly fails → BLOCK merge until justified or fixed +``` + +CTO 守闸 protocol: spot-check the eval delta. Don't merge sprints that touch public surface without an eval-delta block. + +## OpenAI-pattern adoption + +### Function/tool eval (structured-output enforcement) + +OpenAI's strongest practice: **if your agent emits structured output, eval the structure**. + +For ADSD: any sub-agent reporting `[P9-COMPLETION]` should emit a JSON-shaped block. CTO can machine-parse + verify required fields present. + +``` +[P9-LC100-COMPLETION] +```yaml +phase_1_adr: { sha: 3839742, status: accepted } +phase_2_buckets: + - { name: B1, pass: 27, compile_fail: 2, runtime_fail: 1 } + - { name: B2, pass: 24, compile_fail: 4, runtime_fail: 2 } + - { name: B3, pass: 22, compile_fail: 5, runtime_fail: 3 } + - { name: B4, pass: 9, compile_fail: 1, runtime_fail: 0 } +total_pass: 82 +total_fail: 18 +ramp_recommendation: GO_TIER_B +bug_patterns_top5: + - { signature: "i8/i64 mismatch in nested if", count: 4, finding: lc100-i8-i64-nested-if } + - ... +``` + +CTO `yq` or `jq` the YAML; verify fields; spot-check 3 random bugs. + +### OpenAI Evals framework (open source) + +OpenAI Evals repo (github.com/openai/evals) is well-documented. ADSD shouldn't reinvent — adopt their core types: + +- `match` — exact substring +- `fuzzy_match` — token / whitespace tolerant +- `model_graded` — LLM-as-judge with own evaluator model +- `code_run` — execute generated code, compare output + +ADSD's `score.py` per-eval-folder can wrap OpenAI Evals primitives. + +## ADSD integration with existing patterns + +### Wave + Tx + eval delta + +Existing pattern: every Wave merge has 5-gate green (fmt / clippy / build / test / doc-coverage). + +Add 6th gate for any public-surface Wave: **eval delta non-regression**. New cases land + 0 prior cases newly failing. + +This 6th gate is the systemic closure of F20 (constitution mandate without workflow alignment). It makes "evals first" a binding mandate, not aspiration. + +### Eval-first vs TDD-pair (already in ADSD) + +These are complementary, not competing: + +- **TDD-pair** (Phase 2 in dispatch): test agent writes test corpus first, dev agent implements to pass. Per-feature TDD. +- **Eval-first** (Phase 0 in sprint): eval harness exists before the feature is dispatched. Per-public-surface lifetime guard. + +TDD pair tests that the impl matches the test corpus this sprint. Evals catch that impl still matches behavioral contract across sprint history. + +### Finding ↔ eval bidirectional + +When a finding is discovered (e.g. `lc100-i8-i64-nested-if`), the **same sprint that fixes the bug must add an eval case that catches it next time**. This is the prevention layer beyond the documentation layer. + +`docs/agent/findings/<slug>.md` must have a §"Eval case added" section listing the line in `evals/<feature>/cases.jsonl` that catches this specific failure. + +## Concrete template + +Inline template (copy this into your project's `evals/<feature>/REPORT.md` — a dedicated `templates/eval-template.md` may be split out in a future ADSD release): + +``` +--- +name: <feature>-evals +description: <one-line behavior under eval> +date: <date> +last_verified_commit: <SHA> +case_count: <N> +adversarial_count: <K> +oracle: <Python lib | manual | differential against ...> +--- + +# Evals: <feature> + +## Behavior under eval + +<2-3 sentences. The falsifiable claim about what the feature does.> + +## Eval suite layout + +- `cases.jsonl` — N input cases with expected output +- `score.py` — scoring function (cite category: exact / fuzzy / regex / model_graded / code_run) +- `run.sh` — entrypoint, exits non-zero on regression + +## Adversarial cases (subset of cases.jsonl) + +<list the K cases that target edge conditions; identify them by line index> + +## Last run + +| Field | Value | +|---|---| +| Date | <date> | +| Commit | <SHA> | +| Pass | <pass> / <total> | +| New failures vs prior | <K> | +| Regression status | <PASS / FAIL> | + +## Pitfalls + +- LLM-judge eval drifts if evaluator model changes. Pin evaluator model in `score.py`. +- Differential evals need pinned oracle version. Document oracle version in frontmatter. +``` + +## Pitfalls + +| Pitfall | Symptom | Recovery | +|---|---|---| +| Evals as documentation, not runnable code | `cases.jsonl` exists but no `run.sh` | Promote to runnable in 1 PR or delete | +| Eval coverage cliff: tons of easy cases, no adversarial | All cases pass on first try | Demand `adversarial_count ≥ N/5` in frontmatter | +| LLM-judge instability | Same case gives different verdict on rerun | Pin evaluator model + temperature 0 + cache responses | +| Differential eval oracle drift | Oracle library version bumps and evals silently re-baseline | Pin oracle version in frontmatter + verify in CI | +| Eval delta forgotten in PR | Sub-agent completion report omits eval-delta block | Make `[P9-COMPLETION]` template require the block | + +## Cross-references + +- `SKILL.md` §"Wave + Tx commit tags" — eval delta is the 6th gate +- `reference/failure-modes-catalogue.md` F19 (release install-not-tested) — eval-first is the systemic prevention +- `reference/failure-modes-catalogue.md` F20 (constitution-vs-workflow) — eval-first IS the workflow that enforces "test-first" mandate +- Anthropic: https://www.anthropic.com/engineering (search "evals") +- OpenAI Evals: https://github.com/openai/evals diff --git a/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md b/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md index 9909285..df8b68b 100644 --- a/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md +++ b/plugins/adsd/skills/agent-driven-development/reference/failure-modes-catalogue.md @@ -1,3 +1,13 @@ +--- +name: ADSD failure modes catalogue (F1-F30) +description: Concrete failure modes encountered in real ADSD projects with empirical evidence, root cause analysis, recovery patterns, and prevention mechanisms. F1 Sediment Family (9 sub-forms) + F2-F30 individual entries. Cobrust N=1 surfaced F1.0-F1.2 + F2-F24; Cobrust Studio N=2 (M0-M5) surfaced F1.3, F1.4, F25-F28; Cobrust Studio M6/M7 cycle surfaced F1.5 (candidate) + F29 (candidate). Add F31+ as your project hits new failure modes. +type: reference +version: 1.2.7 +date: 2026-05-12 +status: active +relates_to: [skill:SKILL.md §"Failure modes catalogue", case-study:cobrust-multi-agent-experience.md, case-study:cobrust-studio-experience.md, reference:evals-first-development.md, reference:context-window-strategy.md, reference:cross-session-memory-architecture.md] +--- + # Failure modes catalogue > Concrete failure modes encountered in real ADSD projects, with @@ -8,15 +18,20 @@ --- -## F1 — Declared rules without enforcement — **"F1 Sediment Family"** (**P0 SOP gap, 6 sub-forms confirmed**) +## F1 — Declared rules without enforcement — **"F1 Sediment Family"** (**P0 SOP gap, 9 sub-forms confirmed**) > **Status upgraded to "F1 Sediment Family" parent pattern** after 6 distinct -> sub-forms observed across Cobrust 11-day experiment. F1 is the single most -> common systemic failure in ADSD-flavor projects. Original 3 sub-forms -> (F1.0 / F1.1 / F1.2) remain as implementation-level instances. New -> sub-forms F16, F17, F18 extend the family to identity, self-reporting, and -> attribution-policy dimensions — all share the same root: **declaration ≠ -> enforcement, and enforcement scope silently lags reality.** +> sub-forms observed across Cobrust 11-day experiment, 2 additional sub-forms +> (F1.3 local-vs-CI gate drift, F1.4 README-vs-release-tag drift) confirmed on +> Cobrust Studio's 21-hour N=2 dogfood (M0-M5), and 1 additional sub-form +> (F1.5 test-corpus structural blind spot) surfaced in Cobrust Studio M6 cycle. +> F1 is the single most common systemic failure in ADSD-flavor projects. Original +> 3 sub-forms (F1.0 / F1.1 / F1.2) remain as implementation-level instances; +> F1.3 + F1.4 extend to enforcement-scaffold drift; F1.5 extends to test-corpus +> coverage gaps on re-derive paths. New sub-forms F16, F17, F18 extend the family +> to identity, self-reporting, and attribution-policy dimensions — all share the +> same root: **declaration ≠ enforcement, and enforcement scope silently lags +> reality.** > > **Family pattern one-liner**: Claim is written somewhere (constitution, > schema frontmatter, KPI card, attribution policy, auto-memory). No @@ -24,7 +39,10 @@ > turns. Violation is invisible until an auditor manually checks. > > See F16 (identity drift), F17 (self-report fidelity gap), F18 (attribution -> policy without dir-scope enforcement) for the three new sub-forms. +> policy without dir-scope enforcement) for the three new sub-forms; F1.3 +> (local-vs-CI gate drift), F1.4 (README-vs-release-tag drift) for the two +> Studio M0-M5-surfaced scaffold-level sub-forms; F1.5 (test-corpus structural +> blind spot on re-derive paths) for the M6-surfaced sub-form. ### F1.0 — Snapshot sediment ("重写忘删") @@ -109,16 +127,135 @@ in Cobrust's case — silently skips the gate. script-creation time. New milestones add ADRs/findings, but the script doesn't auto-extend. -**Evidence**: Cobrust 11th-review §H2. -`grep -rE "ADR-003[0-9]" docs/human/` returns **0 hits**. -ADR-0030..0039 全部 not in zh+en doc trees. Triple-tree drift is -systemic for all post-M14 work, but doc-coverage.sh is silent on it. +**Evidence**: Cobrust 11th-review §H2 (anchored at HEAD ~`06df4b4`, 2026-05-10). +At that time, `grep -rE "ADR-003[0-9]" docs/human/` returned **0 hits** — +ADR-0030..0039 were not in zh+en doc trees. Triple-tree drift was +systemic for all post-M14 work, but doc-coverage.sh was silent on it. +(Note 2026-05-12: this specific grep has since changed as later doc +sync added ADR-0030..0039 mentions, but the systemic pattern remains +the F1.2 instance — the verification step was hardcoded against a +specific milestone range and went stale.) **Recovery**: doc-coverage scripts must auto-discover scope via `ls docs/agent/adr/00*.md` patterns, not hardcode milestone lists. Same applies to any "rule covers M0-M<N>" pattern — it will go stale the moment M<N+1> lands. +### F1.3 — Local-vs-CI gate definition drift (sub-form of F1.2) + +**Symptoms**: a project enforces "N gates green before merge" at two +layers — `scripts/doc-coverage.sh` (developer-local fast feedback) and +the GitHub Actions workflow (canonical merge gate). The mandate is +nominally identical at both layers, but the two layers define the +gate-set differently. Local script reports "all 6 gates passed"; CI +fails the PR on a 7th gate the local script never ran. The developer +sees a green local run + a red CI run and cannot reconcile them +without reading both scripts side-by-side. + +**Concrete shapes seen**: +- Local `doc-coverage.sh` runs fmt/clippy/build/test + 2 doc-shape + checks (6 steps). CI workflow runs the same 6 + a separate + `cargo fmt --check` job (7 jobs). Local passes; CI fails on fmt + drift because the local script never ran fmt-check. +- Local script runs `cargo test --workspace`; CI runs `cargo test + --workspace --all-features`. Feature-gated test fails only in CI. +- Local script uses one cargo binary; CI uses pinned toolchain + version. Toolchain-specific lint fails only in CI. + +**Root cause**: structurally identical to F1.2 (constitution rules +with partial-scope enforcement), but applied to the enforcement +*scaffold itself*. The "N gates" rule is declared in the project +constitution; the enforcement layer has two implementations (local +script + CI workflow), and their definitions of "N" diverge silently. +Without a meta-check that script-set ⊆ CI-set, drift is invisible +until the next CI red. + +**Evidence**: Cobrust Studio M5.8 sprint, 2026-05-12. Persona auditor +Sarah v2 caught the gap: local `scripts/doc-coverage.sh` reported "6 +gates passed"; the GitHub Actions matrix-CI workflow added at M5 (per +Sarah v1 dispatch) ran `cargo fmt --check` as a separate job and +failed on the same SHA the local script approved. Resolution: §5b +added to `doc-coverage.sh` to run `cargo fmt --check` alongside the +existing gates, restoring script ⊇ CI invariant. See Studio case +study §4.2 and persona-driven §M5.8. + +**Recovery**: +1. Establish the invariant **script ⊇ CI** (the local script runs at + least every check the CI workflow runs). +2. Add a meta-check: a small CI job that fails if any check in + `workflows/*.yml` lacks a corresponding step in `doc-coverage.sh` + (grep-driven; brittle but bounds the drift). +3. When CI fails on a gate the local script didn't run, the fix is + to extend the local script in the same PR, not to silently rely + on CI catching it. + +**Prevention going forward**: in the SAME PR that introduces a new +CI job, extend the local enforcement script to run it. The "N gates" +mandate must name a single source of truth (the local script) and +treat CI as the canonical re-runner of that script — not as a parallel +gate-set. See F26 (recursive enforcement-script closure) for the +multi-layer-review discipline this implies. + +### F1.4 — Doc-coverage script enforces what it knows, README-vs-release-tag drifts silently (sub-form of F1.0) + +**Symptoms**: a project's `scripts/doc-coverage.sh` enforces invariants +on artifacts it knows about — module-doc `last_verified_commit:` SHA +reachability, ADR roster completeness, findings frontmatter shape. The +script is rigorous on its declared scope. Meanwhile, the public README +ships claims that **the script has no clause for**: +- README badge shows version `vX.Y.Z` while the latest pushed tag is + `vX.Y.(Z+1)` (badge-vs-tag drift) +- README §"Install" describes a single-platform tarball while + `release.yml` builds a 5-platform matrix (asset-coverage drift) +- README §"Compare to X" cites old positioning while the current + positioning was updated 3 commits ago (narrative drift) + +The script is green; the public surface is stale. Discovery happens +only when a persona auditor or new visitor reads the README cold. + +**Root cause**: F1.0 family — the script enforces what it knows to +enforce. Its declared scope (module-doc / ADR / finding) is rigorous; +its undeclared scope (README ↔ latest tag, README ↔ release.yml +matrix, README ↔ current positioning) is unenforced. The script +doesn't know to enforce these; nobody told it to. + +**Evidence**: Cobrust Studio post-v0.1.3 sprint, persona auditor Sarah +v2 R9 finding (2026-05-12). README §"Releases" badge displayed +`v0.1.2` and §"Install" described a single-platform `aarch64-apple-darwin` +tarball, AFTER v0.1.3 had shipped with a 5-platform `release.yml` +matrix (Linux + macOS x86_64/aarch64 + Windows). `doc-coverage.sh` +green; `last_verified_commit:` rigorous on every module-doc; README +content not under any gate. See Studio case study §M5.8 and Sarah-v2 +R9. + +**Rule of thumb**: + +> **What the doc-coverage script doesn't enforce, drifts. The script +> enforces what it knows to enforce. Anything outside its declared +> scope is on human discipline alone — i.e., it will drift.** + +**Recovery**: +1. For every public-facing artifact (README, release notes, landing + page), enumerate the claims that are bound to a current-tag value + (badge SHA, asset names, platform matrix, version string). +2. Add a `scripts/doc-coverage.sh` clause per claim: + - Badge SHA must equal `git describe --tags --abbrev=0` + - Every asset URL in README must resolve via `gh api` + - Every platform mentioned in README must appear in + `.github/workflows/release.yml` matrix +3. Mark previously-aspirational claims (e.g. "single-platform tarball" + wording) as ASPIRATIONAL per F1's generalized prevention rule, or + add the enforcement. + +**Prevention going forward**: when introducing a new public-facing +claim (README §"Install", release notes), in the same commit add the +script clause that enforces the claim. F1 family applied to public +surface, not just internal scaffolding. Composes with F19 +(release-readiness independent install-test) and F8 (marketing +overreach without citation): F19 verifies the install path runs; F8 +verifies marketing claims have citations; F1.4 verifies README claims +track current tag. + ### Generalized prevention going forward (P0 SOP) > **Any project-level rule without an automated check is security @@ -1012,15 +1149,21 @@ attribution rather than schema invariants. ### Evidence -Cobrust Day 11, ~14:00: P7 sonnet sub-agent, dispatched for a broad cleanup -sprint, edited README sections that included review-claude's narrative §F -findings summary. This is described in review-claude's README §A.NEW5 -"review-claude 13 own" item: "P7 sonnet boundary violation editing -review-claude README". - -The attribution policy was clearly stated in README §Attribution: -"findings/ entries are review-claude originals — discovered_by field -marks source." P7 had no enforcement signal preventing the edit. +Cobrust Day 11 (2026-05-11), early afternoon: a P7 sonnet sub-agent +dispatched for a broad cleanup sprint edited README narrative sections +that review-claude considered its own authoring territory. The +boundary violation was paraphrased in the review-claude session's +own-up log (review-claude session 4bb35f43, paraphrased as: "P7 broad- +cleanup spawn edited review-claude's narrative §F without scope- +exclusion guard in dispatch prompt"). The literal handoff-README +text content at that line had evolved over multiple turns, so the +canonical citation is the pattern description, not a verbatim quote. + +The attribution policy was stated in `review-claude-handoff/README.md` +§"Attribution policy": "findings/ entries are review-claude originals — +each file's `discovered_by:` frontmatter marks source." P7 had no +machine-enforcement signal preventing the edit (no CODEOWNERS, no +dispatch-prompt-level exclusion list). Note: This is a **candidate F18** because (a) it was observed in a single session, (b) the root-cause was partially ambiguous (was it P7 ignoring @@ -1065,6 +1208,1377 @@ When declaring attribution/ownership policies: --- +## F19 — Public-facing onboarding text written but never independently install-tested (F1 Sediment Family, install-test sub-form) + +> **F1 sub-form, confirmed**. Same family as F1.1 (declared invariants without enforcement). The text claims an install path; the install path is never executed in a clean shell by the writer. F19 is high-blast-radius: failures land on every new user's first impression, not on internal dev productivity. + +### Definition + +A release artifact (README quickstart, release notes, GitHub Release body, install script, `cargo install` command, `curl -L` URL) ships to public users without anyone running the documented commands in a clean shell before publish. The text passes review by being read; it fails reality by being run. + +### Symptoms + +- `README.md` says `cargo install foo-cli` but the package is not on crates.io (path-deps not published) → user gets `error: could not find foo-cli in registry` +- Release notes list `cobrust-v0.1.1-x86_64-apple-darwin.tar.gz` but `release.yml` never built that target → user gets HTTP 404 +- README's dynamic URL builder via `$(uname -sm | tr ' ' '-')` builds `Darwin-arm64.tar.gz` but the actual asset is named `aarch64-apple-darwin` → 404 on Apple Silicon Macs +- GitHub Release body references action SHAs that don't resolve (typo'd / hallucinated commit hash) → CI red on the release branch +- Quick-start uses `curl -L ... | tar xz` but server doesn't follow redirect on plain `-L` without `-fsSL` → empty extraction, opaque error + +### Root cause + +Two compounding patterns: + +1. **Author writes from intent**: "this is how install should work" — writer's mental model of the asset naming convention or registry state, not the actual file on disk / artifact in release. +2. **No clean-shell verification step**: standard PR review reads the README diff in GitHub UI. Reviewer does not `cd /tmp && bash <(paste commands)`. So the text passes review by being plausible, not by being executable. + +Same structural pattern as F1.1: declaration (install path) + missing verification step (run in clean shell). F17 (self-report fidelity) is the report analog; F19 is the user-onboarding analog. Both are F1 family. + +### Evidence + +Cobrust 24-hour window 2026-05-10 → 2026-05-11, three consecutive instances: + +1. **M10 hallucinated SHA pins (v0.1.0 tag)**: M10 sub-agent SHA-pinned 4 GitHub Actions to fake 40-char hex with confident `# v4.2.2` comments. 13/14 CI jobs failed at action resolution, leaving v0.1.0 tag with red CI for ~4 hours until user spotted. Recovery: revert to tag form (`@stable`, `@v4`). +2. **v0.1.1 install path 404 (release notes)**: release notes listed `cargo install cobrust-cli` (package not on crates.io, path-deps unpublished) and curl URL `cobrust-v0.1.1-x86_64-apple-darwin.tar.gz` (release.yml never built x86_64-apple-darwin). Mei persona audit + Layer 3 review-claude curl test caught both. Recovery: change install command to `cargo install --git ...`, remove non-existent asset URL. +3. **v0.1.2 release-readiness audit (mechanism validated)**: §A.3 dispatched a release-readiness sub-agent before public announcement. Agent ran the documented curl commands in clean shell, surfaced friction (`curl -L` without `-fsSL` left empty extractions on some platforms). Friction was fixed pre-release **and back-ported to v0.1.1's notes** (`4baea69 docs(release): back-port -fsSL curl flag to v0.1.1 release notes`). This is the closure cycle: BLOCK → fix → re-test → GO. **First validated execution of F19's prevention mechanism in the wild.** + +### Rule of thumb + +> **Any public-facing install / quickstart / release command must pass independent execution in a clean shell before publish.** +> +> Mandatory release-readiness gate: +> ``` +> # In a /tmp/release-test-<sprint> directory with no env vars from dev box: +> 1. Run each `cargo install` / `curl -L` / install command verbatim from the doc +> 2. For each URL: curl -fsSL -o /tmp/check.tar.gz <URL> ; echo "HTTP $?" +> 3. For each command: confirm exit 0 + expected stdout +> 4. Block merge if any command fails +> ``` +> Spawn a dedicated **release-readiness agent** for this — not the same agent that wrote the docs (avoid F1.1 self-attestation pattern). + +### Recovery + +1. **Immediate**: if a 404 install URL is in a published release, edit the release body via `gh release edit <tag> --notes-file <fixed>.md` and force-push a docs commit on main. Tag itself is immutable, but body + asset uploads are not. +2. **Workspace version**: if `Cargo.toml` workspace.package.version was not bumped before tag, the prebuilt binary will report wrong version → user files bug → confidence damaged. Bump version BEFORE tag in every release SOP. +3. **Backport friction fixes** to prior releases: if you find `curl -L` should have been `curl -fsSL`, back-port via `gh release edit v0.1.1` so old release pages also fix. Don't leave half the user base hitting a known-fixed friction. + +### Prevention going forward + +In every project adopting ADSD: + +1. **Add release-readiness as a tier-0 verification step** in `cto_operations_runbook.md` §"Dispatching a new P9": for any commit touching `README.md` / `docs/releases/*.md` / GitHub Release body / `release.yml`, spawn a P7 sonnet release-readiness agent that runs install commands in a clean shell and reports `[P7-RELEASE-READY-VERDICT] GO / BLOCK`. +2. **Release-readiness agent prompt template** in `templates/dispatch-prompt-p7.md` (release-readiness flavor): the agent's job is to be skeptical of the docs it's auditing, run every command verbatim, paste raw exit codes + sizes as evidence. +3. **CI lint** (stretch): a release-time CI gate that resolves each URL/asset listed in release notes via `gh api` and fails if any returns 404. This is the F1 family enforcement mechanism. + +### Closure: BLOCK → fix → GO cycle as validation + +The validation that F19's mitigation works is itself empirical: v0.1.2's release-readiness audit produced a BLOCK verdict, the friction was fixed (back-port + new asset naming), the next audit returned GO. The system worked. Any project adopting this pattern should expect: first 1-2 releases produce BLOCK verdicts; over time, BLOCKs become rare because the writing convention internalizes the verification step. + +--- + +## F20 — Constitution mandate written but workflow never aligned (F1 Sediment Family, mandate-vs-workflow sub-form) + +> **F1 sub-form, confirmed**. A project constitution declares a binding rule ("test-first development", "atomic commits", "no `unwrap()` in non-test code"). The dispatch SOP, daily workflow, and reviewer checklist never align to enforce the rule. The constitution becomes aspirational marketing; the workflow runs unconstrained. + +### Definition + +The project's foundational document (CLAUDE.md, constitution, README §Principles) states a binding development rule. Implementation of that rule requires a corresponding step in the workflow: dispatch prompt template field, CI gate, pre-commit hook, reviewer checklist item. The workflow step is missing or unenforced. Code continues to be written without violating the constitution textually (no one disputes the rule), but the rule is never actually exercised. + +### Symptoms + +- Constitution §"Test-first": "failing test before implementation" — but every sprint's commits show `feat(X): implementation + tests in same commit`. Test-first ordering is impossible to verify from the diff. +- Constitution §"Atomic commits, code + tests + docs same commit" — but findings get added in separate doc-cleanup commits days later. +- Constitution §"No `unwrap()` in non-test code; use `expect("rationale")` instead" — `grep -r 'unwrap()' crates/*/src/` returns N hits, none with rationale. +- Project memory `feedback_subagent_model_tier.md` says "Opus for hard / sonnet for easy / haiku NEVER" — but P9 dispatch prompts consistently use sonnet without difficulty assessment, occasional spawns of haiku for trivial doc rewrites. + +### Root cause + +Two compounding patterns: + +1. **Mandate is text-level, not workflow-level**: writing the rule in CLAUDE.md feels like enforcing it. But the rule exists in agents' context only at session start; after compaction or sub-agent spawn, the rule is not re-asserted. +2. **No enforcement scaffold built alongside the rule**: when the constitution is drafted, the corresponding CI lint / dispatch prompt field / commit hook is not built in the same PR. The rule is declared; the enforcement is "we'll add it later." + +This is the meta-pattern of F1 Sediment Family applied to the project's own ground rules. Every other F1 sub-form (F1.0 schema invariants, F1.1 declared without CI, F16 identity preamble in skill not memory, F17 self-report fidelity, F18 attribution scope, F19 install commands) is an instance of this F20 meta-pattern: declared without enforcement at the right layer. + +### Evidence + +Cobrust 9-day pre-2026-05-11 period: CLAUDE.md §6 stated "Test-first for compiler internals: failing test before implementation." Every P9 sprint from M3 through M12 used a single P7 sonnet agent writing impl + tests in the same commit. No commit log shows tests committed before impl. The constitution mandate was fact-violated for 9 consecutive days without anyone (including review-claude) spotting it. + +Discovery: 2026-05-11, project owner posed the question "CTO 只管开发不管测试, 不太好, 他手底下应该每个开发都再配一个 sonnet 测试" — owner-spotted constitution gap, not agent-spotted. Review-claude's analysis (this catalogue's parallel session) confirmed: CLAUDE.md §6 mandate without dispatch-prompt workflow alignment = F20 instance. + +Resolution: 2026-05-11 same-day codification of D0-D5 difficulty matrix + mandatory dev/test pair workflow (separate test agent + dev agent, test-first ordering, P9 reviews corpus between) into: + +- `feedback_subagent_model_tier.md` §"Extension 2026-05-11" (memory enforcement) +- `cto_operations_runbook.md` §"Dispatching a new P9" + §"Dev/test pair pattern" (SOP enforcement) +- ADSD `templates/dispatch-prompt-p9.md` (template enforcement) + +Validation: Cobrust W2 sprint (the first sprint after codification) executed with TDD ordering visible in commit log: + +``` +ca4c37c tests(adr-0044): W2 Phase 2 failing test corpus per ADR-0044 (TDD step 1) +2eb4fca feat(stdlib+codegen+cli+types): wire source-level input/read_line/argv per ADR-0044 W2 Phase 2 (TDD dev step) + +d337cf0 tests(adr-0044): W2 Phase 3 LeetCode oracle-match corpus (TDD step 1) +0145e8b feat(examples): W2 Phase 3 — 10 LeetCode .cb programs (TDD dev step, ADR-0044 stdin/argv usage) +``` + +The TDD step 1 commits land before TDD dev step commits in temporal order. **First executed test-first sprint after 12 days of constitution mandate fact-violated.** F20 is closed for Cobrust via execution evidence, not just documentation. + +### Rule of thumb + +> **Every binding constitution rule must have a paired enforcement step in the same PR that introduces it.** +> +> Enforcement layers in ascending strength: +> 1. Mandate appears in dispatch prompt template (workflow text) +> 2. Mandate appears in auto-loaded project memory (survives compaction) +> 3. Mandate has a CI lint / commit hook / pre-commit check +> 4. Mandate is enforced by the tool itself (e.g. `cobrust build` rejects code with `unwrap()`) +> +> Aim for layer 3+ on critical rules. Layer 1 alone = F20 instance waiting to happen. + +### Recovery + +When discovering an F20 instance: + +1. **Locate the mandate text**: which paragraph in which doc? +2. **Identify the workflow gap**: which dispatch prompt template / SOP / CI file should enforce this? +3. **Add the enforcement in the next PR**, not "later". Same-day codification is the minimum. +4. **Backfill validation**: after enforcement is added, run one sprint that exercises the enforced path; verify the enforcement actually fires (e.g. CI rejects bad commit). + +### Prevention going forward + +In every new constitution / CLAUDE.md / project rules document: + +1. After each rule, add a `**Enforced by**: <CI lint / dispatch prompt field / memory entry / N/A — aspirational>` line. +2. If `Enforced by: N/A — aspirational` appears, flag for future codification or downgrade the rule to "guidance" rather than "mandate". +3. Periodic constitution audit (quarterly): grep every mandate, verify each has a working enforcement mechanism. + +This is itself a meta-application of ADSD: the project's own development discipline must be ADSD-managed. + +--- + +## F21 — Cross-session AI agent identity overload (F1 Sediment Family, identity-namespace sub-form) + +> **F1 sub-form, confirmed**. A symbolic agent handle ("review-claude", "the CTO", "studio-reviewer") is used across multiple distinct AI sessions/contexts as if it were a stable identity. Audit trail becomes ambiguous: claims attributed to handle X may originate from session A, B, or C, each with different context and authority. + +### Definition + +A natural-language handle is adopted as the de-facto name for a role (audit reviewer, CTO, tech lead). Multiple distinct AI sessions assume the same handle when fulfilling that role at different times or in parallel. Cross-session artifacts (documents, findings, commit messages) attribute work to "review-claude" without disambiguating which session. Future readers cannot distinguish whether a claim came from a session with deep project context vs. a fresh session with shallow context. + +### Symptoms + +- Document signed "— review-claude, 2026-05-11" appears in a directory; another document also signed "— review-claude, 2026-05-11" appears with conflicting analysis +- A handoff doc claims "review-claude audited the project across 7+ review rounds" — but the actual author was a different session that synthesized the prior rounds from transcripts, not the session that performed them +- Commit message says `Co-Authored-By: review-claude` — git log cannot distinguish which session +- Project memory references "review-claude" as if it were a single persistent agent, when in practice it's been multiple sessions with different context depths + +### Root cause + +Three compounding patterns: + +1. **Symbolic-handle reuse**: humans naming AI roles (review-claude, CTO, tech-lead-p9) creates an implicit identity. Distinct sessions, when assigned that role, adopt the handle as their own. +2. **No session-ID attribution in artifacts**: documents/commits sign with the handle, not with `handle (session XYZ)` or `handle (timestamp)`. Audit trail collapses across sessions. +3. **Cross-session learning illusion**: readers assume "review-claude knows" things from prior sessions because the handle is consistent. But each session has fresh context unless explicitly fed prior artifacts. + +This is F1 family because: the role is declared (review-claude is the auditor), the identity is not enforced (no scheme to distinguish session-A's review-claude from session-B's). Audit attribution drifts. + +### Evidence + +Cobrust 2026-05-11 evening: project owner asked claude-desktop to draft a Cobrust Studio handoff. Claude-desktop drafted a multi-hundred-line document signing it "— review-claude, 2026-05-11". A separate Claude Code session (the parallel one auditing Cobrust live, session ID `4bb35f43...`) was also active that day and had been signing its own artifacts "review-claude". The Studio handoff cited an external "multi-turn review-claude session" — but the original session that performed those reviews did not write the handoff; claude-desktop did, citing the parallel session's prior work. + +Result: future readers of the Studio handoff cannot tell which review-claude session authored each claim, when, with what context. The handle "review-claude" became identity-overloaded between at least 2 concurrent sessions on the same day. + +The cleanly-locatable artifact instances on disk: `review-claude-handoff/handoff-pack/dispatches/claude-desktop-integrated-handoff.md` (claude-desktop integration record) + 5+ findings under `review-claude-handoff/findings/` with `discovered_by:` frontmatter, and ADSD's own `docs/agent/conventions.md` §"Identity hygiene (F21 closure)" prescribing session-ID-stamped attribution going forward. + +### Rule of thumb + +> **Symbolic AI role handles must carry session-ID or timestamp attribution in any persistent artifact.** +> +> Naming convention: +> - In documents: `— review-claude (session 4bb35f43, 2026-05-11)` +> - In commits: `Co-Authored-By: review-claude-session-4bb35f43 <noreply@anthropic.com>` +> - In findings: frontmatter `discovered_by: review-claude (session 4bb35f43)` +> - Reserve plain "review-claude" for the abstract role; never use it bare in attribution. +> +> Stronger: when spawning a new internal review agent, give it a distinct handle (e.g. `studio-reviewer-001`) rather than reusing "review-claude". Reserve "review-claude" for the originating external audit window. + +### Recovery + +When discovering an F21 instance in existing artifacts: + +1. Audit document signatures: identify which actually came from which session. +2. Where ambiguous: leave the original signature, append `(provenance: see commit <SHA> for session metadata)`. +3. Going forward, prefix new artifacts with explicit session ID. + +### Prevention going forward + +In every ADSD project: + +1. At the start of a session that will produce persistent artifacts, declare the session ID. Stamp every commit / finding / ADR with that ID. +2. Distinct roles get distinct handles. "review-claude" is the role; "review-claude (session 4bb35f43)" is the agent instance. Documents reference the latter. +3. If multiple sessions of the same role are concurrent: choose disambiguating suffixes (`review-claude-A`, `review-claude-B`, or session-ID). + +This convention applies to any AI agent role that produces persistent artifacts in a multi-session project. The cost is one extra string per signature; the benefit is unambiguous audit trail forever. + +--- + +## F22 — Coverage drive without bug-fix cadence (mitigation pattern validated, F1 Sediment Family suppression sub-form) + +> **F1 sub-form, candidate → validated-as-suppressed**. F22 is the negative pattern an ADSD project hits when it scales a stress-test corpus (N → 5N → 10N programs) without applying fix-pack between scales. ADR-0047 (LeetCode coverage strategy) was authored as the explicit F22 mitigation, and the LC-100 → Option H decision was the empirical validation that the mitigation works. + +### Definition + +The temptation to run "all 3816 LeetCode problems" / "all 500 test cases" / "all N stress-test inputs" *before* triaging and fixing bugs from the first batch. The result: each subsequent batch hits the same N bug-patterns as the first, multiplying the surface defect count without surfacing new failure modes. Bug-pattern density per batch saturates after ~100 programs; the next 3700 are mostly re-discovery of the same gaps. + +### Symptoms + +- A coverage-drive sprint exits with 3000+ test programs but only 5-7 distinct bug-patterns +- Triage time grows quadratically with batch size (more programs to classify into same patterns) +- "Pass rate" stays roughly flat across scales (e.g. 77% at N=100 stays 75-80% at N=500 absent fix-pack) +- Fix-pack debt accumulates: each unfixed pattern blocks ~N/k programs per round, with k ≈ patterns + +### Root cause + +Coverage-as-throughput optimism: the assumption that running more cases surfaces more bugs. In practice, the bug-pattern distribution is heavy-tailed — the first ~100 programs of any reasonable sample surface ~80% of patterns. Continuing past saturation is re-discovery. + +ADR-0047 codified the **ramp gate**: pass rate < 70% → HOLD (fix-pack), 70-90% → conditional GO (fix-pack-OR-ramp evidence-driven), ≥ 90% → SKIP back to other work (gap-saturated, no Tier B ROI). + +### Evidence + +**LC-100 Tier A discovery sweep (2026-05-12)**: P9 opus + 4 P7 sonnet TDD pairs ran 100 programs across 10 algorithm categories. Initial result 77/100 with 3 distinct failure patterns (Pattern A codegen rodata literals, Pattern B list[str] type gap, Pattern C test corpus oracle defects). + +**ADR-0047 ramp logic predicted**: 77% is the conditional zone — Option G (immediate Tier B) vs Option H (fix-pack first). P9 + review-claude both recommended Option H based on F22 mitigation principle: don't ramp the same defect distribution to 5N scale. + +**Option H executed** at commits `2d952e0` (Sprint 1 Pattern C fix, +15 programs) + `2a8bdc0` (Sprint 2 Pattern A C-ABI fix, +7 programs) = 99/100 stable. Post-fix-pack pass rate 99/100 = 99% triggers ADR-0047's SKIP-back-to-W1 gate — Tier A is gap-saturated, Tier B has no ROI. + +**Validation**: F22 was NOT fired because the mitigation existed and was followed. The reverse-evidence (counterfactual: had ADR-0047 not existed, P9 likely would have ramped to 500 programs and re-discovered the same 3 patterns at ~75-defect scale, wasting ~5-10× agent-time). + +### Rule of thumb + +> **Stress-test corpus growth (N → 5N) MUST be gated by current-batch pass rate.** +> +> Decision logic: +> - < 70% pass: HOLD; fix-pack the patterns surfaced; re-baseline at N before ramping +> - 70-90% pass: conditional GO with bug-fix-cost check — if fix-pack > 1 day, ramp anyway; if ≤ 1 day, fix first +> - ≥ 90% pass: SKIP — corpus is gap-saturated for this language area; ramping has no ROI + +Time-cap the discovery sweep at the same time: ADR-0047 capped Tier A at 1-2 day. Without a time-cap, F22 manifests as "ramp anyway because the test feels useful". The cap forces the gate decision. + +### Recovery + +If F22 has already fired (you ramped before fixing): + +1. **Triage all failures into pattern groups** (ADR-0047 Phase 3-style). Aim for 3-7 distinct patterns; if more, the test corpus is noisy. +2. **Identify the high-multiplicity patterns**: which 2-3 patterns account for ≥ 80% of failures? Fix those first. +3. **Re-baseline at the smaller scale (e.g. N) after the fix-pack**. Confirm pass rate ≥ 90% before considering further ramp. +4. **Document the cost lesson** as a finding — "we ramped to 5N before fix-pack and lost ~K agent-hours to repeated triage." + +### Prevention going forward + +For any future stress-test corpus design: + +1. **ADR the ramp strategy before generating the corpus** (per ADR-0047 template). Include the gate thresholds. +2. **Build the time-cap into the dispatch prompt**. P9 sub-agents that exceed cap MUST escalate, not auto-ramp. +3. **Track bug-pattern density per batch**. When it falls below ~1 new pattern per 50 programs, you've hit saturation. +4. **Reverse-evidence is real evidence**. When F22 doesn't fire, document the counterfactual cost saved. + +--- + +## F23-A — Oracle authorship without independent verification (F1 Sediment Family, oracle-verify sub-form) + +> **F1 sub-form, confirmed**. Same family as F1.1 (declared invariants without enforcement) and F17 (sub-agent KPI self-report fidelity), but specific to **the test oracle itself** rather than the implementation under test. The pattern: the agent authoring the test corpus mentally executes the algorithm and writes both the algorithm description AND the expected output. Without an independent verifier (a reference implementation), arithmetic / DP-trace / tree-encoding mistakes get encoded directly into the oracle — silently invalidating the test gate. + +### Definition + +A P7-TEST sonnet agent produces a test corpus (`test.toml` cases + algorithm paraphrase in README) by mental execution of the algorithm. The expected output field is the agent's mental computation result — no independent verification path runs. Bugs in the agent's mental execution become bugs in the oracle. + +The downstream effect: a P7-DEV agent's `solution.cb` may be algorithmically correct, but fails the oracle because the oracle itself is wrong. Triage misclassifies this as a "language gap" instead of a "test corpus defect", wasting language-implementation effort on a test-author mistake. + +### Symptoms + +- Algorithm-style stress-test corpus shows 15-30% failures concentrated in arithmetic / DP-trace / graph-traversal categories +- DEV agent's failing solutions look algorithmically reasonable on careful read +- Quick reference-implementation check (running the algorithm in Python by hand) confirms DEV output is correct and the oracle is wrong +- "Pattern: test corpus defects" emerges as a primary failure class in triage + +### Root cause + +Mental execution is unreliable for non-trivial algorithms. Even high-quality LLM agents have non-zero error rate when computing: +- DP transitions for sequences > ~10 elements +- BFS / DFS over trees with > ~5 levels +- Modular arithmetic chains +- Bit manipulation edge cases +- String parsing with escape sequences + +The author of the algorithm description and the author of the expected output are the same agent in the same session — confirmation bias guarantees the oracle agrees with the agent's mental model, not with reality. + +### Evidence + +**LC-100 Tier A failure triage**: 15 of 23 initial failures (65%) were oracle-authorship defects, not language gaps. Concrete examples (from `lc100-pattern-c-test-corpus-defects.md`): + +- coin-change DP: agent computed DP[5] = 2 mentally; actual algorithm returns 1 +- BFS level-count: agent encoded "depth = 3" for a tree where actual BFS returns 4 (off-by-one on root) +- Roman-to-int: agent's mental arithmetic on "MCMXCIV" yielded 1995 instead of 1994 (subtraction-rule miscount) +- Climbing-stairs: agent encoded fib(N+1) instead of fib(N) (off-by-one on base case) + +15 corrections were derivable post-hoc by running reference Python implementations against the same inputs. The author's mental execution had been the sole oracle source — no second pass. + +**Codified mitigation: ADR-0047a verify.py mandate** (2026-05-12). Every Tier B program must ship with a `verify.py` reference Python implementation that runs against the `test.toml` corpus and confirms the oracle before the DEV phase begins. + +### Rule of thumb + +> **The test oracle author MUST run an independent verification (different code path, ideally different agent) before declaring the corpus ready.** +> +> Concrete forms: +> 1. **Reference-implementation pattern (lightweight, default)**: P7-TEST authors a `verify.py` reference Python impl in the same sprint; runs it against test cases; commits only when all match. +> 2. **CPython differential pattern (heavyweight, for numerical / library translations)**: oracle is computed by an authoritative external implementation; agent encodes the input + the differential check, not the expected output. +> 3. **Hand-verified pattern (lowest scale, ≤ 5 cases)**: human reviewer hand-traces each case; works only at small N. + +For algorithm-style corpora (LeetCode shape), Form 1 (verify.py) is the empirically validated default. + +### Recovery + +When F23-A fires (oracle defects discovered post-hoc): + +1. **Triage**: separate corpus-defect failures from language-gap failures. The corpus-defect class shows DEV output looking algorithmically reasonable. +2. **Author reference impls** (Python, Rust, or pseudocode) for the affected cases; run them against the corpus. +3. **Fix the corpus, not the implementation**, for any case where reference impl confirms DEV output. +4. **Re-run the full corpus** post-fix; confirm pass rate change matches the corrected-defect count. + +### Prevention going forward + +In any future stress-test corpus dispatch: + +1. **Update dispatch templates**: P7-TEST prompt MUST include verify.py authoring as a step before test.toml finalization (per ADR-0047a pattern). +2. **Sprint exit gate**: `[P7-TEST-CORPUS-READY]` report MUST include per-program `verify_py_matches: yes/no` rows. +3. **CI extension (stretch)**: a release-readiness-style harness re-runs verify.py against test.toml at corpus-edit time, catching oracle drift between sprints. + +--- + +## F23-B — Synthetic stress test distribution drift from real-world (F1 Sediment Family, distribution-coverage sub-form) [CANDIDATE, UNMEASURED] + +> **F1 sub-form, candidate**. A stress-test corpus is hand-picked or algorithmically generated to exercise a specific surface (e.g. "10 algorithm categories × 10 programs each"). The resulting bug-pattern distribution may diverge from what real-world programs in the same language would surface. The corpus's coverage claim ("we tested 100 programs") may not generalize to "the language handles 100% of similar real-world programs." + +### Definition + +A discovery sweep's bug-distribution is a function of the corpus's input-distribution. If the corpus's distribution differs from production-distribution, the bug-set found is unrepresentative — both falsely confident (missing bugs that real programs would surface) and falsely alarming (surfacing bugs that real programs never trigger). + +For Cobrust LC-100: 10 algorithm categories × 10 paraphrased programs each is a synthetic distribution. Real-world Python programs (e.g. tomli, msgpack, dateutil) have very different structure — heavy on string parsing, library boilerplate, error-handling, less on DP/graph/numerical algorithms. + +### Symptoms (predicted, not yet validated) + +- Stress-test discovery surfaces N bug-patterns; real Python lib translation later surfaces M ≠ N bug-patterns +- Bug-pattern overlap between synthetic and real-world is < 70% +- "Pass rate at N synthetic programs ≥ 90%" does NOT imply "pass rate on real Python libs ≥ 90%" + +### Root cause + +Distribution mismatch: + +- **Synthetic-leaning bias**: algorithm-style problems exercise control flow + arithmetic + small data structures. Real Python programs exercise string manipulation + I/O + library interop more heavily. +- **Length distribution**: LeetCode programs typically 20-100 LOC. Real Python files are 200-2000 LOC with multi-module imports. +- **Error-handling absence**: algorithm-style problems usually have well-defined inputs; real programs need defensive error handling, validation, malformed input recovery. + +A 99/100 pass rate on synthetic corpus doesn't bound the failure rate on production-distribution programs. + +### Evidence + +**Unmeasured at LC-100 Tier A close (2026-05-12)**. Empirical validation requires running translated real Python libraries (T1.1 tomli, msgpack, dateutil) against the same Cobrust compiler that achieves 99/100 on LC-100, then comparing bug-pattern overlap. + +Hypothesis: pattern overlap will be < 60%. Real-Python translation will surface string-handling + library-interop bugs that LC-100 doesn't probe; LC-100 surfaces algorithmic-edge bugs that real Python rarely hits. + +The candidate becomes "confirmed F23-B" when this measurement happens. + +### Rule of thumb + +> **A stress-test pass rate is a function of the test corpus distribution. To bound real-world failure rates, run additional probes on the actual production distribution (or a sample of it).** +> +> Practical forms: +> 1. **Cross-distribution validation**: after a synthetic corpus closes, run a smaller (~10-30) real-distribution sample. Compare bug-pattern overlap. +> 2. **Real-distribution prioritization**: if real-world coverage is the goal, prioritize real-distribution corpus design over synthetic. +> 3. **Cite distribution explicitly**: marketing / release messaging "Cobrust passes N/M LeetCode" must qualify "(synthetic algorithm-style corpus; real Python lib translation rates vary)". + +### Recovery + +If F23-B is suspected (synthetic pass rate is high but real-world deployment has issues): + +1. **Build a real-distribution sample**: ~30 representative programs from production code or real libraries. +2. **Run against the same compiler**; classify failures. +3. **Pattern-overlap analysis**: which patterns appear in both? Which only in synthetic? Which only in real-world? +4. **Update marketing / release messaging** to cite the appropriate distribution. + +### Prevention going forward + +When designing future stress-test corpora: + +1. **Declare the corpus distribution in the dispatch ADR**. "10 algorithm categories × 10 programs each" is a synthetic-leaning distribution and must be acknowledged as such. +2. **Add a real-distribution sample at Phase 4** of any large coverage sweep. Even 10-20 real programs validate the synthetic pass rate's generalizability. +3. **Marketing copy must qualify**: "99/100 on synthetic algorithm corpus" not "99% language coverage." F8 (marketing overreach) prevention. + +### Status + +**Candidate**, awaiting empirical measurement post-T1.1 real-LLM E2E on msgpack / dateutil / requests / click. When pattern-overlap data lands, this entry promotes to confirmed. + +--- + +## F24 — Stress-test pass via primitive-as-everything simulation (F1 Sediment Family, coverage-fidelity sub-form) + +> **F1 sub-form, confirmed**. Related to F23-A (oracle-without-verify) and F23-B (distribution drift) but distinct: F24 is about **what the implementation under test actually exercises**. The pass rate metric becomes semantically vacuous when programs route around a missing language feature using a primitive type (list as linked-list / dict as tree / list-as-stack-as-queue) — the language passes the test but doesn't actually implement the structure the test category claims to cover. + +### Definition + +A stress-test corpus organized by feature category (e.g. "10 linked-list problems / 10 tree problems / 10 hash-set problems") shows a high pass rate. But inspection of the actual program implementations reveals they all use a single primitive type (`list[i64]`, `array<T>`) as the data backbone, simulating the richer category-named structure via index arithmetic or value-arrays. The language never actually compiled a real linked-list / tree / set type — the test passed via simulation, not via real coverage of the claimed feature category. + +### Symptoms + +- Stress-test categories named after data structures show high pass (e.g. "10/10 linked list", "9/10 binary tree") +- All "linked list" .cb programs share a comment like `# Algorithm: store values in an array then two-pointer / index manipulate` +- Tree problems use level-order index encoding (`parent = (i-1)/2`) on a flat array, not real tree nodes +- Hash-set problems use dict-with-1-as-value, not a Set type +- `grep -r 'struct.*Node\|struct.*Tree\|enum.*List' src/` returns nothing matching real recursive types + +### Root cause + +The corpus author (P7-TEST or human spec author) selects categories by their algorithmic shape ("LinkedList problems", "Binary Tree problems") but the corpus's pass condition is "expected stdout matches actual stdout" — which is achievable by **any** correct algorithm regardless of data structure. The cheapest correct implementation often routes through a primitive the language already supports, bypassing the structure the category implicitly claims. + +Without an explicit constraint "this category MUST use a recursive struct" or "this category MUST allocate K Tree nodes", the pass rate measures algorithmic correctness, not feature-category coverage. + +This is F1 family because: the coverage claim ("we tested 10 linked list problems") is declared, but no enforcement mechanism verifies the language actually exercised linked-list semantics. The declaration drifts from the enforced reality. + +### Evidence + +**Cobrust LC-100 Tier A close (2026-05-12, HEAD 459b820)**: 99/100 pass rate. Linked-list problems inspection: + +```cobrust +# examples/leetcode-stress/045-linked-list-palindrome/solution.cb +# Algorithm: store all values in an array, then two-pointer compare from both ends +fn main() -> i64: + let vals = list_new(n) + # ... list_set / list_get loops, two-pointer arithmetic +``` + +```cobrust +# examples/leetcode-stress/047-merge-k-sorted-lists/solution.cb +# Algorithm: store all lists in a flat array, then selection-sort via K pointers +fn main() -> i64: + let flat = list_new(10000) + let offsets = list_new(k + 1) + # ... index arithmetic, no Node struct +``` + +```cobrust +# examples/leetcode-stress/050-rotate-linked-list/solution.cb +# Algorithm: values in array, rotate by index +``` + +All linked-list programs use `list_new / list_set / list_get` flat-array simulation. Same pattern across the 10 linked-list + 10 tree + N hash-set programs. + +Cobrust language as of HEAD 459b820: +- `grep -rE 'LinkedList|TreeNode|HashSet' crates/cobrust-stdlib/src/` returns Rust-side `HashSet<T>` internal wrappers but **no source-level (`.cb`-visible) types** for LinkedList / Tree / Set +- `grep -rE 'struct.*ref|recursive struct' crates/cobrust-types/src/` returns nothing matching source-level recursive struct support + +Conclusion: the 99/100 pass rate is **valid as algorithmic stress test** but **does not bound the language's recursive-type support**. The two metrics diverge; the corpus's category names suggest coverage that the language did not actually achieve. + +### Rule of thumb + +> **Coverage claims by feature category MUST be verified at the implementation surface, not just the output surface.** +> +> Concrete forms: +> 1. **Type-asserting pass condition**: corpus per-program asserts that the .cb solution uses the claimed type (e.g. `solution.cb` for LinkedList must contain `struct.*Node` or import the stdlib `LinkedList`). Static check at sprint exit gate. +> 2. **Feature-category audit**: P9 Phase 3 triage explicitly inspects K random programs per category for primitive-simulation pattern. If > 50% use the same primitive, flag the category as "simulated, not really tested". +> 3. **Counterfactual sample**: write 1-2 programs per category that DELIBERATELY use the claimed type. If they don't compile, the category was never really covered. + +For Cobrust LC-100: forms 1+2 should have fired during P9 Phase 3 triage. Recovery: track the gap as explicit tech debt with a pre-tag blocker (per ADR-0045 user-traction milestone gate pattern). + +### Recovery + +When F24 has fired (your stress-test passes mask a real coverage gap): + +1. **Document the tech debt explicitly**. Write a finding citing per-category simulation patterns observed. Cite specific .cb files. +2. **Set a binding pre-tag gate**: the next major-version release (v0.X+1.0) MUST NOT ship until the simulated categories have real-type implementations. Codify in an ADR. +3. **Dispatch the tech debt sprint**: design + implement the missing language features (recursive struct + ref semantics + stdlib LinkedList/Tree/Set generics) + retrofit a subset of programs (3-5 per category) to use the real types. +4. **Re-baseline pass rate on retrofit subset**: confirm the language really compiles and runs the typed implementations. Pass rate on retrofit subset is the honest coverage metric. + +### Prevention going forward + +For future stress-test corpus design: + +1. **Categorize by data structure constraint, not just by algorithm**. "10 programs that MUST use struct Node" not "10 programs about linked lists" — the difference is enforcement. +2. **Sprint exit gate per-category**: static analysis confirms each program in a category exercises the claimed feature. +3. **Cross-reference language ADRs**: if your corpus has a "tree" category, but the language doesn't have an ADR for recursive struct support, the category is fictional until that ADR lands. + +This pattern composes with F19 (install-not-tested): both reflect a gap between **what the artifact claims** and **what was actually verified**. F19 is on user-facing surface (install commands), F24 is on test-coverage surface (category claims). Both close by the same principle: independent verification of the claim against reality. + +--- + +## F25 — Tag → audit → patch as a release pattern under AI velocity (discipline, not failure) + +> **Discipline entry, not a defect pattern**. F25 is the empirically validated +> *legitimate-and-disciplined* form of what would otherwise read as "shipping +> broken tags". The pattern only becomes anti-pattern when its three preconditions +> (honest CHANGELOG, audit-as-experiment, K-bound convergence) are violated — +> see §"When F25 degrades into anti-pattern" below. Catalogued here because under +> AI velocity (~2.5×-10×) the first tag will not be the publishable one, and the +> right discipline is to *plan for K patch tags* rather than aim for shippable-on-first-try. + +### Definition + +Under AI-velocity acceleration, a project ships its first tag with the +expectation that the **first release-readiness audit will reveal an enforcement +gap that intent-driven self-checks missed**. The pattern is: + +``` +Tag v0.1.<N> ← experiment substrate + ↓ +Release-readiness audit in clean shell ← observation + ↓ (BLOCK) +Finding filed + patch + tag v0.1.<N+1> ← learning + ↓ +Re-audit + ↓ (GO) +Announce, publish notes +``` + +Each tag is the experiment; each audit is the observation; each patch is +the learning. The pattern's success metric is **bounded convergence after K +patches**, not "first tag is perfect". For Cobrust Studio: K=2 (v0.1.0 broken +→ v0.1.1 broken → v0.1.2 usable, in 6 hours wall-clock). + +### Symptoms (legitimate form) + +- Multiple consecutive patch-tags in a single calendar day (v0.1.0 → v0.1.1 + → v0.1.2 in 6 hours) +- Each tag has its own CHANGELOG entry naming the gap explicitly + ("v0.1.1 stale Cargo.lock; cargo build --locked exit 101") +- Each tag has a corresponding finding under `docs/agent/findings/` filed + before the next patch +- README §"Honest status" or equivalent names the patch dance up front + for users +- Total K is bounded (typically 2–3); convergence is not "endless patch + spiral" + +### When F25 degrades into anti-pattern + +F25 becomes a defect pattern (and should be filed as a separate finding) +when any of the following hold: + +1. **No honest CHANGELOG**: subsequent tag silently overwrites prior + without naming the broken state. Users cannot distinguish which tags + to skip. *Recovery*: amend CHANGELOG at the next patch; never delete + the prior tag's broken state. +2. **Audit-as-ceremony, not audit-as-experiment**: the release-readiness + audit is rubber-stamping rather than truly running install commands + in a clean shell. Same F19 (release-readiness untested) instance, + wearing a release-pattern costume. +3. **K unbounded**: more than ~3 patch tags without convergence suggests + the project is missing a structural fix (the F20/F26 enforcement + layer the patches are nominally closing). *Recovery*: stop tagging; + land the enforcement-script fix; re-tag once. + +### Root cause + +AI-velocity acceleration buys experimental cycles, not shippable-first-try. +Under a 5-day human plan compressed to 2 days, the writer's mental model of +"what will install correctly" diverges from the actual artifact more than +under a 5-day human cadence. The release-readiness audit (F19's prevention +mechanism) catches the divergence; the patch closes it. The pattern is +*the right discipline* for AI velocity — but only with the three preconditions +above honored. + +### Evidence + +Cobrust Studio 2026-05-12, three consecutive tags in 6 hours wall-clock +(case study §3.4, §3.5, §4.1): + +1. **v0.1.0** (commit `a722e09`, tag `0a7fd3e`): SPA fallback regression + (`Path<String>` on `Router::fallback`) shipped. Post-tag CTO 守闸 + release-readiness audit ran hermetic Playwright against + `./target/release/cobrust-studio` built from main HEAD; 13/14 e2e specs + failed. Finding `m4-release-readiness-spa-fallback-extractor.md` filed + P0. +2. **v0.1.1** (commit `15b6f46`): SPA fallback fixed via `Uri` extractor. + Stale Cargo.lock shipped; `cargo build --workspace --locked` exit 101. + `release-tarball.sh` errored; CHANGELOG names the gap. +3. **v0.1.2** (commit `7ea9ae3`): Cargo.lock regenerated + `doc-coverage.sh` + §6 hardened with paired exit-code + FAILED-grep gate. Release-readiness + audit returned GO. First usable tag. + +CHANGELOG names each broken tag explicitly; README §"Honest status" names +the patch dance up front. All three preconditions honored. K=2 (within +the bounded convergence claim). + +### Rule of thumb + +> **Under AI velocity, plan for K=2 patch tags before first usable. The +> right discipline is fast experimental cycle: tag, audit in clean shell, +> patch the gap, re-tag.** +> +> Hard preconditions for the pattern to remain legitimate-and-disciplined: +> +> 1. **Honest CHANGELOG**: each broken tag named with its gap; no quiet +> retag. +> 2. **Audit-as-experiment**: the release-readiness audit must actually +> run commands in a clean shell, not read the README. +> 3. **K-bound convergence**: K ≤ 3 typical. If K > 3, the underlying +> enforcement-script layer is missing — stop tagging, fix the +> enforcement, re-tag once. + +### Recovery + +When F25 is firing (legitimate use): + +1. After each patch tag, file a finding naming the gap as an instance + of F19/F20/F26 (which enforcement layer was missing). +2. Update `scripts/doc-coverage.sh` or equivalent enforcement script + in the same PR as the patch, closing the gap structurally — not + just fixing the symptom. +3. Verify convergence: each subsequent patch should close a *different* + gap. Two consecutive patches closing the same gap = K-bound violated, + stop tagging. + +When F25 has degraded into anti-pattern (quiet retag / endless spiral): + +1. Audit CHANGELOG: name every prior broken state retroactively. +2. Locate the missing enforcement layer (the F20 instance the patches + are nominally closing); land it; re-tag once. +3. Communicate to users: "we shipped K tags rapidly; here is what + each one fixed; here is the structural fix we landed at v0.1.<K+1>". + +### Prevention going forward + +Adopt F25 as an explicit release pattern in `cto_operations_runbook.md` +§"Tagging policy" for any AI-velocity project: + +- Plan for K=2 patch tags in the release window. +- Spawn the release-readiness agent (F19) on **every** tag push, not + just the planned "final" one. +- CHANGELOG template includes a §"This tag is known-broken; upgrade to + v0.1.<N+1>" section for any tag the audit returned BLOCK on. +- README §"Honest status" names the current usable tag, not the latest + tag — users can find both with `git tag --sort=-creatordate`. + +This composes with F19 (release-readiness untested — F25's audit step +*is* an F19 prevention exercise) and F20 (constitution-vs-workflow +alignment — each patch is an F20 closure landed in the same PR as the +fix). + +--- + +## F26 — Recursive enforcement-script closure required (F1 Sediment Family, orthogonal-failure sub-form) + +> **F1 sub-form, confirmed**. Direct refinement of F20 (constitution-vs-workflow +> alignment). F20 closure is not one-shot; every enforcement layer needs its +> own paired review against orthogonal failure modes on the same code path. +> A doc-coverage gate hardened against pattern X can ship green against pattern +> Y on the same operation. Studio's `doc-coverage.sh` §6 evolution is the +> empirical substrate: two patches before the §6 gate stopped letting things +> through. + +### Definition + +An enforcement script (CI lint, doc-coverage gate, pre-commit hook) is +written or hardened to catch failure mode X on operation Op. The script +appears correct against X. The script ships green against failure mode Y +on the same operation Op — Y being a different shape of the same underlying +contract violation that X manifests. Each enforcement layer needs its own +paired orthogonal-failure review until the failure-mode class no longer +recurs. + +### Symptoms + +- An F20 closure (script hardened against bug pattern X) ships green + against bug pattern Y the same week +- The script's invariant is declared once ("no test failures shipped") but + the operation has multiple orthogonal failure shapes (FAILED summary line + emitted vs exit code only vs hang vs panic vs OOM) +- "Two strikes" pattern: same script, same invariant, two consecutive + bypasses through different failure modes +- Auditor's review of the script reads correct against the failure mode + that motivated the script's creation, but doesn't scan for orthogonal + failure modes on the same code path + +### Root cause + +Enforcement-script authors close the failure mode that triggered the script. +They do not scan the same operation for other failure modes that would +bypass the new check. The script's coverage is local to the bug; the +contract's coverage is global to the operation. Closing X without checking +Y leaves the layer half-closed. + +This is structurally a recursive application of F20 (constitution-vs-workflow: +mandate vs workflow has a gap). F26 is F20 applied to the workflow itself +— the enforcement layer is a workflow, the workflow has a gap, the gap +becomes a new finding, the new closure may itself have a gap. + +### Evidence + +Cobrust Studio `doc-coverage.sh` §6 evolution (2026-05-12; case study §3.5 +and §4.2): + +| Stage | Enforcement | Gap revealed | Closure tag | +|---|---|---|---| +| Pre-M4.1 | `grep '^test result' \| wc -l` | Counts both `ok` and `FAILED` as "result" lines; 9 failing tests shipped as "22 test groups all green" | A4 merge `8d5475f` shipped 9 failing integration tests under green-gate | +| M4.1 | `grep -c '^test result: FAILED'` | Misses non-zero exit without summary line (e.g. `cargo build --locked` exit 101 from lockfile mismatch) | v0.1.1 tag `15b6f46` shipped broken | +| v0.1.2 | Paired: `if ! cargo test ...` AND FAILED-grep | Both classes now caught | v0.1.2 tag `7ea9ae3` first usable | + +Each fix was complete against the bug class it was designed for. But the +enforcement layer had orthogonal failure modes (FAILED-line emit-ing vs +not-emit-ing on `cargo test --locked`) that needed their own paired review. + +A second F26 instance landed in M5.8: `doc-coverage.sh` §5b added `cargo +fmt --check` after Sarah-persona v2 caught local "6 gates passed" while +CI's separate `cargo fmt --check` job failed on the same SHA — the §5b +gate was missing because the §6 gate's authoring scope was "test-failure +shape", not "any orthogonal pre-merge check the project also runs in CI". +Same F26 pattern, different orthogonal failure axis. + +### Rule of thumb + +> **When closing an F20 instance, scan for orthogonal failure modes on the +> same code path BEFORE declaring the closure complete.** +> +> Ask explicitly: "could my enforcement layer still pass under a different +> failure shape of the same operation?" If yes, the closure is partial. +> +> Common orthogonal axes to enumerate per operation: +> +> | Operation | Orthogonal failure axes | +> |---|---| +> | `cargo test --locked` | exit code ≠ 0 / FAILED summary line / hang / panic / OOM / lockfile mismatch / build error | +> | Frontmatter SHA check | absent / placeholder string ("HEAD") / wrong hex shape / hex-shaped but unreachable / wrong-branch SHA | +> | README install command | URL 404 / URL redirect needs -fsSL / asset name typo / wrong-arch asset / missing dependency | +> | CI matrix job | platform missing / runner image deprecated / cache miss balloons time / artifact upload silently truncated | + +### Recovery + +When F26 fires (a closure shipped, then a sibling failure bypassed it): + +1. **Add the paired check to the same script in the same PR**. Don't + wait for the next sprint. +2. **Enumerate orthogonal failure axes for the operation** (use the table + above as starting point; extend per project). +3. **Add a "deliberately-broken-input test" in CI**: feed the enforcement + script a fixture for each orthogonal failure mode; assert exit ≠ 0. + This is the F20 §"Rule of thumb" layer-3 enforcement applied to F26. +4. **Document the closure as a finding**: `<script>-orthogonal-<mode>-closure.md` + naming the prior closure that missed the orthogonal mode. + +### Prevention going forward + +In every project's enforcement-script authoring SOP: + +1. **Script-level review checklist**: every new check has a §"Orthogonal + failure modes considered" comment block enumerating the operation's + failure axes and which the check covers vs which it explicitly delegates + to other checks. +2. **Layered review discipline**: F20 closure dispatches must include a + `[P7-ORTHOGONAL-SCAN]` step before declaring closure — scan the script + against the orthogonal-axes table for the operation it gates. +3. **Layer 4 enforcement** (the F20 §"Prevention going forward" layer-4 + extension Studio surfaced): "orthogonal-failure review against every + paired-gate enforcement" is a first-class layer in the enforcement + stack. + +F26 generalizes beyond test gates. Same logic applies to schema invariants +in frontmatter (different shape of violation), CI lint scripts (different +shape of bad input), and dispatch-prompt template fields (different shape +of agent shortcut). Anywhere a workflow is itself an enforcement layer, +F26 applies. + +--- + +## F27 — Continuous persona testing as dev-loop primitive (discipline, not failure) + +> **Discipline entry, not a defect pattern**. F27 catalogues the validated +> dev-loop form of ADSD v1.2.1's "persona simulation as 5th audit dimension". +> v1.2.1 introduced persona simulation as a *pre-release* audit pattern. +> Studio's M5 cycle validated the *continuous* variant: persona → concrete PR +> → land → re-spawn persona → verify gap closed → next PR. Pattern emerges +> as a dev-loop primitive, not a one-shot pre-release ceremony. + +### Definition + +Persona simulation is dispatched as **a dev-loop step** in the same +cadence as test-runs and lint-runs, not as a pre-release audit. Each +persona round produces concrete findings; each finding maps to exactly +one PR ({README edit, ADR addendum, finding, doc fix, code fix}); the +PR lands; a fresh persona round verifies the gap is closed. The loop +runs continuously across releases, not once-per-release. + +Loop shape: + +``` +Persona Vn dispatched (Mei v1 / Aleksandr v1 / Sarah v1) + ↓ +Findings filed; each maps to exactly one PR + ↓ +PRs land within hours (not next release cycle) + ↓ +Persona V(n+1) dispatched against the same persona profile + ↓ +Verify prior gaps closed; surface new gaps (typically post-rewrite, the +README has new vocabulary that wasn't in V1) + ↓ +Next round of PRs +``` + +### Symptoms (legitimate form) + +- Multiple persona rounds per persona profile within a single release + window (Mei v1 → Mei v2 → Mei v3) +- Each persona round's findings have a 1:1 mapping to PRs that land + before the next round +- README / positioning evolves measurably between rounds (a Mei v1 + finding "what's an ADR?" → README v2 has §"Methodology vocabulary" + → Mei v2's response no longer flags vocabulary) +- Persona finding-rate decreases per round (V1 produces ~10 findings, + V2 ~5, V3 ~2; saturation) +- Persona dispatch is a P7 step in the dispatch SOP, not a pre-tag + ceremony + +### Root cause for the pattern's value + +Internal review agents (P7-REVIEW) maintain *internal* coherence — "is +the code sound?". Persona agents simulate *external* coherence — "would +a real user understand this?". Internal review cannot catch external- +coherence gaps because the internal reviewer has the same context as +the writer. Only an agent that starts cold (persona simulation with +explicit fresh-context constraint) can probe the external surface. + +Continuous (vs pre-release-only) cadence matters because each README +rewrite surfaces *new* external-coherence gaps. The vocabulary that +replaces the old vocabulary may itself be opaque to the persona. Only +re-running the persona against the new version closes the loop. + +### Evidence + +Cobrust Studio M5 cycle, 2026-05-12 (case study §4.3): + +- **Mei v1** (Python data scientist target user) → 4 findings: vocabulary + confusion ("what's an ADR?"), missing "why not Linear/Notion?", + install path assumes `rustup`, "is this production-ready?". +- **README rewrite** (`339e1ab`): §"Methodology vocabulary" table added; + §"Why this and not Linear + git?" comparison; §"Honest status" + section naming patch dance; §"Looking for design partners" with + concrete asks. +- **Mei v2** → vocabulary confusion resolved; new gap: "Honest-status" + placement was buried mid-page; persona-naming was visible to users. +- **README v2 rewrite**: "Honest-status" moved to top of README; + persona-naming removed from public-facing copy. + +Aleksandr loop (Rust skeptic): + +- **Aleksandr v1** → F-05 dead deps catch (`unicode-normalization`, `uuid`, + `hex`, `tracing` lifted from upstream but unused); missing CI matrix. +- **2 PRs landed** (`339e1ab` dead-deps removal + `58cbe94` matrix CI). +- **Aleksandr v2** → next PR filed: Windows test matrix (Sarah's release.yml + added Windows tarball builds, but the test matrix only covered Linux + + macOS). + +Sarah loop (OSS evaluator / tech-lead): + +- **Sarah v1** → bus-factor flag, no SECURITY.md, no CONTRIBUTING.md. +- **PRs landed**: CI matrix + release pipeline + design-partner template. +- **Sarah v2** → verdict updated 6mo-watch → 3mo-watch; flagged R8 + (closed-feedback-loop, see F28) and R9 (README-vs-release drift, see + F1.4). + +All three persona profiles ran 2 rounds within a 4-hour window after +v0.1.2. Finding-rate decreased per round (V1: 10 items; V2: 4 items). +~15 concrete PR items in 90 min total persona dispatch time. ~7 landed +in the same wave. + +### Rule of thumb + +> **Persona simulation is a dev-loop primitive, not a pre-release +> ceremony. Run it continuously, with finding-rate as the saturation +> signal.** +> +> Five preconditions for legitimate continuous persona testing: +> +> 1. **Personas richly defined**: years of experience, prior burned-by +> experiences, current frustrations — not "a Python dev". +> 2. **Specific scenario per round**: "you have 30 min on HN", not +> "evaluate this README". +> 3. **Stay-in-character constraint** in prompt: no "as an AI..." +> breakouts. +> 4. **Structured output fields** aligned to persona's actual decision: +> "would I upvote?", "what would I PR if I had an afternoon?". +> 5. **1:1 finding-to-PR mapping**: each finding maps to exactly one of +> {README edit, ADR addendum, finding, doc fix, code fix}. Findings +> mapping to "no action / acknowledged" are research findings (file +> for case study), not product findings. +> +> Saturation signal: when finding-rate falls below ~2 new findings per +> persona round, the persona profile has reached coverage saturation +> for the current artifact. Pause this profile; rotate in a different +> persona; resume when the artifact changes substantially. + +### Recovery + +When the pattern degrades (persona output not driving PRs): + +1. Audit the finding→PR mapping. If >30% of findings map to "no action", + the persona prompt is producing generic feedback, not decision-bound + feedback. Tighten constraints 1-5 above. +2. Audit finding-rate. If V2 produces *more* findings than V1, the + intervening rewrite surfaced new gaps — that's the pattern working. + If V2 produces the same findings, the rewrite missed the gap. + +### Prevention going forward + +In every ADSD project's dispatch SOP: + +1. **Persona-as-step in dispatch template**: after every README / public + surface PR, dispatch the relevant persona profiles before the next + wave starts. +2. **Persona finding-rate tracked per round**: record rate-of-new-findings + in the case study or operations log; saturation triggers profile + rotation. +3. **Persona prompts versioned**: keep persona prompts in `templates/ + personas/<profile>.md` so V1 and V2 use the same persona profile + text; only the scenario differs. + +F27 composes with F28 (persona-simulation-as-validation epistemic risk) +— F27 is the legitimate dev-loop form; F28 names the failure mode that +emerges when F27's loop becomes the *substitute* for external grounding +rather than an internal-coherence check. The two entries must be read +together. + +--- + +## F28 — Persona-simulation-as-validation epistemic risk (closed-feedback-loop sub-form) + +> **Confirmed failure mode, surfaced by Studio Sarah v2 as risk R8**. F28 is +> the failure mode F27 (continuous persona testing) regresses into when +> persona simulation becomes the *primary* validation surface, with no +> out-of-distribution grounding from actual external users or independent +> teams. The feedback loop is internally consistent and externally untested. + +### Definition + +A project's release-readiness validation pipeline consists of: + +- Internal review agents (P7-REVIEW) +- Persona simulation agents (Mei / Aleksandr / Sarah) +- The project's own maintainer 守闸 + +All agents are spawned by the same maintainer / harness / methodology. No +agent is *out-of-distribution* with respect to the project's training context. +The persona-simulation loop (F27) iterates: persona → README rewrite → +persona again → README v2. Each iteration is internally coherent. None of +the iterations are validated against an actual external user, an independent +team running the methodology, or a real-world install attempt by someone +who has never read the project's prompts. + +The closed-feedback-loop is the failure mode: **the methodology that built +the tool is also auditing the tool, with no external grounding**. + +### Symptoms + +- Persona rounds converge to "PASS-watch" verdicts without any actual + external user contact +- Persona-driven README rewrites optimize for what the persona simulation + responds to, not what an actual external reader would respond to +- Case study artifacts cite the persona output as validation evidence + ("Mei v2 confirms the README is now accessible"), with no follow-up + external user reading +- The project's methodology section (ADSD-style) cites the project itself + as the methodology's N=2 dogfood, and the project's own personas + validate the methodology — circular validation chain +- From a tech-lead vendor-eval standpoint: the project looks suspicious + because the methodology that built the tool is the same methodology + auditing the tool + +### Root cause + +Persona agents are LLMs simulating users. Their training distribution overlaps +with the maintainer's distribution. When the maintainer rewrites the README +to address persona findings, the rewrite's vocabulary and framing are +calibrated to *what the persona simulation responds to* — which is +calibrated to *what the underlying LLM family interprets as accessible*. + +This is structurally a closed feedback loop: the optimizer (maintainer + +LLM) and the evaluator (persona LLM) share latent space. Improvements +along the persona-validated axis may or may not correspond to improvements +along the actual-external-user axis. Without out-of-distribution input, +the loop converges to a local optimum in the shared latent space. + +The methodology (ADSD v1.2.1) explicitly recognizes this risk by treating +persona simulation as a *dev-loop variant* (F27) — but the case study's +"PASS-watch" verdicts and methodology's own self-validation through the +persona loop *do* exhibit the closed-loop pattern absent external grounding. + +### Evidence + +Cobrust Studio Sarah v2 risk R8 (2026-05-12; persona dispatch artifact +referenced in case study §4.5 and §10): + +Sarah-persona v2 explicitly raised R8 as a tech-lead-vendor-eval finding: +the maintainer is running 2-round continuous persona tests as a substitute +for external review; personas are agents simulating users; the Mei v1 → +Mei v2 → README-rewrite-to-hide-persona-names loop is a closed feedback +system with no external grounding. ADSD calls this "dev-loop variant" and +treats it as legitimate. Sarah v2 says: from a tech-lead vendor-eval +standpoint, it's exactly the failure mode that makes a project suspicious +— the methodology that built the tool is also auditing the tool, with no +out-of-distribution input. + +Cobrust + Studio N=2 case-study chain (case study §10): + +> *"The ADSD methodology distilled from Cobrust (N=1) was the experimental +> substrate for Studio (N=2). The result confirms: core invariants hold +> under acceleration."* + +Both case studies were authored by the same maintainer's agent harness. +The "N=2 validation" is methodologically a closed-loop self-validation +until a third, independent team runs ADSD on a project the maintainer +did not author. + +### Rule of thumb + +> **Persona simulation cannot substitute for actual external grounding. +> A persona-validated artifact is internally coherent; it is not +> externally validated.** +> +> Required external-grounding sources, in increasing strength: +> +> 1. **Eventual external persona dispatch**: a persona agent dispatched +> by a *different maintainer* on a *different harness*, with no +> access to the project's own prompts. The "external persona" is +> still an LLM simulation, but it's no longer the same harness. +> 2. **Actual external user contact**: a real person (named, attributable, +> not anonymous) installs the project from a clean shell, reports +> back. One real user is worth ~10 persona rounds. +> 3. **N=3+ independent case studies**: another team adopts the +> methodology on a project the methodology's author did not touch, +> reports outcomes. N=3 is the minimum to move from "self-validated" +> to "externally validated". +> +> Until at least source 1 lands, every persona-simulation-driven +> validation claim must carry a **"closed-loop self-validation" caveat** +> in the case study, README, and methodology document. + +### Recovery + +When F28 is firing (a project relies on persona simulation as primary +validation): + +1. **Add closed-loop caveat to case study and README**: explicitly name + that the validation is internal; persona simulation is a dev-loop + step, not external grounding. +2. **Solicit at least one external user**: design-partner outreach, HN + post, conference demo. One real-user data point breaks the closed + loop. +3. **Track N independent applications of the methodology**. Cobrust + Studio is N=2 by the same maintainer. The methodology becomes + externally validated at N=3 by an independent team. +4. **Distinguish "internal coherence claim" from "external validation + claim"** in all marketing and case-study copy. F8 (marketing + overreach) applies: "persona-validated" is a weaker claim than + "user-validated" and must be qualified accordingly. + +### Prevention going forward + +In every ADSD project that uses continuous persona testing (F27): + +1. **Case-study template includes §"External grounding status"**: enumerate + which validation sources (persona only / external persona / external + user / independent team) have been exercised. +2. **README "Validated against" claims must cite the strongest source + exercised**, not the most flattering. "Validated against persona + simulation" is honest; "user-validated" without an actual user is + F8-class overreach. +3. **Active outreach for external grounding**: a design-partner template, + a SECURITY.md, a CONTRIBUTING.md — anything that channels external + contact. Bus-factor 1 projects (Studio is one) cannot escape F28 + without active outreach. +4. **Methodology document carries the same caveat**: ADSD itself must + name "N=2 validated by same maintainer; awaiting N=3 independent + adoption" until that adoption lands. + +F28 is the structural risk that makes F27 (continuous persona testing) +both valuable and dangerous. F27 is the right discipline for internal +coherence; F28 is what happens when F27's loop is treated as external +validation. The two entries close together: F27 names the legitimate +form, F28 names the failure mode, and the prevention is to always run +F27 with the F28 caveat attached. + +--- + +--- + +## F1.5 — Test-corpus structural blind spot (re-derive path gap) [CANDIDATE] (F1 Sediment Family, coverage sub-form) + +> **Candidate entry — confirmed once in Cobrust Studio M6 cycle (2026-05-12).** F1 +> Sediment Family sub-form. Same root as F1.0 (declared invariants without +> enforcement), but the enforcement mechanism (unit tests) exists and passes — +> the gap is that the tests don't cover the *path being claimed* in the ADR, only +> the *happy path* that bypasses the claimed path. Promote from candidate if a +> second instance is observed in a different ADSD project. + +### Definition + +An ADR declares a wire-format or protocol invariant that involves a +**re-construct / re-derive / re-open** path: *"packed field X enables +re-derive"* / *"packed salt enables re-construct the key at restart"* / +*"serialised blob enables re-validate at next login"*. The unit test corpus +tests the happy path — `seal()` then `open()` using the same in-memory key +object — which passes trivially. No test exercises the re-derive path: extract +field X from blob → re-derive key from extracted X + passphrase → open blob +with re-derived key. Bugs in the packed field's content (wrong value packed +vs value used for derivation) are structurally invisible to the unit test +corpus, because the happy path never exercises the extraction step. + +### Symptoms + +- ADR §"Wire format" or §"Decision" contains language like: "packed salt + enables re-derive at restart", "serialised token ID enables re-validate", + "blob header encodes the derivation parameters for session recovery" +- Unit tests for the module pass 100% (all happy-path; no re-derive tests) +- Integration tests pass (they exercise API-level round-trips but use the + same session without a drop+re-login) +- Bug manifests in Playwright E2E or production when a real user drops their + session and re-enters their passphrase — the re-derive produces a different + key, AEAD open fails, user sees "wrong passphrase" on a correct passphrase +- The bug is NOT findable by code review alone — the implementation looks + correct (it packs a salt, it derives from a salt) without tracing the + specific values + +### Root cause + +The test corpus was designed to verify the cryptographic operations (derive, +seal, open, tamper-detect) in isolation. None of the tests simulate the +sequence of operations a real user performs across a session boundary: seal +with key K → drop K from memory → extract packed-salt from blob → re-derive +K' from passphrase + packed-salt → open blob with K'. The test corpus is +correct against its own test design; the test design is incomplete against the +ADR's claimed invariant. + +This is a structural gap, not an oversight: the developer who wrote the tests +wrote correct tests for the function signatures. The gap is that the ADR's +"packed salt enables re-derive" claim implies a test pattern that the natural +test design does not produce unless explicitly prompted by the ADR's Done-means +criteria. + +### Recovery + +1. When ADR §"Done means" is written, scan §"Wire format" and §"Decision" + for any "packed X enables re-Y" language. +2. For each such claim, add a required test that exercises the re-Y path + end-to-end: seal → extract X from blob output → re-Y(passphrase, X) → + open. This test should pass before Phase 2 is declared complete. +3. If the bug is already shipped, the fix is to correct the packed value + (ensure packed value = value used for derivation, not a newly-generated + value) and add the re-derive test. + +### Evidence + +Cobrust Studio M6 (2026-05-12): `SessionKey::seal()` generated a fresh random +salt on each call and packed it into the blob header, but `SessionKey` was +derived from a different salt at login time. The 6 unit tests in ADR-0007's +Done-means tested seal+open on the same key — none tested re-derive from blob. +Playwright login-aead.spec.ts test 2 (restart + re-login) caught the bug the +same day as v0.2.0. Fixed at commit `3753a2b` (`SessionKey` now carries its +`derive_salt`; `seal()` packs `self.salt`). New test +`seal_then_re_derive_then_open_round_trips` locks the contract. + +Case study: `cobrust-studio-experience.md §11.3`. + +### Prevention going forward + +When writing ADR §"Done means" for any module with a wire format: + +1. Scan §"Wire format" for "packed X enables re-Y" clauses +2. For each such clause, add a required Done-means test of the form: + ``` + <module>_packed_<X>_enables_re_<Y>: + derive key K from (passphrase, fresh-salt) + sealed = K.seal(payload) + extracted_X = sealed[..len(X)] + K2 = re_derive(passphrase, extracted_X) + assert K2.open(sealed) == payload + ``` +3. This test class is orthogonal to tamper-detection tests (which also + flip bits in X but don't re-derive from the flipped X) and to happy-path + seal-open tests. All three are necessary; none is sufficient. + +The general principle: **any ADR claim that references "packed field enables +reconstruct" implies a test that exercises the reconstruction path, not just +the forward path**. If the Done-means criteria don't name this test explicitly, +the claim is declared but not enforced — F1 Sediment Family. + +--- + +## F29 — Cross-platform runner-pool dependency as a release-infra failure mode [CANDIDATE] + +> **Candidate entry — confirmed twice in Cobrust Studio (v0.1.3 and v0.2.0, +> both 2026-05-12) and closed at v0.2.1.** Distinct from F1.0 (code/doc +> invariants without enforcement) because the failure is not in code or +> documentation but in the infrastructure layer that executes the release. +> Promote from candidate if a second instance is observed in a different +> ADSD project. + +### Definition + +A release workflow declares N build targets (platforms, architectures, +OS variants) via a CI matrix. The workflow code is correct. One or more +targets depend on a **GitHub-hosted runner pool** (or equivalent +infrastructure service) with insufficient queue depth, unpredictable +availability, or a specific runner generation that has been deprioritised +in the provider's scheduling. Multiple consecutive releases ship N-1 (or +fewer) successful artifacts for the affected targets, despite no code +changes between attempts. + +### Symptoms + +- `cargo build --target=X` succeeds locally on the developer's machine +- CI build for the same `--target=X` completes 0/N or times out when using + runner label `old-generation` (e.g. `macos-13` Intel) +- The release workflow shows the target job as "queued" for 30+ minutes + before eventually timing out or completing with a stale artifact +- The pattern recurs across multiple release tags (same missing target, + same runner label) +- No code change produces a fix; only a runner label change resolves it + +### Root cause + +GitHub-hosted runner pools for older or less-popular runner generations +(`ubuntu-20.04`, `macos-13`, `windows-2019`) have smaller pool sizes than +current-generation runners. During peak CI periods or for projects with +infrequent cache warming, the queue wait time can exceed job timeouts. +The release workflow is correct; the infrastructure serving it is the +bottleneck. + +For macOS specifically: GitHub maintains separate pools for Intel +(`macos-13`) and Apple Silicon (`macos-14`/`macos-15`). Apple Silicon runners +are currently more abundant and have shorter queue times. Rust supports +cross-compilation from Apple Silicon → Intel via +`--target=x86_64-apple-darwin` natively, making the pool substitution +transparent to the build output. + +### Recovery + +1. Identify the stalling target (check CI logs for long queue times vs + build times). +2. Check if the runner can be substituted with a higher-availability + alternative while keeping the `--target` flag unchanged (e.g. + `macos-13` → `macos-14 --target=x86_64-apple-darwin`). +3. Verify that the language's toolchain supports the cross-compile path + (Rust: yes for Apple Silicon → Intel via LLVM; Go: yes; Python: depends + on C extensions). +4. Patch release.yml with the new runner label. Ship as a patch tag with + no code changes (infrastructure-only patch is acceptable; see §11.4 of + `cobrust-studio-experience.md`). +5. Validate by observing the next release: if all N targets ship first-time + green, the runner pool was the root cause. + +### Evidence + +Cobrust Studio: +- v0.1.3 (2026-05-12): `x86_64-apple-darwin` build job stalled on + `macos-13` runner; release shipped 4/5 platform tarballs. +- v0.2.0 (2026-05-12): same pattern; 4/5 tarballs. +- Sarah v3 audit predicted: "if this stalls again, consider whether the + cross-compile setup needs to change." +- v0.2.1 (2026-05-12): `.github/workflows/release.yml` patched to + `runner: macos-14` (with existing `--target=x86_64-apple-darwin` flag). + **All 5 platforms green first-time.** Runner pool was confirmed root cause. + +Case study: `cobrust-studio-experience.md §11.4`. + +### Prevention going forward + +When writing a multi-platform release workflow for the first time: + +1. Check the GitHub Actions runner availability documentation for each + runner label in the matrix. Note which generations are "current" vs + "legacy." +2. For any target that uses a legacy runner generation, prefer cross-compilation + from a current-generation runner if the toolchain supports it. +3. Add a comment in release.yml citing the runner substitution rationale: + ```yaml + # macos-14 used instead of macos-13 to avoid Intel runner pool + # queue stalls. --target=x86_64-apple-darwin provides cross-compile. + # See cobrust-studio v0.2.1 / ADSD F29. + runner: macos-14 + ``` +4. The "no CODE tag→patch dance" rule applies: infrastructure-only patches + between release tags are acceptable when the audit predicted the failure + mode. CHANGELOG the change explicitly as "infrastructure patch, no code + changes." + +**ADSD §4 ("tag → audit → patch") extends to the infrastructure layer.** +A release-infra failure mode (runner pool stall, action version deprecation, +Docker image removal) is a legitimate release regression that warrants its +own patch tag with honest CHANGELOG. It is NOT a code quality failure; it +DOES count against the release readiness gate if it blocks one or more +declared targets from shipping. + +--- + +## F30 — Projection docs outrank the canonical snapshot (F1 Sediment Family, doc-authority sub-form) + +> **F1 sub-form, confirmed.** The repo declares a canonical state record, but day-to-day editing happens in the more visible projection docs (`README`, agent guidance, operator docs). The projections drift ahead of the canonical source, so future agents inherit a persuasive but false narrative. + +### Definition + +A project has one document intended to be the canonical statement of current repo state, phase, verification surface, or next target. Other docs are projection layers derived from it. In practice, contributors update the projection docs first because they are easier to notice or more user-facing. The canonical doc lags, and the declared authority order silently inverts. + +### Symptoms + +- `README` says the project is in phase N while snapshot still says phase N-1 +- Agent guidance lists commands or constraints not yet reflected in the canonical state doc +- A close-out report claims docs are synced, but only the projection docs changed +- Future agents cold-start from the stale canonical doc and make wrong dispatch or review decisions + +### Root cause + +Two compounding patterns: + +1. **Visibility bias**: people naturally update the doc they are already reading (`README`, CLAUDE-like guidance, release notes), not the denser state ledger. +2. **Truthfulness is declared, not operationalized**: the project says "snapshot is canonical" but does not enforce update order or require a synchronized close-out set. + +This is an F1-family pattern because the authority rule exists only as prose until the workflow makes it executable. + +### Evidence + +ADD Studio methodology codified the countermeasure explicitly after repeated emphasis during Wave 1 → Wave 2 close-out: +- `docs/agent/snapshot.md` named as canonical repo-state record +- `README.md` and `CLAUDE.md` named as projection layers +- close-out rule: update snapshot first, then synchronize the affected projections before the work is considered done + +This is important because the repo's current phase, verification commands, and next target all appear in multiple top-level docs. Without a canonical-first rule, the most visible doc would naturally outrank the truth source. + +### Rule of thumb + +> **If a project has a canonical state document, every projection doc must be downstream of it in both authority and update order.** +> +> Close-out sequence: +> 1. Update canonical snapshot/state ledger +> 2. Update every dependent projection doc +> 3. Run doc verification +> 4. Only then report completion + +### Recovery + +When F30 fires: + +1. Stop editing projection docs in isolation. +2. Reconcile the canonical state doc against repo reality first. +3. Diff every named projection doc against the canonical state and remove contradictions. +4. Add an explicit close-out checklist entry so future sprints cannot skip the sync. + +### Prevention going forward + +For any ADSD project that uses ADRs/findings/snapshot discipline: + +1. Name the canonical doc explicitly in both the snapshot template and the top-level guidance docs. +2. Add a close-out rule that starts with the canonical doc and ends only when projections match. +3. Make documentation verification part of the required gate surface for doc-affecting work. +4. Treat "docs truthfulness" as a deliverable owned during dispatch, not as polish after the merge. + +--- + ## Catalogue maintenance This catalogue is alive — add to it as you encounter new failure modes. @@ -1075,7 +2589,8 @@ When adding: 3. Evidence section MUST cite a specific case-study artifact (not "I think we hit this once") 4. Submit via PR; reviewer should verify the failure mode is - distinct from existing F1-F11 + distinct from existing F1-F30 (and from existing F1 Sediment + Family sub-forms F1.0-F1.5, F16, F17, F18, F19, F20, F21) If a failure mode becomes obsolete (e.g. tool now prevents it automatically), don't delete — mark as "superseded by <SOP>" and link. diff --git a/plugins/adsd/skills/agent-driven-development/reference/prompt-engineering-patterns.md b/plugins/adsd/skills/agent-driven-development/reference/prompt-engineering-patterns.md new file mode 100644 index 0000000..1bd6a7c --- /dev/null +++ b/plugins/adsd/skills/agent-driven-development/reference/prompt-engineering-patterns.md @@ -0,0 +1,280 @@ +--- +name: Prompt engineering patterns for sub-agent dispatch +description: Distilled prompt engineering patterns from Anthropic and OpenAI public guidance, adapted for ADSD sub-agent dispatch context. Covers chain-of-thought, few-shot, structured output, role priming, anti-hallucination guards. +type: reference +version: 1.0.0 +date: 2026-05-12 +status: active +relates_to: [skill:SKILL.md §"Two-phase dispatch", templates:dispatch-prompt-p7.md + dispatch-prompt-p9.md] +--- + +# Prompt engineering patterns + +> When you spawn a sub-agent, the prompt is your only lever. A sub-agent with a poorly written prompt cannot recover at runtime — there's no second chance. This reference codifies the patterns Anthropic and OpenAI publicly recommend, adapted to ADSD's sub-agent dispatch context. + +## When this applies + +- Writing any P9 or P7 dispatch prompt +- Designing a new sub-agent role +- Diagnosing why a sub-agent went off the rails +- Auditing existing dispatch templates for gaps + +Not for: writing user-facing docs, marketing copy, or release notes (different audience, different goals). + +## Core principles (Anthropic + OpenAI consensus) + +### P1 — Explicit role + scope first + +Start every sub-agent prompt with: + +``` +You are <ROLE> delivering <SPECIFIC SCOPE>. +Your deliverable is <ARTIFACT>. +You do NOT do <BLACKLIST: e.g. "modify files outside crates/cobrust-mir/">. +Time budget: <DURATION>. +``` + +Role priming concentrates the agent's behavior. Without it, generic Claude / GPT defaults take over — verbose, hedging, broad-scoped. + +### P2 — Required-reads section before mission + +Sub-agents have no prior context. List the exact files they must read before starting work: + +``` +REQUIRED READS (read all before any tool call): +- /abs/path/to/relevant/ADR.md +- /abs/path/to/spec.md +- /abs/path/to/existing/test_surface.rs +``` + +Absolute paths. Not "look in the docs folder." + +### P3 — Mission expressed as a verifiable claim + +Bad: "Implement stdin support." + +Good: "Implement `input(prompt: str) -> str` such that the test corpus in `crates/cobrust-stdlib/tests/input_corpus.rs` passes with 0 failures." + +A verifiable claim has: a specific surface (`input(...)`), an acceptance signal (test corpus), and a measurable outcome (0 failures). The agent can self-check progress against this. + +### P4 — Anti-hallucination guards + +Three guards Anthropic specifically calls out: + +1. **"Cite or admit"**: "When making a quantitative claim (test count, file count, SHA), include the verifying command in the same response. If you don't have the command result, say 'unverified'." +2. **"No phantom paths"**: "If you reference a file path, only reference paths returned by an actual tool call this session. Don't invent plausible-looking paths." +3. **"Match-or-mismatch"**: "When the user provides a value and you echo it back, ensure character-for-character match. If your output differs even by case or whitespace, flag the discrepancy explicitly." + +### P5 — Output structure first, content second + +OpenAI's structured-output discipline: define the output schema before describing what goes in each field. + +Bad: "Return a completion report with all the details." + +Good: +``` +Report format (must include these exact section headers): + +[P7-MISSION-COMPLETION] +- Branch: <name> +- Final SHA: <40-char hex> +- Gate verdicts: + - fmt: <pass | fail with count> + - clippy: <pass | fail with count> + - build: <pass | fail> + - test: <pass | fail with count> + - doc-coverage: <pass | fail> +- Empirical evidence: + - <command>: <output snippet> +- Followups: <bullet list> +- Escalations: <bullet list or "none"> +``` + +The agent fills the slots. Structure resists drift. + +## Pattern catalogue + +### PT1 — Chain-of-thought elicitation + +Anthropic + OpenAI: explicit "think step by step" works on hard tasks but hurts on simple ones. + +Use for: design decisions, debugging, ambiguous specs. +Skip for: well-scoped impl tasks (TDD pair handles the structure). + +Form: +``` +Before writing code, write 3-5 sentences answering: +1. What does the spec actually require? +2. What's the simplest implementation that meets the spec? +3. What edge cases must the impl handle? +4. What's an alternative implementation, and why was it rejected? +5. What's the test that would catch a regression here? + +Then write the code. +``` + +### PT2 — Few-shot examples (for output format) + +When you want the sub-agent's output to follow a specific format, **show 1-2 examples in the prompt itself**. + +Form: +``` +Example completion report (do NOT copy these literal values; use this STRUCTURE): + +[P7-EXAMPLE-COMPLETION] +- Branch: feature/foo-bar +- Final SHA: abcd1234abcd1234abcd1234abcd1234abcd1234 +- Gate verdicts: + - fmt: pass (0 diff) + - clippy: pass (0 warnings) + - ... +``` + +Anti-pattern: telling without showing. "Return a structured report" without an example produces freeform prose. + +### PT3 — Role priming with negative example + +Form: +``` +You are a P9 tech lead. Your deliverable is Task Prompts for P7 sub-agents. + +You do NOT: +- Edit source files yourself (that's P7 work) +- Run cargo test on feature branches (that's P7 work) +- Push to remote on feature branches (that's P7's deliverable) +- Ask the user about decisions covered by the constitution + +You DO: +- Draft ADRs for design decisions (~3 hr opus solo work) +- Spawn P7 sub-agents for impl +- Review their completion reports +- Merge cleanly after independent gate verification +``` + +The negative blacklist concentrates the agent's behavior more reliably than the positive whitelist alone. + +### PT4 — Structured output via JSON / YAML block + +When downstream parsing is needed (CTO will `yq` the result): + +``` +After the human-readable report, append a YAML block with these fields: + +```yaml +status: success | partial | failed +final_sha: <40-char hex> +gates: + fmt: { pass: true, count: 0 } + clippy: { pass: true, count: 0 } + build: { pass: true } + test: { pass: true, total: 2611, failed: 0 } + doc_coverage: { pass: true } +followups: + - <string> +escalations: + - <string> +``` +``` + +Both human-readable and machine-parseable. + +### PT5 — Refusal / escalation conditions + +Tell the agent when to STOP and report instead of continuing: + +``` +STOP and report to CTO if any of: +- The ADR's "Done means" is unreachable with the spec as written +- The spec contradicts another ADR — escalate the conflict +- 600s+ stream-idle on cargo test (likely environment issue) +- > 50 retry attempts on any single failing test (root cause is deeper) + +In these cases, report partial work + ask for guidance. Don't loop indefinitely. +``` + +### PT6 — Self-verification block + +Before submitting completion, agent must verify own claims (Anthropic anti-hallucination): + +``` +VERIFICATION (run these commands and paste raw output before submitting): +- git log --oneline main..HEAD | head -5 +- cargo test --workspace --locked 2>&1 | tail -3 +- bash scripts/doc-coverage.sh 2>&1 | tail -3 +- grep -c "F<N>" reference/failure-modes-catalogue.md (if claiming N entries) +``` + +The agent's claim only counts if verification command output is pasted alongside. + +## ADSD-specific patterns + +### PT7 — Difficulty self-rating (per D-matrix) + +Every P9 dispatch must include: + +``` +DIFFICULTY-RATING (mandatory): +- D-RATING: D0 / D1 / D2 / D3 / D4 / D5 +- RATIONALE: <2-3 sentences citing specific crates/files/edge cases> +- MODEL-DEV: sonnet | opus +- MODEL-TEST: sonnet | opus | n/a +- PAIR: yes (D1/D2/D3/D5) | no (D0/D4) +``` + +This pattern catches model-tier mismatches before agent spawn. + +### PT8 — Identity hygiene (F21 closure) + +For agents producing persistent artifacts: + +``` +Sign commits and documents with your SESSION ID, not your role handle alone. + +Wrong: `Co-Authored-By: review-claude` +Right: `Co-Authored-By: review-claude (session 4bb35f43)` + +Wrong: "— CTO, 2026-05-12" +Right: "— CTO session XYZ, 2026-05-12" +``` + +### PT9 — Release-readiness guard (F19 closure) + +For any commit touching user-facing artifact: + +``` +Before declaring this Tx done, spawn a P7 sonnet release-readiness agent +to clean-shell-verify install commands in this commit's changes. See +cto_operations_runbook.md §"Release-readiness agent". + +Do NOT self-attest "the install command works" without independent +verification. F17/F19 closure mechanism. +``` + +## Pitfalls + +| Pitfall | Symptom | Fix | +|---|---|---| +| Generic role ("you are a helpful AI") | Sub-agent over-explains, hedges, asks unnecessary questions | Replace with specific role + scope (P1) | +| Mission as a verb without scope | Sub-agent expands work indefinitely | Reframe as verifiable claim (P3) | +| No required-reads list | Sub-agent makes up plausible-but-wrong file paths | Required-reads with absolute paths (P2) | +| "Be thorough" | Long, low-density output | Demand structured output (P5 / PT4) | +| No escalation conditions | Sub-agent retries forever | PT5 explicit STOP conditions | +| No verification block | Claims drift from reality (F17) | PT6 mandatory verification | +| No difficulty rating | Model tier mismatch (F20 family) | PT7 mandatory D-rating | +| Generic sign-off "review-claude" | Cross-session identity overload (F21) | PT8 session-ID stamping | + +## Anti-patterns (cross-reference to F-patterns) + +- **F13 (plan-vs-execute coherence gap)**: prompt says "do X carefully" but doesn't specify what "carefully" looks like in execution. Fix: PT5 + PT6 explicit verification. +- **F17 (KPI self-report fidelity)**: agent claims completed work without verification. Fix: PT6 mandatory verification block. +- **F19 (install-not-tested)**: prompt asks agent to write docs but doesn't require execution verification. Fix: PT9 release-readiness guard. +- **F21 (cross-session identity overload)**: agent signs with bare role handle. Fix: PT8 session-ID stamping. + +## Cross-references + +- `templates/dispatch-prompt-p9.md` — P9 template applies these patterns +- `templates/dispatch-prompt-p7.md` — P7 template applies these patterns +- `reference/failure-modes-catalogue.md` — anti-patterns these prompts mitigate +- `reference/evals-first-development.md` — verification block ↔ eval delta +- Anthropic prompt engineering guide: https://www.anthropic.com/engineering +- OpenAI prompt engineering best practices: https://platform.openai.com/docs/guides diff --git a/scripts/doc-coverage.sh b/scripts/doc-coverage.sh new file mode 100755 index 0000000..26c23ea --- /dev/null +++ b/scripts/doc-coverage.sh @@ -0,0 +1,130 @@ +#!/usr/bin/env bash +# scripts/doc-coverage.sh — ADSD repo doc-coverage gate +# +# Enforces ADSD §3 documentation mandate on this repo itself: +# - Every docs/human/zh/*.md has a parallel docs/human/en/*.md (and vice versa) +# - Parallel files have matching filenames +# - Reference files in plugins/adsd/skills/agent-driven-development/reference/ +# have YAML frontmatter +# +# Exits non-zero on coverage failure. Pre-commit hook + CI both should run this. + +set -euo pipefail + +REPO_ROOT="${1:-$(git rev-parse --show-toplevel)}" +cd "$REPO_ROOT" + +# Color output (skip if not a TTY) +if [ -t 1 ]; then + RED='\033[0;31m' + GREEN='\033[0;32m' + YELLOW='\033[1;33m' + NC='\033[0m' +else + RED='' GREEN='' YELLOW='' NC='' +fi + +errors=0 + +echo "ADSD doc-coverage gate" +echo "----------------------" + +# ---------------------------------------------------------------------------- +# Inv 1: docs/human/zh/<file> ⟺ docs/human/en/<file> parity +# ---------------------------------------------------------------------------- +echo "" +echo "[Inv 1] Bilingual parity (zh ⟺ en)" + +if [ ! -d docs/human/zh ] || [ ! -d docs/human/en ]; then + echo -e " ${YELLOW}Warning: docs/human/{zh,en} missing — skipping parity check${NC}" +else + # Build sorted lists + zh_files=$(find docs/human/zh -maxdepth 2 -name '*.md' -exec basename {} \; | sort) + en_files=$(find docs/human/en -maxdepth 2 -name '*.md' -exec basename {} \; | sort) + + # Diff zh against en + while IFS= read -r f; do + [ -z "$f" ] && continue + if [ ! -f "docs/human/en/$f" ]; then + echo -e " ${RED}error${NC}: docs/human/zh/$f has no parallel docs/human/en/$f" + errors=$((errors + 1)) + fi + done <<< "$zh_files" + + # Diff en against zh + while IFS= read -r f; do + [ -z "$f" ] && continue + if [ ! -f "docs/human/zh/$f" ]; then + echo -e " ${RED}error${NC}: docs/human/en/$f has no parallel docs/human/zh/$f" + errors=$((errors + 1)) + fi + done <<< "$en_files" + + if [ "$errors" -eq 0 ]; then + zh_count=$(echo "$zh_files" | grep -c . || true) + en_count=$(echo "$en_files" | grep -c . || true) + echo -e " ${GREEN}OK${NC}: $zh_count zh + $en_count en files, all parallel" + fi +fi + +# ---------------------------------------------------------------------------- +# Inv 2: reference files have YAML frontmatter +# ---------------------------------------------------------------------------- +echo "" +echo "[Inv 2] Reference file frontmatter" + +ref_dir="plugins/adsd/skills/agent-driven-development/reference" +if [ -d "$ref_dir" ]; then + for f in "$ref_dir"/*.md; do + [ -f "$f" ] || continue + first_line=$(head -1 "$f") + if [ "$first_line" != "---" ]; then + echo -e " ${RED}error${NC}: $f missing YAML frontmatter (first line not '---')" + errors=$((errors + 1)) + fi + done + + if [ "$errors" -eq 0 ] || [ -z "${seen_inv2_err:-}" ]; then + ref_count=$(find "$ref_dir" -name '*.md' | wc -l | tr -d ' ') + echo -e " ${GREEN}OK${NC}: $ref_count reference file(s) all have frontmatter" + fi +fi + +# ---------------------------------------------------------------------------- +# Inv 3: ADR files (if any) zero-padded monotonic +# ---------------------------------------------------------------------------- +echo "" +echo "[Inv 3] ADR numbering (zero-padded monotonic)" + +adr_dir="docs/agent/adr" +if [ -d "$adr_dir" ]; then + adr_count=$(find "$adr_dir" -name '[0-9][0-9][0-9][0-9]-*.md' | wc -l | tr -d ' ') + if [ "$adr_count" -gt 0 ]; then + # Just verify each ADR filename starts with 4 digits + bad_count=$(find "$adr_dir" -name '*.md' -not -name '_*' \ + | grep -cv '/[0-9][0-9][0-9][0-9]-' || true) + if [ "$bad_count" -gt 0 ]; then + echo -e " ${RED}error${NC}: $bad_count ADR file(s) not zero-padded 4-digit prefixed" + errors=$((errors + 1)) + else + echo -e " ${GREEN}OK${NC}: $adr_count ADR file(s) properly numbered" + fi + else + echo -e " ${YELLOW}info${NC}: no ADRs yet (acceptable for a fresh repo)" + fi +else + echo -e " ${YELLOW}info${NC}: docs/agent/adr/ doesn't exist (acceptable)" +fi + +# ---------------------------------------------------------------------------- +# Summary +# ---------------------------------------------------------------------------- +echo "" +echo "----------------------" +if [ "$errors" -eq 0 ]; then + echo -e "${GREEN}doc-coverage: PASS${NC}" + exit 0 +else + echo -e "${RED}doc-coverage: FAIL ($errors errors)${NC}" + exit 1 +fi