From eb6f34befd409f6a4b1bf7f45dc90eb3b7fab304 Mon Sep 17 00:00:00 2001 From: Jon Langevin Date: Sat, 25 Apr 2026 12:59:56 -0400 Subject: [PATCH 1/4] Tighten subagent-driven-development reviewer prompts MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Field experience running subagent-driven-development on a Phase 0 foundations branch (separate-Steam-SKU companion-game scaffold) showed that reviewers reflexively approve once each spec acceptance bullet has a matching file in the diff — without running the test the spec describes, without applying every bug class, and without promoting Minor determinism issues to Important. Copilot subsequently caught seven issues that internal review had missed: - test using a real network call to localhost - hardcoded magic number duplicated against a constant in another module - Windows path-separator bug in a HUD-file detector - FORBIDDEN list scope mismatched its own doc comment - test invocation command copy-paste error - Cargo.toml missing [[test]] entries (deferred-then-cascaded) These are bugs the existing rubric should have caught — symmetry violations, comment-vs-code drift, missing edge cases, magic literals. The gap was reviewer discipline: prompts delegated the bug-class checklist to skills/requesting-code-review/SKILL.md by reference, and reviewers glossed past it without applying every class. Changes: 1. skills/subagent-driven-development/code-quality-reviewer-prompt.md — inline a four-step required workflow (read git show , run the tests yourself, full bug-class scan, Minor->Important promotion). The bug-class checklist still lives in SKILL.md but the discipline to apply it on every dispatch is now in the prompt template. 2. skills/subagent-driven-development/spec-reviewer-prompt.md — inline a five-step workflow that requires walking every Acceptance bullet and locating its corresponding diff change, running TDD tests yourself (red must fail; green must pass), checking scope creep, and grepping for reuse-of-existing-helpers. 3. skills/requesting-code-review/SKILL.md — add two new bug classes to the checklist: - Test hermeticity / non-determinism (real network/DB/clock/fs in tests; promote to Important even if the test passes today). - Cross-platform / portability (path-separator literals, bash-only shell syntax, Unix-only env-var assumptions). The four-step / five-step inlines duplicate spirit-of-rule that's also in SKILL.md; this is intentional. Field experience: the reviewer skill file gets read once at agent spawn and then forgotten on subsequent review rounds. Inlining the discipline in the prompt template makes it unavoidable on every dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) --- skills/requesting-code-review/SKILL.md | 2 + .../code-quality-reviewer-prompt.md | 59 ++++++++++++++++++- .../spec-reviewer-prompt.md | 55 +++++++++++++++-- 3 files changed, 111 insertions(+), 5 deletions(-) diff --git a/skills/requesting-code-review/SKILL.md b/skills/requesting-code-review/SKILL.md index 6db906a..1924767 100644 --- a/skills/requesting-code-review/SKILL.md +++ b/skills/requesting-code-review/SKILL.md @@ -62,6 +62,8 @@ The reviewer must explicitly scan for each of these bug classes on every diff. T | **Scope-vs-dispatch drift** | The implementation ships materially MORE or LESS than the dispatch asked for. Both directions are findings. (See the dedicated gate above.) | | **Shortcut / band-aid fix** | The fix suppresses a symptom without addressing the root cause. Examples: nil-guard on a value that should never be nil (ask why it is nil); retry loop masking a race condition; special-casing one known-bad input instead of fixing the upstream producer; deleting a failing test instead of fixing the underlying code. Ask: "if the root cause were restored and only the symptom suppressor kept, would the test still pass?" If yes, the test proves nothing about correctness. | | **One-sided boundary wiring** | A change touches an interface boundary (producer→consumer, caller→callee, plugin→host, sender→handler — see `agents/boundary-classes.md` for the canonical list) but only one side is implemented or tested. The other side is a stub, TODO, or unchecked assumption. Both sides of every boundary must be wired, and at least one test must exercise the full crossing end-to-end (not mock both sides independently). | +| **Test hermeticity / non-determinism** | A test makes a real network call (e.g. `connect("localhost:50051")`), depends on a real DB without a `t.Skip`-when-missing guard, writes to a shared filesystem path that races with parallel tests, depends on the system clock without injection, or relies on external services that may flake. CI should be reproducible from a clean checkout; tests that touch the world without mediation are landmines. Promote to **Important** even if the test currently passes — non-determinism only fails on the day you most need green CI. | +| **Cross-platform / portability** | A path uses `/` literals or string-`contains("\\")` checks where `std::path::Path::components` (Rust) or `filepath.Join`/`filepath.Separator` (Go) would be portable. Code spawns shell commands with bash-only syntax. Code reads or hardcodes Unix-style env vars (`HOME`, `PATH`) without checking the OS. Asset paths assume one filesystem layout. CI matrices commonly hit Linux first, so these regressions slip past local development. | For each finding, the reviewer cites the bug class explicitly. This makes patterns visible across reviews and lets the team identify recurring weaknesses. diff --git a/skills/subagent-driven-development/code-quality-reviewer-prompt.md b/skills/subagent-driven-development/code-quality-reviewer-prompt.md index 34bcba9..adddb5b 100644 --- a/skills/subagent-driven-development/code-quality-reviewer-prompt.md +++ b/skills/subagent-driven-development/code-quality-reviewer-prompt.md @@ -2,7 +2,7 @@ Use this template when dispatching a code quality reviewer subagent. -**Purpose:** Verify implementation is well-built (clean, tested, maintainable) +**Purpose:** Verify implementation is well-built (clean, tested, maintainable) and find bugs the implementer missed **Only dispatch after spec compliance review passes.** @@ -30,9 +30,66 @@ Task tool (superpowers:code-reviewer): For the bug-class checklist and verdict vocabulary, use `skills/requesting-code-review/SKILL.md`. + ## Adversarial framing — required mindset + + You are NOT validating that the code matches the spec. You are looking + for bugs the implementer missed. Find at least three things wrong, even + if they seem minor. Bias toward finding issues. If fewer than three are + found, explicitly document every bug-class check you ran and what you + verified. Reflexive approval is forbidden. + + ## Required workflow (DO ALL FOUR — do not skip any) + + 1. **Read the diff with `git show `** (NOT `git diff HEAD~1` — + the latter pulls in commits that may not belong to this task). + + 2. **Run the relevant tests yourself** in the same turn. Do NOT trust + the implementer's "tests pass" report: + - For Go diffs: `go test -race -tags=e2e .//...` + - For Rust diffs: `cargo test --manifest-path --features ` + - For tests-only diffs: still run the test you reviewed; capture pass/fail. + + 3. **Run the full bug-class checklist from + `skills/requesting-code-review/SKILL.md`**. Every class. State for + each: "ran, found X" or "ran, nothing found". + + 4. **Apply Minor→Important promotion rules.** Promote any of these + from Minor to Important (blocking) before issuing your verdict: + - Test makes a real network call, real DB call without skip-when-missing, + real filesystem write that could race, or any non-hermetic dependency + that could cause CI flakes. + - Magic number duplicated across modules where the source-of-truth + constant should be referenced (or exposed for tests to reference). + - Hardcoded path separator (`/`), endpoint, or platform-specific + assumption that breaks on Windows/macOS/Linux when it should be + portable. + - Test-only deferral noted as "fix later in Task X" that is the third + (or later) such deferral against the same module — call it out as + cascade risk before it becomes a fix-cycle in a later task. + + ## Output format + + Per-finding inline at file:line, using the format from + `skills/requesting-code-review/SKILL.md`. End with one verdict: + SHIP-IT | FIX-FORWARD | REQUEST-CHANGES | REVERT-AND-REWRITE. + When notified that a task is spec-approved and ready for quality review: - Notify the implementer when quality issues are found - Notify the orchestrator (team-lead on Claude Code) when the task is fully approved ``` **Code reviewer returns:** Strengths, Issues (Critical/Important/Minor), Assessment + +## Why this template inlines key rules + +The bug-class checklist and adversarial framing live in +`skills/requesting-code-review/SKILL.md`, but field experience shows that +delegating these via reference alone produces reviewers who "review the +work" rather than adversarially scan it. The reviewer reads the spec, +sees the diff matches, and signs off — without running tests, without +applying every bug class, and without promoting Minor determinism issues +to Important. + +The four-step workflow (read diff → run tests → full checklist → promote) +is the floor, not the ceiling. Inlining it here forces every dispatch to +include the discipline rather than trusting the agent to look it up. diff --git a/skills/subagent-driven-development/spec-reviewer-prompt.md b/skills/subagent-driven-development/spec-reviewer-prompt.md index 48d60b9..f7fe040 100644 --- a/skills/subagent-driven-development/spec-reviewer-prompt.md +++ b/skills/subagent-driven-development/spec-reviewer-prompt.md @@ -30,12 +30,59 @@ Task tool (general-purpose): [From implementer's report] - Verify by reading actual code — do not trust the report. Compare - implementation to spec line by line. Report: - - ✅ Spec compliant (after code inspection) - - ❌ Issues found: [list specifically what's missing or extra, with file:line references] + ## Required workflow (DO ALL FIVE) + + 1. **Read the actual diff** with `git show `. Do NOT trust + the implementer's summary. Do NOT use `git diff HEAD~1` (it pulls + in unrelated commits). + + 2. **Walk every "Acceptance" bullet from the task description**. + For each, locate the corresponding change in the diff: + - Required file created/modified? Confirm path. + - Required function/RPC/test added? grep the diff for it. + - Required behavior asserted? Read the test body, not just the file. + - Required side-effect (DB write, AGE edge, forensic_evidence row, + event emission)? grep for the helper call by name. + + If any acceptance bullet has no corresponding change: **MISSING**. + + 3. **For TDD tasks, run the test yourself**. Do not trust "I ran it": + - "RED" task (failing test): `go test ./...` (or equivalent) MUST FAIL + when you run it. If it passes, the test isn't asserting what the + spec says it asserts. + - "GREEN" task (implementation makes prior RED test pass): run the + same command; the prior failing test MUST now pass. + - Capture and paste the actual command output in your review. + + 4. **Diff scope check** — list everything the diff does that the spec + did NOT ask for. Even helpful additions are SCOPE CREEP findings; + flag them. Implementer may justify and you accept on merits. + + 5. **Reuse-vs-reinvention check** — if the spec says "emit X via the + existing pipeline", grep the diff to confirm the existing helper + was called. A new parallel helper or inline raw SQL where a typed + helper exists = reject as scope creep / drift. + + ## Output + + - ✅ Spec compliant (after running steps 1–5; cite which were verified) + - ❌ Issues found — list per finding with file:line: + - **MISSING** — spec asked for X, diff doesn't deliver it + - **SCOPE CREEP** — diff includes Y not in spec + - **DRIFT** — diff fakes/reinvents an existing helper When notified that a task is ready for review: - Notify the code-reviewer when spec compliance passes - Notify the implementer when issues are found ``` + +## Why these steps are inlined + +Field experience: spec-reviewers who "read the spec and check the diff" +reflexively approve once each acceptance bullet has a matching file in +the diff — without verifying the file actually does what the spec says, +without running the test the spec describes, and without grepping for +the side-effects the spec demands. + +The five-step workflow is the floor. The reviewer must produce evidence +(command output, grep matches) for each step before issuing approval. From a84ae4f6cb9e539616d54550e7f1590dedc74fda Mon Sep 17 00:00:00 2001 From: Jon Langevin Date: Sat, 25 Apr 2026 14:51:31 -0400 Subject: [PATCH 2/4] Reviewer prompts: DM-routing + bookkeeping + vacuous-assertion class MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase 1 field-data refinements after running the tightened rubric on a 13-task implementation phase. Three patterns surfaced that the existing prompt didn't address explicitly: 1. **DM-routing slips.** Code-reviewer occasionally DM'd the wrong implementer when fixes were needed — typically the one who wrote a prior task's test fixture, rather than the current task's owner. Add: explicit "DM the implementer who CURRENTLY OWNS the task per TaskList owner field" rule. Same rule added to spec-reviewer prompt. 2. **Bookkeeping completion gap.** Code-reviewer consistently marked the "Review quality" task completed but missed flipping the "Implement" task to completed, leaving team-lead with a stragglers queue. Add: explicit "mark BOTH tasks completed" rule with note that code-reviewer is the only role that flips the Implement task. 3. **Vacuous-assertion bug class.** Phase 1 reviewer-loop caught a test whose assertion `p < Floor || p > Ceiling` was tautologically true for all the inputs it exercised — the function under test was correct but the test proved nothing. Existing checklist had "Test-name-vs-body mismatch" but didn't explicitly cover the tautology case. Add as a 14th bug class with explicit examples. Field results from Phase 1 with the tightened rubric (pre-refinement): - 4 fix cycles (3 caught by code-reviewer, 1 by spec-reviewer running tests directly) - Error-swallowing class caught twice (Tasks 1.7 and 1.13) by the same pattern recognition — pure validation of the existing checklist - Code-reviewer cited actual `go test` / `cargo test` output on every approval, walked failure modes one-by-one, and consistently produced audit-grade output Co-Authored-By: Claude Opus 4.7 (1M context) --- skills/requesting-code-review/SKILL.md | 1 + .../code-quality-reviewer-prompt.md | 13 +++++++++++-- .../spec-reviewer-prompt.md | 4 ++-- 3 files changed, 14 insertions(+), 4 deletions(-) diff --git a/skills/requesting-code-review/SKILL.md b/skills/requesting-code-review/SKILL.md index 1924767..2f9a004 100644 --- a/skills/requesting-code-review/SKILL.md +++ b/skills/requesting-code-review/SKILL.md @@ -64,6 +64,7 @@ The reviewer must explicitly scan for each of these bug classes on every diff. T | **One-sided boundary wiring** | A change touches an interface boundary (producer→consumer, caller→callee, plugin→host, sender→handler — see `agents/boundary-classes.md` for the canonical list) but only one side is implemented or tested. The other side is a stub, TODO, or unchecked assumption. Both sides of every boundary must be wired, and at least one test must exercise the full crossing end-to-end (not mock both sides independently). | | **Test hermeticity / non-determinism** | A test makes a real network call (e.g. `connect("localhost:50051")`), depends on a real DB without a `t.Skip`-when-missing guard, writes to a shared filesystem path that races with parallel tests, depends on the system clock without injection, or relies on external services that may flake. CI should be reproducible from a clean checkout; tests that touch the world without mediation are landmines. Promote to **Important** even if the test currently passes — non-determinism only fails on the day you most need green CI. | | **Cross-platform / portability** | A path uses `/` literals or string-`contains("\\")` checks where `std::path::Path::components` (Rust) or `filepath.Join`/`filepath.Separator` (Go) would be portable. Code spawns shell commands with bash-only syntax. Code reads or hardcodes Unix-style env vars (`HOME`, `PATH`) without checking the OS. Asset paths assume one filesystem layout. CI matrices commonly hit Linux first, so these regressions slip past local development. | +| **Vacuous assertion** | A test whose name promises behavior X but whose body asserts a tautology (e.g., `assert p < Floor \|\| p > Ceiling` on a value that doesn't trigger either bound, so the OR is always true; `assert result == result`; `assert err == nil \|\| err != nil`). The function under test may be correct, but the test proves nothing. Vacuously-passing tests give false confidence and are worse than no test. Either rewrite the assertion to actually exercise the named path, or rename the test to match what it does. | For each finding, the reviewer cites the bug class explicitly. This makes patterns visible across reviews and lets the team identify recurring weaknesses. diff --git a/skills/subagent-driven-development/code-quality-reviewer-prompt.md b/skills/subagent-driven-development/code-quality-reviewer-prompt.md index adddb5b..990bf15 100644 --- a/skills/subagent-driven-development/code-quality-reviewer-prompt.md +++ b/skills/subagent-driven-development/code-quality-reviewer-prompt.md @@ -9,9 +9,9 @@ Use this template when dispatching a code quality reviewer subagent. **Agent Teams additions:** When using Agent Teams, also add to the prompt: - Wait for DMs from spec-reviewer saying a task is spec-approved -- DM implementer when quality issues are found +- DM the implementer who CURRENTLY OWNS the task (per `TaskList` `owner` field), not the implementer who wrote the original test fixture or whoever happened to be idle. The owner is the person responsible for fixing it. - DM the orchestrator (team-lead) when task is fully approved -- Use TaskUpdate to mark "Review quality:" tasks as completed +- After approving, **mark BOTH tasks completed**: the "Review quality: …" task AND the corresponding "Implement: …" task. Code-reviewer is the only role that flips the Implement task to completed; missing this leaves the team-lead with bookkeeping stragglers. ``` @@ -66,6 +66,15 @@ Task tool (superpowers:code-reviewer): - Test-only deferral noted as "fix later in Task X" that is the third (or later) such deferral against the same module — call it out as cascade risk before it becomes a fix-cycle in a later task. + - **Vacuous assertion** — a test whose name promises behavior X but + whose body's assertion is tautologically true (e.g., `assert p < Floor || p > Ceiling` + on a value `0.135` that doesn't trigger either bound, so the OR is + always satisfied; or `assert err == nil || err != nil`). The function + under test may be correct, but the test doesn't prove it. A vacuously-passing + test is worse than no test — it gives false confidence. Promote to + Important and require the implementer to either rewrite the assertion + to actually exercise the named path, or rename the test to match what + it does. ## Output format diff --git a/skills/subagent-driven-development/spec-reviewer-prompt.md b/skills/subagent-driven-development/spec-reviewer-prompt.md index f7fe040..578fddf 100644 --- a/skills/subagent-driven-development/spec-reviewer-prompt.md +++ b/skills/subagent-driven-development/spec-reviewer-prompt.md @@ -8,8 +8,8 @@ Use this template when dispatching a spec compliance reviewer subagent. **Agent Teams additions:** When using Agent Teams, also add to the prompt: - Wait for DMs from implementers saying a task is ready - DM code-reviewer when spec compliance passes -- DM implementer when issues are found -- Use TaskUpdate to mark "Review spec:" tasks as completed +- DM the implementer who CURRENTLY OWNS the task (per `TaskList` `owner` field) when issues are found, not whoever sent the most recent DM +- Use TaskUpdate to mark "Review spec:" tasks as completed (only the Review spec task; never the Implement task — code-reviewer marks that one) ``` From 71ae9da5e6bd7856ee1fe912819ce4d3655a5494 Mon Sep 17 00:00:00 2001 From: Jon Langevin Date: Sat, 25 Apr 2026 16:50:24 -0400 Subject: [PATCH 3/4] Address Copilot review on PR #21 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - SKILL.md: clarify cross-platform examples — PATH exists on Windows with `;` separator; HOME / XDG_* / SHELL are the truly Unix-only vars. Mention the PATH-separator gotcha explicitly. - spec-reviewer-prompt.md Step 1: prefer `git diff ..` when the task spans multiple commits; `git show ` covers only the tip and would miss earlier task commits. - code-quality-reviewer-prompt.md Step 1: same clarification on range diff vs single-commit show. - code-quality-reviewer-prompt.md Step 2: drop hard-coded `-tags=e2e` for Go. The project's actual tag set is project-specific; forcing e2e tags can run non-hermetic tests, contradicting the rubric's own hermeticity rule. Examples now show: with project's actual tags; without tags; Rust with conditional features. Co-Authored-By: Claude Opus 4.7 (1M context) --- skills/requesting-code-review/SKILL.md | 2 +- .../code-quality-reviewer-prompt.md | 17 +++++++++++------ .../spec-reviewer-prompt.md | 8 +++++--- 3 files changed, 17 insertions(+), 10 deletions(-) diff --git a/skills/requesting-code-review/SKILL.md b/skills/requesting-code-review/SKILL.md index 2f9a004..f908b8f 100644 --- a/skills/requesting-code-review/SKILL.md +++ b/skills/requesting-code-review/SKILL.md @@ -63,7 +63,7 @@ The reviewer must explicitly scan for each of these bug classes on every diff. T | **Shortcut / band-aid fix** | The fix suppresses a symptom without addressing the root cause. Examples: nil-guard on a value that should never be nil (ask why it is nil); retry loop masking a race condition; special-casing one known-bad input instead of fixing the upstream producer; deleting a failing test instead of fixing the underlying code. Ask: "if the root cause were restored and only the symptom suppressor kept, would the test still pass?" If yes, the test proves nothing about correctness. | | **One-sided boundary wiring** | A change touches an interface boundary (producer→consumer, caller→callee, plugin→host, sender→handler — see `agents/boundary-classes.md` for the canonical list) but only one side is implemented or tested. The other side is a stub, TODO, or unchecked assumption. Both sides of every boundary must be wired, and at least one test must exercise the full crossing end-to-end (not mock both sides independently). | | **Test hermeticity / non-determinism** | A test makes a real network call (e.g. `connect("localhost:50051")`), depends on a real DB without a `t.Skip`-when-missing guard, writes to a shared filesystem path that races with parallel tests, depends on the system clock without injection, or relies on external services that may flake. CI should be reproducible from a clean checkout; tests that touch the world without mediation are landmines. Promote to **Important** even if the test currently passes — non-determinism only fails on the day you most need green CI. | -| **Cross-platform / portability** | A path uses `/` literals or string-`contains("\\")` checks where `std::path::Path::components` (Rust) or `filepath.Join`/`filepath.Separator` (Go) would be portable. Code spawns shell commands with bash-only syntax. Code reads or hardcodes Unix-style env vars (`HOME`, `PATH`) without checking the OS. Asset paths assume one filesystem layout. CI matrices commonly hit Linux first, so these regressions slip past local development. | +| **Cross-platform / portability** | A path uses `/` literals or string-`contains("\\")` checks where `std::path::Path::components` (Rust) or `filepath.Join`/`filepath.Separator` (Go) would be portable. Code spawns shell commands with bash-only syntax. Code reads Unix-only env vars (`HOME`, `XDG_*`, `SHELL`) without falling back on Windows. Code splits `PATH` on `:` rather than the platform separator (`;` on Windows). Asset paths assume one filesystem layout. CI matrices commonly hit Linux first, so these regressions slip past local development. | | **Vacuous assertion** | A test whose name promises behavior X but whose body asserts a tautology (e.g., `assert p < Floor \|\| p > Ceiling` on a value that doesn't trigger either bound, so the OR is always true; `assert result == result`; `assert err == nil \|\| err != nil`). The function under test may be correct, but the test proves nothing. Vacuously-passing tests give false confidence and are worse than no test. Either rewrite the assertion to actually exercise the named path, or rename the test to match what it does. | For each finding, the reviewer cites the bug class explicitly. This makes patterns visible across reviews and lets the team identify recurring weaknesses. diff --git a/skills/subagent-driven-development/code-quality-reviewer-prompt.md b/skills/subagent-driven-development/code-quality-reviewer-prompt.md index 990bf15..fd32340 100644 --- a/skills/subagent-driven-development/code-quality-reviewer-prompt.md +++ b/skills/subagent-driven-development/code-quality-reviewer-prompt.md @@ -40,14 +40,19 @@ Task tool (superpowers:code-reviewer): ## Required workflow (DO ALL FOUR — do not skip any) - 1. **Read the diff with `git show `** (NOT `git diff HEAD~1` — - the latter pulls in commits that may not belong to this task). + 1. **Read the diff for the WHOLE task:** + - When `BASE_SHA` and `HEAD_SHA` are both known: `git diff ..` (covers all commits in the task) + - When you only have the tip commit: `git show ` covers ONE commit. Check `git log ..` first; if the task spans multiple commits, prefer the range diff. + - Do NOT use `git diff HEAD~1` blindly — it picks an arbitrary parent that may not be the task's base. 2. **Run the relevant tests yourself** in the same turn. Do NOT trust - the implementer's "tests pass" report: - - For Go diffs: `go test -race -tags=e2e .//...` - - For Rust diffs: `cargo test --manifest-path --features ` - - For tests-only diffs: still run the test you reviewed; capture pass/fail. + the implementer's "tests pass" report. Use the project's actual test + invocation, NOT a hard-coded one — read the project's CONTRIBUTING.md / + README / Makefile if unsure: + - Go projects with build tags: `go test -race -tags= .//...` — only include `-tags=e2e` if the project actually uses that tag AND you intend to run the integration tests. Don't force non-hermetic e2e runs that the rubric's hermeticity rule says to flag. + - Go projects without build tags: `go test -race .//...` + - Rust: `cargo test --manifest-path ` (add `--features ` only when the diff actually changed features-gated code). + - For tests-only diffs: run only the changed test file; capture pass/fail. 3. **Run the full bug-class checklist from `skills/requesting-code-review/SKILL.md`**. Every class. State for diff --git a/skills/subagent-driven-development/spec-reviewer-prompt.md b/skills/subagent-driven-development/spec-reviewer-prompt.md index 578fddf..aee6f0c 100644 --- a/skills/subagent-driven-development/spec-reviewer-prompt.md +++ b/skills/subagent-driven-development/spec-reviewer-prompt.md @@ -32,9 +32,11 @@ Task tool (general-purpose): ## Required workflow (DO ALL FIVE) - 1. **Read the actual diff** with `git show `. Do NOT trust - the implementer's summary. Do NOT use `git diff HEAD~1` (it pulls - in unrelated commits). + 1. **Read the actual diff for the WHOLE task**: + - If `BASE_SHA` and `HEAD_SHA` are both known: `git diff ..` (covers all commits in the task) + - If only the latest commit is provided: `git show ` covers a single commit; if you suspect the task spans multiple commits (look at `git log ..`), prefer the range diff + - Do NOT trust the implementer's summary + - Do NOT use `git diff HEAD~1` blindly — it picks an arbitrary parent that may not be the task's base 2. **Walk every "Acceptance" bullet from the task description**. For each, locate the corresponding change in the diff: From 85af672cb5fa98ee9a3cf1f0a0caab809b961bac Mon Sep 17 00:00:00 2001 From: Jon Langevin Date: Sat, 25 Apr 2026 17:07:46 -0400 Subject: [PATCH 4/4] Address Copilot review round 2 on PR #21 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - SKILL.md (subagent-driven-development) line 202: code-reviewer now marks BOTH tasks completed (Review quality + Implement). Reconciles the canonical workflow with the prompt-template addition. Field experience showed leaving Implement tasks at in_progress strands bookkeeping for the orchestrator. - SKILL.md (requesting-code-review) Vacuous-assertion class: replaced the misleading example. Now distinguishes (a) syntactic tautology (`err == nil || err != nil`, `p >= Floor || p < Floor`) from (b) trivially-satisfied-for-every-test-input (`composed_probability == Floor` on inputs that never trigger clamping; `spawn_count == observed_count` where both come from the same in-memory list — this was the actual Phase 2 finding pattern). - code-quality-reviewer-prompt.md: same vacuous-assertion fix as SKILL.md table. - code-quality-reviewer-prompt.md "tests-only" command: replaced the vague "run only the changed test file" with language-specific smallest-scope examples — `go test .// -run ` or full package for Go; `cargo test --test ` or `-p ` for Rust. Co-Authored-By: Claude Opus 4.7 (1M context) --- skills/requesting-code-review/SKILL.md | 2 +- skills/subagent-driven-development/SKILL.md | 2 +- .../code-quality-reviewer-prompt.md | 21 ++++++++++--------- 3 files changed, 13 insertions(+), 12 deletions(-) diff --git a/skills/requesting-code-review/SKILL.md b/skills/requesting-code-review/SKILL.md index f908b8f..e6b0129 100644 --- a/skills/requesting-code-review/SKILL.md +++ b/skills/requesting-code-review/SKILL.md @@ -64,7 +64,7 @@ The reviewer must explicitly scan for each of these bug classes on every diff. T | **One-sided boundary wiring** | A change touches an interface boundary (producer→consumer, caller→callee, plugin→host, sender→handler — see `agents/boundary-classes.md` for the canonical list) but only one side is implemented or tested. The other side is a stub, TODO, or unchecked assumption. Both sides of every boundary must be wired, and at least one test must exercise the full crossing end-to-end (not mock both sides independently). | | **Test hermeticity / non-determinism** | A test makes a real network call (e.g. `connect("localhost:50051")`), depends on a real DB without a `t.Skip`-when-missing guard, writes to a shared filesystem path that races with parallel tests, depends on the system clock without injection, or relies on external services that may flake. CI should be reproducible from a clean checkout; tests that touch the world without mediation are landmines. Promote to **Important** even if the test currently passes — non-determinism only fails on the day you most need green CI. | | **Cross-platform / portability** | A path uses `/` literals or string-`contains("\\")` checks where `std::path::Path::components` (Rust) or `filepath.Join`/`filepath.Separator` (Go) would be portable. Code spawns shell commands with bash-only syntax. Code reads Unix-only env vars (`HOME`, `XDG_*`, `SHELL`) without falling back on Windows. Code splits `PATH` on `:` rather than the platform separator (`;` on Windows). Asset paths assume one filesystem layout. CI matrices commonly hit Linux first, so these regressions slip past local development. | -| **Vacuous assertion** | A test whose name promises behavior X but whose body asserts a tautology (e.g., `assert p < Floor \|\| p > Ceiling` on a value that doesn't trigger either bound, so the OR is always true; `assert result == result`; `assert err == nil \|\| err != nil`). The function under test may be correct, but the test proves nothing. Vacuously-passing tests give false confidence and are worse than no test. Either rewrite the assertion to actually exercise the named path, or rename the test to match what it does. | +| **Vacuous assertion** | A test whose body asserts something tautologically true so it cannot fail meaningfully. Two flavors: (1) syntactic tautology — `assert err == nil \|\| err != nil`, `assert result == result`, `assert p >= Floor \|\| p < Floor`. (2) trivially satisfied for every test input — assertion is non-tautological in general but the test inputs never exercise the failing branch (e.g., `assert composed_probability == Floor` on inputs that never trigger clamping; `assert spawns_observed == spawns_recorded` where both sides come from the same in-process list). The function under test may be correct, but the test proves nothing. Worse than no test — gives false confidence. Either rewrite the assertion to exercise the named path, or rename the test to match what it does. | For each finding, the reviewer cites the bug class explicitly. This makes patterns visible across reviews and lets the team identify recurring weaknesses. diff --git a/skills/subagent-driven-development/SKILL.md b/skills/subagent-driven-development/SKILL.md index 481e367..9e27d48 100644 --- a/skills/subagent-driven-development/SKILL.md +++ b/skills/subagent-driven-development/SKILL.md @@ -199,7 +199,7 @@ Agent tool: - DM the implementer who built it with specific issues - Wait for fix and re-review 4. If approved: - - Mark the "Review quality:" task as completed via TaskUpdate + - Mark BOTH tasks completed via TaskUpdate: the "Review quality:" task AND the corresponding "Implement:" task. Code-reviewer is the only role that flips the Implement task to completed; missing this leaves the orchestrator with bookkeeping stragglers. - DM team-lead: "Task N fully approved" ## Team Conventions diff --git a/skills/subagent-driven-development/code-quality-reviewer-prompt.md b/skills/subagent-driven-development/code-quality-reviewer-prompt.md index fd32340..165571d 100644 --- a/skills/subagent-driven-development/code-quality-reviewer-prompt.md +++ b/skills/subagent-driven-development/code-quality-reviewer-prompt.md @@ -52,7 +52,7 @@ Task tool (superpowers:code-reviewer): - Go projects with build tags: `go test -race -tags= .//...` — only include `-tags=e2e` if the project actually uses that tag AND you intend to run the integration tests. Don't force non-hermetic e2e runs that the rubric's hermeticity rule says to flag. - Go projects without build tags: `go test -race .//...` - Rust: `cargo test --manifest-path ` (add `--features ` only when the diff actually changed features-gated code). - - For tests-only diffs: run only the changed test file; capture pass/fail. + - For tests-only diffs: run the smallest relevant test scope. Go: `go test .// -run ` to scope to specific tests, or the full package if multiple. Rust: `cargo test --manifest-path --test ` to scope to a specific integration test, or `cargo test -p ` for the full crate. Capture pass/fail in the review. 3. **Run the full bug-class checklist from `skills/requesting-code-review/SKILL.md`**. Every class. State for @@ -71,15 +71,16 @@ Task tool (superpowers:code-reviewer): - Test-only deferral noted as "fix later in Task X" that is the third (or later) such deferral against the same module — call it out as cascade risk before it becomes a fix-cycle in a later task. - - **Vacuous assertion** — a test whose name promises behavior X but - whose body's assertion is tautologically true (e.g., `assert p < Floor || p > Ceiling` - on a value `0.135` that doesn't trigger either bound, so the OR is - always satisfied; or `assert err == nil || err != nil`). The function - under test may be correct, but the test doesn't prove it. A vacuously-passing - test is worse than no test — it gives false confidence. Promote to - Important and require the implementer to either rewrite the assertion - to actually exercise the named path, or rename the test to match what - it does. + - **Vacuous assertion** — two flavors: (a) syntactic tautology like + `assert err == nil || err != nil` or `assert p >= Floor || p < Floor`; + (b) assertion is non-tautological in general but the chosen test inputs + trivially satisfy it (e.g., `assert composed_probability == Floor` when + no input combination actually triggers clamping; `assert spawn_count == + observed_count` when both values come from the same in-memory list). + The function under test may be correct, but the test doesn't prove it. + Worse than no test — gives false confidence. Promote to Important and + require the implementer to either rewrite the assertion to actually + exercise the named path, or rename the test to match what it does. ## Output format