Tighten subagent-driven-development reviewer prompts by intel352 · Pull Request #21 · GoCodeAlone/autonomous-dev-kit

intel352 · 2026-04-25T17:00:30Z

Summary

Field experience running subagent-driven-development on a Phase 0 + Phase 1 + Phase 2 sequence (Core Runner companion-game scaffold for the Core Dump MMORPG) showed that internal reviewers were systematically passing PRs that Copilot subsequently flagged. The existing rubric in skills/requesting-code-review/SKILL.md was comprehensive — twelve bug classes with adversarial framing — but reviewers were reflexively approving once each spec acceptance bullet had a matching file in the diff, without running the test the spec described, without applying every bug class, and without promoting Minor determinism issues to Important.

This PR tightens the prompt templates to inline the discipline that the bug-class checklist alone wasn't enforcing. Plus three new bug classes (Test hermeticity, Cross-platform / portability, Vacuous assertion) all validated by repeated field catches across Phases 1 and 2.

Field validation

Across 24 implementation tasks (Phase 1 + 2) under the tightened rubric:

Vacuous assertion caught 4 times (Tasks 2.3, 2.2, 2.6, 2.10) — all real test-quality bugs, not noise
Error swallowing caught 3 times (Tasks 1.7, 1.13, plus a follow-up) — the _ = err / unchecked-Scan pattern
Test hermeticity caught 1 time (Task 0.2 — test connecting to localhost network)
Cross-platform / portability caught 1 time (Task 0.9 — Windows path-separator bug)
Cascade risks preempted twice — Task 2.1's missing proto_sync_test maps folded into Task 2.2 spec; Task 2.8's dead JobBoard rendering folded into Task 2.10 spec

Code-reviewer now consistently cites actual go test / cargo test output on every approval. Spec-reviewer runs TDD red/green checks themselves rather than trusting implementer reports.

Changes

skills/subagent-driven-development/code-quality-reviewer-prompt.md

Inlines a four-step required workflow: read the diff for the WHOLE task (range diff git diff <BASE>..<HEAD> for multi-commit tasks; single git show <HEAD> only when verified single-commit), run the tests yourself with project-appropriate invocation, run the full bug-class checklist, apply Minor→Important promotion.
Adds explicit Minor→Important promotion rules for: tests with real network/DB/filesystem dependencies, magic numbers duplicated across modules, hardcoded path separators, accumulating test-only deferrals, and vacuous assertions (both syntactic tautologies and tests where every input trivially satisfies the assertion).

skills/subagent-driven-development/spec-reviewer-prompt.md

Inlines a five-step required workflow: read the diff for the whole task, walk every Acceptance bullet and locate its diff change, run TDD tests yourself (red must fail, green must pass), check scope-creep, and grep for reuse of existing helpers vs. parallel reinvention.

skills/requesting-code-review/SKILL.md (bug-class checklist) — adds three new classes:

Test hermeticity / non-determinism — real network calls in tests, real DB without skip-when-missing, fs races, clock dependencies. Promote to Important even if the test passes today.
Cross-platform / portability — / path literals where portable APIs exist, bash-only shell syntax, Unix-only env vars (with note that PATH exists on Windows with ; separator).
Vacuous assertion — distinguishes (a) syntactic tautology (err == nil || err != nil, p >= Floor || p < Floor) from (b) trivially-satisfied-for-every-test-input (the more common field pattern, e.g. asserting composed_probability == Floor on inputs that never trigger clamping).

skills/subagent-driven-development/SKILL.md — code-reviewer now marks BOTH the "Review quality:" task AND the corresponding "Implement:" task as completed. Field experience: leaving Implement tasks at in_progress after quality approval strands bookkeeping for the orchestrator. The prompt template and SKILL.md are now consistent on this rule.

Phase 2 retro additions:

DM-routing rule explicit (DM the implementer who CURRENTLY OWNS the task per TaskList, not whoever sent the most recent message)
Implementer self-marking-Implement-complete prohibited (only code-reviewer flips that bit)

Why inline rules that already exist in SKILL.md?

Bug-class checklists and adversarial framing live in requesting-code-review/SKILL.md, but field experience shows that delegating these by reference alone produces reviewers who "review the work" rather than adversarially scan it. The reviewer reads the spec, sees the diff matches, signs off — without running tests, without applying every bug class.

The four-step / five-step inlines duplicate spirit-of-rule that's also in SKILL.md. This is intentional. The reviewer skill file gets read once at agent spawn and then forgotten on subsequent review rounds. Inlining the discipline in the prompt template makes it unavoidable on every dispatch.

Test plan

Phase 0 retro: 10 tasks reviewed under old rubric → 3 internal Important findings + 7 Copilot misses
Phase 1 retro (with tightened prompts via DM): 4 fix cycles, all caught by internal review; new bug-class catches validated
Phase 2 retro (with tightened prompts via DM, third refinement round): 4 vacuous-assertion catches, 3 error-swallowing catches, 2 cross-task cascade preemptions; reviewers consistently cite test-run output
Manual review of the prompt-template wording for clarity (Copilot-driven, in flight)

🤖 Generated with Claude Code

Field experience running subagent-driven-development on a Phase 0 foundations branch (separate-Steam-SKU companion-game scaffold) showed that reviewers reflexively approve once each spec acceptance bullet has a matching file in the diff — without running the test the spec describes, without applying every bug class, and without promoting Minor determinism issues to Important. Copilot subsequently caught seven issues that internal review had missed: - test using a real network call to localhost - hardcoded magic number duplicated against a constant in another module - Windows path-separator bug in a HUD-file detector - FORBIDDEN list scope mismatched its own doc comment - test invocation command copy-paste error - Cargo.toml missing [[test]] entries (deferred-then-cascaded) These are bugs the existing rubric should have caught — symmetry violations, comment-vs-code drift, missing edge cases, magic literals. The gap was reviewer discipline: prompts delegated the bug-class checklist to skills/requesting-code-review/SKILL.md by reference, and reviewers glossed past it without applying every class. Changes: 1. skills/subagent-driven-development/code-quality-reviewer-prompt.md — inline a four-step required workflow (read git show <sha>, run the tests yourself, full bug-class scan, Minor->Important promotion). The bug-class checklist still lives in SKILL.md but the discipline to apply it on every dispatch is now in the prompt template. 2. skills/subagent-driven-development/spec-reviewer-prompt.md — inline a five-step workflow that requires walking every Acceptance bullet and locating its corresponding diff change, running TDD tests yourself (red must fail; green must pass), checking scope creep, and grepping for reuse-of-existing-helpers. 3. skills/requesting-code-review/SKILL.md — add two new bug classes to the checklist: - Test hermeticity / non-determinism (real network/DB/clock/fs in tests; promote to Important even if the test passes today). - Cross-platform / portability (path-separator literals, bash-only shell syntax, Unix-only env-var assumptions). The four-step / five-step inlines duplicate spirit-of-rule that's also in SKILL.md; this is intentional. Field experience: the reviewer skill file gets read once at agent spawn and then forgotten on subsequent review rounds. Inlining the discipline in the prompt template makes it unavoidable on every dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR tightens the reviewer prompt templates used by the subagent-driven-development workflow to enforce more disciplined reviews (read the real diff, run tests, apply the full checklist, and promote determinism/portability issues).

Changes:

Inline a required step-by-step workflow into the spec reviewer prompt (acceptance-by-acceptance verification, TDD test execution, scope and reuse checks).
Inline an adversarial, test-running, checklist-driven workflow into the code-quality reviewer prompt (including Minor→Important promotion rules).
Extend the requesting-code-review bug-class checklist with new “test hermeticity” and “cross-platform/portability” classes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
skills/subagent-driven-development/spec-reviewer-prompt.md	Adds an explicit 5-step spec-review workflow and required output expectations.
skills/subagent-driven-development/code-quality-reviewer-prompt.md	Adds adversarial framing and a 4-step quality-review workflow including test execution and promotion rules.
skills/requesting-code-review/SKILL.md	Adds two new bug classes to the reviewer checklist: hermeticity and portability.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Phase 1 field-data refinements after running the tightened rubric on a 13-task implementation phase. Three patterns surfaced that the existing prompt didn't address explicitly: 1. **DM-routing slips.** Code-reviewer occasionally DM'd the wrong implementer when fixes were needed — typically the one who wrote a prior task's test fixture, rather than the current task's owner. Add: explicit "DM the implementer who CURRENTLY OWNS the task per TaskList owner field" rule. Same rule added to spec-reviewer prompt. 2. **Bookkeeping completion gap.** Code-reviewer consistently marked the "Review quality" task completed but missed flipping the "Implement" task to completed, leaving team-lead with a stragglers queue. Add: explicit "mark BOTH tasks completed" rule with note that code-reviewer is the only role that flips the Implement task. 3. **Vacuous-assertion bug class.** Phase 1 reviewer-loop caught a test whose assertion `p < Floor || p > Ceiling` was tautologically true for all the inputs it exercised — the function under test was correct but the test proved nothing. Existing checklist had "Test-name-vs-body mismatch" but didn't explicitly cover the tautology case. Add as a 14th bug class with explicit examples. Field results from Phase 1 with the tightened rubric (pre-refinement): - 4 fix cycles (3 caught by code-reviewer, 1 by spec-reviewer running tests directly) - Error-swallowing class caught twice (Tasks 1.7 and 1.13) by the same pattern recognition — pure validation of the existing checklist - Code-reviewer cited actual `go test` / `cargo test` output on every approval, walked failure modes one-by-one, and consistently produced audit-grade output Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- SKILL.md: clarify cross-platform examples — PATH exists on Windows with `;` separator; HOME / XDG_* / SHELL are the truly Unix-only vars. Mention the PATH-separator gotcha explicitly. - spec-reviewer-prompt.md Step 1: prefer `git diff <BASE>..<HEAD>` when the task spans multiple commits; `git show <HEAD>` covers only the tip and would miss earlier task commits. - code-quality-reviewer-prompt.md Step 1: same clarification on range diff vs single-commit show. - code-quality-reviewer-prompt.md Step 2: drop hard-coded `-tags=e2e` for Go. The project's actual tag set is project-specific; forcing e2e tags can run non-hermetic tests, contradicting the rubric's own hermeticity rule. Examples now show: with project's actual tags; without tags; Rust with conditional features. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- SKILL.md (subagent-driven-development) line 202: code-reviewer now marks BOTH tasks completed (Review quality + Implement). Reconciles the canonical workflow with the prompt-template addition. Field experience showed leaving Implement tasks at in_progress strands bookkeeping for the orchestrator. - SKILL.md (requesting-code-review) Vacuous-assertion class: replaced the misleading example. Now distinguishes (a) syntactic tautology (`err == nil || err != nil`, `p >= Floor || p < Floor`) from (b) trivially-satisfied-for-every-test-input (`composed_probability == Floor` on inputs that never trigger clamping; `spawn_count == observed_count` where both come from the same in-memory list — this was the actual Phase 2 finding pattern). - code-quality-reviewer-prompt.md: same vacuous-assertion fix as SKILL.md table. - code-quality-reviewer-prompt.md "tests-only" command: replaced the vague "run only the changed test file" with language-specific smallest-scope examples — `go test ./<pkg>/ -run <TestName>` or full package for Go; `cargo test --test <binary>` or `-p <crate>` for Rust. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Picks up the reviewer-prompt tightening from PR #21: - Inlined four-step / five-step required workflows in the subagent-driven-development reviewer prompt templates so reviewers apply the bug-class checklist + run-the-tests-yourself discipline on every dispatch - Three new bug classes in skills/requesting-code-review/SKILL.md: Test hermeticity / non-determinism, Cross-platform / portability, Vacuous assertion - DM-routing rule (DM the implementer who currently OWNS the task) and bookkeeping rule (code-reviewer marks BOTH the Review-quality AND the Implement task completed) reconciled across SKILL.md and the prompt template - Range-diff guidance for multi-commit tasks (`git diff <base>..<head>`) - Project-appropriate test invocations (no hard-coded `-tags=e2e`) Field-validated across 24 implementation tasks (Phase 1 + Phase 2 of a Core Runner companion-game scaffold): 4 vacuous-assertion catches, 3 error-swallowing catches, 2 cross-task cascade preemptions, code- reviewers consistently citing actual test-run output. Additive only. No breaking changes to existing skill APIs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Picks up the reviewer-prompt tightening from PR #21: - Inlined four-step / five-step required workflows in the subagent-driven-development reviewer prompt templates so reviewers apply the bug-class checklist + run-the-tests-yourself discipline on every dispatch - Three new bug classes in skills/requesting-code-review/SKILL.md: Test hermeticity / non-determinism, Cross-platform / portability, Vacuous assertion - DM-routing rule (DM the implementer who currently OWNS the task) and bookkeeping rule (code-reviewer marks BOTH the Review-quality AND the Implement task completed) reconciled across SKILL.md and the prompt template - Range-diff guidance for multi-commit tasks (`git diff <base>..<head>`) - Project-appropriate test invocations (no hard-coded `-tags=e2e`) Field-validated across 24 implementation tasks (Phase 1 + Phase 2 of a Core Runner companion-game scaffold): 4 vacuous-assertion catches, 3 error-swallowing catches, 2 cross-task cascade preemptions, code- reviewers consistently citing actual test-run output. Additive only. No breaking changes to existing skill APIs. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 25, 2026 17:00

Copilot AI reviewed Apr 25, 2026

View reviewed changes

intel352 and others added 2 commits April 25, 2026 14:51

Copilot AI review requested due to automatic review settings April 25, 2026 20:50

Copilot started reviewing on behalf of intel352 April 25, 2026 20:50 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

intel352 merged commit 74fabb1 into main Apr 25, 2026
6 checks passed

intel352 mentioned this pull request Apr 25, 2026

chore: bump plugin manifests to v5.2.0 #22

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tighten subagent-driven-development reviewer prompts#21

Tighten subagent-driven-development reviewer prompts#21
intel352 merged 4 commits into
mainfrom
cross-llm/tighten-reviewer-prompts

intel352 commented Apr 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

intel352 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Field validation

Changes

Why inline rules that already exist in SKILL.md?

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

intel352 commented Apr 25, 2026 •

edited

Loading