Skip to content

Tighten subagent-driven-development reviewer prompts#21

Merged
intel352 merged 4 commits into
mainfrom
cross-llm/tighten-reviewer-prompts
Apr 25, 2026
Merged

Tighten subagent-driven-development reviewer prompts#21
intel352 merged 4 commits into
mainfrom
cross-llm/tighten-reviewer-prompts

Conversation

@intel352
Copy link
Copy Markdown
Contributor

@intel352 intel352 commented Apr 25, 2026

Summary

Field experience running subagent-driven-development on a Phase 0 + Phase 1 + Phase 2 sequence (Core Runner companion-game scaffold for the Core Dump MMORPG) showed that internal reviewers were systematically passing PRs that Copilot subsequently flagged. The existing rubric in skills/requesting-code-review/SKILL.md was comprehensive — twelve bug classes with adversarial framing — but reviewers were reflexively approving once each spec acceptance bullet had a matching file in the diff, without running the test the spec described, without applying every bug class, and without promoting Minor determinism issues to Important.

This PR tightens the prompt templates to inline the discipline that the bug-class checklist alone wasn't enforcing. Plus three new bug classes (Test hermeticity, Cross-platform / portability, Vacuous assertion) all validated by repeated field catches across Phases 1 and 2.

Field validation

Across 24 implementation tasks (Phase 1 + 2) under the tightened rubric:

  • Vacuous assertion caught 4 times (Tasks 2.3, 2.2, 2.6, 2.10) — all real test-quality bugs, not noise
  • Error swallowing caught 3 times (Tasks 1.7, 1.13, plus a follow-up) — the _ = err / unchecked-Scan pattern
  • Test hermeticity caught 1 time (Task 0.2 — test connecting to localhost network)
  • Cross-platform / portability caught 1 time (Task 0.9 — Windows path-separator bug)
  • Cascade risks preempted twice — Task 2.1's missing proto_sync_test maps folded into Task 2.2 spec; Task 2.8's dead JobBoard rendering folded into Task 2.10 spec

Code-reviewer now consistently cites actual go test / cargo test output on every approval. Spec-reviewer runs TDD red/green checks themselves rather than trusting implementer reports.

Changes

skills/subagent-driven-development/code-quality-reviewer-prompt.md

  • Inlines a four-step required workflow: read the diff for the WHOLE task (range diff git diff <BASE>..<HEAD> for multi-commit tasks; single git show <HEAD> only when verified single-commit), run the tests yourself with project-appropriate invocation, run the full bug-class checklist, apply Minor→Important promotion.
  • Adds explicit Minor→Important promotion rules for: tests with real network/DB/filesystem dependencies, magic numbers duplicated across modules, hardcoded path separators, accumulating test-only deferrals, and vacuous assertions (both syntactic tautologies and tests where every input trivially satisfies the assertion).

skills/subagent-driven-development/spec-reviewer-prompt.md

  • Inlines a five-step required workflow: read the diff for the whole task, walk every Acceptance bullet and locate its diff change, run TDD tests yourself (red must fail, green must pass), check scope-creep, and grep for reuse of existing helpers vs. parallel reinvention.

skills/requesting-code-review/SKILL.md (bug-class checklist) — adds three new classes:

  • Test hermeticity / non-determinism — real network calls in tests, real DB without skip-when-missing, fs races, clock dependencies. Promote to Important even if the test passes today.
  • Cross-platform / portability/ path literals where portable APIs exist, bash-only shell syntax, Unix-only env vars (with note that PATH exists on Windows with ; separator).
  • Vacuous assertion — distinguishes (a) syntactic tautology (err == nil || err != nil, p >= Floor || p < Floor) from (b) trivially-satisfied-for-every-test-input (the more common field pattern, e.g. asserting composed_probability == Floor on inputs that never trigger clamping).

skills/subagent-driven-development/SKILL.md — code-reviewer now marks BOTH the "Review quality:" task AND the corresponding "Implement:" task as completed. Field experience: leaving Implement tasks at in_progress after quality approval strands bookkeeping for the orchestrator. The prompt template and SKILL.md are now consistent on this rule.

Phase 2 retro additions:

  • DM-routing rule explicit (DM the implementer who CURRENTLY OWNS the task per TaskList, not whoever sent the most recent message)
  • Implementer self-marking-Implement-complete prohibited (only code-reviewer flips that bit)

Why inline rules that already exist in SKILL.md?

Bug-class checklists and adversarial framing live in requesting-code-review/SKILL.md, but field experience shows that delegating these by reference alone produces reviewers who "review the work" rather than adversarially scan it. The reviewer reads the spec, sees the diff matches, signs off — without running tests, without applying every bug class.

The four-step / five-step inlines duplicate spirit-of-rule that's also in SKILL.md. This is intentional. The reviewer skill file gets read once at agent spawn and then forgotten on subsequent review rounds. Inlining the discipline in the prompt template makes it unavoidable on every dispatch.

Test plan

  • Phase 0 retro: 10 tasks reviewed under old rubric → 3 internal Important findings + 7 Copilot misses
  • Phase 1 retro (with tightened prompts via DM): 4 fix cycles, all caught by internal review; new bug-class catches validated
  • Phase 2 retro (with tightened prompts via DM, third refinement round): 4 vacuous-assertion catches, 3 error-swallowing catches, 2 cross-task cascade preemptions; reviewers consistently cite test-run output
  • Manual review of the prompt-template wording for clarity (Copilot-driven, in flight)

🤖 Generated with Claude Code

Field experience running subagent-driven-development on a Phase 0
foundations branch (separate-Steam-SKU companion-game scaffold) showed
that reviewers reflexively approve once each spec acceptance bullet has
a matching file in the diff — without running the test the spec
describes, without applying every bug class, and without promoting
Minor determinism issues to Important. Copilot subsequently caught
seven issues that internal review had missed:

  - test using a real network call to localhost
  - hardcoded magic number duplicated against a constant in another module
  - Windows path-separator bug in a HUD-file detector
  - FORBIDDEN list scope mismatched its own doc comment
  - test invocation command copy-paste error
  - Cargo.toml missing [[test]] entries (deferred-then-cascaded)

These are bugs the existing rubric should have caught — symmetry
violations, comment-vs-code drift, missing edge cases, magic literals.
The gap was reviewer discipline: prompts delegated the bug-class checklist
to skills/requesting-code-review/SKILL.md by reference, and reviewers
glossed past it without applying every class.

Changes:

1. skills/subagent-driven-development/code-quality-reviewer-prompt.md —
   inline a four-step required workflow (read git show <sha>, run the
   tests yourself, full bug-class scan, Minor->Important promotion).
   The bug-class checklist still lives in SKILL.md but the discipline
   to apply it on every dispatch is now in the prompt template.

2. skills/subagent-driven-development/spec-reviewer-prompt.md —
   inline a five-step workflow that requires walking every Acceptance
   bullet and locating its corresponding diff change, running TDD
   tests yourself (red must fail; green must pass), checking scope
   creep, and grepping for reuse-of-existing-helpers.

3. skills/requesting-code-review/SKILL.md — add two new bug classes
   to the checklist:
     - Test hermeticity / non-determinism (real network/DB/clock/fs in
       tests; promote to Important even if the test passes today).
     - Cross-platform / portability (path-separator literals, bash-only
       shell syntax, Unix-only env-var assumptions).

The four-step / five-step inlines duplicate spirit-of-rule that's also
in SKILL.md; this is intentional. Field experience: the reviewer skill
file gets read once at agent spawn and then forgotten on subsequent
review rounds. Inlining the discipline in the prompt template makes it
unavoidable on every dispatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 25, 2026 17:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tightens the reviewer prompt templates used by the subagent-driven-development workflow to enforce more disciplined reviews (read the real diff, run tests, apply the full checklist, and promote determinism/portability issues).

Changes:

  • Inline a required step-by-step workflow into the spec reviewer prompt (acceptance-by-acceptance verification, TDD test execution, scope and reuse checks).
  • Inline an adversarial, test-running, checklist-driven workflow into the code-quality reviewer prompt (including Minor→Important promotion rules).
  • Extend the requesting-code-review bug-class checklist with new “test hermeticity” and “cross-platform/portability” classes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
skills/subagent-driven-development/spec-reviewer-prompt.md Adds an explicit 5-step spec-review workflow and required output expectations.
skills/subagent-driven-development/code-quality-reviewer-prompt.md Adds adversarial framing and a 4-step quality-review workflow including test execution and promotion rules.
skills/requesting-code-review/SKILL.md Adds two new bug classes to the reviewer checklist: hermeticity and portability.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread skills/requesting-code-review/SKILL.md Outdated
Comment thread skills/subagent-driven-development/spec-reviewer-prompt.md Outdated
Comment thread skills/subagent-driven-development/code-quality-reviewer-prompt.md Outdated
Comment thread skills/subagent-driven-development/code-quality-reviewer-prompt.md Outdated
intel352 and others added 2 commits April 25, 2026 14:51
Phase 1 field-data refinements after running the tightened rubric on a
13-task implementation phase. Three patterns surfaced that the existing
prompt didn't address explicitly:

1. **DM-routing slips.** Code-reviewer occasionally DM'd the wrong
   implementer when fixes were needed — typically the one who wrote a
   prior task's test fixture, rather than the current task's owner.
   Add: explicit "DM the implementer who CURRENTLY OWNS the task per
   TaskList owner field" rule. Same rule added to spec-reviewer prompt.

2. **Bookkeeping completion gap.** Code-reviewer consistently marked
   the "Review quality" task completed but missed flipping the
   "Implement" task to completed, leaving team-lead with a stragglers
   queue. Add: explicit "mark BOTH tasks completed" rule with note
   that code-reviewer is the only role that flips the Implement task.

3. **Vacuous-assertion bug class.** Phase 1 reviewer-loop caught a test
   whose assertion `p < Floor || p > Ceiling` was tautologically true
   for all the inputs it exercised — the function under test was
   correct but the test proved nothing. Existing checklist had
   "Test-name-vs-body mismatch" but didn't explicitly cover the
   tautology case. Add as a 14th bug class with explicit examples.

Field results from Phase 1 with the tightened rubric (pre-refinement):
- 4 fix cycles (3 caught by code-reviewer, 1 by spec-reviewer running
  tests directly)
- Error-swallowing class caught twice (Tasks 1.7 and 1.13) by the same
  pattern recognition — pure validation of the existing checklist
- Code-reviewer cited actual `go test` / `cargo test` output on every
  approval, walked failure modes one-by-one, and consistently produced
  audit-grade output

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- SKILL.md: clarify cross-platform examples — PATH exists on Windows
  with `;` separator; HOME / XDG_* / SHELL are the truly Unix-only
  vars. Mention the PATH-separator gotcha explicitly.

- spec-reviewer-prompt.md Step 1: prefer `git diff <BASE>..<HEAD>` when
  the task spans multiple commits; `git show <HEAD>` covers only the
  tip and would miss earlier task commits.

- code-quality-reviewer-prompt.md Step 1: same clarification on range
  diff vs single-commit show.

- code-quality-reviewer-prompt.md Step 2: drop hard-coded `-tags=e2e`
  for Go. The project's actual tag set is project-specific; forcing
  e2e tags can run non-hermetic tests, contradicting the rubric's own
  hermeticity rule. Examples now show: with project's actual tags;
  without tags; Rust with conditional features.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 25, 2026 20:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread skills/subagent-driven-development/code-quality-reviewer-prompt.md
Comment thread skills/subagent-driven-development/code-quality-reviewer-prompt.md Outdated
Comment thread skills/subagent-driven-development/code-quality-reviewer-prompt.md Outdated
Comment thread skills/requesting-code-review/SKILL.md Outdated
Comment thread skills/requesting-code-review/SKILL.md Outdated
- SKILL.md (subagent-driven-development) line 202: code-reviewer now
  marks BOTH tasks completed (Review quality + Implement). Reconciles
  the canonical workflow with the prompt-template addition. Field
  experience showed leaving Implement tasks at in_progress strands
  bookkeeping for the orchestrator.

- SKILL.md (requesting-code-review) Vacuous-assertion class: replaced
  the misleading example. Now distinguishes (a) syntactic tautology
  (`err == nil || err != nil`, `p >= Floor || p < Floor`) from (b)
  trivially-satisfied-for-every-test-input (`composed_probability ==
  Floor` on inputs that never trigger clamping; `spawn_count ==
  observed_count` where both come from the same in-memory list — this
  was the actual Phase 2 finding pattern).

- code-quality-reviewer-prompt.md: same vacuous-assertion fix as
  SKILL.md table.

- code-quality-reviewer-prompt.md "tests-only" command: replaced the
  vague "run only the changed test file" with language-specific
  smallest-scope examples — `go test ./<pkg>/ -run <TestName>` or full
  package for Go; `cargo test --test <binary>` or `-p <crate>` for Rust.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@intel352 intel352 merged commit 74fabb1 into main Apr 25, 2026
6 checks passed
intel352 added a commit that referenced this pull request Apr 25, 2026
Picks up the reviewer-prompt tightening from PR #21:

- Inlined four-step / five-step required workflows in the
  subagent-driven-development reviewer prompt templates so reviewers
  apply the bug-class checklist + run-the-tests-yourself discipline
  on every dispatch
- Three new bug classes in skills/requesting-code-review/SKILL.md:
  Test hermeticity / non-determinism, Cross-platform / portability,
  Vacuous assertion
- DM-routing rule (DM the implementer who currently OWNS the task)
  and bookkeeping rule (code-reviewer marks BOTH the Review-quality
  AND the Implement task completed) reconciled across SKILL.md and
  the prompt template
- Range-diff guidance for multi-commit tasks (`git diff <base>..<head>`)
- Project-appropriate test invocations (no hard-coded `-tags=e2e`)

Field-validated across 24 implementation tasks (Phase 1 + Phase 2 of
a Core Runner companion-game scaffold): 4 vacuous-assertion catches,
3 error-swallowing catches, 2 cross-task cascade preemptions, code-
reviewers consistently citing actual test-run output.

Additive only. No breaking changes to existing skill APIs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
intel352 added a commit that referenced this pull request Apr 25, 2026
Picks up the reviewer-prompt tightening from PR #21:

- Inlined four-step / five-step required workflows in the
  subagent-driven-development reviewer prompt templates so reviewers
  apply the bug-class checklist + run-the-tests-yourself discipline
  on every dispatch
- Three new bug classes in skills/requesting-code-review/SKILL.md:
  Test hermeticity / non-determinism, Cross-platform / portability,
  Vacuous assertion
- DM-routing rule (DM the implementer who currently OWNS the task)
  and bookkeeping rule (code-reviewer marks BOTH the Review-quality
  AND the Implement task completed) reconciled across SKILL.md and
  the prompt template
- Range-diff guidance for multi-commit tasks (`git diff <base>..<head>`)
- Project-appropriate test invocations (no hard-coded `-tags=e2e`)

Field-validated across 24 implementation tasks (Phase 1 + Phase 2 of
a Core Runner companion-game scaffold): 4 vacuous-assertion catches,
3 error-swallowing catches, 2 cross-task cascade preemptions, code-
reviewers consistently citing actual test-run output.

Additive only. No breaking changes to existing skill APIs.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants