From 57eae8d1350ace650d2d7f2fcaec38ea765b6e6d Mon Sep 17 00:00:00 2001 From: Andrew Stellman Date: Wed, 15 Apr 2026 10:34:20 -0400 Subject: [PATCH 1/4] Update quality-playbook skill to v1.4.0, add agent --- agents/quality-playbook.agent.md | 64 + skills/quality-playbook/LICENSE.txt | 211 +- skills/quality-playbook/SKILL.md | 1708 ++++++++++++++++- skills/quality-playbook/quality_gate.sh | 632 ++++++ .../references/defensive_patterns.md | 20 + .../references/exploration_patterns.md | 283 +++ .../quality-playbook/references/iteration.md | 190 ++ .../references/requirements_pipeline.md | 427 +++++ .../references/requirements_refinement.md | 113 ++ .../references/requirements_review.md | 158 ++ .../references/review_protocols.md | 279 ++- .../quality-playbook/references/spec_audit.md | 107 +- .../references/verification.md | 192 +- 13 files changed, 4279 insertions(+), 105 deletions(-) create mode 100644 agents/quality-playbook.agent.md create mode 100755 skills/quality-playbook/quality_gate.sh create mode 100644 skills/quality-playbook/references/exploration_patterns.md create mode 100644 skills/quality-playbook/references/iteration.md create mode 100644 skills/quality-playbook/references/requirements_pipeline.md create mode 100644 skills/quality-playbook/references/requirements_refinement.md create mode 100644 skills/quality-playbook/references/requirements_review.md diff --git a/agents/quality-playbook.agent.md b/agents/quality-playbook.agent.md new file mode 100644 index 000000000..48ca51fe0 --- /dev/null +++ b/agents/quality-playbook.agent.md @@ -0,0 +1,64 @@ +--- +name: "Quality Playbook" +description: "Run a complete quality engineering audit on any codebase. Derives behavioral requirements from the code, generates spec-traced functional tests, runs a three-pass code review with regression tests, executes a multi-model spec audit (Council of Three), and produces a consolidated bug report with patches and TDD verification. Finds the 35% of real defects that structural code review alone cannot catch." +tools: + - search/codebase + - web/fetch +--- + +# Quality Playbook Agent + +You are a quality engineering agent. Your job is to run the Quality Playbook — a systematic methodology for finding bugs that require understanding what the code is *supposed* to do, not just what it does. + +## Before you start + +Check that the quality playbook skill is installed. Look for it in one of these locations: + +1. `.github/skills/quality-playbook/SKILL.md` +2. `.github/skills/SKILL.md` + +Also check for the reference files directory alongside SKILL.md (in a `references/` folder). + +**If the skill is not installed**, tell the user: + +> The quality playbook skill isn't installed in this repository yet. You can install it from [awesome-copilot](https://awesome-copilot.github.com/#file=skills%2Fquality-playbook%2FSKILL.md) or from the [quality-playbook repository](https://github.com/andrewstellman/quality-playbook). Copy the `SKILL.md` file and the `references/` directory into `.github/skills/quality-playbook/`. + +Then stop and wait for the user to install it. + +**If the skill is installed**, read SKILL.md and every file in the `references/` directory. Then follow the skill's instructions exactly — it defines six phases, each with entry gates and exit gates. + +## How it works — phase by phase + +The playbook runs one phase at a time. Each phase runs with a clean context window, producing files that the next phase reads. After each phase, stop and tell the user what happened and what to say next. + +1. **Phase 1 (Explore)** — Understand the codebase: architecture, risks, failure modes +2. **Phase 2 (Generate)** — Produce quality artifacts: requirements, tests, protocols +3. **Phase 3 (Code Review)** — Three-pass review with regression tests for every bug +4. **Phase 4 (Spec Audit)** — Three independent auditors check code against requirements +5. **Phase 5 (Reconciliation)** — TDD red-green verification for every confirmed bug +6. **Phase 6 (Verify)** — Self-check benchmarks validate all artifacts + +After all six phases, the user can run iteration strategies (gap, unfiltered, parity, adversarial) to find more bugs — iterations typically add 40-60% more confirmed bugs. + +**Default behavior: run Phase 1 only, then stop.** The user drives each phase forward by saying "keep going" or "run phase N". + +## Documentation warning + +Before starting Phase 1, check if the project has documentation (a `docs/` or `docs_gathered/` directory). If not, warn the user that the playbook finds significantly more bugs with documentation, and suggest they add specs or API docs to `docs_gathered/` before running. + +## Responding to user questions + +- **"help" / "how does this work"** — Explain the six phases, mention that documentation improves results, and suggest "Run the quality playbook on this project" to get started. +- **"what happened" / "what's going on"** — Read `quality/PROGRESS.md` and give a status update. +- **"keep going" / "continue" / "next"** — Run the next phase in sequence. +- **"run phase N"** — Run the specified phase (check prerequisites first). + +## How to invoke + +Tell the user they can invoke you by name in Copilot Chat. Example prompts: + +- "Run the quality playbook on this project" +- "Keep going" (after any phase completes) +- "Run quality playbook phase 3" +- "Help — how does the quality playbook work?" +- "What happened? What should I do next?" diff --git a/skills/quality-playbook/LICENSE.txt b/skills/quality-playbook/LICENSE.txt index e0d4f9147..ce64d27c2 100644 --- a/skills/quality-playbook/LICENSE.txt +++ b/skills/quality-playbook/LICENSE.txt @@ -1,21 +1,190 @@ -MIT License - -Copyright (c) 2025 Andrew Stellman - -Permission is hereby granted, free of charge, to any person obtaining a copy -of this software and associated documentation files (the "Software"), to deal -in the Software without restriction, including without limitation the rights -to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -copies of the Software, and to permit persons to whom the Software is -furnished to do so, subject to the following conditions: - -The above copyright notice and this permission notice shall be included in all -copies or substantial portions of the Software. - -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -SOFTWARE. + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to the Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by the Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding any notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + Copyright 2025 Andrew Stellman + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/skills/quality-playbook/SKILL.md b/skills/quality-playbook/SKILL.md index b5242ec42..7d5b30ff0 100644 --- a/skills/quality-playbook/SKILL.md +++ b/skills/quality-playbook/SKILL.md @@ -1,21 +1,41 @@ --- name: quality-playbook -description: "Explore any codebase from scratch and generate six quality artifacts: a quality constitution (QUALITY.md), spec-traced functional tests, a code review protocol with regression test generation, an integration testing protocol, a multi-model spec audit (Council of Three), and an AI bootstrap file (AGENTS.md). Includes state machine completeness analysis and missing safeguard detection. Works with any language (Python, Java, Scala, TypeScript, Go, Rust, etc.). Use this skill whenever the user asks to set up a quality playbook, generate functional tests from specifications, create a quality constitution, build testing protocols, audit code against specs, or establish a repeatable quality system for a project. Also trigger when the user mentions 'quality playbook', 'spec audit', 'Council of Three', 'fitness-to-purpose', 'coverage theater', or wants to go beyond basic test generation to build a full quality system grounded in their actual codebase." +description: "Run a complete quality engineering audit on any codebase. Derives behavioral requirements from the code, generates spec-traced functional tests, runs a three-pass code review with regression tests, executes a multi-model spec audit (Council of Three), and produces a consolidated bug report with TDD-verified patches. Finds the 35% of real defects that structural code review alone cannot catch. Works with any language. Trigger on 'quality playbook', 'spec audit', 'Council of Three', 'fitness-to-purpose', or 'coverage theater'." license: Complete terms in LICENSE.txt metadata: - version: 1.2.0 + version: 1.4.0 author: Andrew Stellman - github: https://github.com/andrewstellman/ + github: https://github.com/andrewstellman/quality-playbook --- # Quality Playbook Generator -**When this skill starts, display this banner before doing anything else:** +## Plan Overview — read this first, then explain it to the user -``` -Quality Playbook v1.2.0 — by Andrew Stellman -https://github.com/andrewstellman/ -``` +Before reading any other section of this skill, understand the plan and its dependencies. Each phase produces artifacts that the next phase depends on. Skipping or rushing a phase means every downstream phase works from incomplete information. + +**Phase 0 (Prior Run Analysis):** If previous quality runs exist, load their findings as seed data. This is automatic and only applies to re-runs. + +**Phase 1 (Explore):** Explore the codebase thoroughly in three stages. First, open exploration driven by domain knowledge — understand the architecture, risks, and failure modes the way an experienced developer would. Second, domain-knowledge risk analysis — step back and reason about what goes wrong in systems like this one, generating specific failure scenarios grounded in the code you just explored. Third, apply structured exploration patterns (selected for this codebase, not all six exhaustively) to catch specific bug classes that free exploration misses. Write all findings to `quality/EXPLORATION.md`. This file is the foundation — Phase 2 reads it as its primary input. + +**Phase 2 (Generate):** Read EXPLORATION.md and produce the quality artifacts: requirements, constitution, functional tests, code review protocol, integration tests, spec audit protocol, TDD protocol, AGENTS.md. + +**Phase 3 (Code Review):** Run the three-pass code review against HEAD. Write regression tests for every confirmed bug. Generate patches. + +**Phase 4 (Spec Audit):** Three independent AI auditors review the code against requirements. Triage with verification probes. Write regression tests for net-new findings. + +**Phase 5 (Reconciliation):** Close the loop — every bug from code review and spec audit is tracked, regression-tested or explicitly exempted. Run TDD red-green cycle. Finalize the completeness report. + +**Phase 6 (Verify):** Run self-check benchmarks against all generated artifacts. Check for internal consistency, version stamp correctness, and convergence. + +Every bug found traces back to a requirement, and every requirement traces back to an exploration finding. + +**The critical dependency chain:** Exploration findings → EXPLORATION.md → Requirements → Code review + Spec audit → Bug discovery. A shallow exploration produces abstract requirements. Abstract requirements miss bugs. The exploration phase is where bugs are won or lost. + +**MANDATORY FIRST ACTION:** After reading and understanding the plan above, print the following message to the user, then explain the plan in your own words — what you'll do, what each phase produces, and why the exploration phase matters most. Emphasize that exploration starts with open-ended domain-driven investigation, followed by domain-knowledge risk analysis that reasons about what goes wrong in systems like this, then supplemented by selected structured patterns. Do not copy the plan verbatim; paraphrase it to demonstrate understanding. + +> Quality Playbook v1.4.0 — by Andrew Stellman +> https://github.com/andrewstellman/quality-playbook Generate a complete quality system tailored to a specific codebase. Unlike test stub generators that work mechanically from source code, this skill explores the project first — understanding its domain, architecture, specifications, and failure history — then produces a quality playbook grounded in what it finds. @@ -27,42 +47,253 @@ Without a quality playbook, every new contributor (and every new AI session) sta ## What This Skill Produces -Six files that together form a repeatable quality system: +Nine files that together form a repeatable quality system: | File | Purpose | Why It Matters | Executes Code? | |------|---------|----------------|----------------| | `quality/QUALITY.md` | Quality constitution — coverage targets, fitness-to-purpose scenarios, theater prevention | Every AI session reads this first. It tells them what "good enough" means so they don't guess. | No | +| `quality/REQUIREMENTS.md` | Testable requirements with project overview, use cases, and narrative — generated by a five-phase pipeline (contract extraction → derivation → verification → completeness → narrative) | The foundation for Passes 2 and 3 of the code review. Without requirements, review is limited to structural anomalies (~65% ceiling). With them, the review can catch intent violations — absence bugs, cross-file contradictions, and design gaps that are invisible to code reading alone. | No | | `quality/test_functional.*` | Automated functional tests derived from specifications | The safety net. Tests tied to what the spec says should happen, not just what the code does. Use the project's language: `test_functional.py` (Python), `FunctionalSpec.scala` (Scala), `functional.test.ts` (TypeScript), `FunctionalTest.java` (Java), etc. | **Yes** | -| `quality/RUN_CODE_REVIEW.md` | Code review protocol with guardrails that prevent hallucinated findings | AI code reviews without guardrails produce confident but wrong findings. The guardrails (line numbers, grep before claiming, read bodies) often improve accuracy. | No | +| `quality/RUN_CODE_REVIEW.md` | Three-pass code review protocol: structural review, requirement verification, cross-requirement consistency | Structural review alone misses ~35% of real defects. The three-pass pipeline adds requirement verification and consistency checking — backed by experiment evidence showing it finds bugs invisible to all structural review conditions. | No | | `quality/RUN_INTEGRATION_TESTS.md` | Integration test protocol — end-to-end pipeline across all variants | Unit tests pass, but does the system actually work end-to-end with real external services? | **Yes** | +| `quality/BUGS.md` | Consolidated bug report with patches | Every confirmed bug in one place with reproduction details, spec basis, severity, and patch references. The single source of truth for what's broken and how to verify it. | No | +| `quality/RUN_TDD_TESTS.md` | TDD red-green verification protocol | Proves each bug is real (test fails on unpatched code) and each fix works (test passes after patch). Stronger evidence than a bug report alone — maintainers trust FAIL→PASS demonstrations. | **Yes** | | `quality/RUN_SPEC_AUDIT.md` | Council of Three multi-model spec audit protocol | No single AI model catches everything. Three independent models with different blind spots catch defects that any one alone would miss. | No | | `AGENTS.md` | Bootstrap context for any AI session working on this project | The "read this first" file. Without it, AI sessions waste their first hour figuring out what's going on. | No | -Plus output directories: `quality/code_reviews/`, `quality/spec_audits/`, `quality/results/`. +Plus output directories: `quality/code_reviews/`, `quality/spec_audits/`, `quality/results/`, `quality/history/`. + +The pipeline also generates supporting artifacts: `quality/PROGRESS.md` (phase-by-phase checkpoint log with cumulative BUG tracker), `quality/CONTRACTS.md` (behavioral contracts), `quality/COVERAGE_MATRIX.md` (traceability), `quality/COMPLETENESS_REPORT.md` (final gate), `quality/VERSION_HISTORY.md` (review log), `quality/REVIEW_REQUIREMENTS.md` (interactive review protocol), and `quality/REFINE_REQUIREMENTS.md` (refinement pass protocol). + +The two critical deliverables are the requirements file and the functional test file. The requirements file (`quality/REQUIREMENTS.md`) feeds the code review protocol's verification and consistency passes — it's what makes the code review catch more than structural anomalies. The functional test file (named for the project's language and test framework conventions) is the automated safety net. The Markdown protocols are documentation for humans and AI agents. + +### Complete Artifact Contract + +The quality gate (`quality_gate.sh`) validates these artifacts. If the gate checks for it, this skill must instruct its creation. This is the canonical list — any artifact not listed here should not be gate-enforced, and any gate check should trace to an artifact listed here. + +| Artifact | Location | Required? | Created In | +|----------|----------|-----------|------------| +| Quality constitution | `quality/QUALITY.md` | Yes | Phase 2 | +| Requirements (UC identifiers) | `quality/REQUIREMENTS.md` | Yes | Phase 2 | +| Behavioral contracts | `quality/CONTRACTS.md` | Yes | Phase 2 | +| Functional tests | `quality/test_functional.*` | Yes | Phase 2 | +| Regression tests | `quality/test_regression.*` | If bugs found | Phase 3 | +| Code review protocol | `quality/RUN_CODE_REVIEW.md` | Yes | Phase 2 | +| Integration test protocol | `quality/RUN_INTEGRATION_TESTS.md` | Yes | Phase 2 | +| Spec audit protocol | `quality/RUN_SPEC_AUDIT.md` | Yes | Phase 2 | +| TDD verification protocol | `quality/RUN_TDD_TESTS.md` | Yes | Phase 2 | +| Bug tracker | `quality/BUGS.md` | Yes | Phase 3 | +| Coverage matrix | `quality/COVERAGE_MATRIX.md` | Yes | Phase 2 | +| Completeness report | `quality/COMPLETENESS_REPORT.md` | Yes | Phase 5 | +| Progress tracker | `quality/PROGRESS.md` | Yes | Throughout | +| AI bootstrap | `AGENTS.md` | Yes | Phase 2 | +| Bug writeups | `quality/writeups/BUG-NNN.md` | If bugs found | Phase 5 | +| Regression patches | `quality/patches/BUG-NNN-regression-test.patch` | If bugs found | Phase 3 | +| Fix patches | `quality/patches/BUG-NNN-fix.patch` | Optional | Phase 3 | +| TDD sidecar | `quality/results/tdd-results.json` | If bugs found | Phase 5 | +| TDD red-phase logs | `quality/results/BUG-NNN.red.log` | If bugs found | Phase 5 | +| TDD green-phase logs | `quality/results/BUG-NNN.green.log` | If fix patch exists | Phase 5 | +| Integration sidecar | `quality/results/integration-results.json` | When integration tests run | Phase 5 | +| Mechanical verify script | `quality/mechanical/verify.sh` | Yes (benchmark) | Phase 5 | +| Verify receipt | `quality/results/mechanical-verify.log` + `.exit` | Yes (benchmark) | Phase 5 | +| Triage probes | `quality/spec_audits/triage_probes.sh` | When triage runs | Phase 4 | +| Code review reports | `quality/code_reviews/*.md` | Yes | Phase 3 | +| Spec audit reports | `quality/spec_audits/*auditor*.md` + `*triage*` | Yes | Phase 4 | + +**Sidecar JSON lifecycle:** Write all bug writeups *before* finalizing `tdd-results.json` — the sidecar's `writeup_path` field must point to an existing file, not a placeholder. Similarly, run integration tests and collect results before writing `integration-results.json`. + +### Sidecar JSON Canonical Examples + +**`quality/results/tdd-results.json`** — the gate validates field names, not just presence: + +```json +{ + "schema_version": "1.1", + "skill_version": "1.4.0", + "date": "2026-04-12", + "project": "repo-name", + "bugs": [ + { + "id": "BUG-001", + "requirement": "UC-03: Description of the requirement violated", + "red_phase": "Regression test fails on unpatched code, confirming the bug", + "green_phase": "After applying fix patch, regression test passes", + "verdict": "TDD verified", + "fix_patch_present": true, + "writeup_path": "quality/writeups/BUG-001.md" + } + ], + "summary": { + "total": 3, "confirmed_open": 1, "red_failed": 0, "green_failed": 0, "tdd_verified": 2 + } +} +``` + +`verdict` must be one of: `"TDD verified"`, `"red failed"`, `"green failed"`, `"confirmed open"`, `"deferred"`. `date` must be ISO 8601 (YYYY-MM-DD), not a placeholder, not in the future. + +**`quality/results/integration-results.json`:** + +```json +{ + "schema_version": "1.1", + "skill_version": "1.4.0", + "date": "2026-04-12", + "project": "repo-name", + "recommendation": "SHIP", + "groups": [{ "name": "Group 1", "tests": [{ "name": "happy path", "status": "pass" }] }], + "summary": { "total": 12, "passed": 11, "failed": 1, "skipped": 0 }, + "uc_coverage": { "UC-01": "covered", "UC-02": "not covered — no API key" } +} +``` -The critical deliverable is the functional test file (named for the project's language and test framework conventions). The Markdown protocols are documentation for humans and AI agents. The functional tests are the automated safety net. +`recommendation` must be one of: `"SHIP"`, `"FIX BEFORE MERGE"`, `"BLOCK"`. `uc_coverage` maps UC identifiers from REQUIREMENTS.md to coverage status. ## How to Use -Point this skill at any codebase: +**The playbook is designed to run one phase at a time.** Each phase runs in its own session with a clean context window, producing files on disk that the next phase reads. This gives much better results than running all phases at once — each phase gets the full context window for deep analysis instead of competing for space with other phases. + +**Default behavior: run Phase 1 only.** When someone says "run the quality playbook" or "execute the quality playbook," run Phase 1 (Explore) and stop. After Phase 1 completes, tell the user what happened and what to say next. The user drives each phase forward explicitly. + +### Interactive protocol — how to guide the user + +**After every phase and every iteration, STOP and print guidance.** Use a `#` header so it's prominent in the chat. The guidance must include: what just happened (one line), what the key outputs are, and the exact prompt to continue. See the end-of-phase messages defined after each phase section below. + +**If the user says "keep going", "continue", "next phase", "next", or anything similar**, run the next phase in sequence. If all 6 phases are complete, suggest the first iteration strategy (gap). If an iteration just finished, suggest the next strategy in the recommended cycle. + +**If the user says "run all phases", "run everything", or "run the full pipeline"**, run all phases sequentially in a single session. This uses more context but some users prefer it. + +**If the user asks "help", "how does this work", "what is this", or any similar phrasing**, respond with this explanation (adapt the wording naturally, don't copy verbatim): + +> The Quality Playbook finds bugs that structural code review alone can't catch — the 35% of real defects that require understanding what the code is *supposed* to do. It works in six phases: +> +> - **Phase 1 (Explore):** Understand the codebase — architecture, risks, failure modes, specifications +> - **Phase 2 (Generate):** Produce quality artifacts — requirements, tests, review protocols +> - **Phase 3 (Code Review):** Three-pass review with regression tests for every confirmed bug +> - **Phase 4 (Spec Audit):** Three independent AI auditors check the code against requirements +> - **Phase 5 (Reconciliation):** Close the loop — TDD red-green verification for every bug +> - **Phase 6 (Verify):** Self-check benchmarks validate all generated artifacts +> +> After all six phases, you can run iteration strategies (gap, unfiltered, parity, adversarial) to find additional bugs — iterations typically add 40-60% more confirmed bugs on top of the baseline. +> +> The playbook works best when you provide documentation alongside the code — specs, API docs, design documents, community documentation. It also gets significantly better results when you run each phase separately rather than all at once. +> +> To get started, say: **"Run the quality playbook on this project."** + +**If the user asks "what happened", "what's going on", "where are we", or "what should I do next"**, read `quality/PROGRESS.md` and give them a concise status update: which phases are complete, how many bugs found so far, and what the next step is. + +### Documentation warning + +**At the start of Phase 1, before exploring any code, check for documentation.** Look for directories named `docs/`, `docs_gathered/`, `doc/`, `documentation/`, or any gathered documentation files. Also check if the user mentioned documentation in their prompt. + +**If no documentation is found, print this warning immediately (before proceeding):** + +> **Important: No project documentation found.** The quality playbook works without documentation, but it finds significantly more bugs — and higher-confidence bugs — when you provide specs, API docs, design documents, or community documentation. In controlled experiments, documentation-enriched runs found different and better bugs than code-only baselines. +> +> If you have documentation available, you can add it to a `docs_gathered/` directory and re-run Phase 1. Otherwise, I'll proceed with code-only analysis. + +Then proceed with Phase 1 — don't block on this, just make sure the user sees the warning. + +### Running a specific phase + +The user can request any individual phase: ``` -Generate a quality playbook for this project. +Run quality playbook phase 1. +Run quality playbook phase 3 — code review. +Run phase 5 reconciliation. ``` +When running a specific phase, check that its prerequisites exist (e.g., Phase 3 requires Phase 2 artifacts). If prerequisites are missing, tell the user which phases need to run first. + +### Iteration mode — improve on a previous run + +Use this when a previous playbook run exists and you want to find additional bugs. Iteration mode replaces Phase 1's from-scratch exploration with a targeted exploration using one of five strategies, then merges findings with the previous run and re-runs Phases 2–6 against the combined results. + +**When to use iteration mode:** After a complete playbook run, when you believe the codebase has more bugs than the first run found. This is especially effective for large codebases where a single run can only cover 3–5 subsystems, and for library/framework codebases where different exploration paths find different bug classes. + +**Read `.github/skills/references/iteration.md` for detailed strategy instructions.** That file contains the full operational detail for each strategy, shared rules, merge steps, and the completion gate. The summary below describes when to use each strategy. + +**TDD applies to iteration runs.** Every newly confirmed bug in an iteration run must go through the full TDD red-green cycle and produce `quality/results/BUG-NNN.red.log` (and `.green.log` if a fix patch exists). The quality gate enforces this — missing logs cause FAIL. See `references/iteration.md` shared rule 5 and the TDD Log Closure Gate in Phase 5. + +**Iteration strategies.** The user selects a strategy by naming it in the prompt. If no strategy is named, default to `gap`. + ``` -Update the functional tests — the quality playbook already exists. +Run the next iteration of the quality playbook. # default: gap strategy +Run the next iteration of the quality playbook using the gap strategy. +Run the next iteration using the unfiltered strategy. +Run the next iteration using the parity strategy. +Run an iteration using the adversarial strategy. ``` +**Recommended cycle:** gap → unfiltered → parity → adversarial. Each strategy finds different bug classes: + +- **`gap`** (default) — Scan previous coverage, explore uncovered subsystems and thin sections. Best when the first run was structurally sound but only covered a subset of the codebase. +- **`unfiltered`** — Pure domain-driven exploration with no structural constraints. No pattern templates, no applicability matrices, no section format requirements. Recovers bugs that structured exploration suppresses. +- **`parity`** — Systematically enumerate parallel implementations of the same contract (transport variants, fallback chains, setup-vs-reset paths) and diff them for inconsistencies. Finds bugs that only emerge from cross-path comparison. +- **`adversarial`** — Re-investigate dismissed/demoted triage findings and challenge thin SATISFIED verdicts. Recovers Type II errors from conservative triage. +- **`all`** — Runner-level convenience: executes gap → unfiltered → parity → adversarial in sequence, each as a separate agent session. Stops early if a strategy finds zero new bugs. + +### Phase-by-phase execution + +Each phase produces files on disk that the next phase reads. This is how context transfers between phases — through files, not through conversation history. The key handoff files are: + +- **`quality/EXPLORATION.md`** — Phase 1 writes this, Phase 2 reads it. Contains everything Phase 2 needs to generate artifacts without re-exploring the codebase. +- **`quality/PROGRESS.md`** — Updated after every phase. Cumulative BUG tracker ensures no finding is lost. +- **Generated artifacts** (REQUIREMENTS.md, CONTRACTS.md, etc.) — Phase 2 writes these, Phases 3–5 read them to run reviews, audits, and reconciliation. + +The pattern for each phase boundary: finish the current phase, write everything to disk, then print the end-of-phase message and stop. When the user starts the next phase, read back the files you need before proceeding. This "write then read" cycle is the phase boundary — it lets you drop exploration context from working memory before loading review context, for example. + +Write your Phase 1 exploration findings to `quality/EXPLORATION.md` before proceeding. This file is mandatory in all modes. Make it thorough: domain identification, architecture map, existing tests, specification summary, quality risks, skeleton/dispatch analysis, derived requirements (REQ-NNN), and derived use cases (UC-NN). Everything Phase 2 needs to generate artifacts must be in this file. + +The discipline of writing exploration findings to disk is what forces thorough analysis. Without it, the model keeps vague impressions in working memory and produces broad, abstract requirements that miss function-level defects. Writing forces specificity: file paths, line numbers, exact function names, concrete behavioral rules. That specificity is what makes requirements precise enough to catch bugs during code review. + +--- + +## Phase 0: Prior Run Analysis (Automatic) + +**This phase runs only if `previous_runs/` exists and contains prior quality artifacts.** If there are no prior runs, skip to Phase 1. + +When prior runs exist, the playbook enters **continuation mode**. This enables iterative bug discovery: each run inherits confirmed findings from prior runs, verifies them mechanically, and explores for additional bugs. The iteration converges when a run finds zero net-new bugs. + +**Step 0a: Build the seed list.** Read `previous_runs/*/quality/BUGS.md` from all prior runs. For each confirmed bug, extract: bug ID, file:line, summary, and the regression test assertion. Deduplicate by file:line (the same bug found in multiple runs counts once). Write the merged seed list to `quality/SEED_CHECKS.md` with this format: + +```markdown +## Seed Checks (from N prior runs) + +| Seed | Origin Run | File:Line | Summary | Assertion | +|------|-----------|-----------|---------|-----------| +| SEED-001 | run-1 | virtio_ring.c:3509-3529 | RING_RESET dropped | `"case VIRTIO_F_RING_RESET:" in func` | ``` -Run the spec audit protocol. -``` -If a quality playbook already exists (`quality/QUALITY.md`, functional tests, etc.), read the existing files first, then evaluate them against the self-check benchmarks in the verification phase. Don't assume existing files are complete — treat them as a starting point. +**Step 0b: Execute seed checks mechanically.** For each seed, run the assertion against the current source tree. Record PASS (bug was fixed since last run) or FAIL (bug still present). A failing seed is a confirmed carry-forward bug — it must appear in this run's BUGS.md regardless of whether any auditor independently finds it. A passing seed means the bug was fixed — note it in PROGRESS.md as "SEED-NNN: resolved since prior run." + +**Step 0c: Identify prior-run scope.** Read `previous_runs/*/quality/PROGRESS.md` for scope declarations. Note which subsystems were covered in prior runs. During Phase 1 exploration, prioritize areas NOT covered by prior runs to maximize the chance of finding new bugs. If all subsystems were covered in prior runs, explore the same scope but with different emphasis (e.g., different scrutiny areas, different entry points). + +**Step 0d: Inject seeds into downstream phases.** The seed list becomes input to: +- **Phase 3 (code review):** Add to the code review prompt: "Prior runs confirmed these bugs — verify they are still present and look for additional findings in the same subsystems." +- **Phase 4 (spec audit):** Add to `RUN_SPEC_AUDIT.md`: "Known open issues from prior runs: [seed list]. Expect auditors to find these. If an auditor does NOT flag a known seed bug, that is a coverage gap in their review, not evidence the bug was fixed." + +**Why this exists:** Non-deterministic scope exploration means different runs notice different bugs. In cross-version testing, 4/8 repos had bugs found in some versions but not others — not because the bugs were fixed, but because the model explored different parts of the codebase. Iterating with seed injection solves this: confirmed bugs carry forward mechanically (no re-discovery needed), and each new run can focus exploration on uncovered territory. + +### Phase 0b: Sibling-Run Seed Discovery (Automatic) + +**This step runs only if `previous_runs/` does not exist** (i.e., Phase 0a has nothing to work with) **and** the project directory is versioned (e.g., `httpx-1.3.23/` sits alongside `httpx-1.3.21/`). If `previous_runs/` exists, Phase 0a already handles seed injection — skip this step. + +When no `previous_runs/` directory exists but sibling versioned directories do, look for prior quality artifacts in those siblings: + +1. **Discover siblings.** List directories matching the pattern `-/quality/BUGS.md` relative to the parent directory. Exclude the current directory. Sort by version descending (most recent first). +2. **Import confirmed bugs as seeds.** For each sibling with a `quality/BUGS.md`, extract confirmed bugs using the same format as Step 0a. Write them to `quality/SEED_CHECKS.md` with origin noted as the sibling directory name. +3. **Execute seed checks mechanically** (same as Step 0b in Phase 0a). For each imported seed, run the assertion against the current source tree and record PASS/FAIL. +4. **Inject into downstream phases** (same as Step 0d in Phase 0a). + +**Why this exists:** In v1.3.23 benchmarking, httpx produced a zero-bug result despite httpx-1.3.21 having found the `Headers.__setitem__` non-ASCII encoding bug. The model simply explored different code paths and never examined the Headers area. Sibling-run seeding ensures that bugs confirmed in prior versioned runs carry forward even without an explicit `previous_runs/` archive. This is a different failure class than mechanical tampering — it addresses **exploration non-determinism**, not evidence corruption. --- -## Phase 1: Explore the Codebase (Do Not Write Yet) +## Phase 1: Explore the Codebase (Write As You Go) + +> **Required references for this phase** — read these before proceeding: +> - `.github/skills/references/exploration_patterns.md` — six bug-finding patterns to apply after open exploration Spend the first phase understanding the project. The quality playbook must be grounded in this specific codebase — not generic advice. @@ -70,6 +301,44 @@ Spend the first phase understanding the project. The quality playbook must be gr **Scaling for large codebases:** For projects with more than ~50 source files, don't try to read everything. Focus exploration on the 3–5 core modules (the ones that handle the primary data flow, the most complex logic, and the most failure-prone operations). Read representative tests from each subsystem rather than every test file. The goal is depth on what matters, not breadth across everything. +**Depth over breadth (critical).** A narrow scope with function-level detail finds more bugs than a broad scope with subsystem-level summaries. For each core module you explore, identify the specific functions that implement critical behavior and document them by name, file path, and line number. Requirements derived from "the reset subsystem should handle errors" will not catch bugs. Requirements derived from "`vm_reset()` at `virtio_mmio.c:256` must poll the status register after writing zero" will. The difference between a useful exploration and a useless one is specificity — file paths, function names, line numbers, exact behavioral rules. + +**Three-stage exploration: open first, then domain risks, then selected patterns.** Exploration has three stages, and the order matters: + +1. **Open exploration (domain-driven).** Before applying any structured pattern, explore the codebase the way an experienced developer would: read the code, understand the architecture, identify risks based on your domain knowledge of what goes wrong in systems like this one. Ask yourself: "What would an expert in [this domain] check first?" For an HTTP library, that means redirect handling, header encoding, connection lifecycle. For a CLI framework, that means flag parsing, help generation, completion/validation consistency. For a serialization library, that means type coverage, round-trip fidelity, edge-case handling. Write concrete findings with file paths and line numbers. This stage must produce at least 8 concrete bug hypotheses or suspicious findings — not architectural observations, but specific "this code at file:line might be wrong because [reason]" findings. At least 4 must reference different modules or subsystems. + +2. **Domain-knowledge risk analysis.** After open exploration, step back from the code and reason about what you know from training about systems like this one. This is the primary bug-hunting pass for library and framework codebases. Complete the Step 6 questions below using two sources — the code you just explored AND your domain knowledge of similar systems. Generate at least 5 ranked failure scenarios, each naming a specific function, file, and line, and explaining why a domain-specific edge case produces wrong behavior. You don't need to have observed these failures — you know from training that they happen to systems of this type. Write the results to the `## Quality Risks` section of EXPLORATION.md before proceeding to patterns. + + **What this stage must NOT produce:** A section that lists defensive patterns the code already has (things the code does RIGHT) is not a risk analysis. A section that lists risky modules without specific failure scenarios is not a risk analysis. A section that concludes "this is a mature, well-tested library so basic bugs are unlikely" is actively harmful — mature libraries have the most subtle bugs, precisely because the obvious ones were found years ago. The test: could a code reviewer read each scenario and immediately know what to check? If not, the scenario is too abstract. + +3. **Pattern-driven exploration (selected, not exhaustive).** After open exploration and domain-risk analysis are written to disk, evaluate all six analysis patterns from `exploration_patterns.md` using a pattern applicability matrix. For each pattern, assess whether it applies to this codebase and what it would target. Then select 3 to 4 patterns for deep-dive treatment — the highest-yield patterns for this specific codebase. The remaining patterns get a brief "not applicable" or "deferred" note with codebase-specific rationale. Do not produce deep sections for all six patterns — depth on 3–4 beats shallow coverage of 6. Select 4 when a fourth pattern has clear applicability and would cover code areas not reached by the other three; default to 3 when in doubt. + + For each selected pattern deep dive, use the output format from the reference file and trace code paths across 2+ functions. The deep dives should pressure-test, refine, or extend the findings from the open exploration and risk analysis — not repeat them. + +The Phase 1 completion gate checks for all three stages. The open exploration section, the quality risks section, the pattern applicability matrix, and the pattern deep-dive sections must all be present. + +**Write incrementally — do not hold findings in memory.** This is the single most important execution rule in Phase 1. After you explore each subsystem or apply each pattern, **immediately append your findings to `quality/EXPLORATION.md` on disk before moving to the next subsystem or pattern.** Do not try to hold findings in working memory across multiple subsystems. The write-as-you-go discipline serves two purposes: + +1. **Depth recovery.** If you explore the PCI interrupt routing subsystem and find suspicious code at `vp_find_vqs_intx()`, write that finding to EXPLORATION.md immediately. Then when you move to the admin queue subsystem, your working memory is free to go deep there. Without incremental writes, findings from the first subsystem compete with findings from the second, and both end up shallow. + +2. **Nothing gets lost.** In v1.3.41 benchmarking, the model explored 8 pattern sections but wrote only 5–7 lines per section — perfectly uniform, perfectly shallow. Every section passed the gate but none went deep enough to find bugs that require tracing code paths across multiple functions. The model was trying to compose the entire EXPLORATION.md at the end, after reading everything, and could only recall the surface-level findings. Incremental writes prevent this. + +**The rhythm is: read a subsystem → write findings to disk → read the next subsystem → append findings → repeat.** Each append should include specific function names, file paths, line numbers, and concrete bug hypotheses. A 5-line section that says "checked cross-implementation consistency, found one gap" is a gate-passing placeholder, not an exploration finding. A useful section traces a code path: "function A at file:line calls function B at file:line, which does X but not Y; compare with function C at file:line which does both X and Y." + +**Mandatory consolidation step.** After all three stages (open exploration, quality risks, and selected pattern deep dives) are explored and written to EXPLORATION.md, add a final section: `## Candidate Bugs for Phase 2`. This section consolidates the strongest bug hypotheses from all earlier sections into a prioritized handoff list. For each candidate, include: the hypothesis, the specific file:line references, which stage surfaced it (open exploration, quality risks, or pattern), and what the code review should look for. This section is the bridge between exploration and artifact generation — it tells Phase 3 exactly where to focus. Minimum: 4 candidate bugs with file:line references — at least 2 from open exploration or quality risks, and at least 1 from a pattern deep dive. There is no maximum. + +**Pre-flight: Scope declaration for large repositories** + +Before exploring any source code, estimate scale: approximate source-file count (excluding tests, docs, and generated files), major subsystem count, and documentation volume. Note the count in PROGRESS.md. + +- **Fewer than 200 source files:** Proceed with full exploration. The depth-vs-breadth guidance above still applies. +- **200–500 source files:** Declare your intended scope before exploring. Write a `## Scope declaration` section to PROGRESS.md naming the 3–5 subsystems you will cover, the expected file count for each, and which subsystems you are deferring with rationale. Then proceed with exploration of the declared scope only. +- **More than 500 source files:** Stop and write a mandatory scope declaration to PROGRESS.md before reading any source files. The scope declaration must include: (a) the subsystems covered in this run, (b) the subsystems explicitly deferred, (c) the exclusion rationale for each deferred subsystem, and (d) recommended subsystem scope for follow-on runs. Do not begin exploration until this is written. A scope declaration that covers "everything" is not valid for repositories above this threshold. + +**Resuming a previous session:** If PROGRESS.md already exists and shows phases marked complete, read it first. Do not redo phases already marked complete — resume from the first phase marked incomplete. If a scope declaration is already written, honor it exactly. If the previous session's scope declaration deferred subsystems, do not expand scope to cover them unless this run is explicitly a follow-on for the deferred areas. + +**Specification-primary repositories:** Some repositories ship a specification, configuration, or protocol document as their primary product, with executable code as supporting infrastructure. Examples: a skill definition with benchmark tooling, a schema registry with validation scripts, a pipeline config with orchestration helpers. When the primary product is a specification rather than executable code, derive requirements from the specification's internal consistency, completeness, and correctness — not just from the executable code paths. The specification is the thing users depend on; the tooling is secondary. If you find yourself writing 80%+ of requirements about helper scripts and <20% about the primary specification, you have the focus inverted. + ### Step 0: Ask About Development History Before exploring code, ask the user one question: @@ -87,6 +356,8 @@ This context is gold. A chat history where the developer discussed "why we chose If the user doesn't have chat history, proceed normally — the skill works without it, just with less context. +**Autonomous fallback:** When running in benchmark mode, via `run_playbook.sh`, or without user interaction (e.g., `--single-pass`), skip Step 0's question and proceed directly to Step 1. If chat history folders are visible in the project tree (e.g., `AI Chat History/`, `.chat_exports/`), scan them without asking. If no chat history is found, proceed — do not block waiting for a response that won't come. + ### Step 1: Identify Domain, Stack, and Specifications Read the README, existing documentation, and build config (`pyproject.toml` / `package.json` / `Cargo.toml`). Answer: @@ -114,6 +385,33 @@ When working from non-formal requirements, label each scenario and test with a * Use this exact tag format in QUALITY.md scenarios, functional test documentation, and spec audit findings. It makes clear which requirements are authoritative and which need validation. +### Step 1b: Evaluate Documentation Depth + +If `docs_gathered/` exists, read every file in it before deciding which subsystems to focus on. For each document, classify its depth: + +- **Deep** — contains internal contracts, safety invariants, concurrency models, defensive patterns, error handling details, or line-number-level source references. Suitable for deriving requirements. +- **Moderate** — covers architecture and API surface with some implementation detail. Useful for orientation but insufficient alone for requirement derivation. +- **Shallow** — API catalog, feature overview, or marketing-level summary. Lists what exists but not how it works, how it fails, or what contracts it enforces. **Not sufficient for scoping decisions.** + +**The scoping rule:** Do not narrow the audit scope to only the subsystems that have deep documentation. If the most complex or most failure-prone module has only shallow documentation, that is a **documentation gap to flag in PROGRESS.md**, not a reason to skip the module. The highest-risk code with the thinnest documentation is where bugs hide — auditing only well-documented areas produces a safe-looking report that misses real defects. + +When documentation is shallow for a high-risk area: + +1. Note the gap explicitly in PROGRESS.md under a `## Documentation depth assessment` section. +2. Derive requirements from source code directly (doc comments, safety annotations, defensive patterns, existing tests) and tag them as `[Req: inferred — from source]`. +3. Flag the area for deeper documentation gathering in the completeness report. + +Record the depth classification for each `docs_gathered/` file in PROGRESS.md so reviewers can assess whether the documentation influenced the scope appropriately. + +**Coverage commitment table:** After classifying all `docs_gathered/` documents, produce this table in PROGRESS.md under the `## Documentation depth assessment` section: + +| Document | Depth | Subsystem | Requirements commitment | If excluded: justification | +|----------|-------|-----------|------------------------|---------------------------| + +For every **deep** document, map it to the subsystem it covers, then either commit to deriving requirements from it ("will cover in Phase 2") or provide a specific justification that names the tradeoff. A sentence like "out of scope for this run" is not sufficient — the justification must say *why*, e.g., "interpreter JIT is excluded because this run focuses on the parser/compiler/GC pipeline; separate run recommended." + +**Gate:** A high-risk subsystem documented deeply in `docs_gathered/` must not silently disappear from the requirements set. If a deep document has a "will cover" commitment but produces zero requirements by the end of Step 7, the requirements pipeline is incomplete — go back and derive requirements for the gap before proceeding to Phase 2 artifact generation. + ### Step 2: Map the Architecture List source directories and their purposes. Read the main entry point, trace execution flow. Identify: @@ -135,7 +433,7 @@ Read the existing test files — all of them for small/medium projects, or a rep Walk each spec document section by section. For every section, ask: "What testable requirement does this state?" Record spec requirements without corresponding tests — these are the gaps the functional tests must close. -If using inferred requirements (from tests, types, or code behavior), tag each with its confidence tier using the `[Req: tier — source]` format defined in Step 1. Inferred requirements feed into QUALITY.md scenarios and should be flagged for user review in Phase 4. +If using inferred requirements (from tests, types, or code behavior), tag each with its confidence tier using the `[Req: tier — source]` format defined in Step 1. Inferred requirements feed into QUALITY.md scenarios and should be flagged for user review in Phase 7. ### Step 4b: Read Function Signatures and Real Data @@ -179,7 +477,9 @@ If the project has a validation layer (Pydantic models in Python, JSON Schema, T **Read `references/schema_mapping.md`** for the mapping format and why this matters for writing valid boundary tests. -### Step 6: Identify Quality Risks (Code + Domain Knowledge) +### Step 6: Domain-Knowledge Risk Analysis (Code + Domain Knowledge) + +**This is the primary bug-hunting pass for library and framework codebases.** Complete it before selecting any structured patterns. Write the results to the `## Quality Risks` section of EXPLORATION.md immediately — do not hold them in memory. Every project has a different failure profile. This step uses **two sources** — not just code exploration, but your training knowledge of what goes wrong in similar systems. @@ -190,22 +490,437 @@ Every project has a different failure profile. This step uses **two sources** - Where do cross-cutting concerns hide? **From domain knowledge**, ask: -- "What goes wrong in systems like this?" — If it's a batch processor, think about crash recovery, idempotency, silent data loss, state corruption. If it's a web app, think about auth edge cases, race conditions, input validation bypasses. If it handles randomness or statistics, think about seeding, correlation, distribution bias. -- "What produces correct-looking output that is actually wrong?" — This is the most dangerous class of bug: output that passes all checks but is subtly corrupted. +- "What goes wrong in systems like this?" — If it's an HTTP router, think about header parsing edge cases (quality values, token lists, case sensitivity), middleware ordering dependencies, and path normalization. If it's an HTTP client, think about redirect credential stripping, encoding detection, and connection state leaking. If it's a serialization library, think about null handling asymmetry, API surface consistency between direct methods and view wrappers, lazy evaluation caching bugs, and round-trip fidelity. If it's a web framework, think about response helper edge cases, configuration compilation chains, and middleware state isolation. If it's a batch processor, think about crash recovery, idempotency, silent data loss, state corruption. If it handles randomness or statistics, think about seeding, correlation, distribution bias. +- "What produces correct-looking output that is actually wrong?" — This is the most dangerous class of bug: output that passes all checks but is subtly corrupted. A response with a `200 OK` but the wrong `Content-Type`. A redirect that succeeds but leaks credentials. A deserialized object that has silently truncated values. - "What happens at 10x scale that doesn't happen at 1x?" — Chunk boundaries, rate limits, timeout cascading, memory pressure. - "What happens when this process is killed at the worst possible moment?" — Mid-write, mid-transaction, mid-batch-submission. +- "Where do two surfaces that should behave the same drift on edge inputs?" — Overloads, aliases, sync/async APIs, builder vs direct APIs, direct mutators vs live views/wrappers, stdlib-compatible wrappers vs framework-native surfaces. For Java/Kotlin: `add(null)` vs `asList().add(null)`, `put(key,null)` vs `asMap().put(key,null)`. For Python: constructor encoding vs mutator encoding, sync vs async client behavior. +- "What emits plausible output with subtly wrong metadata?" — Content type, charset, route pattern, ETag strength, byte count, auth/header/cookie propagation, status code, cache validators. +- "What standard grammar or list syntax is being parsed with ad hoc string logic?" — Quality values (`q=0`), comma-separated headers, digest challenges, MIME types with parameters, query strings, enum/keyword sets, cookie merging. +- "What edge-case inputs would a domain expert reach for?" — For HTTP code: `Accept-Encoding: gzip;q=0`, `Connection: keep-alive, Upgrade`, `Content-Type: application/problem+json`. For serialization code: `null` through different API surfaces, values at `Integer.MAX_VALUE + 1`, round-tripping through encode-then-decode. For routing code: overlapping patterns, mounted prefix propagation, same path with different methods. - "What information does the user need before committing to an irreversible or expensive operation?" — Pre-run cost estimates, confirmation of scope (especially when fan-out or expansion will multiply the work), resource warnings. If the system can silently commit the user to hours of processing or significant cost without showing them what they're about to do, that's a missing safeguard. Search for operations that start long-running processes, submit batch jobs, or trigger expansion/fan-out — and check whether the user sees a preview, estimate, or confirmation with real numbers before the point of no return. - "What happens when a long-running process finishes — does it actually stop?" — Polling loops, watchers, background threads, and daemon processes that run until completion should have explicit termination conditions. If the loop checks "is there more work?" but never checks "is all work done?", it will run forever after completion. This is especially common in batch processors and queue consumers. -Generate realistic failure scenarios from this knowledge. You don't need to have observed these failures — you know from training that they happen to systems of this type. Write them as **architectural vulnerability analyses** with specific quantities and consequences. Frame each as "this architecture permits the following failure mode" — not as a fabricated incident report. Use concrete numbers to make the severity non-negotiable: "If the process crashes mid-write during a 10,000-record batch, `save_state()` without an atomic rename pattern will leave a corrupted state file — the next run gets JSONDecodeError and cannot resume without manual intervention." Then ground them in the actual code you explored: "Read persistence.py line ~340 (save_state): verify temp file + rename pattern." +Generate at least 5 ranked failure scenarios from this knowledge. You don't need to have observed these failures — you know from training that they happen to systems of this type. Write them as **specific bug hypotheses with file-path and line-number citations**, ranked by priority. Frame each as: "Because [code at file:line] does [X], a [domain-specific edge case] will produce [wrong behavior] instead of [correct behavior]." Then ground them in the actual code you explored: "Read persistence.py line ~340 (save_state): verify temp file + rename pattern." + +**Anti-patterns that fail the gate:** A Quality Risks section that lists defensive patterns the code already has (things the code does right) is not a risk analysis — it is a reassurance exercise and will not find bugs. A section that lists risky modules without specific failure scenarios is not actionable. A section that concludes "this is a mature, well-tested library so basic bugs are unlikely" is actively harmful — mature libraries have the most subtle API-contract and edge-case bugs, precisely because the obvious ones were found years ago. The test: could a code reviewer read each scenario and immediately know what function to open and what input to test? If not, the scenario is too abstract. + +### Step 7: Derive Testable Requirements + +**Read `references/requirements_pipeline.md`** for the complete five-phase pipeline, domain checklist, and versioning protocol. + +This is the most important step for the code review protocol. Everything found during exploration — specs, ChangeLog entries, config structs, source comments, chat history — gets distilled into a set of testable requirements that the code review will verify. The pipeline separates contract discovery from requirement derivation, uses file-based external memory, and includes mechanical verification with a completeness gate. + +**Why this matters:** Structural code review catches about 65% of real defects. The remaining 35% are intent violations — absence bugs, cross-file contradictions, and design gaps. These are invisible to code reading because the code that IS there is correct. You need to know what the code is supposed to do, then check whether it does it. That's what testable requirements provide. + +**The five-phase pipeline:** + +1. **Phase A — Contract extraction.** Read all source files, list every behavioral contract. Write to `quality/CONTRACTS.md`. This is discovery — list everything, even if it seems obvious. +2. **Phase B — Requirement derivation.** Read CONTRACTS.md and documentation. Group related contracts, enrich with user intent, write formal requirements. Write to `quality/REQUIREMENTS.md`. For each requirement, record the **doc source** with its authority tier — the specific gathered document (filename), section, and passage that establishes the expected behavior, prefixed with `[Tier N]`. If the requirement derives from source code rather than documentation, record `[Tier 3] [source] file:line` and tag it `[Req: inferred — from source]`. This doc-source field creates the forward link in the traceability chain: gathered docs → requirements → bugs → tests. Without it, a reviewer cannot verify that the requirement reflects the spec's actual intent rather than the agent's interpretation. + + **Primary-source extraction rule for code-presence claims.** When writing a requirement that asserts specific constants, values, or labels are handled by a specific function (e.g., "the whitelist must preserve X, Y, and Z"), the requirement must distinguish between what the **spec says should be there** and what the **code actually contains**. Extract the actual contents from the code (case labels, map keys, if-else branches) and compare to the spec's list. If a constant appears in the spec but NOT in the code, write the requirement as "must handle X — **[NOT IN CODE]**: defined in header.h:NN but absent from function() at file.c:NN-NN." Do not write "must preserve X" without verifying X is actually preserved. This prevents a contamination chain where a requirement asserts code presence, the code review copies the assertion, the spec audit inherits it, and the triage accepts it — all without anyone reading the actual code. This exact chain was observed in v1.3.17 virtio testing: REQUIREMENTS.md asserted RING_RESET was preserved in a switch, the code review copied the list, three spec auditors inherited the claim, and the bug went undetected. + **Mechanical verification artifact for dispatch functions (mandatory).** When a contract asserts that a function handles, preserves, or dispatches a set of named constants (feature bits, enum values, opcode tables, event types, handler registries), you must generate and execute a shell command or script that mechanically extracts the actual case labels/branches from the function body **before writing the contract line**. Save the raw output to `quality/mechanical/_cases.txt`. The command must be a non-interactive pipeline (e.g., `awk` + `grep`) that cannot hallucinate — it reads file bytes and prints matches. Example: + + ```bash + awk '/void vring_transport_features/,/^}$/' drivers/virtio/virtio_ring.c \ + | grep -E '^\s*case\s+' > quality/mechanical/vring_transport_features_cases.txt + ``` + + After execution, read the output file and use it as the sole source of truth for what the function handles. A contract line asserting "function preserves constant X" is **forbidden** unless `quality/mechanical/_cases.txt` contains a matching `case X:` line. If a constant appears in a spec or header but NOT in the mechanical output, the contract must record it as absent: `"must handle X — **[NOT IN CODE]**: defined in header.h:NN but absent from function() per mechanical check."` Downstream artifacts (`REQUIREMENTS.md`, `RUN_SPEC_AUDIT.md`, code review) must cite the mechanical file path when referencing dispatch-function coverage — they may not replace the mechanical output with a hand-written list. + + **Mechanical artifact integrity check (mandatory).** For each mechanical extraction command, also append it to `quality/mechanical/verify.sh` as a verification step. The script must re-run the same extraction pipeline and diff the result against the saved file. Generate `verify.sh` with this structure: + + ```bash + #!/bin/bash + # Auto-generated: re-run mechanical extraction commands and verify saved artifacts + set -euo pipefail + FAIL=0 + + # Verify + ACTUAL=$(awk '/void vring_transport_features/,/^}$/' drivers/virtio/virtio_ring.c | grep -nE '^\s*case\s+') + SAVED=$(cat quality/mechanical/vring_transport_features_cases.txt) + if [ "$ACTUAL" != "$SAVED" ]; then + echo "MISMATCH: vring_transport_features_cases.txt" + diff <(echo "$ACTUAL") <(echo "$SAVED") || true + FAIL=1 + else + echo "OK: vring_transport_features_cases.txt" + fi + + exit $FAIL + ``` + + **Phase 6 must execute `bash quality/mechanical/verify.sh`** and the benchmark fails if any artifact mismatches. This catches a failure mode observed in v1.3.19: the model executed the extraction command but wrote its own expected output to the file instead of letting the shell redirect capture it, inserting a hallucinated `case VIRTIO_F_RING_RESET:` line that the real command does not produce. Re-running the same command in a separate step and diffing against the file detects this tampering. + + **Immediate integrity gate (mandatory, Phase 2a).** Run `bash quality/mechanical/verify.sh` **immediately** after writing each `*_cases.txt` file and **before** writing any contract, requirement, or prose artifact that cites the extraction. If exit code ≠ 0: stop, delete the failed `*_cases.txt`, re-run the extraction command with a fresh shell redirect (do not hand-edit the output), and re-verify. Do not advance to Phase 3/2c until verify.sh exits 0. Save verify.sh stdout and exit code to `quality/results/mechanical-verify.log` and `quality/results/mechanical-verify.exit` as durable receipt files. This gate exists because v1.3.23 showed that deferring verification to Phase 6 allows downstream artifacts (CONTRACTS.md, REQUIREMENTS.md, triage probes) to build on a forged extraction — the model reconciles a discrepancy between requirements and the artifact by corrupting the artifact instead of correcting the requirement. + + **Mechanical artifacts are immutable after extraction.** Once a `*_cases.txt` file has been written by the shell redirect and verified by `verify.sh`, it must not be modified, overwritten, or regenerated for the remainder of the run. If a downstream step discovers a discrepancy between the mechanical artifact and a requirement or contract, the requirement or contract is wrong — not the artifact. Fix the prose, not the extraction. This rule prevents the v1.3.23 failure mode where the model overwrote a correct extraction with fabricated content to match its own narrative. + + **Forbidden probe pattern (triage and verification).** Triage probes, verification probes, and audit assertions must not use `open('quality/mechanical/...')` or `cat quality/mechanical/...` as sole evidence for what a source file contains at a given line. To verify that function F handles constant C at line N, the probe must either: (a) read the source file directly (`open('drivers/virtio/virtio_ring.c')` with line-anchored assertions), or (b) re-execute the same extraction pipeline used by `verify.sh` and check its output. Reading the saved artifact proves only what the artifact says, not what the code says — this is circular verification. In v1.3.23, Probe C validated the forged artifact instead of the source code, passing with fabricated data. + + **Do not create an empty mechanical/ directory.** Only create `quality/mechanical/` if the project's contracts include dispatch functions, registries, or enumeration checks that require mechanical extraction. If no such contracts exist, skip the directory entirely and record in PROGRESS.md: `Mechanical verification: NOT APPLICABLE — no dispatch/registry/enumeration contracts in scope.` Creating an empty mechanical/ directory (or one without verify.sh) is non-conformant — it signals that extraction was attempted and abandoned. Decide before creating the directory: does this project have dispatch-function contracts? If no, don't `mkdir`. If yes, populate it fully. + + **Normative vs. descriptive split.** Requirements and contracts must use normative language ("must preserve," "should handle") for expected behavior. They may only use descriptive language ("preserves," "handles") when the mechanical verification artifact confirms the claim. A requirement that says "the implementation preserves VIRTIO_F_RING_RESET" without a confirming mechanical artifact is non-conformant — write "the implementation **must** preserve VIRTIO_F_RING_RESET" and cite the mechanical check result showing whether the constant is currently present or absent. + +3. **Phase C — Coverage verification.** Cross-reference every contract against every requirement. Fix gaps. Loop up to 3 times until coverage reaches 100%. Write to `quality/COVERAGE_MATRIX.md`. The matrix must have **one row per requirement** (REQ-001, REQ-002, etc.) — not grouped ranges like "C-001 to C-007 | REQ-001, REQ-003". Grouped ranges make machine verification impossible and hide gaps. +4. **Phase D — Completeness check + self-refinement loop.** Apply the domain checklist, testability audit, and cross-requirement consistency check. Also verify that every deep document with a "will cover" commitment in the coverage commitment table has at least one requirement traced to it — if not, add requirements for the gap before continuing. + + Write to `quality/COMPLETENESS_REPORT.md` as a **baseline** completeness report (without a `## Verdict` section — the verdict is deferred to Phase 5 post-reconciliation, which produces the only verdict that counts for closure). Then run up to 3 self-refinement iterations: read the report, fix gaps, re-check. Short-circuit when fewer than 3 changes per iteration. +5. **Phase E — Narrative pass.** Add project overview (with overview validation gate), then derive use cases (with use case derivation gate). Both gates must pass before proceeding to category narratives, cross-cutting concerns, and final reordering. This sequencing prevents multi-pass loops where a failed late gate forces re-derivation. Reorder for top-down flow. Renumber sequentially. + +**REQUIREMENTS.md must begin with a human-readable overview** that answers: What is this project? What does it do? Who are the actors (users, systems, hardware, protocols)? What are the highest-risk areas? This overview should be useful to someone who has never seen the project before. If the project is a library or driver where all actors are systems, describe the system actors (kernel maintainers, protocol peers, integrators, end-user developers) and their interactions. Do not start with raw scope metadata or HTML comments — lead with a plain-language description. + +**Overview validation gate (mandatory).** After writing the overview, perform this self-check before proceeding to use case derivation: + +> Does this overview describe the project the way its actual users would recognize it? Specifically: +> - Does it name the project's ecosystem role and real-world significance? +> - Does it identify who depends on it and for what? +> - Would a developer who uses this project daily say "yes, that's what it is and why it matters"? +> - For well-known projects, does it reflect publicly known adoption (e.g., Cobra → kubectl/Hugo/GitHub CLI; Express → millions of Node.js API servers; Zod → form validation/tRPC; Serde → the default Rust serialization layer)? + +If the overview reads like it was written by someone who only read the source code and never used the software, revise it before proceeding. The overview sets the frame for everything downstream — feature-oriented use cases and internally focused requirements are symptoms of an overview that only describes the code, not the project. + +**Use case derivation (mandatory, runs after overview gate).** Derive 5–7 use cases from the validated overview and gathered documentation, then validate them against the code. Each use case must: + +- Describe a **real user outcome**, not a code feature. "Developer builds a CLI tool with nested subcommands, persistent flags, and shell completion" — not "Framework supports command trees." +- Name a **concrete actor** and what they are trying to accomplish. Actors include end-user developers, system administrators, kernel maintainers, protocol peers, integrators, and automated consumers. +- Be **recognizable to an actual user** of the software. For well-known projects, validate use cases against the model's own knowledge of the project, community docs, tutorials, and real-world adoption patterns. +- Connect to at least one requirement through testable conditions of satisfaction. + +The pipeline should explicitly ask: "Based on this project's overview, gathered documentation, and known user base, what are the 5–7 most important things real users do with this software?" Derive use cases from that question — not from scanning the code and grouping features into categories. + +**Use case validation against code:** After deriving use cases from the overview and docs, verify each one against the codebase. If a use case describes something the code doesn't actually support, revise or remove it. If the code supports an important user outcome that no use case covers, add one. The goal is use cases that are both user-recognizable AND code-grounded. + +**Acceptance criteria span check (mandatory, runs after use case derivation).** After use cases are finalized and validated against code, check whether the conditions of satisfaction across all requirements collectively span the project's main behaviors: + +> Do these acceptance criteria, taken together, cover the project? Is there a major user-facing behavior described in the overview or use cases that no requirement's conditions of satisfaction would catch if it broke? + +For each use case, at least one requirement's conditions of satisfaction must be traceable to it, and at least one linked requirement must be `specific` (not `architectural-guidance`). Use cases with no linked specific requirements indicate a gap. When gaps are found, either: (a) add new requirements or sharpen existing conditions to cover the gap, or (b) revise the use case if it doesn't reflect what the requirements actually protect. Record the results of this check in the completeness report. + +Follow the use cases with the individual requirements. + +**For each requirement, provide all of these fields:** + +- **Summary**: State it as a testable assertion: "X must satisfy Y" or "When A, the system must B" +- **User story**: Frame it from the caller's perspective: "As a [role] doing [action], I expect [behavior] **so that** [outcome]." The "so that" clause is mandatory — it forces you to articulate the intent behind the requirement. +- **Implementation note**: How the code achieves this requirement — the mechanism, the relevant code paths, the design choice. +- **Conditions of satisfaction**: Specific, testable scenarios that prove this requirement is met. Include the happy path, edge cases, and failure modes. Each individual contract from Phase A that was grouped into this requirement becomes a condition of satisfaction. +- **Alternative paths**: Multiple code paths, modes, or entry points that must all satisfy the requirement. Alternative paths are where bugs hide. +- **References**: Cite the source — spec section, ChangeLog entry, config field definition, source comment, issue number, or domain knowledge. +- **Doc source**: The specific gathered document and section that establishes this requirement's expected behavior, with an inline authority tier. Format: `[Tier N] [doc_filename] § [section/page]` with a ≤15-word behavioral contract quote. If derived from source code, use `[Tier 3] [source] file:line` with the relevant comment or assertion. This field feeds the TDD traceability chain — when a bug violates this requirement, the test cites this passage. + + **Authority tiers:** + - **Tier 1 (Canonical):** Official API docs, published specs, language/protocol standards. These directly state the expected behavior. + - **Tier 2 (Strong secondary):** Design documents, gathered docs with behavioral contracts, well-maintained READMEs, inline Javadoc/docstrings that define public API contracts, formal locking annotations and safety invariants that document caller-facing contracts (not incidental implementation notes — the test is "was this written as a deliberate contract for callers?"). + - **Tier 3 (Weak secondary):** Changelogs, issue summaries, troubleshooting guides, source comments, test files, migration guides. + + Requirements backed only by Tier 3 sources must be tagged `[Req: inferred]` with a note explaining why no stronger source exists. The completeness report must flag the ratio of Tier 1/2/3 sources across all requirements. +- **Specificity**: **specific** (testable — must have conditions of satisfaction that a code reviewer can check against a specific code location or behavior; this is the default and counts toward coverage metrics) or **architectural-guidance** (not testable against individual code paths — covers cross-cutting properties like "remain lightweight and stdlib-compatible" or "no_std support"; informs the quality constitution but is not counted in coverage metrics; most projects should have 0–3 architectural-guidance requirements — more than 3 triggers the mandatory self-check below). The category "directional" is retired. Any requirement that would have been "directional" must either be made specific (with testable conditions) or explicitly classified as architectural-guidance. + + **Architectural-guidance self-check (mandatory, runs after requirement derivation).** Count the requirements tagged `architectural-guidance`. Apply both bounds: + + - **Maximum bound (>3):** If the count exceeds 3, stop and re-examine each one. For each, ask: "Can I add a testable condition of satisfaction that a code reviewer could verify against a specific code location?" If yes, reclassify it as `specific` and add the condition. Only requirements that genuinely cannot be verified against any specific code path should remain `architectural-guidance`. A final count above 3 requires an explicit justification per excess requirement explaining why it cannot be made specific. + - **Minimum bound (0 on 15+ requirements):** If the total requirement count is 15 or more and the `architectural-guidance` count is 0, re-examine the requirements for cross-cutting design invariants. Libraries that span protocol layers, manage resource lifecycles, enforce ordering guarantees, or maintain compatibility contracts (e.g., "remain stdlib-compatible," "preserve no_std support," "maintain wire-format backward compatibility") typically have 1–3 architectural-guidance requirements. Write one sentence in the completeness report explaining why no requirement qualified as architectural-guidance, or reclassify the appropriate requirements. + + Record the count and any reclassifications in the completeness report. + +**Do not cap the requirement count.** Derive as many as the project warrants. A small utility might have 20. A mature library might have 100+. The goal is completeness. + +**Step 7a: Documentation-to-requirement reconciliation** + +Re-read the coverage commitment table from PROGRESS.md. For each deep document you committed to covering ("will cover in Phase 2"), verify that at least one requirement traces to the subsystem it documents. If your requirements cover only some committed subsystems, add requirements for the gaps before completing Step 7. + +For each subsystem, record one of the following in PROGRESS.md: +- the requirement IDs that cover it, or +- an explicit exclusion with rationale, risk acknowledgment, and recommended follow-up + +A deep-documented subsystem with a "will cover" commitment and zero mapped requirements is a process failure, not a legitimate scope choice. Do not proceed to artifact generation until every commitment is satisfied or explicitly converted to a justified exclusion. + +**Step 7b: Bidirectional traceability check (mandatory)** + +**Timing: Execute Step 7a and 7b after Phase E completes** (i.e., after the overview validation gate, use case derivation, and acceptance criteria span check have all run). The reverse traceability check depends on finalized requirements AND finalized use cases. + +After requirements derivation is complete, run a reverse traceability check. Forward traceability (gathered docs → requirements → bugs → tests) is already built into the pipeline. This step checks the reverse direction: do significant code paths map back to requirement conditions? + +This operates at **path/branch/helper granularity**, not file level. File-level coverage was 100% in v1.3.13 and still missed two real bugs. The question is not "does this file map to some requirement?" but "does this significant branch map to a requirement clause that states what must be preserved here?" + +**Scoped to four categories** (not an open-ended branch audit): + +1. **Alternative paths already named in requirements.** If a requirement mentions fallback or alternative paths (e.g., "primary vs. degraded mode," "negotiated vs. default configuration," "sync vs. async"), each alternative must have an explicit **symmetry condition** — a statement of what invariant must hold across both paths. A requirement that says "the system handles both X and Y" without specifying what "handles" means for each is incomplete. + +2. **Helpers that translate public constants into runtime behavior.** If a helper function whitelists, filters, or translates between defined constants and runtime behavior (e.g., feature flag gates, codec registry lookups, capability whitelist helpers), it must have a helper-specific requirement enumerating the expected preserved/translated values. + +3. **Capability-negotiation and fallback logic.** Code paths where the system negotiates capabilities with an external peer (protocol version negotiation, feature detection, graceful degradation) must have requirements covering both the negotiated-up and negotiated-down paths. + +4. **Functions named in prior BUGS.md, VERSION_HISTORY.md, or spec audit outputs.** If a previous run found a bug in a specific function, future runs must show explicit re-check evidence for that function ("known bug class sentinels"). This prevents the "lost requirement" regression class. If prior spec audit outputs exist in `quality/spec_audits/`, read them before running the sentinel check — cross-model findings from council reviews are a high-value source of known bug surfaces. + +For each category, check whether the requirements contain specific conditions covering the identified paths. Orphaned paths — significant code paths without requirement coverage — trigger a "coverage gap" marker in the completeness report. These gaps must be resolved (by adding requirement conditions or by providing explicit justification) before the completeness report can declare requirements sufficient. + +**Carry-forward rule:** When a prior run's REQUIREMENTS.md exists in the quality directory, the pipeline must read it and check whether any conditions from the prior version were dropped. If conditions were dropped, the pipeline must either: (a) re-derive them with updated justification, or (b) document why the condition is no longer relevant. Silent drops are not permitted — they are a direct cause of regressions where previously learned requirements are lost between runs. + +**After the pipeline:** The skill also generates `quality/REVIEW_REQUIREMENTS.md` (interactive review protocol) and `quality/REFINE_REQUIREMENTS.md` (refinement pass protocol). These support iterative improvement — the user can review requirements interactively, run refinement passes with different models, and keep versioned backups of each iteration. See `references/requirements_pipeline.md` for the full versioning protocol and backup structure. + +Record all requirements in a structured format. These feed directly into the code review protocol's verification and consistency passes. + +### Checkpoint: Initialize PROGRESS.md + +After completing Phase 1 exploration, create `quality/PROGRESS.md`. This file is the skill's external memory — it persists state across phases so that context is never lost, even if the session is interrupted and resumed. It also serves as an audit trail for debugging and improvement. + +**Why this exists:** In single-session runs, the agent holds context in memory. But context degrades over long sessions — findings from Phase 1 are forgotten by Phase 6, BUG counts drift, spec-audit bugs get orphaned because the closure check never saw them. PROGRESS.md solves this by making every phase write its state to disk. The agent reads it back before each phase, so it always has an accurate picture of what happened so far. As a side benefit, it makes the skill work correctly even if the run is split across multiple sessions. + +**Checkpoint discipline for long runs:** After each requirements-pipeline phase (Contracts, Requirements, Coverage Matrix, Completeness, Narrative), update `quality/PROGRESS.md` with: completed phase, artifact paths, current scoped subsystems, remaining work, and exact resume point. This ensures a resumed session can continue from the last completed checkpoint without redoing work. + +**Timestamp discipline:** Write each phase completion entry to PROGRESS.md immediately when you finish that phase, before starting the next phase. Do not batch-write or back-fill timestamps after the fact. The timestamps are an audit trail — if Phase 2 shows a completion time earlier than Phase 1, a reviewer cannot verify that phases ran in the correct sequence. If you realize you forgot to write a checkpoint, write it now with an honest timestamp and a note explaining the gap. + +Write the initial PROGRESS.md: + +```markdown +# Quality Playbook Progress + +## Run metadata +Started: [date/time] +Project: [project name] +Skill version: [read from .github/skills/SKILL.md metadata — must match exactly] +With docs: [yes/no] + +## Phase completion +- [x] Phase 1: Exploration — completed [date/time] +- [ ] Phase 2: Artifact generation (QUALITY.md, REQUIREMENTS.md, tests, protocols, BUGS.md, RUN_TDD_TESTS.md, AGENTS.md) +- [ ] Phase 3: Code review + regression tests +- [ ] Phase 4: Spec audit + triage +- [ ] Phase 5: Post-review reconciliation + closure verification +- [ ] TDD logs: red-phase log for every confirmed bug, green-phase log for every bug with fix patch +- [ ] Phase 6: Verification benchmarks + +## Artifact inventory +| Artifact | Status | Path | Notes | +|----------|--------|------|-------| +| QUALITY.md | pending | | | +| REQUIREMENTS.md | pending | | | +| CONTRACTS.md | pending | | | +| COVERAGE_MATRIX.md | pending | | | +| COMPLETENESS_REPORT.md | pending | | | +| Functional tests | pending | | | +| RUN_CODE_REVIEW.md | pending | | | +| RUN_INTEGRATION_TESTS.md | pending | | | +| BUGS.md | pending | | | +| RUN_TDD_TESTS.md | pending | | | +| RUN_SPEC_AUDIT.md | pending | | | +| AGENTS.md | pending | | | +| tdd-results.json | pending | quality/results/ | Structured TDD output | +| integration-results.json | pending | quality/results/ | Structured integration output | +| Bug writeups | pending | quality/writeups/ | One per TDD-verified bug | + +## Cumulative BUG tracker + + +| # | Source | File:Line | Description | Severity | Closure Status | Test/Exemption | +|---|--------|-----------|-------------|----------|----------------|----------------| + + + +## Terminal Gate Verification + + +## Exploration summary +[Brief notes on architecture, key modules, spec sources, defensive patterns found] +``` + +Update this file after every phase. The cumulative BUG tracker is the most important section — it ensures no finding is orphaned regardless of which phase produced it. + +### Write exploration findings to disk + +After initializing PROGRESS.md, write your full exploration findings to `quality/EXPLORATION.md`. This file captures everything you learned in Phase 1 so it can survive a context boundary (session break, multi-pass handoff, or long-run memory degradation). Structure it as: + +```markdown +# Exploration Findings + +## Domain and Stack +[Language, framework, build system, deployment target] + +## Architecture +[Key modules with file paths, entry points, data flow, layering] + +## Existing Tests +[Test framework, test count, coverage areas, gaps] + +## Specifications +[What docs_gathered/ contains, key spec sections, behavioral rules] + +## Open Exploration Findings +[At least 8 concrete findings from domain-driven investigation. +Each must have a file path, line number, and specific bug hypothesis. +At least 4 must reference different modules or subsystems. +At least 3 must trace a behavior across 2+ functions.] + +## Quality Risks +[At least 5 domain-driven failure scenarios ranked by priority. +Each must name a specific function, file, and line and explain the failure +mechanism using domain knowledge of what goes wrong in systems like this. +These are hypotheses, not confirmed bugs — they tell Phase 2 where to look. +Frame each as: "Because [code at file:line] does [X], a [domain-specific +edge case] will produce [wrong behavior] instead of [correct behavior]." +A section that lists defensive patterns the code already has does NOT belong here.] + +## Skeletons and Dispatch +[State machines, dispatch tables, feature registries — with file:line citations] + +## Pattern Applicability Matrix +| Pattern | Decision (`FULL` / `SKIP`) | Target modules | Why | +|---|---|---|---| +| Fallback and Degradation Path Parity | | | | +| Dispatcher Return-Value Correctness | | | | +| Cross-Implementation Consistency | | | | +| Enumeration and Representation Completeness | | | | +| API Surface Consistency | | | | +| Spec-Structured Parsing Fidelity | | | | + +[3 to 4 patterns must be marked FULL. The rest are SKIP with codebase-specific rationale. Select 4 when a fourth pattern clearly applies and covers different code areas.] + +## Pattern Deep Dive — [Pattern Name] +[Use the output format from `exploration_patterns.md`. +Trace the relevant code path across 2+ functions, implementations, or API surfaces. +Each deep dive should pressure-test, refine, or extend findings from the open +exploration and quality risks stages.] + +## Pattern Deep Dive — [Pattern Name] +[Repeat for each selected FULL pattern. 3 to 4 deep-dive sections total.] + +## Pattern Deep Dive — [Pattern Name] +[Third and final deep dive.] + +## Candidate Bugs for Phase 2 +[Consolidated from ALL earlier sections — open exploration, quality risks, AND patterns. +Minimum 4 candidates with file:line references. At least 2 from open exploration or +quality risks, at least 1 from a pattern deep dive. For each candidate include the +source stage and what the Phase 2 code review should inspect.] + +## Derived Requirements +[REQ-001 through REQ-NNN, each with spec basis and tier] + +## Derived Use Cases +[UC-01 through UC-NN, each with actor, trigger, expected outcome] + +## Notes for Artifact Generation +[Anything the next phase needs to know — naming conventions, test patterns, framework quirks] + +## Gate Self-Check +[Written by the Phase 1 gate. Each check 1–12 with PASS/FAIL and one-line evidence. +This section proves the gate was executed. Do not write this section until you have +actually verified each check against the file contents.] +``` + +**Minimum depth expectation:** EXPLORATION.md must contain at least 120 lines of substantive content — not padding or boilerplate headers, but actual findings (file paths, behavioral rules, derived requirements, architecture observations). A skeleton that lists section headers with one-line placeholders is not a valid handoff artifact. If the file is thinner than this, go back and add the detail Phase 2 will need. + +**Re-read after writing (mandatory).** After writing EXPLORATION.md, explicitly read the file back from disk before proceeding to Phase 2. This serves two purposes: (1) it confirms the file was written correctly, and (2) it loads the structured findings into working memory for artifact generation. Do not skip this step and rely on what you remember writing — the "write then read" cycle is the context bridge. + +This file is essential in all modes. In single-pass mode it forces the model to articulate specific findings (file paths, function names, line numbers) before generating artifacts. In multi-pass mode it is also the handoff artifact between passes. Either way, the write-then-read cycle is the quality gate for exploration depth. + +**Phase 1 completion gate (mandatory — STOP HERE before Phase 2).** You MUST execute this gate before proceeding to Phase 2. This is not optional. Re-read `quality/EXPLORATION.md` from disk and run every check below. After checking, append a `## Gate Self-Check` section to the bottom of EXPLORATION.md that lists each check number (1–12) with PASS or FAIL and a one-line evidence note. If any check fails, fix EXPLORATION.md and re-run the gate. Do not proceed to Phase 2 until all checks pass AND the Gate Self-Check section is written to disk. + +**Common gate-bypass failure mode:** In v1.3.43 benchmarking, two repos (chi, zod) produced EXPLORATION.md files with completely wrong section structure — sections like "Architecture summary", "Behavioral contracts", "Repository and architecture map" instead of the required sections. The model never ran the gate checks and proceeded directly to Phase 2, producing zero bugs. If your EXPLORATION.md does not contain sections with the EXACT titles listed below, it is non-conformant and must be rewritten before proceeding. + +1. The file exists on disk and contains at least 120 lines of substantive content. +2. `quality/PROGRESS.md` exists and marks Phase 1 complete. +3. The Derived Requirements section contains at least one REQ-NNN with specific file paths and function names — not abstract subsystem descriptions. +4. A section titled **exactly** `## Open Exploration Findings` exists and contains at least 8 concrete bug hypotheses or suspicious findings, each with a file path and line number. These must come from domain-driven investigation, not just from applying patterns. At least 4 must reference different modules or subsystems. +5. **Open-exploration depth check:** At least 3 findings in `## Open Exploration Findings` must trace a behavior across 2 or more functions or 2 concrete code locations. A list of isolated single-function suspicions is not sufficient depth. +6. A section titled **exactly** `## Quality Risks` exists and contains at least 5 domain-driven failure scenarios ranked by priority. Each scenario must: (a) name a specific function, file, and line, (b) describe a domain-specific edge case or failure mode, and (c) explain why the code produces wrong behavior. These must come from domain knowledge about what goes wrong in systems like this one — not from structural analysis of the code alone. A section that lists defensive patterns the code already has (things the code does RIGHT) does not satisfy this gate. A section that lists risky modules without specific failure scenarios does not satisfy this gate. A section that concludes the library is mature and unlikely to have basic bugs does not satisfy this gate. +7. A section titled **exactly** `## Pattern Applicability Matrix` exists and evaluates all six patterns from `exploration_patterns.md`, marking each as `FULL` or `SKIP` with target modules and codebase-specific rationale. +8. Between 3 and 4 patterns (inclusive) are marked `FULL` in the applicability matrix. +9. There are between 3 and 4 sections (inclusive) whose titles begin with `## Pattern Deep Dive — `. Each must contain concrete file:line evidence, not just pattern-name placeholders. The count must match the number of `FULL` patterns in the matrix. +10. **Pattern depth check:** At least 2 of the pattern deep-dive sections must trace a code path across 2 or more functions. A section that says "function X at file:line has a gap" is a surface finding. A section that says "function X at file:line calls function Y at file:line, which does A but not B; compare with function Z which does both" is a depth finding. +11. A section titled **exactly** `## Candidate Bugs for Phase 2` exists and contains at least 4 prioritized bug hypotheses with file:line references, the stage that surfaced each one (open exploration, quality risks, or pattern), and what the code review should look for. +12. **Ensemble balance check:** At least 2 candidate bugs must originate from open exploration or quality risks, and at least 1 must originate from or be materially strengthened by a pattern deep dive. This ensures both domain-knowledge and structural-analysis findings flow into Phase 2. + +Do not begin Phase 2 until all twelve checks pass AND the `## Gate Self-Check` section is written to EXPLORATION.md on disk. Phase 1 is your only chance to understand the codebase deeply. Every requirement you miss here is a bug you will not find in Phase 3. Invest the time. + +**If you find yourself about to start Phase 2 without having written a Gate Self-Check section, STOP.** Go back and run the gate. This instruction exists because models reliably skip the gate when they feel confident about their exploration — and that confidence is precisely when bugs are missed. + +**End-of-phase message (mandatory — print this after Phase 1 completes, then STOP):** + +``` +# Phase 1 Complete — Exploration + +I've finished exploring the codebase and written my findings to `quality/EXPLORATION.md`. +[Summarize: how many candidate bugs, which subsystems explored, key risks identified.] + +To continue to Phase 2 (Generate quality artifacts), say: + + Run quality playbook phase 2. + +Or say "keep going" to continue automatically. +``` + +**After printing this message, STOP. Do not proceed to Phase 2 unless the user explicitly asks.** --- ## Phase 2: Generate the Quality Playbook -Now write the six files. For each one, follow the structure below and consult the relevant reference file for detailed guidance. +> **Required references for this phase** — read these before proceeding: +> - `quality/EXPLORATION.md` — your Phase 1 findings (architecture, requirements, use cases, pattern analysis) +> - `.github/skills/references/requirements_pipeline.md` — five-phase pipeline for requirement derivation +> - `.github/skills/references/defensive_patterns.md` — grep patterns for finding defensive code +> - `.github/skills/references/schema_mapping.md` — field mapping format for schema-aware tests +> - `.github/skills/references/constitution.md` — QUALITY.md template +> - `.github/skills/references/functional_tests.md` — test structure and anti-patterns +> - `.github/skills/references/review_protocols.md` — code review and integration test templates + +**Phase 2 entry gate (mandatory — HARD STOP).** Before generating any artifacts, read `quality/EXPLORATION.md` from disk and verify ALL of the following exact section titles exist (grep or search — do not rely on memory): + +1. `## Open Exploration Findings` — must exist verbatim +2. `## Quality Risks` — must exist verbatim +3. `## Pattern Applicability Matrix` — must exist verbatim +4. At least 3 sections starting with `## Pattern Deep Dive — ` — must exist verbatim +5. `## Candidate Bugs for Phase 2` — must exist verbatim +6. `## Gate Self-Check` — must exist (proves the Phase 1 gate was run) + +If the file does not exist, has fewer than 120 lines, or is **missing ANY of these exact section titles**, STOP and go back to Phase 1. Do not attempt to proceed with "equivalent" sections under different names — the exact titles above are required. Write EXPLORATION.md now, starting with domain-driven open exploration, then domain-knowledge risk analysis, then selecting 3–4 patterns from `.github/skills/references/exploration_patterns.md` for deep dives. Do not proceed with Phase 2 until EXPLORATION.md passes the Phase 1 completion gate. This check exists because single-pass execution can skip the Phase 1 gate — this is the backstop. In v1.3.43, two repos bypassed both gates and produced zero bugs. + +Use `quality/EXPLORATION.md` as your primary source for this phase — do not re-explore the codebase from scratch. The exploration findings contain the architecture map, derived requirements, use cases, and risk analysis that drive every artifact below. If you find yourself reading source files to figure out what the project does, go back to EXPLORATION.md instead. Re-exploration wastes context and produces inconsistencies between what Phase 1 found and what Phase 2 generates. + +Now write the nine files. For each one, follow the structure below and consult the relevant reference file for detailed guidance. + +**Version stamp (mandatory on every generated file).** Every Markdown file the playbook generates must begin with the following attribution line immediately after the file's title heading: + +``` +> Generated by [Quality Playbook](https://github.com/andrewstellman/quality-playbook) v1.4.0 — Andrew Stellman +> Date: YYYY-MM-DD · Project: +``` + +Every generated code file (test files, scripts) must begin with a comment header: + +``` +# Generated by Quality Playbook v1.4.0 — https://github.com/andrewstellman/quality-playbook +# Author: Andrew Stellman · Date: YYYY-MM-DD · Project: +``` -**Why six files instead of just tests?** Tests catch regressions but don't prevent new categories of bugs. The quality constitution (`QUALITY.md`) tells future sessions what "correct" means before they start writing code. The protocols (`RUN_*.md`) provide structured processes for review, integration testing, and spec auditing that produce repeatable results — instead of leaving quality to whatever the AI feels like checking. Together, these files create a quality system where each piece reinforces the others: scenarios in QUALITY.md map to tests in the functional test file, which are verified by the integration protocol, which is audited by the Council of Three. +Use the comment syntax appropriate for the language (`#`, `//`, `/* */`, etc.). The version in the stamp must match the `metadata.version` in this skill's frontmatter. This stamp makes every generated artifact traceable back to the tool, version, and run that created it — essential when files are emailed, attached to tickets, or reviewed outside the repository context. Use the date the playbook generation started, not the date each individual file was written. + +**Stamp placement and exemptions:** +- For Python files with an encoding pragma (`# -*- coding: utf-8 -*-`) or a shebang (`#!/usr/bin/env python`), place the stamp comment *after* the pragma/shebang, not before — pushing it past line 2 causes `SyntaxWarning`. +- For sidecar JSON files (`tdd-results.json`, `integration-results.json`), the `skill_version` field already serves as the version stamp. JSON does not support comments — do not inject one. +- For JUnit XML files, no stamp is needed — these are framework-generated. +- For `.patch` files, do not inject a stamp into the diff body — it would break `git apply`. Rely on the surrounding artifact metadata (BUGS.md, tdd-results.json) for provenance. + +**Artifact dependency rules:** +- `quality/RUN_CODE_REVIEW.md` Pass 2 depends on a stable `quality/REQUIREMENTS.md` — thin requirements produce thin Pass 2 review. If the requirements count seems low for the code surface (fewer than ~3–4 requirements per core module), note this at the start of the Pass 2 report. +- Functional tests depend on `quality/REQUIREMENTS.md` and `quality/QUALITY.md` — after any requirements refinement, re-verify that `test_functional.*` still covers every requirement. +- `quality/RUN_SPEC_AUDIT.md` depends on requirements, quality scenarios, and docs validation. +- `quality/COMPLETENESS_REPORT.md` has two stages: baseline (pre-review, no verdict section) and final (post-reconciliation in Phase 5, with the authoritative verdict). +- `quality/PROGRESS.md` is the authoritative state file and must be updated before each downstream artifact begins. + +**Why nine files instead of just tests?** Tests catch regressions but don't prevent new categories of bugs. The quality constitution (`QUALITY.md`) tells future sessions what "correct" means before they start writing code. The protocols (`RUN_*.md`) provide structured processes for review, integration testing, and spec auditing that produce repeatable results — instead of leaving quality to whatever the AI feels like checking. Together, these files create a quality system where each piece reinforces the others: scenarios in QUALITY.md map to tests in the functional test file, which are verified by the integration protocol, which is audited by the Council of Three. ### File 1: `quality/QUALITY.md` — Quality Constitution @@ -247,7 +962,9 @@ Key rules: **Read `references/review_protocols.md`** for the template. -Key sections: bootstrap files, focus areas mapped to architecture, and these mandatory guardrails: +The code review protocol has three passes. Each pass runs independently — a fresh session with no shared context except the requirements document. This clean separation prevents cross-contamination between structural review and requirement-based review. + +**Pass 1 — Structural Review.** Read the code and spot anomalies. This is what every AI code review tool already does well. No requirements, no focus areas — just the model's own knowledge of code correctness. Keep these mandatory guardrails: - Line numbers are mandatory — no line number, no finding - Read function bodies, not just signatures @@ -255,7 +972,190 @@ Key sections: bootstrap files, focus areas mapped to architecture, and these man - Grep before claiming missing - Do NOT suggest style changes — only flag things that are incorrect -**Phase 2: Regression tests.** After the review produces BUG findings, write regression tests in `quality/test_regression.*` that reproduce each bug. Each test should fail on the current implementation, confirming the bug is real. Report results as a confirmation table (BUG CONFIRMED / FALSE POSITIVE / NEEDS INVESTIGATION). See `references/review_protocols.md` for the full regression test protocol. +**Minimum required Pass 1 scrutiny areas (address each explicitly):** + +1. **Input validation and boundary handling** — check every trust boundary where external or caller-supplied data enters the code. Every string parser, enum lookup, and binary-format parser must reject input that shares a valid prefix with a valid token but contains additional characters. +2. **Resource lifecycle** — allocation, refcount management, error-path cleanup, lock release on failure, file descriptor/handle lifetime. Every function that acquires a reference or resource must release it on every early-exit path, or must complete all validation before acquiring the resource. +3. **Concurrency and state management** — lock ordering, atomic operation correctness (every atomic modification of a shared state word must use read-modify-write semantics and preserve bits outside the intended modification), state machine completeness (all states handled at all consumers). +4. **Unit and encoding correctness** — every field read from hardware, protocol structures, or user input that has defined units must be converted correctly before use in calculations or comparisons. +5. **Enumeration and whitelist completeness** — when a function uses a `switch`/`case`, `match`, if-else chain, or any branching construct to handle a set of named constants (feature bits, enum values, event types, command codes, permission flags), perform a **mechanical enumeration check**: + + (a) **List A (code extraction):** If a `quality/mechanical/_cases.txt` artifact exists for this function, use it as the authoritative code-side list — do not re-extract manually. If no mechanical artifact exists, extract every branch/case label actually present in the code. List each with its exact line number: "line 3511: `case VIRTIO_RING_F_INDIRECT_DESC`", "line 3513: `case VIRTIO_RING_F_EVENT_IDX`", etc. **Extract this list from the code only — do not copy from REQUIREMENTS.md, CONTRACTS.md, or any other generated artifact.** If you cannot cite a line number for a case label, it is not present. + + (b) **List B (spec extraction):** List every constant defined in the relevant header, enum, or spec that *should* be handled. + + (c) **Diff:** Compare the two lists. For each constant in List B, mark it as "FOUND (line NNN)" or "NOT IN CODE." Report any constants that are defined but not handled. + + **Do not assert that a whitelist "covers all values" or "preserves supported bits" without performing this two-list comparison.** AI models reliably hallucinate completeness for switch/case constructs — the model sees the function, sees the constants defined elsewhere, and assumes coverage without checking each case label. The most dangerous form of this hallucination is copying from an upstream artifact (like REQUIREMENTS.md) that asserts a constant is present, rather than extracting from the code. In v1.3.17, the code review's "case labels present" list was word-for-word identical to the requirements list — proving it was copied rather than extracted. The mechanical check with per-label line numbers is the fix. + +These five areas must appear as labeled subsections in the Pass 1 report. If a project has no meaningful concurrency, say so explicitly and document why rather than omitting the section. Add project-specific scrutiny areas beyond these four as warranted. + +Pass 1 catches ~65% of real defects: race conditions, null pointer hazards, resource leaks, off-by-one errors, type mismatches — structural problems visible in the code. + +**Pass 2 — Requirement Verification.** For each testable requirement derived in Step 7 of Phase 1, check whether the code satisfies it. For each requirement, either show the code that satisfies it or explain specifically why it doesn't. This is a pure verification pass — the reviewer's only job is "does the code satisfy this requirement?" Not a general review. Not looking for other bugs. Just verification. + +**Minimum evidence rule:** Pass 2 must cite at least one code location (file:line or file:function) **per requirement**. Blanket satisfaction claims like "REQ-003 through REQ-012 — satisfied by the client paths reviewed during the pass" without per-requirement code citations do not satisfy Pass 2. If two or three requirements are satisfied by the same function, cite the function once and list those specific requirements — but each requirement must appear individually with its own SATISFIED/VIOLATED verdict, not as part of an unverified range. A group of more than three requirements under a single citation is a sign that the verification was superficial. The point is traceability — a reviewer reading Pass 2 should be able to follow the evidence chain from any single requirement to the code that satisfies it without re-reading the entire codebase. + +**Enumeration completeness claims require mechanical proof.** When evaluating a requirement that involves a whitelist, lookup table, feature-bit set, handler registry, or any claim of the form "all X are covered by Y," the reviewer must perform the two-list enumeration check from Pass 1 scrutiny area 5: extract every item from the code (with line numbers), extract every item from the spec, and diff. **The code-side list must be extracted fresh from the source — do not reuse any list from REQUIREMENTS.md, CONTRACTS.md, the code review prompt, or any other generated artifact.** If the code-side list matches the requirements list word-for-word, that is a sign the list was copied rather than extracted, and the check must be redone. + +Do not mark such a requirement SATISFIED based on reading the function and believing it handles everything — that is the specific hallucination pattern this rule prevents. Example: a requirement says "the transport feature whitelist must preserve all supported ring features." The reviewer reads `vring_transport_features()` and sees it has a switch/case. The correct check: extract each case label with its line number (`line 3511: INDIRECT_DESC`, `line 3513: EVENT_IDX`, ..., `line 3527: default`), then list the header constants, then diff. The hallucination: "the whitelist preserves supported bits including VIRTIO_F_RING_RESET" without checking that RING_RESET actually appears as a case label. This exact failure mode has been observed in practice across multiple versions — the model asserted coverage of a constant that was absent from the switch, and in v1.3.17, the code review's "case labels present" list was copied from the requirements rather than extracted from the code, causing three independent spec auditors to inherit the false claim. + +Pass 2 catches violations of individual requirements — cases where the code doesn't do what the specification says it should. This finds bugs that structural review misses because the code that IS there is correct; the bug is what's missing or what doesn't match the spec. + +**Pass 3 — Cross-Requirement Consistency.** Compare pairs of requirements that reference the same field, constant, range, or security policy. For each pair, verify that their constraints are mutually consistent. Do numeric ranges match bit widths? Do security policies propagate to all connection types? Do validation bounds in one file agree with encoding limits in another? + +Pass 3 catches contradictions where two individually-correct pieces of code disagree about a shared constraint. These bugs are invisible to both structural review and per-requirement verification because each requirement IS satisfied individually — the bug only appears when you compare them. This is the pass that catches cross-file arithmetic bugs and design gaps where a security configuration doesn't propagate to all connection paths. + +**Source code boundary rule:** The playbook must never modify files outside the `quality/` directory. All source-tree changes — bug fixes, test additions to the project's own test suite — must be expressed as `git diff`-format patch files saved under `quality/patches/`. This ensures the original source tree remains untouched, patches are reviewable and reversible, and the playbook's findings are cleanly separable from the code it audited. + +**BUGS.md:** After all review and audit phases, generate `quality/BUGS.md` — a consolidated bug report with full reproduction details for each confirmed bug. For each bug, include: bug ID, source (code review or spec audit), file:line, description, severity, minimal reproduction scenario (what input or sequence triggers the bug), expected vs actual behavior, references to the regression test and any proposed fix patch, and **spec basis**. + +**What counts as sufficient evidence to confirm a bug (critical).** A code-path trace that demonstrates a specific behavioral violation IS sufficient evidence to confirm a bug. You do NOT need executed request-level evidence, a running test, or an integration-level reproduction to promote a finding from candidate to confirmed. Specifically: + +- A code-path trace showing function A calls function B which does X but should do Y, with file:line references — **sufficient to confirm**. +- A missing case/branch identified by enumeration comparison (spec says X should be handled, code has no handler for X) — **sufficient to confirm**. +- A requirement violation identified in Pass 2 where the code demonstrably does not implement the specified behavior — **sufficient to confirm**. +- A domain-knowledge finding where you can trace from input through specific code to wrong output — **sufficient to confirm**. + +Do NOT demand "executed request-level evidence" or defer findings because "they require runtime testing to distinguish implementation choice from spec gap." If the spec or documentation says the behavior should be X, and the code demonstrably produces Y (traceable through the code path), that is a confirmed bug — not a candidate awaiting runtime validation. The regression test and TDD protocol exist to provide runtime evidence AFTER confirmation, not as a prerequisite FOR confirmation. + +**Why this rule exists:** In v1.3.43 javalin benchmarking, the code review and triage both identified 4 legitimate candidate bugs with code-path traces and requirement violations, then demoted all of them because "the highest-confidence items still require executed request-level evidence." This produced zero confirmed bugs from a codebase where previous versions found 5. The evidentiary bar was set at runtime-proof-before-confirmation, which is backwards — the playbook's design is confirm-then-prove-with-TDD. + +**Severity calibration:** Credential leakage, authentication bypass, and injection-class bugs are always high severity regardless of assessed likelihood. Authorization header exposure across trust boundaries (e.g., cross-domain redirects) is credential leakage. When in doubt about security-relevant severity, default to high. + +**Spec basis (mandatory field per bug):** Cite the specific documentation passage that establishes the expected behavior — the gathered doc filename, section/page, and the behavioral contract it defines. If no gathered doc covers the behavior, check whether the project's own comments, README, or API docs define it. If no documentation exists for the expected behavior, classify the bug as a "code inconsistency" rather than a "spec violation" and note this in the severity assessment. A spec violation is a stronger finding than a code inconsistency — it means the code contradicts an authoritative source, not just that the code looks wrong. This distinction matters when reporting upstream: maintainers respond to "your code violates section X.Y of your own spec" differently than "this looks like it might be a bug." + +**Patch files (mandatory for every confirmed bug).** For each confirmed bug, generate: +- `quality/patches/BUG-NNN-regression-test.patch` — a `git diff` that adds a test demonstrating the bug. **This patch is mandatory, not optional.** It is the strongest evidence a bug exists — independent of any opinion about the fix. A confirmed bug without a regression-test patch is incomplete and will cause `quality_gate.sh` to FAIL. Generate this patch immediately after confirming the bug, before moving to the next bug. +- `quality/patches/BUG-NNN-fix.patch` (optional but strongly encouraged) — a `git diff` with the proposed fix. For bugs where the fix is a single-line or few-line change (e.g., adding a case label, fixing an argument), generate the fix patch — these are low-effort and high-value. + +**How to generate patch files.** Use `git diff` format. The simplest approach: write the patch content directly as a unified diff. Example for a regression test patch: + +``` +--- /dev/null ++++ b/quality/test_regression_virtio.c +@@ -0,0 +1,15 @@ ++// Generated by Quality Playbook v1.4.0 ++// Regression test for BUG-004: VIRTIO_F_RING_RESET missing from vring_transport_features() ++#include ++#include ++... +``` + +For fix patches that modify existing source files, use the `--- a/path` / `+++ b/path` format with correct line offsets. If you cannot determine exact line offsets, generate the patch content and note "offsets approximate" — an approximate patch is more valuable than no patch. + +Patches must apply cleanly against the original source tree with `git apply`. Do not modify the source tree directly. + +**Patch validation gate (mandatory).** Before declaring any bug as confirmed with a fix patch, run this gate: + +1. **Apply test:** `git apply --check quality/patches/BUG-NNN-regression-test.patch` — must exit 0. +2. **Apply test + fix:** `git apply --check quality/patches/BUG-NNN-fix.patch` — must exit 0 (test against clean tree, not against regression-test-applied tree, unless the fix patch depends on the regression test). +3. **Compile check:** After applying both patches, run the project's build/compile command (e.g., `go build ./...`, `mvn compile`, `cargo check`, `tsc --noEmit`). Must succeed. + +**Temporary worktree for step 3.** Steps 1–2 use `--check` (non-destructive). Step 3 requires actually applying patches and compiling, which modifies the source tree. To comply with the source code boundary rule ("never modify files outside `quality/`"), run step 3 in a disposable worktree: + +```bash +git worktree add /tmp/qpb-patch-check HEAD --quiet +cd /tmp/qpb-patch-check +git apply quality/patches/BUG-NNN-regression-test.patch quality/patches/BUG-NNN-fix.patch + +cd - +git worktree remove /tmp/qpb-patch-check --force +``` + +If `git worktree` is unavailable (shallow clone, detached HEAD), use `git stash && git apply ... && && git checkout . && git stash pop` as a fallback, or accept `--check`-only validation and note the limitation. + +**Compile check for interpreted languages.** The compile command varies by ecosystem: +- **Go:** `go build ./...` +- **Rust:** `cargo check` +- **Java/Kotlin (Maven):** `mvn compile -q` +- **Java/Kotlin (Gradle):** `./gradlew compileJava compileTestJava -q` +- **TypeScript:** `tsc --noEmit` +- **Python:** `python -m py_compile ` for syntax, then `pytest --collect-only -q` for import/discovery validation +- **JavaScript (Node.js):** `node --check ` for syntax; if the project uses ESLint, `npx eslint ` for structural issues +- **JavaScript (Mocha/Jest):** Run the specific test in discovery-only mode (`mocha --dry-run` or `jest --listTests`) to verify it loads without errors + +If no compile/syntax check is feasible for the project's language, document this in the patch entry and rely on the TDD red phase to catch syntax errors. + +If any step fails, fix the patch before recording the bug as confirmed. A bug with a corrupt patch that won't apply is not a confirmed bug — it's a hypothesis with broken evidence. The TDD red-green cycle cannot run on patches that don't apply, and reporting a bug with an unapplyable patch undermines credibility with upstream maintainers. Common patch failures: truncated hunks (missing closing braces), wrong line offsets (patch generated against modified tree instead of clean tree), and syntax errors in generated test code. + +**Fix patch requirement.** Every confirmed bug must have either: +- A `quality/patches/BUG-NNN-fix.patch` that passes the validation gate above, OR +- An explicit justification in BUGS.md explaining why no fix patch is provided (e.g., "fix requires architectural change beyond patch scope," "multiple valid fix strategies — deferring to maintainer judgment," "bug is in upstream dependency"). + +A bug with a regression test but no fix patch and no justification is incomplete. The regression test proves the bug exists; the fix patch (or justification for its absence) completes the evidence chain. Bugs without fix patches cannot achieve "TDD verified (FAIL→PASS)" status — they remain at "confirmed open (xfail)" until a fix is provided. + +**TDD verification cycle:** Each confirmed bug with a fix patch should go through the red-green TDD cycle (test fails on unpatched code, passes after fix). This is executed via the `quality/RUN_TDD_TESTS.md` protocol (File 7), not inline during the code review. The protocol generates spec-grounded tests where every assertion message, variable name, and comment traces back to gathered documentation. + +**After all three passes:** Combine findings. Write regression tests in `quality/test_regression.*` that reproduce each confirmed bug. Use the same test framework as `test_functional.*` — if functional tests use pytest, regression tests use pytest (with `@pytest.mark.xfail(strict=True)`); if functional tests use unittest, regression tests use unittest (with `@unittest.expectedFailure`). Report results as a confirmation table (BUG CONFIRMED / FALSE POSITIVE / NEEDS INVESTIGATION). See `references/review_protocols.md` for the full three-pass template and regression test protocol. + +**Regression test skip guards (mandatory).** Every regression test in `quality/test_regression.*` must include a skip/xfail guard so that running the full test suite on unpatched code does not produce unexpected failures. The guard must be the **earliest syntactic guard for the framework** — a decorator or annotation where idiomatic, otherwise the first executable line in the test body. Use the language-appropriate mechanism: + +- **Python (pytest):** `@pytest.mark.xfail(strict=True, reason="BUG-NNN: [description]")` — placed as a **decorator above** `def test_...():`, not inside the function body. When the bug is present, the test fails → XFAIL (expected). When the bug is fixed but the marker isn't removed, the test passes → XPASS → strict mode makes this a failure, signaling the guard should be removed. +- **Python (unittest):** `@unittest.expectedFailure` — decorator above the test method. +- **Go:** `t.Skip("BUG-NNN: [description] — unskip after applying quality/patches/BUG-NNN-fix.patch")` — first line inside the test function. Note: Go's `t.Skip` hides the test entirely (reports SKIP, not FAIL), which is weaker evidence than Python's xfail. This is a known limitation of Go's test primitives. +- **Java (JUnit 5):** `@Disabled("BUG-NNN: [description]")` — annotation above the test method. +- **Rust:** `#[ignore]` attribute on the test function (the standard "don't run in default suite" mechanism). Use `#[should_panic]` only for bugs that manifest as panics; use `compile_fail` doctest annotation only for compile-time bugs. +- **TypeScript/JavaScript (Jest):** `test.failing("BUG-NNN: [description]", () => { ... })` +- **TypeScript/JavaScript (Vitest):** `test.fails("BUG-NNN: [description]", () => { ... })` +- **JavaScript (Mocha):** `it.skip("BUG-NNN: [description]", () => { ... })` or `this.skip()` inside the test body for conditional skipping. + +When a bug is fixed (fix patch applied permanently), remove the skip guard and update the BUG tracker closure status from "confirmed open" to "fixed (test passes)". The skip guard message must reference the bug ID and the fix patch path so that someone encountering a skipped test knows exactly how to resolve it. + +**Source-inspection tests must execute (no `run=False`).** Regression tests that verify source-file structure — string presence in function bodies, case label existence, enum extraction, generated-code shape checks — are safe, deterministic, and fast. They read repository files and perform string matches. For these tests, use `@pytest.mark.xfail(strict=True)` with execution enabled. **Do not use `run=False`** unless the test would mutate external state, hang, or require unavailable infrastructure. A source-inspection test with `run=False` is the worst possible state: the correct check exists but is inert. In v1.3.18, the regression test for BUG-004 (`test_bug_004_transport_feature_whitelist_keeps_ring_reset`) contained the correct assertion `assert "case VIRTIO_F_RING_RESET:" in func` but was marked `run=False` — so the test never executed, the assertion never fired, and the bug remained undetected despite the test suite "passing." When an `xfail(strict=True)` test actually executes and fails, the test suite reports it as XFAIL (expected failure) — this is correct behavior, not a suite failure. + +**TDD red/green interaction with skip guards.** During the TDD verification cycle, the red and green phases must temporarily bypass the skip guard to actually execute the test. The protocol should instruct the agent to: +- **Red phase (NEVER SKIPPED):** Remove or disable the skip/xfail guard, then run the test against unpatched code. It must fail. Re-enable the guard after recording the result. **The red phase is mandatory for every confirmed bug, even when no fix patch exists.** A bug without red-phase evidence is unverified — do not record `verdict: "skipped"` without a failing red run. If the red phase cannot execute for a documented reason (compilation failure, environment unavailable), record `red_phase: "error"` with an explanation in `notes`. +- **Green phase:** Remove or disable the guard, apply the fix patch, run the test. It must pass. If the fix will be reverted, re-enable the guard. **If no fix patch exists, record `green_phase: "skipped"` — but the red phase must still have run.** +- **After TDD cycle:** The guard remains in the committed regression test file. It is only permanently removed when the fix is merged into the source tree. + +**TDD execution enforcement (mandatory).** Regression tests must be actually executed during the TDD verification cycle, not just generated as patch files. For every confirmed bug, the red-phase test run must produce a log file at `quality/results/BUG-NNN.red.log` capturing the test output. The green-phase (if a fix patch exists) must produce `quality/results/BUG-NNN.green.log`. Each log file's first line must be a status tag: `RED` (test failed as expected), `GREEN` (test passed after fix), `NOT_RUN` (test could not be executed — with explanation), or `ERROR` (test infrastructure failed — with explanation). + +**Language-aware test execution commands.** Use the project's native test runner to execute regression tests. Detect the project language and use the appropriate command: + +- **Go:** `go test -v -run TestBugNNN ./path/to/package` +- **Python (pytest):** `python -m pytest -xvs quality/test_regression.py::test_bug_nnn` +- **Python (unittest):** `python -m unittest quality.test_regression.TestRegression.test_bug_nnn` +- **Java (Maven + JUnit):** `mvn test -pl module -Dtest=RegressionTest#testBugNnn` +- **Java (Gradle + JUnit):** `./gradlew test --tests RegressionTest.testBugNnn` +- **Rust:** `cargo test bug_nnn -- --nocapture` +- **TypeScript/JavaScript (Jest):** `npx jest --verbose --testNamePattern="BUG-NNN"` +- **TypeScript/JavaScript (Vitest):** `npx vitest run --reporter=verbose --testNamePattern="BUG-NNN"` +- **C (kernel/make-based):** Source-inspection tests via shell script (grep/awk on source files) — log the script output. + +If the project uses a language or test framework not listed above, use whatever test runner the project already uses (check for `Makefile`, `package.json`, `build.gradle`, `Cargo.toml`, `go.mod`, `setup.py`, `pyproject.toml`, etc.) and adapt the pattern. If no test runner is available or the language runtime is not installed, record `NOT_RUN` with an explanation — do not skip the log file entirely. + +**Log capture format.** Each `BUG-NNN.red.log` and `BUG-NNN.green.log` must follow this format: +``` +RED +--- Test output for BUG-NNN red phase --- +Command: [exact command run] +Exit code: [exit code] +[full stdout/stderr from test execution] +``` + +The status tag (`RED`, `GREEN`, `NOT_RUN`, `ERROR`) on the first line is machine-readable — `quality_gate.sh` will check for its presence. The `NOT_RUN` status is acceptable when the test runner is unavailable (e.g., a C project where the kernel build environment is not present), but the log file must still exist with an explanation of why the test could not be executed. + +**Ready-to-run TDD log template.** For each confirmed BUG-NNN, execute this sequence (adapt the test command for the project's language per the table above): + +```bash +# ── Red phase: revert fix, run test, expect FAIL ── +git apply -R quality/patches/BUG-NNN-fix.patch 2>/dev/null # revert fix if applied +TEST_CMD="python -m pytest -xvs quality/test_regression.py::test_bug_nnn" # adapt per language +OUTPUT=$($TEST_CMD 2>&1); EXIT=$? +printf 'RED\n--- Test output for BUG-NNN red phase ---\nCommand: %s\nExit code: %d\n%s\n' \ + "$TEST_CMD" "$EXIT" "$OUTPUT" > quality/results/BUG-NNN.red.log + +# ── Green phase: apply fix, run test, expect PASS ── +git apply quality/patches/BUG-NNN-fix.patch +OUTPUT=$($TEST_CMD 2>&1); EXIT=$? +printf 'GREEN\n--- Test output for BUG-NNN green phase ---\nCommand: %s\nExit code: %d\n%s\n' \ + "$TEST_CMD" "$EXIT" "$OUTPUT" > quality/results/BUG-NNN.green.log +``` + +Run this for every confirmed bug. If the test runner is not available, create the log file with `NOT_RUN` on the first line and an explanation. Do not skip this step — the TDD Log Closure Gate in Phase 5 will block completion if logs are missing. + +**TDD execution gate.** Before the terminal gate in Phase 5, verify that for every confirmed bug in `quality/BUGS.md`, a corresponding `quality/results/BUG-NNN.red.log` exists. Bugs without red-phase logs are incomplete — the regression test patch exists but was never proven to detect the bug. This gate exists because v1.3.45 benchmarking showed that most repos generate regression test patches but never execute them, leaving the TDD verdict unverified. ### File 4: `quality/RUN_INTEGRATION_TESTS.md` @@ -263,10 +1163,101 @@ Key sections: bootstrap files, focus areas mapped to architecture, and these man Must include: safety constraints, pre-flight checks, test matrix with specific pass criteria, an execution UX section, and a structured reporting format. Cover happy path, cross-variant consistency, output correctness, and component boundaries. +**Use-case traceability (mandatory).** The test matrix must include a **use-case traceability column**. Each integration test group must either: + +1. **Map to a use case** — Name the use case (e.g., UC-03) it validates and describe how the test exercises the user outcome from that use case. These are primary integration tests — they verify that the end-to-end behavior described in the use case actually works. + +2. **Be labeled as infrastructure** — Tests that don't map to a use case (build validation, race detection, compatibility checks, existing test suite regression guards) are explicitly labeled `[Infrastructure]` in the traceability column. They have value but don't count toward use-case coverage. + +After generating the test matrix, check: does every use case in REQUIREMENTS.md have at least one integration test mapped to it? If not, flag the uncovered use case as a gap. Integration tests mapped to use cases should test the **end-to-end behavior** described in the use case — not just run existing unit tests that happen to touch the same code paths. For example, if a use case says "Developer authenticates and follows redirects without leaking secrets," the integration test should perform a redirect across domains with auth headers and verify they're stripped — not just run `pytest -k auth`. + +**Per-UC group splitting (mandatory).** Each integration test group must map to at most **2 use cases**. A group that maps to 3+ UCs is too coarse — it can't distinguish which use case failed when a test breaks. If a single test command (e.g., `mvn test`, `go test ./...`) would exercise multiple use cases, split it into separate groups with targeted test selectors (`-Dtest=`, `-run`, `-k`, `--tests`, `-- test_name`, etc.) so each group isolates 1–2 UCs. Groups covering all UCs in one undifferentiated command are explicitly prohibited — they provide no diagnostic value when a failure occurs. + +**No-selector fallback.** If the project's test framework cannot select tests at the granularity needed for splitting (e.g., a monolithic test suite with no tag/filter support), document the limitation in the integration protocol and use the narrowest feasible command. Record which UCs the group covers and why further splitting is not possible. **A single-command project must still use the grouped JSON schema** — wrap the command in one group with a `use_cases` list covering all UCs that command exercises. A flat list of commands is never a valid substitute for the `groups[]` structure. + +**Pre-flight command validation (mandatory).** Before finalizing `RUN_INTEGRATION_TESTS.md`, verify that each group's test command actually discovers and runs tests. Use the framework's dry-run or list mode to confirm: +- **Python:** `pytest --collect-only -q ` — must list at least one test +- **Go:** `go test -list "." ` — must list at least one test name +- **Java/Kotlin:** `mvn -Dtest= test -pl --batch-mode -DfailIfNoTests=true` +- **TypeScript (Vitest):** `vitest list --config ` — must list at least one test +- **TypeScript (Jest):** `jest --listTests ` — must list at least one file +- **Rust:** `cargo test -- --list` — must list at least one test +- **JavaScript (Mocha):** `mocha --dry-run ` — must list at least one test + +If the dry-run exits with "no tests found," "No test files found," or a zero-test count, fix the selector before recording the group. Common fixes: add `--config` or `--root` flags, use file paths instead of `-t` name patterns, anchor regex patterns to the right package. Do not record a group whose command fails discovery — it will produce a `covered_fail` result that masks a selector bug as a code bug. + +If the dry-run fails with a **build error** (compilation failure, import error, missing dependency, test setup exception) rather than "no tests found," record the failure in the group's `notes` field as `"pre_flight_error": "environment"` and do not attempt to fix the selector. Environment errors during pre-flight require environment setup, not selector changes. + +**Infrastructure group definition.** A single `[Infrastructure]` group may cover build validation, race detection, static analysis, and platform compatibility checks without UC mapping. Infrastructure tests verify build toolchain and platform support, not user-observable behavior. Infrastructure groups: +- Do **not** count toward use-case coverage (the UC coverage check ignores them) +- Must include a one-line rationale explaining what they validate +- May **not** be used to relabel broad user-workflow commands to avoid splitting — if the tests exercise user-facing behavior described in a use case, they must be mapped to that UC regardless of how the test is organized + **All commands must use relative paths.** The generated protocol should include a "Working Directory" section at the top stating that all commands run from the project root using relative paths. Never generate commands that `cd` to an absolute path — this breaks when the protocol is run from a different machine or directory. Use `./scripts/`, `./pipelines/`, `./quality/`, etc. **Include an Execution UX section.** When someone tells an AI agent to "run the integration tests," the agent needs to know how to present its work. The protocol should specify three phases: (1) show the plan as a numbered table before running anything, (2) report one-line progress updates as each test runs (`✓`/`✗`/`⧗`), (3) show a summary table with pass/fail counts and a recommendation. See `references/review_protocols.md` section "Execution UX" for the template and examples. Without this, the agent dumps raw output or stays silent — neither is useful. +**Structured output (mandatory).** The protocol must instruct the agent to produce machine-readable results alongside the Markdown report, using **JUnit XML** for test execution and a **sidecar JSON** for QPB-specific metadata. + +**JUnit XML output:** Each test group should run with the framework's native JUnit XML reporter: +- Python: `pytest --junitxml=quality/results/integration-group-N.xml` +- Go: `gotestsum --junitxml quality/results/integration-group-N.xml -- -run "TestPattern"` +- Java/Kotlin: Copy Surefire XML reports to `quality/results/` +- TypeScript: `jest --reporters=jest-junit` with `JEST_JUNIT_OUTPUT_DIR=quality/results/` +- Rust: `cargo test 2>&1 | cargo2junit > quality/results/integration-group-N.xml` (if available) + +If the JUnit XML reporter is unavailable, skip XML and note `"junit_available": false` in the sidecar JSON. + +**Sidecar JSON:** Generate `quality/results/integration-results.json` by copying the template below verbatim and filling in only the values. Do not invent fields, rename keys, or restructure the schema. A flat list of commands without the `groups` array is **invalid** — even if the project runs all tests through a single command, wrap it in one group. + +```json +{ + "schema_version": "1.1", + "skill_version": "", + "date": "YYYY-MM-DD", + "project": "", + "recommendation": "SHIP", + "groups": [ + { + "group": 1, + "name": "Core routing dispatch", + "use_cases": ["UC-01", "UC-02"], + "result": "pass", + "tests_passed": 5, + "tests_failed": 0, + "junit_file": "integration-group-1.xml", + "junit_available": true, + "notes": "" + } + ], + "summary": { + "total_groups": 9, + "passed": 8, + "failed": 1, + "skipped": 0 + }, + "uc_coverage": { + "UC-01": "covered_pass", + "UC-02": "covered_pass", + "UC-03": "not_mapped" + } +} +``` + +**Required top-level fields:** `schema_version`, `skill_version`, `date`, `project`, `recommendation`, `groups`, `summary`, `uc_coverage`. If any of these fields are missing from your output, the result is non-conformant. + +**Invalid examples (do not emit these):** +- A flat `"results": [{"command": "go test ./...", "result": "pass"}]` — this is not the grouped schema. +- A schema with `"commands_run"` instead of `"groups"` — wrong key name. +- A schema missing `"uc_coverage"` — every use case from REQUIREMENTS.md must appear. +- A schema with `"use_case_traceability"` instead of `"use_cases"` — wrong field name. + +Valid `result` values: `"pass"`, `"fail"`, `"skipped"`, `"error"`. Valid `recommendation` values: `"SHIP"` (all groups pass), `"FIX BEFORE MERGE"` (failures in non-blocking groups), `"BLOCK"` (failures in critical groups). The `uc_coverage` section maps every use case from REQUIREMENTS.md to one of: `"covered_pass"` (at least one mapped group passed), `"covered_fail"` (groups mapped but all failed), or `"not_mapped"` (no integration test group maps to this use case). The distinction between `"covered_fail"` and `"not_mapped"` matters: the first means the test exists but the code is broken; the second means the test is missing. + +Runner scripts and CI tools should read the sidecar JSON for results rather than grepping the Markdown report. This eliminates the class of bugs where grep-based counting produces wrong numbers from matching words in prose. + +**Post-write validation (mandatory).** After writing `integration-results.json`, reopen the file and verify: (1) every required top-level field is present, (2) every `groups[]` entry has `group`, `name`, `use_cases`, `result`, and `notes`, (3) all `result` and `recommendation` values use only the allowed enum values listed above, (4) `uc_coverage` maps every use case from REQUIREMENTS.md, (5) no extra undocumented root keys exist. If any check fails, fix the file before proceeding. + **This protocol must exercise real external dependencies.** If the project talks to APIs, databases, or external services, the integration test protocol runs real end-to-end executions against those services — not just local validation checks. Design the test matrix around the project's actual execution modes and external dependencies. Look for API keys, provider abstractions, and existing integration test scripts during exploration and build on them. **Derive quality gates from the code, not generic checks.** Read validation rules, schema enums, and generation logic during exploration. Turn them into per-pipeline quality checks with specific fields and acceptable value ranges. "All units validated" is not enough — the protocol must verify domain-specific correctness. @@ -289,40 +1280,614 @@ Three independent AI models audit the code against specifications. Why three? Be The protocol defines: a copy-pasteable audit prompt with guardrails, project-specific scrutiny areas, a triage process (merge findings by confidence level), and fix execution rules (small batches by subsystem, not mega-prompts). +**Secondary emphasis lenses:** Optionally assign each audit model a secondary emphasis — for example, one starts with input validation, one with resource lifecycle, one with concurrency. Each model still performs a full independent audit; the emphasis biases attention without restricting coverage. Do not split models into disjoint ownership by bug class. + +**Minority finding rule:** During triage, any finding where only one of three auditors flags it (a minority finding) requires a re-investigation — read the specific code location and make an explicit CONFIRMED/FALSE-POSITIVE determination rather than discarding by default. Minority findings are disproportionately likely to be real bugs that two models missed. + +**Triage must not raise the evidentiary bar above code-path analysis.** The triage step confirms or rejects findings — it does not defer them pending runtime evidence. If a finding includes a code-path trace showing a behavioral violation (function calls, missing branches, wrong return values with file:line references), the triage should confirm it. Do not demote code-path-traced findings to "candidate" or "needs runtime verification." The TDD protocol (Phase 5) provides runtime evidence AFTER confirmation. See "What counts as sufficient evidence to confirm a bug" in the BUGS.md section for the full evidentiary standard. + +**Code review vs spec audit conflicts:** If the code review and spec audit disagree on the same finding, the spec audit finding is not automatically correct. Deploy a verification probe — read the specific code location and determine which assessment is accurate. Record the resolution in the BUG tracker. A code review BUG not flagged by any spec auditor is still confirmed but should be verified with a targeted probe before closure. + +**Verification probes must produce executable evidence.** When the triage step confirms OR rejects a finding via verification probe, prose reasoning alone is not sufficient. The probe must produce a test assertion that mechanically proves the determination: + +- **For rejections** (finding is false positive): Write an assertion that PASSES, proving the finding is wrong. Example: if rejecting "function X is missing null check," write `assert "if (ptr == NULL)" in source_of("X"), "X has null check at line NNN"`. If you cannot write a passing assertion that proves your rejection, **do not reject the finding** — escalate it to confirmed or flag it for manual review. + +- **For confirmations** (finding is a real bug): Write an assertion that FAILS (expected-failure), proving the bug exists. Example: if confirming "RING_RESET missing from switch," write `assert "case VIRTIO_F_RING_RESET:" in source_of("vring_transport_features"), "RING_RESET should be in the switch but is not"`. + +- **Every assertion must cite an exact line number** for the evidence it references. Not "lines 3527-3528" but "line 3527: `default:`" — showing what the line actually contains. Assertions without line-number citations are insufficient. + +**Why this rule exists:** In v1.3.16 virtio testing, the triage correctly received a minority finding that `VIRTIO_F_RING_RESET` was missing from a switch/case whitelist. The triage performed a "verification probe" that claimed lines 3527-3528 "explicitly preserve VIRTIO_F_RING_RESET" — but those lines actually contained the `default:` branch. The triage hallucinated compliance with the code. Had it been required to write `assert "case VIRTIO_F_RING_RESET:" in source`, the assertion would have failed, exposing the hallucination. Requiring executable evidence for rejections makes hallucinated rejections self-defeating: the model cannot write a passing assertion for something that isn't in the code. + +**Triage evidence must be written to disk.** Verification probe assertions must appear in a file on disk — either appended to `quality/mechanical/verify.sh` or written to a dedicated `quality/spec_audits/triage_probes.sh`. Assertions described in the triage report prose but never written to an executable file are not executable evidence. The gate checks for the existence of probe assertions in the triage output; a triage report that says "verification probe confirms..." without a corresponding assertion in an executable file is non-conformant. This prevents the failure mode where the model narrates what a probe *would* show without actually running it. + ### File 6: `AGENTS.md` If `AGENTS.md` already exists, update it — don't replace it. Add a Quality Docs section pointing to all generated files. If creating from scratch: project description, setup commands, build & test commands, architecture overview, key design decisions, known quirks, and quality docs pointers. +### File 7: `quality/RUN_TDD_TESTS.md` — TDD Verification Protocol + +This protocol is executed after the code review and spec audit have confirmed bugs and generated fix patches. It runs the red-green TDD cycle for each confirmed bug: test fails on unpatched code, apply fix, test passes. + +**Why a separate protocol?** The code review finds bugs and writes regression tests with `xfail` markers. The TDD protocol takes those tests and proves they actually detect the bug — and that the fix actually fixes it. This is a stronger claim than "we found a bug and wrote a test." It's "here's a test that fails without the patch and passes with it." The distinction matters when reporting bugs upstream: maintainers trust a FAIL→PASS demonstration more than a bug description. + +The generated protocol must include: + +1. **Spec-grounded test requirements.** For each bug in `quality/BUGS.md`, the protocol instructs the agent to: + - Read the bug's **spec basis** field to identify the documentation passage that defines the expected behavior + - Read the gathered doc (from `docs_gathered/` or the project's own docs) at the cited section + - Write test assertions using **language from the spec** — variable names, constants, function names, and assertion messages should echo the spec's terminology, not the code's internal naming + - Include a comment block in each test citing: the requirement ID (from REQUIREMENTS.md), the bug ID (from BUGS.md), and the spec passage (doc name, section, and a ≤15-word quote of the behavioral contract) + +2. **Red-green execution steps.** For each bug with a fix patch: + - **Red:** Run the regression test against unpatched source. It must fail. If it passes, the test doesn't detect the bug — rewrite it using the spec basis to understand what behavior to assert. + - **Green:** Apply the fix patch (`git apply quality/patches/BUG-NNN-fix.patch`), run the same test. It must pass. + - **Record:** Log both results in the BUG tracker with closure status "TDD verified (FAIL→PASS)". + +3. **Framework adaptation.** The protocol must detect the project's test framework and generate idiomatic tests: + - **Projects with test infrastructure** (pytest, JUnit, Go testing, Jest, cargo test, etc.): Write tests in the project's own framework, following existing test conventions discovered during exploration. + - **Projects without test infrastructure** (e.g., Linux kernel, embedded C): Extract the target function with `sed`, write a self-contained C test file with minimal type shims, compile and run directly. Include the extraction command in the test file's header comment so it's self-documenting. + +4. **Upstream reporting format.** For each TDD-verified bug, generate a ready-to-send report block containing: + - One-sentence description citing the spec section violated + - The FAIL→PASS output (copy-pasteable terminal session) + - The test file (as an attachment or inline) + - The fix patch (as an attachment or inline) + +5. **Traceability table.** The protocol produces a `quality/TDD_TRACEABILITY.md` file mapping: + + | Bug ID | Requirement ID | Spec Doc | Spec Section | Behavioral Contract | Test File:Function | Red Result | Green Result | + |--------|---------------|----------|-------------|--------------------|--------------------|------------|--------------| + + Every row must be fully populated. A bug without a spec doc entry is a code inconsistency, not a spec violation — note this in the table and adjust the upstream reporting language accordingly. + +6. **Structured output (mandatory).** The protocol must produce machine-readable results alongside the Markdown report. Use **JUnit XML** for test execution results and a **sidecar JSON** file for QPB-specific metadata that JUnit XML cannot represent. + + **JUnit XML output:** For each red-green phase, run the test with the framework's native JUnit XML output flag: + - Python: `pytest --junitxml=quality/results/tdd-red-BUG-NNN.xml` + - Go: `gotestsum --junitxml quality/results/tdd-red-BUG-NNN.xml -- -run TestRegression_BUG_NNN` + - Java/Kotlin: Maven Surefire reports are generated automatically in `target/surefire-reports/`; copy relevant XML to `quality/results/` + - Rust: `cargo test --test regression 2>&1 | cargo2junit > quality/results/tdd-red-BUG-NNN.xml` (if cargo2junit available; otherwise skip XML for Rust) + - TypeScript: `jest --reporters=default --reporters=jest-junit` with `JEST_JUNIT_OUTPUT_DIR=quality/results/` + + If the framework's JUnit XML reporter is not available or requires a missing dependency, skip the XML output for that language and note it in the sidecar JSON (`"junit_available": false`). Do not fail the TDD run over missing XML tooling. + + **Sidecar JSON (strict schema enforcement):** Generate `quality/results/tdd-results.json` by copying the template below **verbatim** and filling in only the values. Do not invent fields, rename keys, or restructure the schema. The template is the schema — any deviation (extra keys, missing keys, renamed keys, restructured nesting) makes the output non-conformant. Copy-paste the template into your editor first, then fill in the values. Do not write the JSON from memory. + + ```json + { + "schema_version": "1.1", + "skill_version": "", + "date": "YYYY-MM-DD", + "project": "", + "bugs": [ + { + "id": "BUG-001", + "requirement": "REQ-003", + "red_phase": "fail", + "green_phase": "pass", + "verdict": "TDD verified", + "regression_patch": "quality/patches/BUG-001-regression-test.patch", + "fix_patch": "quality/patches/BUG-001-fix.patch", + "fix_patch_present": true, + "patch_gate_passed": true, + "writeup_path": "quality/writeups/BUG-001.md", + "junit_red": "tdd-red-BUG-001.xml", + "junit_green": "tdd-green-BUG-001.xml", + "junit_available": true, + "notes": "" + } + ], + "summary": { + "total": 6, + "verified": 4, + "confirmed_open": 1, + "red_failed": 1, + "green_failed": 0 + } + } + ``` + + **Required top-level fields:** `schema_version`, `skill_version`, `date`, `project`, `bugs`, `summary`. **Required per-bug fields:** `id`, `requirement`, `red_phase`, `green_phase`, `verdict`, `fix_patch_present`, `writeup_path`. If any required field is missing, the result is non-conformant. + + **Required summary sub-keys:** The `summary` object must contain exactly these keys: `total`, `verified`, `confirmed_open`, `red_failed`, `green_failed`. All five are required — omitting any of them (especially `red_failed` or `green_failed`) makes the summary non-conformant. + + **Canonical patch file names:** Regression test patches must be named `BUG-NNN-regression-test.patch`. Fix patches must be named `BUG-NNN-fix.patch`. The gate script globs for these exact patterns — creative variants like `BUG-001-regression.patch` or `BUG-001-test.patch` will not be counted. + + **Date field:** Use the actual date of this session (e.g., `"2026-04-12"`), not the template placeholder `"YYYY-MM-DD"`. The gate validates that the date is a real ISO 8601 date and rejects placeholder strings and future dates. + + **Invalid examples (do not emit these):** + - `"runs": [{"phase": "red", "command": "...", "result": "4 xfailed"}]` — this is a flat runs array, not the bug-indexed `"bugs"` schema. + - A schema with ad-hoc root keys like `"generated"`, `"scope"`, `"status"`, `"testsRun"` — these are not the standard schema fields. + - `"verdict": "skipped"` — this value is deprecated; use `"confirmed open"` with `red_phase: "fail"` and `green_phase: "skipped"`. + - Missing `"schema_version"` at the root — every tdd-results.json must include this field. + + Valid `verdict` values: `"TDD verified"` (FAIL→PASS), `"red failed"` (test passed on unpatched code — test doesn't detect the bug), `"green failed"` (test still fails after fix — fix is incomplete or patch is corrupt), `"confirmed open"` (red phase ran and confirmed the bug, no fix patch available). **Do not use `"skipped"` as a verdict** — every confirmed bug must have a red-phase result. A bug with `verdict: "confirmed open"` must have `red_phase: "fail"` (red ran and confirmed the bug) and `green_phase: "skipped"` (no fix to apply). Valid `red_phase`/`green_phase` values: `"fail"`, `"pass"`, `"error"` (compile/apply failure), `"skipped"` (green only — red is never skipped). The `patch_gate_passed` field records whether the patch validation gate (apply-check + compile) succeeded — `false` if the gate failed and the patch was repaired, `null` if no fix patch exists. The `writeup_path` field points to the per-bug writeup file (see "Bug writeup generation" below) — `null` if no writeup was generated for this bug. + + Runner scripts and CI tools should read the sidecar JSON for pass/fail counts rather than grepping the Markdown report. + + **Post-write validation (mandatory).** After writing `tdd-results.json`, reopen the file and verify: (1) every required top-level field is present, (2) every required per-bug field is present in each `bugs[]` entry, (3) all `verdict`, `red_phase`, and `green_phase` values use only the allowed enum values listed above, (4) no extra undocumented root keys exist. If any check fails, fix the file before proceeding. This step catches the most common failure mode: the agent paraphrases the schema from memory instead of copying the template, producing plausible but non-conformant output. + +**TDD artifact closure gate (mandatory).** If `quality/BUGS.md` contains any confirmed bugs, `quality/results/tdd-results.json` is mandatory — not optional. If any bug has a red-phase result (whether TDD-verified or confirmed-open), `quality/TDD_TRACEABILITY.md` is also mandatory. Zero-bug repos may omit both files. A run that confirms bugs but produces no tdd-results.json is incomplete — the phase cannot close. For repos where TDD cannot execute (environment blocked, no test infrastructure), generate tdd-results.json with `verdict: "deferred"` and a `notes` field explaining why (e.g., `"environment_blocked: missing workspace Cargo.toml"`, `"no_test_infrastructure: kernel C code without userspace harness"`). The deferred verdict makes the gap visible instead of silently omitting the file. + +**Execution UX:** Same three-phase pattern as the integration tests — (1) show the plan as a numbered table of bugs to verify, (2) report one-line progress as each red-green cycle runs (`FAIL ✓ → PASS ✓` or `FAIL ✗ — test passes on unpatched code, rewriting`), (3) show a summary table with verified/failed/rewritten counts. + +7. **Bug writeup generation (for TDD-verified bugs).** After a successful red→green cycle (`verdict: "TDD verified"`), generate a self-contained writeup at `quality/writeups/BUG-NNN.md`. This file is designed to be emailed to a maintainer, attached to a Jira ticket, or reviewed outside the repository — it must stand alone without requiring the reader to navigate the rest of the quality artifacts. + + **Template (sections 1–4, 6, 7 are required in every writeup; add 5 when the depth judgment fires; add 8 when related bugs exist):** + + 1. **Summary** — One paragraph: what's wrong, where (file:line), what breaks in practice. + 2. **Spec reference** — The specific spec section violated, with URL if available. Quote the behavioral contract (≤15 words) that the code fails to satisfy. + 3. **The code** — The buggy code with file:line citation. Explain why it's wrong in terms of the spec, not just "it looks weird." + 4. **Observable consequence** — What actually breaks. Not "could theoretically fail" — what does fail, under what conditions, with what symptoms. + 5. **Depth judgment** *(include only when expansion is warranted)* — After drafting sections 1–4, assess: is the consequence self-evident from the code and test alone? If a reader would reasonably ask "why hasn't anyone noticed this?" or "does this affect all configurations equally?", expand the analysis. Trace the buggy function's callers. Show which code paths expose the bug and which mask it. Concrete expansion triggers: transport/config-dependent behavior, feature flags that mask the bug on some paths, indirect dispatch hiding callers, bugs in negotiation/initialization code that only manifest under specific runtime conditions. If the consequence is obvious from the immediate code (e.g., a null dereference, an off-by-one), keep sections 1–4 tight and omit this section. + 6. **The fix** — A proposed fix as an inline diff (unified diff format), with a brief explanation of why this is the right fix. **Always include a concrete diff** — even for confirmed-open bugs without a separate `.patch` file. If the fix is a one-line change (adding a case label, fixing an argument), write the diff. If the fix requires broader changes, write the minimal diff that addresses the core defect and note what additional changes a full fix would need. The inline diff in the writeup is what makes the writeup actionable — a writeup that says "No fix patch is included" is incomplete and not useful to a maintainer. Example format: + ```diff + --- a/drivers/virtio/virtio_ring.c + +++ b/drivers/virtio/virtio_ring.c + @@ -3527,6 +3527,7 @@ void vring_transport_features(...) + case VIRTIO_F_ORDER_PLATFORM: + case VIRTIO_F_IN_ORDER: + + case VIRTIO_F_RING_RESET: + default: + ``` + 7. **The test** — What the test proves, how to run it, and what output to expect on unpatched vs patched code. + 8. **Related issues** *(include only when related bugs exist)* — Other bugs in the same class, if any. Flag them even if they're not confirmed yet. Omit this section if no related issues were identified. + + **Include the version stamp** at the top of the writeup file (same format as all other generated files). + + **Writeup generation for all confirmed bugs (mandatory).** Generate a writeup at `quality/writeups/BUG-NNN.md` for every confirmed bug — both TDD-verified and confirmed-open. Use the numbered section template above (sections 1–8). For confirmed-open bugs, follow the same template including a proposed fix diff in section 6 (the diff is always required even without a separate `.patch` file). The writeup threshold is bug confirmation, not TDD completion. A run with confirmed bugs and no writeups directory is incomplete. + + **Inline diff is gate-enforced.** The `quality_gate.sh` script checks that every writeup contains a ` ```diff ` block. A writeup without an inline diff will cause the gate to FAIL. Do not write "see patch file" — paste the actual diff inline in the writeup body, inside a fenced ` ```diff ` code block. This is the single most important element of the writeup because it makes the bug actionable for a maintainer reading just the writeup. + +### Checkpoint: Update PROGRESS.md after artifact generation + +Re-read `quality/PROGRESS.md`. Update: +- Mark Phase 2 complete with timestamp +- Update the artifact inventory: set each generated artifact to "generated" with its file path +- Add exploration summary notes if not already present + +**Phase 2 completion gate (mandatory).** Before proceeding to Phase 3, verify: +1. All nine core artifacts exist on disk (`QUALITY.md`, `CONTRACTS.md`, `REQUIREMENTS.md`, `COVERAGE_MATRIX.md`, `test_functional.*`, `RUN_CODE_REVIEW.md`, `RUN_INTEGRATION_TESTS.md`, `RUN_SPEC_AUDIT.md`, `RUN_TDD_TESTS.md`). +2. `REQUIREMENTS.md` contains requirements with specific conditions of satisfaction referencing actual code (file paths, function names, line numbers) — not abstract behavioral descriptions. +3. If dispatch/enumeration contracts exist: `quality/mechanical/verify.sh` exists and has been executed. +4. PROGRESS.md marks Phase 2 complete with timestamp. + +Re-read `quality/PROGRESS.md` and `quality/REQUIREMENTS.md` before starting Phase 3. The requirements are the target list for the code review — every requirement is a potential bug if the code doesn't satisfy its conditions. + +**End-of-phase message (mandatory — print this after Phase 2 completes, then STOP):** + +``` +# Phase 2 Complete — Quality Artifacts Generated + +I've generated the quality infrastructure for this project: +[List the key artifacts created: REQUIREMENTS.md with N requirements and N use cases, +QUALITY.md with N scenarios, functional tests, review protocols, etc.] + +The requirements are now the target list for Phase 3's code review — every requirement +is a potential bug if the code doesn't satisfy it. + +To continue to Phase 3 (Code review with regression tests), say: + + Run quality playbook phase 3. + +Or say "keep going" to continue automatically. +``` + +**After printing this message, STOP. Do not proceed to Phase 3 unless the user explicitly asks.** + +--- + +## Phase 3: Code Review and Regression Tests + +> **Required references for this phase:** +> - `quality/REQUIREMENTS.md` — target list for the code review +> - `.github/skills/references/review_protocols.md` — three-pass protocol and regression test conventions + +Run the code review protocol (all three passes) as described in File 3. After producing findings, write regression tests for every confirmed BUG per the closure mandate in `references/review_protocols.md`. + +**Update PROGRESS.md:** Add every confirmed BUG to the cumulative BUG tracker with source "Code Review", the file:line reference, description, severity, and closure status (regression test function name or exemption reason). Mark Phase 3 (Code review + regression tests) complete. + +**End-of-phase message (mandatory — print this after Phase 3 completes, then STOP):** + +``` +# Phase 3 Complete — Code Review + +The three-pass code review is done. [Summarize: N bugs confirmed, N regression test +patches generated, N fix patches generated. List the bug IDs and one-line summaries.] + +To continue to Phase 4 (Spec audit — Council of Three), say: + + Run quality playbook phase 4. + +Or say "keep going" to continue automatically. +``` + +**After printing this message, STOP. Do not proceed to Phase 4 unless the user explicitly asks.** + +--- + +## Phase 4: Spec Audit and Triage + +> **Required references for this phase:** +> - `.github/skills/references/spec_audit.md` — Council of Three protocol, triage process, verification probes + +Run the spec audit protocol as described in File 5. The triage report **must** include a `## Pre-audit docs validation` section (see `references/spec_audit.md` for the full template). This section is required even if `docs_gathered/` is empty — in that case, note what baseline the auditors used instead. Every verification probe in the triage must produce executable evidence (test assertions with line-number citations) per the "Verification probes must produce executable evidence" rule above. After triage, categorize each confirmed finding. + +**Effective council gating for enumeration checks.** If the effective council is less than 3/3 (fewer than three auditors returned usable reports) and the run includes any whitelist/enumeration/dispatch-function checks or any carried-forward seed checks, the audit may not conclude "no confirmed defects" for those checks without executed mechanical proof artifacts. An incomplete council with mechanical verification is acceptable. An incomplete council relying on prose-only validation for code-presence claims is not — escalate to "NEEDS VERIFICATION" and run the mechanical check before closing. + +**Pre-audit spot-checks must extract from code, not assert from docs.** When the spec audit prompt includes spot-check claims for pre-validation (e.g., "verify that function X handles constant Y at line Z"), the triage must validate each claim by extracting the actual code at the cited lines — not by confirming that the claim sounds plausible. For each spot-check claim about code contents, the pre-validation must report what the cited lines actually contain: "Line 3527 contains `default:` — NOT `case VIRTIO_F_RING_RESET:` as claimed." If the spot-check was generated from requirements or gathered docs rather than from the code itself, treat it as a hypothesis to test, not a fact to confirm. This rule prevents the contamination chain observed in v1.3.17 where a false spot-check claim ("RING_RESET at 3527-3528") was accepted as "accurate" without reading the actual lines, then propagated through the triage and into every downstream artifact. + +**Update PROGRESS.md:** Add every confirmed **code bug** from the spec audit to the cumulative BUG tracker with source "Spec Audit". This is critical — spec-audit bugs are systematically orphaned if they aren't added to the same tracker that the closure verification reads. + +### Post-spec-audit regression tests + +After the spec audit triage, check the cumulative BUG tracker in PROGRESS.md. Any spec-audit BUG that doesn't have a regression test yet needs one now. Write regression tests for spec-audit confirmed code bugs using the same conventions as code-review regression tests (expected-failure markers, test-finding alignment, executable source files). + +**Why this step exists:** Code review bugs get regression tests immediately because tests are written right after the review. Spec audit runs after the tests are written, so its confirmed bugs are orphaned — they appear in the triage report but never get tests. This step closes that gap. + +**Individual auditor artifacts (mandatory).** The spec audit must produce individual auditor report files at `quality/spec_audits/YYYY-MM-DD-auditor-N.md` (one per auditor), not just the triage synthesis. Each auditor report records what that auditor found independently before triage reconciliation. If only the triage file exists with no individual auditor artifacts, the audit is incomplete — the triage cannot be verified because there is no record of pre-reconciliation findings. This requirement exists because a single triage file conflates discovery with reconciliation, making it impossible to tell whether a finding was independently confirmed or synthesized from a single source. + +**Phase 4 completion gate.** Phase 4 is not complete until a triage file exists at `quality/spec_audits/YYYY-MM-DD-triage.md` **and** individual auditor reports exist. If only auditor reports exist with no triage synthesis, mark Phase 4 as "partial — triage pending" in PROGRESS.md and complete the triage before proceeding. If only the triage exists with no individual reports, mark Phase 4 as "partial — auditor artifacts missing" and regenerate them. The PROGRESS.md checkbox must not be set until both the triage file and auditor reports are confirmed present. + +Update the BUG tracker entries with regression test references. Mark Phase 4 (Spec audit + triage) complete. + +**End-of-phase message (mandatory — print this after Phase 4 completes, then STOP):** + +``` +# Phase 4 Complete — Spec Audit + +The Council of Three spec audit is done. [Summarize: N auditors ran, N net-new bugs +confirmed from triage, total bugs now at N. List any new bug IDs and summaries.] + +To continue to Phase 5 (Reconciliation — TDD verification, writeups, closure), say: + + Run quality playbook phase 5. + +Or say "keep going" to continue automatically. +``` + +**After printing this message, STOP. Do not proceed to Phase 5 unless the user explicitly asks.** + +--- + +## Phase 5: Post-Review Reconciliation and Closure Verification + +> **Required references for this phase:** +> - `quality/PROGRESS.md` — cumulative BUG tracker (authoritative finding list) +> - `.github/skills/references/requirements_pipeline.md` — post-review reconciliation process +> - `.github/skills/references/review_protocols.md` — regression test cleanup after reversals +> - `.github/skills/references/spec_audit.md` — verification probe protocol for conflicts + +Re-read `quality/PROGRESS.md` — specifically the cumulative BUG tracker. This is the authoritative list of all findings across both code review and spec audit. + +1. **Run the Post-Review Reconciliation** as described in `references/requirements_pipeline.md`. Update COMPLETENESS_REPORT.md. +2. **Run closure verification:** For every row in the BUG tracker, verify it has either a regression test reference or an explicit exemption. If any BUG lacks both, write the test or exemption now. +3. **Triage-to-BUGS.md sync gate (mandatory).** Re-read the triage report (`quality/spec_audits/*-triage.md`). For every finding confirmed as a code bug, verify it appears in `quality/BUGS.md`. If BUGS.md does not exist, create it now. If BUGS.md exists but is missing confirmed bugs from the triage, append them. A triage report with confirmed code bugs and no corresponding BUGS.md entries is non-conformant — the phase cannot be marked complete until they are synced. This gate exists because in v1.3.21 benchmarking, javalin's triage confirmed 2 bugs but BUGS.md was never created. +4. **Clean up after spec-audit reversals:** If the spec audit reclassified any code review BUG as a design choice or false positive, remove or relocate the corresponding regression test per `references/review_protocols.md`. +5. **Resolve CR vs spec-audit conflicts:** If the code review and spec audit disagree on the same finding (one says BUG, the other says design choice), deploy a verification probe per `references/spec_audit.md` and record the resolution in the BUG tracker. + +**TDD sidecar-to-log consistency check (mandatory).** For every bug entry in `tdd-results.json`, verify the corresponding log files exist and agree. If `tdd-results.json` contains a bug with `verdict: "TDD verified"`, then `quality/results/BUG-NNN.red.log` must exist with first line `RED` and `quality/results/BUG-NNN.green.log` must exist with first line `GREEN`. If the sidecar claims "TDD verified" but no red-phase log exists, the verdict is unsubstantiated — either create the log by running the test, or downgrade the verdict to `"confirmed open"`. This check exists because v1.3.46 benchmarking showed agents writing "TDD verified" verdicts in the JSON based on narrative reasoning without ever executing the test. + +**Executed evidence outranks narrative artifacts (contradiction gate).** Before running the terminal gate, check for contradictions between executed evidence and prose artifacts. Executed evidence includes: mechanical verification artifacts (`quality/mechanical/*`), verification receipt files (`quality/results/mechanical-verify.log`, `quality/results/mechanical-verify.exit`), regression test results (`test_regression.*` with `xfail` outcomes), TDD red-phase log files (`quality/results/BUG-NNN.red.log`), and any shell command output saved during the pipeline. Prose artifacts include: `REQUIREMENTS.md`, `CONTRACTS.md`, code reviews, spec audit triage, and `BUGS.md`. If an executed artifact shows a constant is absent (mechanical check), a test fails (regression test), or a red-phase confirms a bug (TDD traceability) — but a prose artifact claims the constant is present, the bug is fixed, or the code is compliant — the executed result wins. Re-open and correct the contradictory prose artifact before proceeding. Specifically: if `mechanical-verify.exit` contains a non-zero value, PROGRESS.md may not claim "Mechanical verification: passed" and the terminal gate may not pass — regardless of what any other artifact says. In v1.3.18, the triage claimed RING_RESET was preserved (`spec_audits/triage.md`), BUGS.md claimed "fixed in working tree," but TDD traceability showed the assertion `assert "case VIRTIO_F_RING_RESET:" in func` failed on the current source. Those three cannot all be true — the executed failure is the ground truth. This gate would have caught that contradiction. + +**Version stamp consistency check (mandatory).** Read the `version:` field from the SKILL.md metadata (in `.github/skills/SKILL.md`). Then check every generated artifact: PROGRESS.md's `Skill version:` field, every `> Generated by` attribution line, every code file header stamp, and every sidecar JSON `skill_version` field. Every version stamp must match the SKILL.md metadata exactly. A single mismatch is a benchmark failure — fix the stamp before proceeding. This check exists because in v1.3.21 benchmarking, 5 of 9 repos had version stamps from older skill versions (v1.3.16 or v1.3.20) because the PROGRESS.md template contained a hardcoded version number. + +**Mechanical directory conformance check.** If `quality/mechanical/` exists, it must contain at minimum a `verify.sh` file. An empty `quality/mechanical/` directory is non-conformant — it implies the step was attempted but abandoned. If no dispatch-function contracts exist in this project's scope, do not create a `mechanical/` directory at all. Instead, record in PROGRESS.md: `Mechanical verification: NOT APPLICABLE — no dispatch/registry/enumeration contracts in scope.` If dispatch contracts do exist, `verify.sh` must include one verification block per saved extraction file under `quality/mechanical/` (not just one). A verify.sh that checks only one artifact when multiple exist is incomplete. + +**Verification receipt gate (mandatory before terminal gate).** If `quality/mechanical/` exists, the following receipt files must also exist before the terminal gate may run: +- `quality/results/mechanical-verify.log` — full stdout/stderr from `bash quality/mechanical/verify.sh` +- `quality/results/mechanical-verify.exit` — a single line containing the exit code (e.g., `0`) + +If either file is missing, run `bash quality/mechanical/verify.sh > quality/results/mechanical-verify.log 2>&1; echo $? > quality/results/mechanical-verify.exit` now. If the exit code is not `0`, the terminal gate fails — do not proceed until the mechanical mismatch is resolved (by fixing the extraction, not by editing verify.sh or the receipt). PROGRESS.md may not claim "Mechanical verification: passed" unless `mechanical-verify.exit` contains `0`. This gate exists because v1.3.23 PROGRESS.md claimed all verification passed when verify.sh actually returned exit 1 — the receipt file makes this claim auditable. + +**TDD Log Closure Gate (mandatory before terminal gate).** Before proceeding to the terminal gate, enumerate all confirmed bug IDs from `quality/BUGS.md` and verify: +1. `quality/results/BUG-NNN.red.log` exists for every confirmed bug. +2. If `quality/patches/BUG-NNN-fix.patch` exists for that bug, `quality/results/BUG-NNN.green.log` also exists. +3. The first line of each log file is one of: `RED`, `GREEN`, `NOT_RUN`, `ERROR`. +If any check fails, stop and generate the missing logs now using the language-aware test execution commands from the TDD execution enforcement section. Do not proceed to the terminal gate with missing TDD logs — a bug with a "TDD verified" verdict in tdd-results.json but no corresponding red-phase log is a contradiction. + +**Terminal gate (mandatory before marking Phase 5 complete):** + +**Prerequisite check:** The terminal gate may run only if Phase 3 (code review) and Phase 4 (spec audit) are both complete, or explicitly marked skipped with rationale in PROGRESS.md. A zero-bug outcome is valid only if code review and spec audit artifacts exist (i.e., `quality/code_reviews/` and `quality/spec_audits/` directories contain report files). If these artifacts are missing and the phases are not explicitly skipped, the terminal gate fails — do not mark Phase 5 complete. + +**BUGS.md is always required.** Every completed run must produce `quality/BUGS.md`, regardless of whether bugs were found. If code review and spec audit confirmed zero source-code bugs, create BUGS.md with a `## Summary` stating "No confirmed source-code bugs found" and listing how many candidates were evaluated and eliminated (e.g., "Code review evaluated N candidates; spec audit evaluated M candidates; all were reclassified as design choices, test-only issues, or false positives"). This provides a positive assertion of a clean outcome rather than ambiguous file absence. A completed run with no BUGS.md is non-conformant. + +**BUGS.md heading format.** Each confirmed bug must use the heading level `### BUG-NNN` (e.g., `### BUG-001`). This is the canonical heading format — not `## BUG-001`, not `**BUG-001**`, not a bullet point. The `### BUG-NNN` heading is what downstream tools grep for when counting bugs, and what the tdd-results.json `id` field must match. Inconsistent heading levels cause machine-readable counts to disagree with the document. + +Re-read `quality/PROGRESS.md`. Count the BUG tracker entries. Then: + +1. Print the following statement to the user (this is mandatory, not optional): + + > "BUG tracker has N entries. N have regression tests, N have exemptions, N are unresolved. Code review confirmed M bugs. Spec audit confirmed K code bugs (L net-new). Expected total: M + L." + +2. Write the same statement into PROGRESS.md under a new `## Terminal Gate Verification` section (immediately after the BUG tracker table). This persists the gate into the artifact so reviewers can verify it without reading session logs. + +If the tracker entry count does not equal M + L, stop and reconcile — a BUG was orphaned from the tracker. Do not mark Phase 5 complete until the counts match. This gate exists because the v1.3.5 bootstrap showed that agents reliably skip the tracker update after spec audit, orphaning 30-50% of confirmed bugs. + +**Regression test function-name verification:** For each BUG tracker entry that references a regression test, grep for the test function name in the regression test file and confirm it exists. An agent can write a test name in the tracker without actually creating the test. If any referenced test function does not exist, write it now before passing the gate. + +3. Verify the `With docs` metadata field in PROGRESS.md matches reality: if `docs_gathered/` exists and contains files, it should say `yes`; otherwise `no`. Fix it if wrong. + +**Artifact file-existence gate (mandatory before marking Phase 5 complete).** Before writing the Phase 5 completion checkbox, verify that every required artifact exists as a file on disk — not just mentioned in PROGRESS.md. Run these checks (use `ls` or equivalent): + +- `quality/BUGS.md` exists (required for all completed runs, per benchmark 34) +- `quality/REQUIREMENTS.md` exists +- `quality/QUALITY.md` exists +- `quality/PROGRESS.md` exists (obviously — you're writing to it) +- `quality/COVERAGE_MATRIX.md` exists +- `quality/COMPLETENESS_REPORT.md` exists +- If Phase 3 ran: `quality/code_reviews/` contains at least one `.md` file +- If Phase 4 ran: `quality/spec_audits/` contains a triage file AND individual auditor files +- If Phase 0 or 0b ran: `quality/SEED_CHECKS.md` exists as a standalone file (not inlined in PROGRESS.md) +- If confirmed bugs exist: `quality/results/tdd-results.json` exists +- If confirmed bugs exist: `quality/results/BUG-NNN.red.log` exists for every confirmed bug ID in `quality/BUGS.md` +- If confirmed bugs exist with fix patches: `quality/results/BUG-NNN.green.log` exists for each bug that has a `quality/patches/BUG-NNN-fix.patch` + +For each missing file, create it now. Do not mark Phase 5 complete with missing artifacts — the terminal gate verification in PROGRESS.md is meaningless if the files it references don't exist on disk. This gate exists because v1.3.24 benchmarking showed express completing all phases and writing a terminal gate section in PROGRESS.md, but BUGS.md, SEED_CHECKS.md, and code review/spec audit files were never written to disk. + +**Sidecar JSON post-write validation (mandatory).** After writing `quality/results/tdd-results.json` and/or `quality/results/integration-results.json`, immediately reopen each file and verify it contains all required keys. For `tdd-results.json`, the required root keys are: `schema_version`, `skill_version`, `date`, `project`, `bugs`, `summary`. Each entry in `bugs` must have: `id`, `requirement`, `red_phase`, `green_phase`, `verdict`, `fix_patch_present`, `writeup_path`. The `summary` object must include `confirmed_open` alongside `verified`, `red_failed`, `green_failed`. For `integration-results.json`, the required root keys are: `schema_version`, `skill_version`, `date`, `project`, `recommendation`, `groups`, `summary`, `uc_coverage`. Both files must have `schema_version: "1.1"`. If any key is missing, add it now — do not leave a non-conformant JSON file on disk. This validation exists because v1.3.25 benchmarking showed 6 of 8 repos with non-conformant sidecar JSON: httpx invented an alternate schema, serde used legacy shape, javalin omitted `summary` and per-bug fields, and others used invalid enum values. + +**Script-verified closure gate (mandatory, final step before marking Phase 5 complete).** Run `bash .github/skills/quality_gate.sh .` from the project root directory. This script mechanically validates: file existence, BUGS.md heading format, sidecar JSON required keys AND per-bug field names (`id`, `requirement`, `red_phase`, `green_phase`, `verdict`, `fix_patch_present`, `writeup_path`) AND enum values AND summary consistency, use case identifiers, terminal gate section, mechanical verification receipts, version stamps, writeup completeness, **regression-test patch presence for every confirmed bug**, and **inline fix diffs in every writeup** (every `quality/writeups/BUG-NNN.md` must contain a ` ```diff ` block). If the script reports any FAIL results, fix each failing check before proceeding — the most common FAILs are: (1) missing `quality/patches/BUG-NNN-regression-test.patch` files, (2) non-canonical JSON field names like `bug_id` instead of `id`, (3) missing `confirmed_open` in the TDD summary, (4) writeups without inline fix diffs (section 6 must include a concrete diff, not just "see patch file"). Do not mark Phase 5 complete until `quality_gate.sh` exits 0. Append the script's full output to `quality/results/quality-gate.log`. + +**Use case identifier format.** REQUIREMENTS.md must use canonical use case identifiers in the format `UC-01`, `UC-02`, etc. for all derived use cases. Each use case must be labeled with its identifier. This is required for machine-readable traceability — the identifier format enables `quality_gate.sh` and downstream tooling to count and cross-reference use cases programmatically. Use cases written as prose paragraphs without identifiers are non-conformant. + +Update PROGRESS.md: mark Phase 5 complete. The BUG tracker should now show closure status for every entry. + +**End-of-phase message (mandatory — print this after Phase 5 completes, then STOP):** + +``` +# Phase 5 Complete — Reconciliation and TDD Verification + +All confirmed bugs now have regression tests, writeups, and TDD red-green verification. +[Summarize: N total confirmed bugs, N with TDD verified status, N with fix patches. +List all bug IDs with one-line summaries and their TDD verdicts.] + +To continue to Phase 6 (Final verification and quality gate), say: + + Run quality playbook phase 6. + +Or say "keep going" to continue automatically. +``` + +**After printing this message, STOP. Do not proceed to Phase 6 unless the user explicitly asks.** + --- -## Phase 3: Verify +## Phase 6: Verify + +> **Required references for this phase:** +> - `.github/skills/references/verification.md` — 45 self-check benchmarks **Why a verification phase?** AI-generated output can look polished and be subtly wrong. Tests that reference undefined fixtures report 0 failures but 16 errors — and "0 failures" sounds like success. Integration protocols can list field names that don't exist in the actual schemas. The verification phase catches these problems before the user discovers them, which is important because trust in a generated quality playbook is fragile — one wrong field name undermines confidence in everything else. -### Self-Check Benchmarks +**Phase 6 execution model: incremental, not monolithic.** Phase 6 runs as a series of independent verification steps, each reading only the file(s) it needs, checking one thing, and writing its result to `quality/results/phase6-verification.log` before moving to the next step. Do NOT load all artifacts into context at once. Do NOT try to hold the full verification checklist in memory while reading artifacts. Each step below is self-contained — read the file, check the condition, append the result, drop the context. + +### Step 6.1: Mechanical Verification Closure (mandatory first step) + +If `quality/mechanical/` exists, the **literal first action** of Phase 6 is: + +```bash +bash quality/mechanical/verify.sh > quality/results/mechanical-verify.log 2>&1 +echo $? > quality/results/mechanical-verify.exit +``` + +Execute this command in the shell. Do not substitute a Python script, do not read the artifact file and assert on its contents, do not skip this step. The command must be `bash quality/mechanical/verify.sh` — not `python3 -c "..."`, not `cat quality/mechanical/... | grep ...`, not any other equivalent. + +Record the exit code. If non-zero, **Phase 6 fails immediately.** Do not proceed to further steps. Go back to the extraction step: delete the mismatched `*_cases.txt`, re-run the extraction command with a fresh shell redirect, re-verify, and update all downstream artifacts that cited the old extraction. + +Record in PROGRESS.md under `## Phase 6 Mechanical Closure` and append to `quality/results/phase6-verification.log`: +``` +[Step 6.1] Mechanical verification: PASS (exit 0) +``` + +**Why this is non-substitutable:** In v1.3.23, the model replaced `bash verify.sh` with `python3 -c "from pathlib import Path; ..."` that read the (forged) artifact file and asserted on its contents — a circular check that passed despite the artifact being fabricated. The only trustworthy verification is re-running the same shell pipeline that produced the artifact and diffing the results. Any other method can be fooled by a corrupted intermediate file. + +### Step 6.2: Run quality_gate.sh (script-verified checks) + +Run the mechanical validation gate: + +```bash +bash .github/skills/quality_gate.sh . > quality/results/quality-gate.log 2>&1 +echo $? >> quality/results/phase6-verification.log +``` + +Read `quality/results/quality-gate.log`. If it reports any FAIL results, fix each failing check before proceeding. The most common FAILs are: (1) missing `quality/patches/BUG-NNN-regression-test.patch` files, (2) non-canonical JSON field names like `bug_id` instead of `id`, (3) missing `confirmed_open` in the TDD summary, (4) writeups without inline fix diffs, (5) missing TDD red/green log files. Do not proceed until `quality_gate.sh` exits 0. + +Append to `quality/results/phase6-verification.log`: +``` +[Step 6.2] quality_gate.sh: PASS (exit 0) — N checks passed, 0 FAIL, 0 WARN +``` + +This step covers verification benchmarks: 14 (sidecar JSON), 17 (test file extension), 18 (use case count), 20 (writeups), 23 (mechanical artifacts), 26 (version stamps), 27 (mechanical directory), 29 (triage-to-BUGS sync), 34 (BUGS.md exists), 38 (individual auditor reports), 39 (BUGS.md heading format), 40 (artifact file existence), 41 (sidecar JSON validation), 42 (script-verified closure), 43 (use case identifiers), 44 (regression-test patches), 45 (writeup inline diffs). + +### Step 6.3: Test execution verification + +Run the functional test suite. Read only `quality/test_functional.*` to determine the test command: + +- **Python:** `pytest quality/test_functional.py -v 2>&1 | tail -20` +- **Java:** `mvn test -Dtest=FunctionalTest` or `gradle test --tests FunctionalTest` +- **Go:** `go test -v` targeting the generated test file's package +- **TypeScript:** `npx jest functional.test.ts --verbose` +- **Rust:** `cargo test` +- **Scala:** `sbt "testOnly *FunctionalSpec"` + +Check for both failures AND errors. Errors from missing fixtures, failed imports, or unresolved dependencies count as broken tests. Expected-failure (xfail) regression tests do not count against this check. + +Append to `quality/results/phase6-verification.log`: +``` +[Step 6.3] Functional tests: PASS — N tests, 0 failures, 0 errors +``` + +This covers benchmarks 8 (all tests pass) and 9 (existing tests unbroken). + +### Step 6.4: Verification checklist — file-by-file checks + +Process the remaining verification benchmarks from `references/verification.md` in small batches. For each batch, read only the file(s) needed, check the condition, and append the result. **Do not read more than 2 files per batch.** + +**Batch A — QUALITY.md (benchmarks 1-2, 10):** Read `quality/QUALITY.md`. Count scenarios. Verify each scenario references real code (grep for cited function names). Append results. -Before declaring done, check every benchmark. **Read `references/verification.md`** for the complete checklist. +**Batch B — Functional test file (benchmarks 3-7):** Read `quality/test_functional.*`. Check cross-variant coverage (~30%), boundary test count, assertion depth (value checks vs presence checks), layer correctness (outcomes vs mechanisms), mutation validity. -The critical checks: +**Batch C — Protocol files (benchmarks 11-13):** Read `quality/RUN_CODE_REVIEW.md`, then `quality/RUN_INTEGRATION_TESTS.md`, then `quality/RUN_SPEC_AUDIT.md` — one at a time. Check each is self-contained and executable. Verify Field Reference Table in integration tests. -1. **Test count** near heuristic target (spec sections + scenarios + defensive patterns) -2. **Scenario coverage** — scenario test count matches QUALITY.md scenario count -3. **Cross-variant coverage** — ~30% of tests parametrize across all input variants -4. **Boundary test count** ≈ defensive pattern count from Step 5 -5. **Assertion depth** — Majority of assertions check values, not just presence -6. **Layer correctness** — Tests assert outcomes (what spec says), not mechanisms (how code implements) -7. **Mutation validity** — Every fixture mutation uses a schema-valid value from Step 5b -8. **All tests pass — zero failures AND zero errors.** Run the test suite using the project's test runner (Python: `pytest -v`, Scala: `sbt testOnly`, Java: `mvn test`/`gradle test`, TypeScript: `npx jest`, Go: `go test -v`, Rust: `cargo test`) and check the summary. Errors from missing fixtures, failed imports, or unresolved dependencies count as broken tests. If you see setup errors, you forgot to create the fixture/setup file or referenced undefined test helpers. -9. **Existing tests unbroken** — The new files didn't break anything. -10. **Integration test quality gates were written from a Field Reference Table.** Verify that you built a Field Reference Table by re-reading each schema file before writing quality gates, and that every field name in the quality gates is copied from that table — not from memory. If you skipped the table, go back and build it now. +**Batch D — Regression tests (benchmarks 15-16, 24):** Read `quality/test_regression.*` if it exists. Verify skip guards reference bug IDs, verify patch validation gate commands, verify source-inspection tests don't use `run=False`. -If any benchmark fails, go back and fix it before proceeding. +**Batch E — Enumeration and triage checks (benchmarks 19, 21-22, 25, 36):** Read `quality/code_reviews/*.md` (just the enumeration sections). Read `quality/spec_audits/*triage*` (just the verification probe sections). Check two-list comparisons, executable probe evidence, no circular mechanical artifact references, contradiction gate. + +**Batch F — Continuation mode (benchmarks 32-33):** Only if `quality/SEED_CHECKS.md` exists. Read it, verify mechanical execution, verify convergence section in PROGRESS.md. + +Append each batch result to `quality/results/phase6-verification.log`: +``` +[Step 3.4A] QUALITY.md scenarios: PASS — 8 scenarios, all reference real code +[Step 3.4B] Functional test quality: PASS — 30% cross-variant, assertion depth OK +[Step 3.4C] Protocol files: PASS — all self-contained and executable +[Step 3.4D] Regression tests: PASS — all skip guards present +[Step 3.4E] Enumeration/triage: PASS — two-list checks present, probes have assertions +[Step 3.4F] Continuation mode: SKIP — no SEED_CHECKS.md +``` + +If any batch fails, fix the issue immediately before proceeding to the next batch. + +### Step 6.5: Metadata Consistency Check + +Read `quality/PROGRESS.md` (just the metadata and artifact inventory sections). Then spot-check: +- The requirement count is consistent across REQUIREMENTS.md header, PROGRESS.md artifact inventory, and COVERAGE_MATRIX.md header. All three must state the same number. +- The `With docs` field accurately reflects whether `docs_gathered/` exists +- The Terminal Gate Verification section is present and filled in + +Then read `quality/COMPLETENESS_REPORT.md` (just the verdict section). Verify no stale pre-reconciliation text remains — if both a `## Verdict` and an `## Updated verdict` (or `## Post-Review Reconciliation`) section exist, **delete the original `## Verdict` section entirely**. The final document must have exactly one `## Verdict` heading. + +Append to `quality/results/phase6-verification.log`: +``` +[Step 6.5] Metadata consistency: PASS — requirement counts match, version stamps consistent +``` + +If any metadata is stale, fix it now. + +### Checkpoint: Finalize PROGRESS.md + +Re-read `quality/PROGRESS.md`. Update: +- Mark Phase 6 (Verification benchmarks) complete with timestamp +- Verify the BUG tracker has closure for every entry +- Add a final summary line: "Run complete. N BUGs found (N from code review, N from spec audit). N regression tests written. N exemptions granted." +- **Print the suggested next prompt to the user (mandatory, all runs).** This applies to EVERY run, including baseline — it is not iteration-specific. Print the following block so the user can copy-paste it to start the next iteration: + + For a baseline run (no iteration strategy): + ``` + ──────────────────────────────────────────────────────── + Next iteration suggestion: + "Run the next iteration of the quality playbook using the gap strategy." + ──────────────────────────────────────────────────────── + ``` + + For iteration runs, use this mapping to determine the next strategy: + - **gap** → suggest unfiltered + - **unfiltered** → suggest parity + - **parity** → suggest adversarial + - **adversarial** → suggest "Run the quality playbook from scratch." (cycle complete) + +The completed PROGRESS.md is a permanent audit trail. It documents what the skill did, what it found, and how it resolved each finding. Users can read it to understand the run, debug failures, and compare across runs. + +### Convergence Check (continuation mode only) + +> **Scope:** This subsection only. The suggested-next-prompt step above is unconditional and must execute on every run regardless of whether this convergence check is skipped. + +**This step runs only if Phase 0 executed** (i.e., `quality/SEED_CHECKS.md` exists from prior-run analysis). If this is a first run with no prior history, skip to Phase 7. + +Compare this run's bug list against the seed list: + +1. **Count net-new bugs:** bugs in this run's BUGS.md that do NOT match any seed (by file:line). A bug is "net-new" if it was not found in any prior run. +2. **Count seed carryovers:** seeds that were re-confirmed in this run (FAIL result in Step 0b). +3. **Count seed resolutions:** seeds that are now passing (bug was fixed since prior run). + +Write a `## Convergence` section to PROGRESS.md: + +```markdown +## Convergence + +Run number: N (N prior runs in previous_runs/) +Seeds from prior runs: S (S confirmed, R resolved) +Net-new bugs this run: K +Convergence: [CONVERGED | NOT CONVERGED] + +Net-new bugs: +- BUG-NNN: [summary] (file:line) — not in any prior run +``` + +**Convergence criterion:** The run is converged if **net-new bugs = 0** — every bug found in this run was already known from a prior run. This means further runs are unlikely to find additional bugs in the declared scope. + +**If CONVERGED:** Print to the user: "This run found no new bugs beyond the N already known from prior runs. Bug discovery has converged for this scope. Total confirmed bugs across all runs: T." Then proceed to Phase 7. + +**If NOT converged — automatic re-iteration.** When the convergence check shows net-new bugs > 0 and the iteration count has not reached the maximum (default: 5), the skill re-iterates automatically: + +1. Record the iteration number and net-new count in PROGRESS.md. +2. Archive the current `quality/` directory: `cp -a quality/ previous_runs//quality/` then `rm -rf quality/ control_prompts/`. +3. Restart from **Phase 0** (which will now find the newly archived run in `previous_runs/`). +4. Print to the user: "Iteration N found K net-new bugs. Archiving and starting iteration N+1 (max M)." + +The iteration counter starts at 1 for the first run. Each archive-and-restart increments it. When the counter reaches the maximum, stop iterating even if not converged and print: "Reached maximum iterations (M) without convergence. K net-new bugs found in the last run. Total confirmed bugs across all runs: T." + +**Iteration limits.** The default maximum is 5 iterations. If the user's prompt includes an explicit limit (e.g., "run the playbook with 3 iterations"), use that limit instead. If the user's prompt says "single run" or "no iteration," skip re-iteration entirely and treat NOT CONVERGED the same as the pre-iteration behavior: print the net-new count and suggest re-running. + +**Context window awareness.** If at any point during re-iteration you detect that your context window is substantially consumed (e.g., you are producing noticeably shorter or lower-quality output than earlier iterations), stop iterating, write the current state to PROGRESS.md, and print: "Stopping iteration due to context constraints. Completed N of M iterations. Re-run the playbook to continue — Phase 0 will pick up the seed list from previous_runs/." This is a safety valve, not a target — most codebases converge in 2-3 iterations. + +**Why this matters:** A single playbook run explores a subset of the codebase non-deterministically. The first run on virtio might find BUG-001 and BUG-004 but miss BUG-005. The second run might find BUG-005 and BUG-006. By the third run, if no net-new bugs appear, the exploration has likely covered the high-value territory. The seed list ensures previously found bugs are never lost between runs, and the convergence check tells the user when additional runs have diminishing returns. Automatic re-iteration means the skill is self-contained — callers don't need external scripts or manual re-runs to achieve convergence. + +**End-of-phase message (mandatory — print this after Phase 6 completes, then STOP):** + +``` +# Phase 6 Complete — All Phases Done + +The quality playbook baseline run is complete. Here's the summary: + +[Include: total confirmed bugs, quality gate pass/fail/warn counts, +list of all bug IDs with one-line summaries and severities.] + +Key output files: +- quality/BUGS.md — all confirmed bugs with spec basis and patches +- quality/results/tdd-results.json — structured TDD verification results +- quality/patches/ — regression test and fix patches for every bug + +You can now run iteration strategies to find additional bugs. Iterations typically +add 40-60% more confirmed bugs on top of the baseline. The recommended cycle is: +gap → unfiltered → parity → adversarial. + +To start the first iteration, say: + + Run the next iteration of the quality playbook. + +Or ask me about the results: "Tell me about BUG-001" or "Which bugs are highest priority?" +``` + +**After printing this message, STOP. Do not proceed to iterations unless the user explicitly asks.** + +**End-of-iteration message (mandatory — print this after each iteration completes, then STOP):** + +``` +# Iteration Complete — [Strategy Name] + +[Summarize: N net-new bugs found in this iteration, total now at N. +List new bug IDs with one-line summaries.] + +[If there are remaining strategies in the recommended cycle, suggest the next one:] +The next recommended strategy is [next strategy]. To run it, say: + + Run the next iteration using the [next strategy] strategy. + +[If all four strategies have been run:] +All four iteration strategies have been run. Total confirmed bugs: N. +You can review the results, ask about specific bugs, or re-run any strategy. + +Or say "keep going" to run the next iteration automatically. +``` + +**After printing this message, STOP. Do not proceed to the next iteration unless the user explicitly asks.** --- -## Phase 4: Present, Explore, Improve (Interactive) +## Phase 7: Present, Explore, Improve (Interactive) After generating and verifying, present the results clearly and give the user control over what happens next. This phase has three parts: a scannable summary, drill-down on demand, and a menu of improvement paths. @@ -337,12 +1902,14 @@ Here's what I generated: | File | What It Does | Key Metric | Confidence | |------|-------------|------------|------------| +| REQUIREMENTS.md | Testable requirements with use cases | N requirements, N use cases | ██████░░ Medium — solid baseline from 5-phase pipeline, improves with refinement passes | | QUALITY.md | Quality constitution | 10 scenarios | ██████░░ High — grounded in code, but scenarios are inferred, not from real incidents | | Functional tests | Automated tests | 47 passing | ████████ High — all tests pass, 35% cross-variant | -| RUN_CODE_REVIEW.md | Code review protocol | 8 focus areas | ████████ High — derived from architecture | +| RUN_CODE_REVIEW.md | Three-pass code review | 3 passes | ████████ High — structural + requirement verification + consistency | | RUN_INTEGRATION_TESTS.md | Integration test protocol | 9 runs × 3 providers | ██████░░ Medium — quality gates need threshold tuning | | RUN_SPEC_AUDIT.md | Council of Three audit | 10 scrutiny areas | ████████ High — guardrails included | | AGENTS.md | AI session bootstrap | Updated | ████████ High — factual | +| RUN_TDD_TESTS.md | TDD verification protocol | N bugs to verify | ████████ High — mechanical red-green cycle with spec traceability | ``` Adapt the table to what you actually generated — the file names, metrics, and confidence levels will vary by project. The confidence column is the most important: it tells the user where to focus their attention. @@ -368,6 +1935,9 @@ To use these artifacts, start a new AI session and try one of these prompts: • Start a spec audit (Council of Three): "Read quality/RUN_SPEC_AUDIT.md and follow its instructions using [model name]." + +• Run TDD verification for confirmed bugs: + "Read quality/RUN_TDD_TESTS.md and follow its instructions to verify all confirmed bugs." ``` Adapt the test runner command and module names to the actual project. The point is to give the user copy-pasteable prompts — not descriptions of what they could do, but the actual text they'd type. @@ -390,21 +1960,36 @@ The user may go through several drill-downs before they're ready to improve anyt After the user has seen the summary (and optionally drilled into details), present the improvement options: -> "Three ways to make this better:" +> "Five ways to make this better:" +> +> **1. Review requirements interactively** — Read `quality/REVIEW_REQUIREMENTS.md` for a guided walkthrough of the requirements organized by use case. You can pick specific use cases to drill into, or walk through all of them sequentially. A different model can also fact-check the completeness report (cross-model audit). Good for: finding gaps the pipeline missed. +> +> **2. Refine requirements with a different model** — Read `quality/REFINE_REQUIREMENTS.md` and run a refinement pass. You can run this with any AI model — Claude, GPT, Gemini — and each will catch different gaps. Run as many models as you want until you hit diminishing returns. Each pass backs up the current version and logs changes in `quality/VERSION_HISTORY.md`. Good for: pushing requirements from the baseline toward completeness. > -> **1. Review and harden individual items** — Pick any scenario, test, or protocol section and I'll walk through it with you. Good for: tightening specific quality gates, fixing inferred scenarios, adding missing edge cases. +> **3. Review and harden other items** — Pick any scenario, test, or protocol section and I'll walk through it with you. Good for: tightening specific quality gates, fixing inferred scenarios, adding missing edge cases. > -> **2. Guided Q&A** — I'll ask you 3-5 targeted questions about things I couldn't infer from the code: incident history, expected distributions, cost tolerance, model preferences. Good for: filling knowledge gaps that make scenarios more authoritative. +> **4. Guided Q&A** — I'll ask you 3-5 targeted questions about things I couldn't infer from the code: incident history, expected distributions, cost tolerance, model preferences. Good for: filling knowledge gaps that make scenarios more authoritative. > -> **3. Review development history** — Point me to exported AI chat history (Claude, Gemini, ChatGPT exports, Claude Code transcripts) and I'll mine it for design decisions, incident reports, and quality discussions that should be in QUALITY.md. Good for: grounding scenarios in real project history instead of inference. +> **5. Feed in additional documentation** — The requirements pipeline works better with more intent sources. Point me to any of these and I'll use them to refine the requirements and quality constitution: +> - Exported AI chat history (Claude, Gemini, ChatGPT exports, Claude Code transcripts) +> - Slack or Teams channels where the project was discussed +> - Email threads, Jira/Linear tickets, or GitHub issues about the project +> - Design documents, architecture decision records, or meeting notes +> - Newsgroup posts, forum discussions, or mailing list archives +> +> You can use tools like Claude Cowork, GitHub Copilot, or OpenClaw to connect to these sources and gather them into a folder, then point me at the folder. Good for: grounding scenarios and requirements in real project history instead of inference. > > "You can do any combination of these, in any order. Which would you like to start with?" ### Executing Each Improvement Path -**Path 1: Review and harden.** The user picks an item. Walk through it: show the current text, explain your reasoning, ask if it's accurate. Revise based on their feedback. Re-run tests if the functional tests change. +**Path 1: Review requirements interactively.** Point the user to `quality/REVIEW_REQUIREMENTS.md` and offer to walk through it together. The protocol supports self-guided (pick use cases), fully guided (sequential walkthrough), and cross-model audit (different model fact-checks the completeness report). Progress is tracked in `quality/REFINEMENT_HINTS.md` so the user can pick up where they left off. + +**Path 2: Refine requirements with a different model.** Point the user to `quality/REFINE_REQUIREMENTS.md`. Each refinement pass: backs up the current version to `quality/history/vX.Y/`, reads feedback from REFINEMENT_HINTS.md, makes targeted improvements, bumps the minor version, and logs changes in VERSION_HISTORY.md. The user can run this with Claude, GPT, Gemini, or any other model — each catches different blind spots. Run until diminishing returns. + +**Path 3: Review and harden other items.** The user picks a scenario, test, or protocol section. Walk through it: show the current text, explain your reasoning, ask if it's accurate. Revise based on their feedback. Re-run tests if the functional tests change. -**Path 2: Guided Q&A.** Ask 3-5 questions derived from what you actually found during exploration. These categories cover the most common high-leverage gaps: +**Path 4: Guided Q&A.** Ask 3-5 questions derived from what you actually found during exploration. These categories cover the most common high-leverage gaps: - **Incident history for scenarios.** "I found [specific defensive code]. What failure caused this? How many records were affected?" - **Quality gate thresholds.** "I'm checking that [field] contains [values]. What distribution is normal? What signals a problem?" @@ -414,14 +1999,14 @@ After the user has seen the summary (and optionally drilled into details), prese After the user answers, revise the generated files and re-run tests. -**Path 3: Review development history.** If the user provides a chat history folder: +**Path 5: Feed in additional documentation.** The user points you to additional intent sources — chat history, Slack exports, email threads, Jira tickets, design docs, meeting notes, forum archives. These contain design decisions, incident history, and quality discussions that didn't make it into formal documentation. -1. Scan for index files and navigate to quality-relevant conversations (same approach as Step 0, but now with specific targets — you know which scenarios need grounding, which quality gates need thresholds, which design decisions need rationale). -2. Extract: incident stories with specific numbers, design rationale for defensive patterns, quality framework discussions, cross-model audit results. -3. Revise QUALITY.md scenarios with real incident details. Update integration test thresholds with real-world values. Add Council of Three empirical data if audit results exist. -4. Re-run tests after revisions. +1. Scan for index files and navigate to quality-relevant content (same approach as Step 0, but now with specific targets — you know which requirements need grounding, which scenarios need thresholds, which gaps need closing). +2. Extract: incident stories with specific numbers, design rationale for defensive patterns, quality framework discussions, cross-model audit results, and behavioral contracts that weren't visible from the code alone. +3. Feed findings into `quality/REFINEMENT_HINTS.md` as new feedback items, then run a refinement pass to update the requirements. +4. Revise QUALITY.md scenarios with real incident details. Update integration test thresholds with real-world values. Re-run tests after revisions. -If the user already provided chat history in Step 0, you've already mined it — but they may want to point you to specific conversations or ask you to dig deeper into a particular topic. +If the user already provided chat history in Step 0, you've already mined it — but they may want to point you to specific conversations, connect additional sources, or ask you to dig deeper into a particular topic. ### Iteration @@ -461,6 +2046,11 @@ Examine existing test files to understand how they set up test data. Whatever pa 3. Concrete failure modes make standards non-negotiable — abstract requirements invite rationalization 4. Guardrails transform AI review quality (line numbers, read bodies, grep before claiming) 5. Triage before fixing — many "defects" are spec bugs or design decisions +6. Structural review has a ceiling (~65%). The remaining ~35% are intent violations — absence bugs, cross-file contradictions, design gaps — invisible to any tool that only reads code. Requirements make the invisible visible. +7. The specification is the unique contribution, not the review structure. Focus areas and review protocols are secondary to having the right testable requirements derived from intent sources. +8. Cross-requirement consistency checking is essential. Bugs often live in the gap between two individually-correct pieces of code. Per-requirement verification alone can't find these. +9. Keep all derived requirements — do not filter. The cost of checking an extra requirement is low; the cost of missing a bug because you pruned the requirement that would have caught it is high. +10. A failing test is the strongest evidence a bug exists. Run the red-green TDD cycle (test fails on buggy code, passes on fixed code) for every confirmed bug with a fix patch. Show the FAIL→PASS output — reviewers can disagree with your fix but can't argue with a reproducing test. --- @@ -474,6 +2064,6 @@ Read these as you work through each phase: | `references/schema_mapping.md` | Step 5b (schema types) | Field mapping format, mutation validity rules | | `references/constitution.md` | File 1 (QUALITY.md) | Full template with section-by-section guidance | | `references/functional_tests.md` | File 2 (functional tests) | Test structure, anti-patterns, cross-variant strategy | -| `references/review_protocols.md` | Files 3–4 (code review, integration) | Templates for both protocols | +| `references/review_protocols.md` | Files 3–4 (code review, integration) | Templates for both protocols, patch validation, skip guards | | `references/spec_audit.md` | File 5 (Council of Three) | Full audit protocol, triage process, fix execution | -| `references/verification.md` | Phase 3 (verify) | Complete self-check checklist with all 13 benchmarks | +| `references/verification.md` | Phase 6 (verify) | Complete self-check checklist (45 benchmarks) including structured output, patch gate, skip guard validation, pre-flight discovery, version stamps, bug writeups, enumeration completeness, triage executable evidence, code-extracted enumeration lists, mechanical verification artifacts, source-inspection test execution, contradiction gate, seed check execution, convergence tracking, sidecar JSON schema validation, script-verified closure gate, canonical use case identifiers, and writeup inline fix diffs | diff --git a/skills/quality-playbook/quality_gate.sh b/skills/quality-playbook/quality_gate.sh new file mode 100755 index 000000000..11a59937a --- /dev/null +++ b/skills/quality-playbook/quality_gate.sh @@ -0,0 +1,632 @@ +#!/bin/bash +# Post-run validation gate — script-verified closure for benchmark runs. +# +# Mechanically checks artifact conformance issues that model self-attestation +# persistently misses. v1.3.27 adds deep JSON field validation, enum checks, +# summary consistency, and mandatory regression-test patches. v1.3.28 adds +# writeup inline diff validation (every writeup must contain a ```diff block). +# v1.3.31 adds TDD summary shape validation (red_failed, green_failed), +# date validation (reject placeholders/future dates), and cross-run +# contamination detection (version mismatch between directory and SKILL.md). +# v1.3.32 adds test file extension validation, minimum UC count (5+), +# and triage executable evidence check. +# v1.3.33 fixes language detection (find-based, not ls-glob), adds --benchmark +# vs --general strictness modes, and makes UC threshold size-aware. +# v1.3.49 adds TDD log file checks: verifies BUG-NNN.red.log exists for every +# confirmed bug and BUG-NNN.green.log for every bug with a fix patch. +# +# Usage: +# ./quality_gate.sh . # Check current directory (benchmark mode) +# ./quality_gate.sh --general . # Check with relaxed thresholds +# ./quality_gate.sh virtio # Check named repo (from repos/) +# ./quality_gate.sh --all # Check all current-version repos +# ./quality_gate.sh --version 1.3.27 virtio # Check specific version +# +# Exit codes: +# 0 — all checks passed +# 1 — one or more checks failed +# +# This script is also copied into each repo at .github/skills/quality_gate.sh +# so the playbook agent can run it as its final Phase 6 verification step. + +set -uo pipefail + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +FAIL=0 +WARN=0 +REPO_DIRS=() +VERSION="" +CHECK_ALL=false +STRICTNESS="benchmark" # "benchmark" (default) or "general" + +# Parse args +EXPECT_VERSION=false +for arg in "$@"; do + if [ "$EXPECT_VERSION" = true ]; then + VERSION="$arg" + EXPECT_VERSION=false + continue + fi + case "$arg" in + --version) EXPECT_VERSION=true ;; + --all) CHECK_ALL=true ;; + --benchmark) STRICTNESS="benchmark" ;; + --general) STRICTNESS="general" ;; + *) REPO_DIRS+=("$arg") ;; + esac +done + +# Detect version from SKILL.md — try multiple locations +if [ -z "$VERSION" ]; then + for loc in "${SCRIPT_DIR}/../SKILL.md" "${SCRIPT_DIR}/SKILL.md" ".github/skills/SKILL.md"; do + if [ -f "$loc" ]; then + VERSION=$(grep -m1 'version:' "$loc" 2>/dev/null | sed 's/.*version: *//' | tr -d ' ') + [ -n "$VERSION" ] && break + fi + done +fi + +fail() { echo " FAIL: $1"; FAIL=$((FAIL + 1)); } +pass() { echo " PASS: $1"; } +warn() { echo " WARN: $1"; WARN=$((WARN + 1)); } +info() { echo " INFO: $1"; } + +# Helper: check if a JSON file contains a key at any nesting level +json_has_key() { + local file="$1" key="$2" + grep -q "\"${key}\"" "$file" 2>/dev/null +} + +# Helper: extract a string value for a key (first occurrence) +json_str_val() { + local file="$1" key="$2" + grep -o "\"${key}\"[[:space:]]*:[[:space:]]*\"[^\"]*\"" "$file" 2>/dev/null \ + | head -1 | sed 's/.*: *"\([^"]*\)"/\1/' +} + +# Helper: count occurrences of a key in JSON +json_key_count() { + local file="$1" key="$2" + grep -c "\"${key}\"" "$file" 2>/dev/null || echo 0 +} + +check_repo() { + local repo_dir="$1" + local repo_name + repo_name=$(basename "$repo_dir") + local q="${repo_dir}/quality" + + # Handle "." as current directory + [ "$repo_dir" = "." ] && repo_dir="$(pwd)" && repo_name=$(basename "$repo_dir") && q="${repo_dir}/quality" + + echo "" + echo "=== ${repo_name} ===" + + # --- File existence (benchmark 40) --- + echo "[File Existence]" + for f in BUGS.md REQUIREMENTS.md QUALITY.md PROGRESS.md COVERAGE_MATRIX.md COMPLETENESS_REPORT.md; do + if [ -f "${q}/${f}" ]; then + pass "${f} exists" + else + fail "${f} missing" + fi + done + + # Code reviews dir + if [ -d "${q}/code_reviews" ] && [ -n "$(ls ${q}/code_reviews/*.md 2>/dev/null)" ]; then + pass "code_reviews/ has .md files" + else + fail "code_reviews/ missing or empty" + fi + + # Spec audits + if [ -d "${q}/spec_audits" ]; then + local triage_count auditor_count + triage_count=$(ls ${q}/spec_audits/*triage* 2>/dev/null | wc -l | tr -d ' ') + auditor_count=$(ls ${q}/spec_audits/*auditor* 2>/dev/null | wc -l | tr -d ' ') + [ "$triage_count" -gt 0 ] && pass "spec_audits/ has triage file" || fail "spec_audits/ missing triage file" + [ "$auditor_count" -gt 0 ] && pass "spec_audits/ has ${auditor_count} auditor file(s)" || fail "spec_audits/ missing individual auditor files" + + # Triage executable evidence — probe assertions must exist on disk + if [ "$triage_count" -gt 0 ]; then + local has_probes=false + if [ -f "${q}/spec_audits/triage_probes.sh" ]; then + has_probes=true + pass "triage_probes.sh exists (executable triage evidence)" + elif [ -f "${q}/mechanical/verify.sh" ] && grep -q 'probe\|triage\|auditor' "${q}/mechanical/verify.sh" 2>/dev/null; then + has_probes=true + pass "verify.sh contains triage probe assertions" + fi + if [ "$has_probes" = false ]; then + if [ "$STRICTNESS" = "benchmark" ]; then + fail "No executable triage evidence found (expected spec_audits/triage_probes.sh or probe assertions in mechanical/verify.sh)" + else + warn "No executable triage evidence found (expected spec_audits/triage_probes.sh or probe assertions in mechanical/verify.sh)" + fi + fi + fi + else + fail "spec_audits/ directory missing" + fi + + # --- BUGS.md heading format (benchmark 39) --- + echo "[BUGS.md Heading Format]" + local bug_count=0 + if [ -f "${q}/BUGS.md" ]; then + local correct_headings wrong_headings + correct_headings=$(grep -cE '^### BUG-[0-9]+' "${q}/BUGS.md" || true) + correct_headings=${correct_headings:-0} + wrong_headings=$(grep -E '^## BUG-[0-9]+' "${q}/BUGS.md" 2>/dev/null | grep -cvE '^### BUG-' || true) + wrong_headings=${wrong_headings:-0} + local bold_headings bullet_headings + bold_headings=$(grep -cE '^\*\*BUG-[0-9]+' "${q}/BUGS.md" || true) + bold_headings=${bold_headings:-0} + bullet_headings=$(grep -cE '^- BUG-[0-9]+' "${q}/BUGS.md" || true) + bullet_headings=${bullet_headings:-0} + + bug_count=$correct_headings + + if [ "$correct_headings" -gt 0 ] && [ "$wrong_headings" -eq 0 ] && [ "$bold_headings" -eq 0 ] && [ "$bullet_headings" -eq 0 ]; then + pass "All ${correct_headings} bug headings use ### BUG-NNN format" + else + [ "$wrong_headings" -gt 0 ] && fail "${wrong_headings} heading(s) use ## instead of ###" + [ "$bold_headings" -gt 0 ] && fail "${bold_headings} heading(s) use **BUG- format" + [ "$bullet_headings" -gt 0 ] && fail "${bullet_headings} heading(s) use - BUG- format" + if [ "$correct_headings" -eq 0 ] && [ "$wrong_headings" -eq 0 ]; then + if grep -qE '(No confirmed|zero|0 confirmed)' "${q}/BUGS.md" 2>/dev/null; then + pass "Zero-bug run — no headings expected" + else + # Count wrong-format headings as bugs for patch check + bug_count=$((wrong_headings + bold_headings + bullet_headings)) + warn "No ### BUG-NNN headings found in BUGS.md" + fi + else + bug_count=$((correct_headings + wrong_headings + bold_headings + bullet_headings)) + fi + fi + else + fail "BUGS.md missing" + fi + + # --- TDD sidecar JSON — deep validation (benchmarks 14, 41) --- + echo "[TDD Sidecar JSON]" + if [ "$bug_count" -gt 0 ]; then + local json_file="${q}/results/tdd-results.json" + if [ -f "$json_file" ]; then + pass "tdd-results.json exists (${bug_count} bugs)" + + # Required root keys + for key in schema_version skill_version date project bugs summary; do + json_has_key "$json_file" "$key" && pass "has '${key}'" || fail "missing root key '${key}'" + done + + # schema_version value + local sv + sv=$(json_str_val "$json_file" "schema_version") + [ "$sv" = "1.1" ] && pass "schema_version is '1.1'" || fail "schema_version is '${sv:-missing}', expected '1.1'" + + # Per-bug required fields — check that canonical field names exist + for field in id requirement red_phase green_phase verdict fix_patch_present writeup_path; do + local fcount + fcount=$(json_key_count "$json_file" "$field") + if [ "$fcount" -ge "$bug_count" ]; then + pass "per-bug field '${field}' present (${fcount}x)" + elif [ "$fcount" -gt 0 ]; then + warn "per-bug field '${field}' found ${fcount}x, expected ${bug_count}" + else + fail "per-bug field '${field}' missing entirely" + fi + done + + # Check for wrong field names (common model errors) + for bad_field in bug_id bug_name status phase result; do + if json_has_key "$json_file" "$bad_field"; then + fail "non-canonical field '${bad_field}' found (use standard field names)" + fi + done + + # Summary must include confirmed_open, red_failed, green_failed + for skey in confirmed_open red_failed green_failed; do + if json_has_key "$json_file" "$skey"; then + pass "summary has '${skey}'" + else + fail "summary missing '${skey}' count" + fi + done + + # Date validation — must be real ISO 8601, not placeholder + local tdd_date + tdd_date=$(json_str_val "$json_file" "date") + if [ -n "$tdd_date" ]; then + if echo "$tdd_date" | grep -qE '^[0-9]{4}-[0-9]{2}-[0-9]{2}$'; then + # Check for placeholder + if [ "$tdd_date" = "YYYY-MM-DD" ] || [ "$tdd_date" = "0000-00-00" ]; then + fail "tdd-results.json date is placeholder '${tdd_date}'" + else + # Check not in the future + local today + today=$(date +%Y-%m-%d) + if [[ "$tdd_date" > "$today" ]]; then + fail "tdd-results.json date '${tdd_date}' is in the future" + else + pass "tdd-results.json date '${tdd_date}' is valid" + fi + fi + else + fail "tdd-results.json date '${tdd_date}' is not ISO 8601 (YYYY-MM-DD)" + fi + else + fail "tdd-results.json date field missing or empty" + fi + + # Verdict enum validation — allowed: "TDD verified", "red failed", "green failed", "confirmed open" + local bad_verdicts + bad_verdicts=$(grep -oE '"verdict"[[:space:]]*:[[:space:]]*"[^"]*"' "$json_file" 2>/dev/null \ + | sed 's/.*: *"\(.*\)"/\1/' \ + | grep -cvE '^(TDD verified|red failed|green failed|confirmed open|deferred)$' || true) + bad_verdicts=${bad_verdicts:-0} + [ "$bad_verdicts" -eq 0 ] && pass "all verdict values are canonical" || fail "${bad_verdicts} non-canonical verdict value(s)" + + else + fail "tdd-results.json missing (${bug_count} bugs require it)" + fi + else + info "Zero bugs — tdd-results.json not required" + fi + + # --- TDD log files — red/green phase logs per bug (v1.3.49) --- + echo "[TDD Log Files]" + if [ "$bug_count" -gt 0 ]; then + local red_found=0 red_missing=0 green_found=0 green_missing=0 green_expected=0 + # Extract confirmed bug IDs from BUGS.md headings + local bug_ids + bug_ids=$(grep -oE 'BUG-[0-9]+' "${q}/BUGS.md" 2>/dev/null \ + | grep -E '^BUG-[0-9]+$' | sort -u -t'-' -k2,2n) + for bid in $bug_ids; do + # Red-phase log — required for every confirmed bug + if [ -f "${q}/results/${bid}.red.log" ]; then + red_found=$((red_found + 1)) + else + red_missing=$((red_missing + 1)) + fi + # Green-phase log — required only if a fix patch exists + if ls ${q}/patches/${bid}-fix*.patch &>/dev/null; then + green_expected=$((green_expected + 1)) + if [ -f "${q}/results/${bid}.green.log" ]; then + green_found=$((green_found + 1)) + else + green_missing=$((green_missing + 1)) + fi + fi + done + + if [ "$red_missing" -eq 0 ] && [ "$red_found" -gt 0 ]; then + pass "All ${red_found} confirmed bug(s) have red-phase logs" + elif [ "$red_found" -gt 0 ]; then + fail "${red_missing} confirmed bug(s) missing red-phase log (BUG-NNN.red.log)" + else + fail "No red-phase logs found (every confirmed bug needs quality/results/BUG-NNN.red.log)" + fi + + if [ "$green_expected" -gt 0 ]; then + if [ "$green_missing" -eq 0 ]; then + pass "All ${green_found} bug(s) with fix patches have green-phase logs" + else + fail "${green_missing} bug(s) with fix patches missing green-phase log (BUG-NNN.green.log)" + fi + else + info "No fix patches found — green-phase logs not required" + fi + else + info "Zero bugs — TDD log files not required" + fi + + # --- Integration sidecar JSON — deep validation --- + echo "[Integration Sidecar JSON]" + local ij="${q}/results/integration-results.json" + if [ -f "$ij" ]; then + for key in schema_version skill_version date project recommendation groups summary uc_coverage; do + json_has_key "$ij" "$key" && pass "has '${key}'" || fail "missing key '${key}'" + done + + # Recommendation enum + local rec + rec=$(json_str_val "$ij" "recommendation") + case "$rec" in + SHIP|"FIX BEFORE MERGE"|BLOCK) pass "recommendation '${rec}' is canonical" ;; + *) [ -n "$rec" ] && fail "recommendation '${rec}' is non-canonical (must be SHIP/FIX BEFORE MERGE/BLOCK)" || fail "recommendation missing" ;; + esac + else + if [ "$STRICTNESS" = "benchmark" ]; then + warn "integration-results.json not present" + else + info "integration-results.json not present (optional in general mode)" + fi + fi + + # --- Use cases in REQUIREMENTS.md (benchmark 43, 48) --- + echo "[Use Cases]" + if [ -f "${q}/REQUIREMENTS.md" ]; then + local uc_ids uc_unique + uc_ids=$(grep -cE 'UC-[0-9]+' "${q}/REQUIREMENTS.md" || true) + uc_ids=${uc_ids:-0} + uc_unique=$(grep -oE 'UC-[0-9]+' "${q}/REQUIREMENTS.md" 2>/dev/null | sort -u | wc -l | tr -d ' ') + uc_unique=${uc_unique:-0} + # Size-aware UC threshold: count source files to calibrate expectation + local src_count=0 + if [ -d "$repo_dir" ]; then + src_count=$(find "$repo_dir" -maxdepth 4 -type f \ + -not -path '*/vendor/*' -not -path '*/node_modules/*' \ + -not -path '*/.git/*' -not -path '*/quality/*' \ + \( -name '*.go' -o -name '*.py' -o -name '*.java' -o -name '*.kt' \ + -o -name '*.rs' -o -name '*.ts' -o -name '*.js' -o -name '*.scala' \ + -o -name '*.c' -o -name '*.h' -o -name '*.agc' \) 2>/dev/null | wc -l | tr -d ' ') + fi + local min_uc=5 + if [ "$src_count" -lt 5 ]; then + min_uc=3 # Small projects: 3 UCs acceptable + fi + + if [ "$uc_unique" -ge "$min_uc" ]; then + pass "Found ${uc_unique} distinct UC identifiers (${uc_ids} total references, ${src_count} source files)" + elif [ "$uc_unique" -gt 0 ]; then + if [ "$STRICTNESS" = "general" ]; then + warn "Only ${uc_unique} distinct UC identifiers (minimum ${min_uc} for ${src_count} source files)" + else + fail "Only ${uc_unique} distinct UC identifiers (minimum ${min_uc} required for ${src_count} source files)" + fi + else + fail "No canonical UC-NN identifiers in REQUIREMENTS.md" + fi + else + fail "REQUIREMENTS.md missing" + fi + + # --- Test file extension matches project language (benchmark 47) --- + echo "[Test File Extension]" + local func_test reg_test + func_test=$(ls ${q}/test_functional.* 2>/dev/null | head -1) + reg_test=$(ls ${q}/test_regression.* 2>/dev/null | head -1) + if [ -n "$func_test" ]; then + local ext="${func_test##*.}" + # Detect project language using find (portable, no globstar needed). + # Exclude vendor/, node_modules/, .git/, and quality/ to avoid false positives. + local detected_lang="" + local find_exclude="-not -path '*/vendor/*' -not -path '*/node_modules/*' -not -path '*/.git/*' -not -path '*/quality/*'" + if eval "find '${repo_dir}' -maxdepth 3 ${find_exclude} -name '*.go' -print -quit" 2>/dev/null | grep -q .; then detected_lang="go" + elif eval "find '${repo_dir}' -maxdepth 3 ${find_exclude} -name '*.py' -print -quit" 2>/dev/null | grep -q .; then detected_lang="py" + elif eval "find '${repo_dir}' -maxdepth 3 ${find_exclude} -name '*.java' -print -quit" 2>/dev/null | grep -q .; then detected_lang="java" + elif eval "find '${repo_dir}' -maxdepth 3 ${find_exclude} -name '*.kt' -print -quit" 2>/dev/null | grep -q .; then detected_lang="kt" + elif eval "find '${repo_dir}' -maxdepth 3 ${find_exclude} -name '*.rs' -print -quit" 2>/dev/null | grep -q .; then detected_lang="rs" + elif eval "find '${repo_dir}' -maxdepth 3 ${find_exclude} -name '*.ts' -print -quit" 2>/dev/null | grep -q .; then detected_lang="ts" + elif eval "find '${repo_dir}' -maxdepth 3 ${find_exclude} -name '*.js' -print -quit" 2>/dev/null | grep -q .; then detected_lang="js" + elif eval "find '${repo_dir}' -maxdepth 3 ${find_exclude} -name '*.scala' -print -quit" 2>/dev/null | grep -q .; then detected_lang="scala" + elif eval "find '${repo_dir}' -maxdepth 3 ${find_exclude} -name '*.c' -print -quit" 2>/dev/null | grep -q .; then detected_lang="c" + elif eval "find '${repo_dir}' -maxdepth 3 ${find_exclude} -name '*.agc' -print -quit" 2>/dev/null | grep -q .; then detected_lang="agc" + fi + + if [ -n "$detected_lang" ]; then + # Map detected language to valid test extensions + local valid_ext="" + case "$detected_lang" in + go) valid_ext="go" ;; + py) valid_ext="py" ;; + java) valid_ext="java" ;; + kt) valid_ext="kt java" ;; + rs) valid_ext="rs" ;; + ts) valid_ext="ts" ;; + js) valid_ext="js ts" ;; + scala) valid_ext="scala" ;; + c) valid_ext="c py sh" ;; # C projects may use Python/shell test harnesses + agc) valid_ext="py sh" ;; # AGC assembly projects use external test harnesses + esac + if echo "$valid_ext" | grep -qw "$ext"; then + pass "test_functional.${ext} matches project language (${detected_lang})" + else + fail "test_functional.${ext} does not match project language (${detected_lang}) — expected .${valid_ext%% *}" + fi + + # Also validate regression test extension if present + if [ -n "$reg_test" ]; then + local reg_ext="${reg_test##*.}" + if echo "$valid_ext" | grep -qw "$reg_ext"; then + pass "test_regression.${reg_ext} matches project language (${detected_lang})" + else + fail "test_regression.${reg_ext} does not match project language (${detected_lang}) — expected .${valid_ext%% *}" + fi + fi + else + info "Cannot detect project language — skipping extension check (test_functional.${ext})" + fi + else + warn "No test_functional.* found" + fi + + # --- Terminal Gate in PROGRESS.md --- + echo "[Terminal Gate]" + if [ -f "${q}/PROGRESS.md" ]; then + grep -qiE '^#+ *Terminal' "${q}/PROGRESS.md" 2>/dev/null \ + && pass "PROGRESS.md has Terminal Gate section" \ + || fail "PROGRESS.md missing Terminal Gate section" + fi + + # --- Mechanical verification (if applicable) --- + echo "[Mechanical Verification]" + if [ -d "${q}/mechanical" ]; then + if [ -f "${q}/mechanical/verify.sh" ]; then + pass "verify.sh exists" + if [ -f "${q}/results/mechanical-verify.log" ] && [ -f "${q}/results/mechanical-verify.exit" ]; then + local exit_code + exit_code=$(cat "${q}/results/mechanical-verify.exit" 2>/dev/null | tr -d '[:space:]') + [ "$exit_code" = "0" ] && pass "mechanical-verify.exit is 0" || fail "mechanical-verify.exit is '${exit_code}', expected 0" + else + fail "Verification receipt files missing" + fi + else + fail "mechanical/ exists but verify.sh missing" + fi + else + info "No mechanical/ directory" + fi + + # --- Patches for confirmed bugs (benchmark 44) --- + echo "[Patches]" + if [ "$bug_count" -gt 0 ]; then + local reg_patch_count=0 fix_patch_count=0 + if [ -d "${q}/patches" ]; then + reg_patch_count=$(ls ${q}/patches/BUG-*-regression*.patch 2>/dev/null | wc -l | tr -d ' ') + fix_patch_count=$(ls ${q}/patches/BUG-*-fix*.patch 2>/dev/null | wc -l | tr -d ' ') + fi + + if [ "$reg_patch_count" -ge "$bug_count" ]; then + pass "${reg_patch_count} regression-test patch(es) for ${bug_count} bug(s)" + elif [ "$reg_patch_count" -gt 0 ]; then + fail "Only ${reg_patch_count} regression-test patch(es) for ${bug_count} bug(s)" + else + fail "No regression-test patches (quality/patches/BUG-NNN-regression-test.patch required for each bug)" + fi + + if [ "$fix_patch_count" -gt 0 ]; then + pass "${fix_patch_count} fix patch(es)" + else + warn "0 fix patches (fix patches are optional but strongly encouraged)" + fi + + # Total patch count for summary + local total_patches=$((reg_patch_count + fix_patch_count)) + info "Total: ${total_patches} patch file(s) in quality/patches/" + fi + + # --- Writeups for confirmed bugs (benchmark 30) --- + echo "[Bug Writeups]" + if [ "$bug_count" -gt 0 ]; then + local writeup_count=0 writeup_diff_count=0 + if [ -d "${q}/writeups" ]; then + writeup_count=$(ls ${q}/writeups/BUG-*.md 2>/dev/null | wc -l | tr -d ' ') + # Check each writeup for inline diff (section 6 requirement) + for wf in ${q}/writeups/BUG-*.md; do + [ -f "$wf" ] || continue + if grep -q '```diff' "$wf" 2>/dev/null; then + writeup_diff_count=$((writeup_diff_count + 1)) + fi + done + fi + if [ "$writeup_count" -ge "$bug_count" ]; then + pass "${writeup_count} writeup(s) for ${bug_count} bug(s)" + elif [ "$writeup_count" -gt 0 ]; then + warn "${writeup_count} writeup(s) for ${bug_count} bug(s) — incomplete" + else + fail "No writeups for ${bug_count} confirmed bug(s)" + fi + + # Inline diff check — every writeup must have a ```diff block (section 6 "The fix") + if [ "$writeup_count" -gt 0 ]; then + if [ "$writeup_diff_count" -ge "$writeup_count" ]; then + pass "All ${writeup_diff_count} writeup(s) have inline fix diffs" + elif [ "$writeup_diff_count" -gt 0 ]; then + fail "Only ${writeup_diff_count}/${writeup_count} writeup(s) have inline fix diffs (all require section 6 diff)" + else + fail "No writeups have inline fix diffs (section 6 'The fix' must include a \`\`\`diff block)" + fi + fi + fi + + # --- Version stamp consistency (benchmark 26) --- + echo "[Version Stamps]" + local skill_version="" + for loc in "${repo_dir}/.github/skills/SKILL.md" "${repo_dir}/SKILL.md"; do + if [ -f "$loc" ]; then + skill_version=$(grep -m1 'version:' "$loc" 2>/dev/null | sed 's/.*version: *//' | tr -d ' ') + [ -n "$skill_version" ] && break + fi + done + if [ -n "$skill_version" ]; then + if [ -f "${q}/PROGRESS.md" ]; then + local pv + pv=$(grep -m1 'Skill version:' "${q}/PROGRESS.md" 2>/dev/null | sed 's/.*Skill version: *//' | tr -d ' ') + [ "$pv" = "$skill_version" ] && pass "PROGRESS.md version matches (${skill_version})" \ + || { [ -n "$pv" ] && fail "PROGRESS.md version '${pv}' != '${skill_version}'" || warn "PROGRESS.md missing Skill version field"; } + fi + if [ -f "${q}/results/tdd-results.json" ]; then + local tv + tv=$(json_str_val "${q}/results/tdd-results.json" "skill_version") + [ "$tv" = "$skill_version" ] && pass "tdd-results.json skill_version matches" \ + || { [ -n "$tv" ] && fail "tdd-results.json skill_version '${tv}' != '${skill_version}'"; } + fi + else + warn "Cannot detect skill version from SKILL.md" + fi + + # --- Cross-run contamination detection --- + echo "[Cross-Run Contamination]" + if [ -n "$skill_version" ] && [ -n "$VERSION" ]; then + # Check if the repo directory name contains a version that doesn't match the skill + local dir_version + dir_version=$(echo "$repo_name" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+' | tail -1) + if [ -n "$dir_version" ] && [ "$dir_version" != "$skill_version" ]; then + fail "Directory version '${dir_version}' != skill version '${skill_version}' — possible cross-run contamination" + else + pass "No version mismatch detected" + fi + fi + + # Check for artifacts referencing a different version in gate log or tdd-results + if [ -f "${q}/results/tdd-results.json" ] && [ -n "$skill_version" ]; then + local json_sv + json_sv=$(json_str_val "${q}/results/tdd-results.json" "skill_version") + if [ -n "$json_sv" ] && [ "$json_sv" != "$skill_version" ]; then + fail "tdd-results.json skill_version '${json_sv}' != SKILL.md '${skill_version}' — stale artifacts from prior run?" + fi + fi + + echo "" +} + +# Resolve repos +if [ "$CHECK_ALL" = true ]; then + for dir in "${SCRIPT_DIR}/"*-"${VERSION}"/; do + [ -d "$dir/quality" ] && REPO_DIRS+=("$dir") + done +elif [ ${#REPO_DIRS[@]} -eq 1 ] && [ "${REPO_DIRS[0]}" = "." ]; then + # Running from inside a repo + REPO_DIRS=("$(pwd)") +else + resolved=() + for name in "${REPO_DIRS[@]}"; do + if [ -d "$name/quality" ]; then + resolved+=("$name") + elif [ -d "${SCRIPT_DIR}/${name}-${VERSION}" ]; then + resolved+=("${SCRIPT_DIR}/${name}-${VERSION}") + elif [ -d "${SCRIPT_DIR}/${name}" ]; then + resolved+=("${SCRIPT_DIR}/${name}") + else + echo "WARNING: Cannot find repo '${name}'" + fi + done + REPO_DIRS=("${resolved[@]}") +fi + +if [ ${#REPO_DIRS[@]} -eq 0 ]; then + echo "Usage: $0 [--version V] [--all | repo1 repo2 ... | .]" + exit 1 +fi + +echo "=== Quality Gate — Post-Run Validation ===" +echo "Version: ${VERSION:-unknown}" +echo "Strictness: ${STRICTNESS}" +echo "Repos: ${#REPO_DIRS[@]}" + +for repo_dir in "${REPO_DIRS[@]}"; do + check_repo "$repo_dir" +done + +echo "" +echo "===========================================" +echo "Total: ${FAIL} FAIL, ${WARN} WARN" +if [ "$FAIL" -gt 0 ]; then + echo "RESULT: GATE FAILED — ${FAIL} check(s) must be fixed" + exit 1 +else + echo "RESULT: GATE PASSED" + exit 0 +fi diff --git a/skills/quality-playbook/references/defensive_patterns.md b/skills/quality-playbook/references/defensive_patterns.md index 05070576a..281239768 100644 --- a/skills/quality-playbook/references/defensive_patterns.md +++ b/skills/quality-playbook/references/defensive_patterns.md @@ -155,6 +155,26 @@ State machines are a special category of defensive pattern. When you find status 3. Look for states you can enter but never leave (terminal state without cleanup) 4. Look for operations that should be available in a state but are blocked by an incomplete guard +## Enumeration and Whitelist Completeness + +When a function uses `switch`/`case`, `match`, if-else chains, or any dispatch construct to handle a set of named constants (feature bits, enum values, command codes, event types, permission flags), perform the **two-list enumeration check**: + +1. **List A (defined):** Extract every constant from the relevant header, enum, or spec that the code should handle. Use grep — do not list from memory. +2. **List B (handled):** Extract every case label, branch condition, or map key from the dispatch code. Use grep or line-by-line read — do not summarize. +3. **Diff:** Compare the two lists. Any constant in A but not in B is a potential gap. Any constant in B but not in A is a potential dead case. + +**Why this exists:** AI models reliably hallucinate completeness for switch/case constructs. The model sees a function with many case labels, sees constants defined elsewhere, and concludes all constants are handled without actually checking. In one observed case, the model asserted that a kernel feature-bit whitelist "preserves supported ring transport bits including VIRTIO_F_RING_RESET" when that constant was entirely absent from the switch — the model hallucinated coverage because the constant existed in a header the function's callers used. The mechanical two-list check is the only reliable countermeasure. + +**Triage verification probes must produce executable evidence.** When triage confirms or rejects an enumeration finding via verification probe, prose reasoning alone is insufficient. The probe must produce a test assertion for each constant: `assert "case VIRTIO_F_RING_RESET:" in source_of("vring_transport_features"), "RING_RESET at line NNN"`. This rule exists because in v1.3.16, the triage correctly received a minority finding about RING_RESET but rejected it with a hallucinated claim that "lines 3527-3528 explicitly preserve RING_RESET" — those lines were actually the `default:` branch. Had the triage been forced to write an assertion, it would have failed, exposing the hallucination. + +**Code-side lists must be extracted from the code, not copied from requirements.** When performing the two-list check in the code review or spec audit, the "handled" list must be extracted directly from the function body with per-item line numbers. Do not copy from REQUIREMENTS.md, CONTRACTS.md, the audit prompt, or any other generated artifact. If the two lists (code-extracted vs. requirements-claimed) are word-for-word identical, that is a red flag that the code list was copied — redo the extraction. In v1.3.17, the code review's "case labels present" list was identical to the requirements list, proving it was copied rather than extracted. Three spec auditors then inherited this false list and none independently verified. The per-item line-number citation prevents this: you cannot cite "line 3527: `case VIRTIO_F_RING_RESET:`" when line 3527 actually contains `default:`. + +**Mechanical verification artifacts outrank prose lists.** If `quality/mechanical/_cases.txt` exists for a dispatch function, use it as the authoritative source for what the function handles. Do not replace it with a hand-written list. If no mechanical artifact exists, generate one using a non-interactive shell pipeline (e.g., `awk` + `grep`) before writing contracts or requirements about the function's coverage. + +**Artifact integrity risk:** In v1.3.19 testing, the model executed the correct extraction command but wrote its own fabricated output to the file instead of letting the shell redirect capture it. The fabricated file included a hallucinated `case VIRTIO_F_RING_RESET:` line that the real command does not produce. To mitigate: `quality/mechanical/verify.sh` re-runs every extraction command and diffs against saved files. If any diff is non-empty, the artifact was tampered with and must be regenerated. + +**Where to apply:** Feature-bit negotiation functions, protocol message dispatchers, permission check switches, configuration option handlers, codec/format registration tables, HTTP method/status code handlers, and any function where a `default:` or `else` clause silently drops unrecognized values. + **Converting state machine gaps to scenarios:** ```markdown diff --git a/skills/quality-playbook/references/exploration_patterns.md b/skills/quality-playbook/references/exploration_patterns.md new file mode 100644 index 000000000..2f9590d1c --- /dev/null +++ b/skills/quality-playbook/references/exploration_patterns.md @@ -0,0 +1,283 @@ +# Exploration Patterns for Bug Discovery + +This reference defines the exploration patterns that Phase 1 applies during codebase exploration. These patterns target bug classes most commonly missed when exploration stays at the subsystem or architecture level. + +Requirements problems are the most expensive to fix because they are not caught until after implementation. The exploration phase is requirements elicitation — it determines what the code review and spec audit will look for. A requirement that is never derived is a bug that is never found. These patterns exist to systematically surface requirements that broad exploration misses. + +Each pattern includes a definition, the bug class it targets, diverse examples from different domains, and the expected output format for EXPLORATION.md. + +**Important: These patterns supplement free exploration — they do not replace it.** Phase 1 begins with open-ended exploration driven by domain knowledge and codebase understanding. After that open exploration, apply the patterns below as a structured second pass to catch specific bug classes. If you find yourself only looking for things the patterns describe, you are using them wrong. The patterns are a checklist to run after you have already formed your own understanding of the codebase's risks. + +--- + +## Pattern 1: Fallback and Degradation Path Parity + +### Definition + +When code provides multiple strategies for accomplishing the same goal — a primary path and one or more fallback paths — each fallback must preserve the same behavioral invariants as the primary. The fallback may use a different mechanism, but the observable contract must be equivalent. + +### Bug class + +Fallback paths are written later, tested less, and reviewed with less scrutiny than primary paths. They often omit steps the primary path performs (validation, cleanup, index assignment, resource release) because the developer copied the primary path and simplified it for the "degraded" case. The result is a function that works correctly in the common case but violates its contract when the fallback activates. + +### Examples across domains + +- **Authentication:** A web service tries OAuth token validation, falls back to API key lookup, falls back to session cookie. Each fallback must enforce the same authorization scope. Bug: the API key fallback skips scope validation and grants full access. +- **Connection pooling:** A database client tries the primary connection pool, falls back to a secondary pool, falls back to creating a one-off connection. Each path must apply the same timeout and transaction isolation settings. Bug: the one-off connection fallback uses the driver default isolation level instead of the configured one. +- **Resource allocation:** A memory allocator tries a fast slab path, falls back to a slow page-level path. Both must zero-initialize sensitive fields. Bug: the slow path returns uninitialized memory because zero-fill was only in the slab fast path. +- **HTTP redirect handling:** A client follows a redirect and must strip security-sensitive headers (Authorization, Proxy-Authorization, cookies) when the redirect crosses an origin boundary. Bug: the redirect path strips Authorization but not Proxy-Authorization, leaking proxy credentials to the redirected origin. +- **Serialization fallback:** A message broker tries binary serialization, falls back to JSON, falls back to string encoding. Each path must preserve the same field ordering and null-handling semantics. Bug: the JSON fallback silently drops null fields that binary serialization preserves. + +### How to apply + +For each core module, look for: conditional chains that try one approach then fall through to another, strategy/adapter patterns where multiple implementations are selected at runtime, retry logic with different strategies per attempt, feature-negotiation cascades where capabilities determine which code path runs, HTTP redirect/retry logic that must preserve or strip headers. + +For each cascade found: +1. List the primary path and every fallback. +2. For each fallback, check whether it performs the same critical operations as the primary (validation, resource setup, index assignment, cleanup, error reporting, header stripping, resource release). +3. Any operation present in the primary but missing in a fallback is a candidate requirement. + +### EXPLORATION.md output format + +``` +## Fallback Path Analysis + +### [Name of cascade] +- **Primary path:** [function, file:line] — [what it does] +- **Fallback 1:** [function, file:line] — [what it does, what differs] +- **Fallback 2:** [function, file:line] — [what it does, what differs] +- **Parity gaps:** [specific operations present in primary but missing in fallback] +- **Candidate requirements:** REQ-NNN: [fallback must do X] +``` + +--- + +## Pattern 2: Dispatcher Return-Value Correctness + +### Definition + +When a function dispatches on input type or condition and must return a status value, the return value must be correct for every combination of inputs — not just the primary case. Dispatchers that handle multiple event types, request types, or state transitions are particularly prone to return-value bugs in edge combinations. + +### Bug class + +Dispatchers are typically written and tested for the common case. The return value is correct when the primary event fires. But when an unusual combination occurs (only a secondary event, no events at all, multiple concurrent events), the return-value logic may be wrong — returning "not handled" for a handled event, returning success for a partial failure, or returning a stale value from a previous iteration. + +### Examples across domains + +- **HTTP middleware:** A request dispatcher checks for authentication, rate-limiting, and routing. When rate-limiting triggers but authentication was already set, the dispatcher returns the auth status code instead of the rate-limit status code. Bug: rate-limited requests get 401 instead of 429. +- **CORS handler chain:** A CORS preflight handler sets 400 (rejected), then the missing-OPTIONS-handler path sets 404, then an AFTER handler normalizes 404→200 (meant for allowed origins). Bug: rejected preflights get 200 because the status was overwritten by downstream handlers. +- **Event loop:** A poll/select loop handles read-ready, write-ready, and error conditions. When only an error condition fires on a socket with no pending reads, the loop returns "no events" because the read-ready check was false. Bug: connection errors are silently ignored. +- **State machine transition:** A state machine dispatch function handles valid transitions, invalid transitions, and no-op transitions. When a no-op transition occurs (current state == target state), the function returns an error code intended for invalid transitions. Bug: idempotent operations fail when they should succeed. +- **Interrupt handler:** A hardware interrupt handler checks for multiple event types (data-ready, configuration-change, error). When only a secondary event fires (e.g., config change with no data), the handler returns "not mine" because the primary event check failed and the secondary path doesn't set the handled flag. Bug: legitimate secondary events are reported as spurious. + +### How to apply + +For each core module, look for: functions with switch/case or if-else chains that return a status, interrupt/event handlers that handle multiple event types, request dispatchers that check multiple conditions before returning, state machine transition functions, middleware chains where multiple handlers write to the same response status. + +For each dispatcher found: +1. Enumerate all input combinations (not just the ones with explicit case labels — also the implicit "else" and "default" paths). +2. For each combination, trace the return value through the entire handler chain (not just the immediate function). +3. Any combination where the return value doesn't match the expected semantics is a candidate requirement. + +### EXPLORATION.md output format + +``` +## Dispatcher Return-Value Analysis + +### [Function name] at [file:line] +- **Input types:** [list of conditions/events the function dispatches on] +- **Combinations checked:** + - [Condition A only]: returns [X] — correct/incorrect because [reason] + - [Condition B only]: returns [X] — correct/incorrect because [reason] + - [Both A and B]: returns [X] — correct/incorrect because [reason] + - [Neither A nor B]: returns [X] — correct/incorrect because [reason] +- **Candidate requirements:** REQ-NNN: [function must return Y when only B fires] +``` + +--- + +## Pattern 3: Cross-Implementation Contract Consistency + +### Definition + +When multiple functions implement the same logical operation for different contexts (different transports, different backends, different protocol versions), they should all satisfy the same specification requirement. A step that is mandatory in the specification must appear in every implementation — a missing step in one implementation that is present in another is a strong bug signal. + +### Bug class + +When the same operation is implemented in multiple places, each implementation is typically written by a different developer or at a different time. The specification says "reset must wait for completion," and the developer of implementation A writes the wait loop, but the developer of implementation B writes only the reset trigger and forgets the wait. The bug is invisible when testing implementation B in isolation because it "works" on fast hardware — the race condition only manifests under load or on slow devices. + +### Examples across domains + +- **Device reset:** A spec says "the driver must write zero and then poll until the status register reads back zero." The PCI implementation includes the poll loop. The MMIO implementation writes zero but does not poll. Bug: MMIO reset can race with reinitialization. +- **Database driver:** A connection-close spec says "the driver must send a termination message, wait for acknowledgment, then release the socket." The PostgreSQL driver does all three. The MySQL driver sends the termination message and releases the socket without waiting for acknowledgment. Bug: the server may process the termination after the socket is reused. +- **HTTP header encoding:** A Headers class constructor decodes raw bytes as Latin-1 per RFC 7230. The mutation method (`__setitem__`) encodes values as UTF-8. Bug: round-tripping a Latin-1 header through get-then-set corrupts the value because the encoding changed. +- **Cache invalidation:** A cache spec says "invalidation must remove the entry and notify all subscribers." The in-memory cache does both. The distributed cache removes the entry but does not broadcast the notification. Bug: other nodes serve stale data. +- **File locking:** A storage spec says "lock acquisition must set a timeout and clean up on failure." The local filesystem implementation sets the timeout. The NFS implementation uses blocking lock with no timeout. Bug: NFS lock contention can hang the process indefinitely. + +### How to apply + +For each core module, look for: the same operation name implemented in multiple files or classes, interface/trait implementations across different backends, protocol-version-specific implementations of the same message, transport-specific implementations of the same lifecycle operation, constructor vs. mutation implementations of the same logical operation. + +For each pair (or set) of implementations: +1. Identify the specification requirement they share. +2. List the mandatory steps from the spec. +3. Check each implementation for each step. +4. Any step present in one but missing in another is a candidate requirement. + +**Check every cross-transport operation, not just the most obvious one.** If a codebase has multiple transports (PCI, MMIO, vDPA) or backends (PostgreSQL, MySQL), enumerate all operations that have cross-implementation equivalents — reset, interrupt handling, feature negotiation, queue setup, configuration access — and check each one. The first cross-implementation gap you find is rarely the only one. A common failure mode is analyzing reset thoroughly and then skipping interrupt dispatch, which has the same cross-transport structure. + +### EXPLORATION.md output format + +``` +## Cross-Implementation Consistency + +### [Operation name] — [spec reference] +- **Implementation A:** [function, file:line] — performs steps: [1, 2, 3] +- **Implementation B:** [function, file:line] — performs steps: [1, 3] (missing step 2) +- **Gap:** [Implementation B missing step 2: description] +- **Candidate requirements:** REQ-NNN: [all implementations of X must perform step 2] +``` + +--- + +## Pattern 4: Enumeration and Representation Completeness + +### Definition + +When a codebase maintains a closed set of recognized values — a switch/case whitelist, an array of valid constants, an enum/tagged-union definition, a trait/visitor method family, a set of schema keywords, a registry of accepted entries — every value that the specification, upstream definition, or the library's own public API surface says should be accepted must appear in the set. Values not in the set are silently dropped, rejected, or mishandled, and the absence of an entry is invisible at the call site. + +### Bug class + +Closed sets are written once and rarely revisited. When a new capability is added to the specification or upstream header, the code that defines the capability (the constant, the feature flag, the enum variant) is updated, and the code that uses the capability is updated, but the closed set that gates whether the capability survives a filtering step is forgotten. The feature appears to be supported — it's defined, it's negotiated, it's used — but it's silently stripped by a filter function that nobody remembered to update. The bug is invisible in normal testing because the feature simply doesn't activate, and the absence of activation looks like "the other end doesn't support it." + +This pattern also covers **internal representations** that must mirror a public API. If a library's public API accepts i128/u128 integers but an internal buffered representation only has variants for i64/u64, values that pass through the buffer are silently truncated or rejected — even though the public API promises to handle them. + +### Examples across domains + +- **Feature negotiation filter:** A transport layer maintains a switch/case whitelist of feature bits that should survive filtering. A new feature (`RING_RESET`) is added to the UAPI header and used by higher-level code, but never added to the whitelist. Bug: the feature is silently cleared during negotiation, disabling a capability the driver claims to support. +- **Serialization internal representation:** A serialization library's public `Deserializer` trait supports `deserialize_i128()`/`deserialize_u128()`. An internal buffered representation (`Content` enum) used by untagged and internally-tagged enum deserialization has variants only for `I64`/`U64`. Bug: 128-bit integers that pass through the buffer are rejected with a "no variant for i128" error, even though the public API claims to support them. +- **Schema keyword importer:** A validation library imports JSON Schema documents. The spec defines `uniqueItems`, `contains`, `minContains`, `maxContains` for arrays. The importer recognizes these keywords (no parse error) but doesn't enforce them. Bug: imported schemas silently accept arrays that violate the original constraints. +- **Permission system:** An authorization middleware maintains an array of recognized permission strings. A new permission (`audit:write`) is added to the role definitions but not to the middleware's whitelist. Bug: users with the `audit:write` role are silently denied access because the middleware doesn't recognize the permission. +- **Protocol message types:** A message router maintains a switch/case dispatch for recognized message types. A new message type is added to the protocol spec and the serialization layer, but not to the router. Bug: the new message type is silently dropped by the router's default case, and the sender receives no error. + +### How to apply + +For each core module, look for: switch/case statements with explicit case labels and a default that drops/clears/rejects, arrays or sets of accepted values used for filtering or validation, registration functions where new entries must be added manually, enum/tagged-union definitions that mirror a specification or public API, trait/visitor method families where each method handles one variant, schema importers that must handle every keyword the spec defines, internal representations (buffers, IR, AST) that must cover the full range of the public interface. + +For each closed set found: +1. Identify the authoritative source that defines what values should be valid. This could be: a spec, a header file, an upstream enum, a protocol definition, **or the library's own public API surface** (trait methods, function signatures, type definitions). +2. Extract the closed set mechanically (save the case labels, enum variants, visitor methods, array entries, or schema keywords to a file). +3. Compare the extracted set against the authoritative source. Every value in the authoritative source that is absent from the closed set is a candidate requirement. + +**Caller compensation does not excuse a missing entry.** If a closed set in a shared/generic function is missing an entry, that is a bug — even if specific callers compensate by restoring the value after the function runs. The compensation is a workaround, not a fix. Any new caller that doesn't know to compensate silently inherits the bug. Report each missing entry as a finding and note which callers (if any) compensate, but do not dismiss the finding because of compensation. + +### EXPLORATION.md output format + +``` +## Enumeration/Representation Completeness + +### [Function/type name] at [file:line] +- **Purpose:** [what this closed set gates — e.g., "feature bits that survive transport filtering" or "integer variants the buffer can hold"] +- **Authoritative source:** [where valid values are defined — e.g., "include/uapi/linux/virtio_config.h" or "public Deserializer trait methods"] +- **Extracted entries:** [list of values in the closed set, or reference to mechanical extraction file] +- **Missing entries:** [values present in the authoritative source but absent from the closed set] +- **Candidate requirements:** REQ-NNN: [closed set must include X] +``` + +--- + +## Pattern 5: API Surface Consistency + +### Definition + +When the same logical operation can be performed through multiple API surfaces — direct method vs. view/wrapper, constructor vs. mutator, sync vs. async variant, primary API vs. convenience alias — all surfaces must produce equivalent observable behavior for the same input. A divergence between two paths to the same operation is a bug, because callers reasonably expect consistent behavior regardless of which surface they use. + +### Bug class + +Libraries often expose the same underlying data through multiple interfaces: a direct method and a collection view (`add()` vs. `asList().add()`), a constructor and a setter, a sync and async variant. These surfaces are implemented at different times, often by different developers, and their edge-case handling diverges — especially around null/sentinel values, encoding, ordering, and error reporting. The divergence is invisible in normal testing because tests typically exercise only one surface per operation. + +### Examples across domains + +- **JSON null handling:** `JsonArray.add(null)` converts null to `JsonNull.INSTANCE` and succeeds. `JsonArray.asList().add(null)` throws `NullPointerException` because the view's wrapper unconditionally rejects null. Bug: two methods for the same operation have contradictory null semantics. +- **HTTP header encoding:** `Headers([(b"X-Custom", b"\xe9")])` constructs a header from Latin-1 bytes. `headers["X-Custom"] = b"\xe9"` stores the value as UTF-8. Bug: round-tripping a header through get-then-set changes the encoding silently. +- **WebSocket protocol negotiation:** `WebSocketUpgrade::protocols()` returns a `BTreeSet`, which sorts and deduplicates the client's preference-ordered protocol list. Bug: the application sees a different order than the client sent, breaking preference-based negotiation. +- **Configuration option propagation:** `res.sendFile(path, { etag: false })` should disable ETag for this response. But the code converts the option to a boolean before passing to the underlying `send` module, losing the "strong" vs "weak" ETag mode. Bug: per-call ETag configuration is silently ignored or lossy-converted. +- **Map duplicate detection:** `map.put(key, value)` returns the previous value to signal duplicates. When the previous value is legitimately `null`, `put()` returns `null` — the same value it returns for "no previous entry." Bug: duplicate keys go undetected when the first value is null. + +### How to apply + +For each core module, look for: view/wrapper objects returned by methods like `asList()`, `asMap()`, `unmodifiableView()`, `stream()`, `iterator()`; constructor vs. mutation method pairs; sync vs. async variants of the same operation; convenience aliases that delegate to a primary implementation; methods that accept options/configuration objects. + +For each pair of surfaces: +1. Identify the logical operation they share. +2. Test the same edge-case inputs on both surfaces (null, empty, boundary values, special characters, ordering-sensitive data). +3. Any divergence in behavior (different exceptions, different encoding, different ordering, one succeeds and the other fails) is a candidate requirement. + +### EXPLORATION.md output format + +``` +## API Surface Consistency + +### [Operation name] — [two surfaces compared] +- **Surface A:** [method, file:line] — [behavior on edge input] +- **Surface B:** [method, file:line] — [behavior on same edge input] +- **Divergence:** [what differs — exception type, encoding, ordering, null handling] +- **Candidate requirements:** REQ-NNN: [both surfaces must behave equivalently for input X] +``` + +--- + +## Pattern 6: Spec-Structured Parsing Fidelity + +### Definition + +When code parses values defined by a formal grammar or specification — HTTP headers, URLs, MIME types, CLI flags, JSON Schema keywords, file paths — the parsing must match the grammar's actual rules. Shortcuts (substring matching, exact equality, wrong delimiter, prefix matching without boundary checks) produce parsers that work for common inputs but fail on valid edge cases or accept invalid inputs. + +### Bug class + +Developers frequently implement "good enough" parsers that handle the common case: `header.contains("gzip")` instead of tokenizing by comma and trimming whitespace, `url.startsWith("/api")` instead of checking path segment boundaries, `connection == "Upgrade"` instead of case-insensitive token list membership. These shortcuts pass all unit tests because tests use well-formed inputs, but they break on real-world edge cases like `gzip;q=0` (explicitly rejected), `Connection: keep-alive, Upgrade` (token list), or `/api-docs` (prefix match without boundary). + +### Examples across domains + +- **HTTP Accept-Encoding:** Middleware checks `accept.contains("gzip")` to decide whether to compress. This matches `gzip;q=0` (client explicitly rejects gzip) and `xgzip` (not a valid encoding). Bug: responses are compressed when the client said not to. +- **WebSocket Connection header:** Code checks `connection == "Upgrade"` (exact match). Per RFC 7230, `Connection` is a comma-separated token list; `Connection: keep-alive, Upgrade` is valid but fails exact match. Bug: valid WebSocket upgrades are rejected. +- **SPA fallback routing:** A single-page-app handler matches paths with `path.startsWith("/app")`. This matches both `/app/users` (correct) and `/api-docs` (incorrect sibling route). Bug: API documentation requests are swallowed by the SPA handler. +- **MIME type parameter handling:** Content negotiation compares `text/html;level=1` against handler keys but strips parameters before matching. Bug: the `level=1` parameter selected during negotiation is lost from the response Content-Type. +- **URL host normalization:** Code detects internationalized domain names by checking `host.startsWith("xn--")`. Per IDNA, only individual labels start with `xn--`; `foo.xn--example.com` has the punycode label in the middle. Bug: internationalized subdomains are not decoded. + +### How to apply + +For each core module, look for: string comparisons on values defined by RFCs or specs (headers, URLs, MIME types, encoding names), `contains()` / `indexOf()` / `startsWith()` / `endsWith()` on structured values, case-sensitive comparisons where the spec requires case-insensitive, splitting on the wrong delimiter or not splitting at all, prefix/suffix matching without path-segment or token boundaries. + +For each parser found: +1. Identify the spec that defines the grammar (RFC, ABNF, JSON Schema spec, POSIX, etc.). +2. Check whether the implementation handles: token lists (comma-separated), quoted strings, parameters (semicolon-separated), case folding, whitespace trimming, boundary conditions. +3. Construct an input that is valid per the spec but would fail the implementation's shortcut parser. That input is a candidate test case and the parsing gap is a candidate requirement. + +### EXPLORATION.md output format + +``` +## Spec-Structured Parsing + +### [Parser location] at [file:line] +- **Spec:** [which grammar/RFC/standard defines the format] +- **Implementation technique:** [contains/equals/startsWith/split-on-X] +- **Spec-valid input that breaks the parser:** [concrete example] +- **Why it breaks:** [substring match includes invalid case / missing case folding / etc.] +- **Candidate requirements:** REQ-NNN: [parser must tokenize per RFC NNNN §N.N] +``` + +--- + +## Extending This List + +These patterns were derived from analyzing 56 confirmed bugs across 11 open-source repositories spanning 7 languages. Each pattern represents a class of requirements that broad architectural summaries consistently miss. + +To add a new pattern: +1. Identify a confirmed bug that was missed by exploration but would have been found with a specific analysis technique. +2. Generalize the technique: what question should the explorer have asked about the code? +3. Provide at least 5 diverse examples from different domains (not all from the same project). +4. Define the expected output format for EXPLORATION.md. +5. Add the pattern to this file and add the corresponding section to the EXPLORATION.md template in SKILL.md. + +The goal is a library of systematic exploration techniques that accumulate over time as new bug classes are discovered. diff --git a/skills/quality-playbook/references/iteration.md b/skills/quality-playbook/references/iteration.md new file mode 100644 index 000000000..931c0bc68 --- /dev/null +++ b/skills/quality-playbook/references/iteration.md @@ -0,0 +1,190 @@ +# Iteration Mode Reference + +> This file contains the detailed instructions for each iteration strategy. +> The agent reads this file when running an iteration — all operational detail lives here, +> not in the prompt or in run_playbook.sh. + +## Iteration cycle + +The recommended iteration order is: **gap → unfiltered → parity → adversarial**. Each strategy finds different bug classes, and running them in this order maximizes cumulative yield. After each iteration, the skill prints a suggested prompt for the next strategy — follow the cycle until you hit diminishing returns or decide to stop. + +``` +Baseline run # structured three-stage exploration +→ gap scan previous coverage, explore gaps # finds bugs in uncovered subsystems +→ unfiltered pure domain-driven, no structure # finds bugs that structure suppresses +→ parity cross-path comparison and diffing # finds inconsistencies between parallel implementations +→ adversarial challenge dismissed/demoted findings # recovers Type II errors from previous triage +``` + +## Shared rules for all strategies + +These rules apply to every iteration strategy: + +1. **ITER file naming.** Write findings to `quality/EXPLORATION_ITER{N}.md` — check which iteration files already exist and use the next number (e.g., `EXPLORATION_ITER2.md` for the first iteration, `EXPLORATION_ITER3.md` for the second). + +2. **Do NOT delete or archive quality/.** You are building on the existing run, not replacing it. + +3. **Context budget discipline.** A first-run EXPLORATION.md can be 200–400 lines. Loading it all into context before starting your own exploration leaves too little room for deep investigation. The previous-run scan should consume ~20–30 lines of context. Targeted deep-reads should consume ~40–60 lines total. This leaves the bulk of your context budget for new exploration. + +4. **Merge.** After completing the strategy-specific exploration, create or update `quality/EXPLORATION_MERGED.md` that combines findings from ALL iterations. For each section, concatenate the findings with clear attribution (`[Iteration 1]` / `[Iteration 2: gap]` / `[Iteration 3: unfiltered]` / etc.). Include the strategy name in the attribution so downstream phases can see which approach surfaced each finding. The Candidate Bugs section should be re-consolidated from all findings across all iterations. If `EXPLORATION_MERGED.md` already exists from a previous iteration, merge the new iteration's findings into it rather than starting from scratch. + + **Demoted Candidates Manifest (mandatory in EXPLORATION_MERGED.md).** After re-consolidating the Candidate Bugs section, add or update a `## Demoted Candidates` section at the end of EXPLORATION_MERGED.md. This section tracks findings that were dismissed, demoted, or deprioritized during any iteration — they are the raw material for the adversarial strategy. For each demoted candidate, record: + + ``` + ### DC-NNN: [short title] + - **Source:** [which iteration and strategy first surfaced this] + - **Dismissal reason:** [why it was demoted — e.g., "classified as design choice," "insufficient evidence," "needs runtime confirmation"] + - **Code location:** [file:line references] + - **Re-promotion criteria:** [specific evidence that would flip this to a confirmed candidate — e.g., "show that the permissive behavior violates a documented contract," "trace the code path to prove the edge case is reachable," "demonstrate that the output differs from what the spec requires"] + - **Status:** DEMOTED | RE-PROMOTED [iteration] | FALSE POSITIVE [iteration] + ``` + + The re-promotion criteria are the most important field — they tell the adversarial strategy exactly what evidence to gather. Vague criteria like "needs more investigation" are not acceptable; write criteria that a different agent session could act on without additional context. If a subsequent iteration re-promotes or definitively falsifies a demoted candidate, update its status and add a note explaining the resolution. + +5. **Continue with Phases 2–6.** Use `EXPLORATION_MERGED.md` as the primary input for Phase 2 artifact generation. All downstream artifacts (REQUIREMENTS.md, code review, spec audit) should reference the merged exploration. + + **TDD is mandatory for iteration runs (v1.3.49).** Iteration runs must execute the full TDD red-green cycle for every newly confirmed bug, exactly as baseline runs do. This means: for each new BUG-NNN confirmed in this iteration, create a regression test patch, run it against unpatched code to produce `quality/results/BUG-NNN.red.log`, and if a fix patch exists, run it against patched code to produce `quality/results/BUG-NNN.green.log`. The TDD Log Closure Gate in Phase 5 applies equally to iteration runs — missing log files will cause quality_gate.sh to FAIL. Do not skip TDD because this is "just an iteration" or because prior bugs already have logs. New bugs need new logs. If the test runner is not available for the project's language, create the log file with `NOT_RUN` on the first line and an explanation — the file must still exist. + +6. **Iteration mode completion gate.** Before proceeding to Phase 2 (applies to all strategies): + - `quality/ITERATION_PLAN.md` exists and names the strategy used + - `quality/EXPLORATION_ITER{N}.md` exists for this iteration with at least 80 lines of substantive content + - `quality/EXPLORATION_MERGED.md` exists and contains findings from all iterations + - The merged Candidate Bugs section has at least 2 new candidates not present in previous iterations + - At least 1 finding covers a code area not explored in previous iterations OR re-confirms a previously dismissed finding with fresh evidence + +7. **Suggested next iteration.** At the end of Phase 6, after writing the final PROGRESS.md summary, print a suggested prompt for the next iteration strategy in the cycle. If the current strategy was: + - **gap** → suggest: `Run the next iteration of the quality playbook using the unfiltered strategy.` + - **unfiltered** → suggest: `Run the next iteration of the quality playbook using the parity strategy.` + - **parity** → suggest: `Run the next iteration of the quality playbook using the adversarial strategy.` + - **adversarial** → suggest: `Run the quality playbook from scratch.` (cycle complete) + - **baseline (no strategy)** → suggest: `Run the next iteration of the quality playbook using the gap strategy.` + + Format the suggestion clearly so the user can copy-paste it: + ``` + ──────────────────────────────────────────────────────── + Next iteration suggestion: + "Run the next iteration of the quality playbook using the [strategy] strategy." + ──────────────────────────────────────────────────────── + ``` + +## Meta-strategy: `all` — run every strategy in sequence + +The `all` strategy is a runner-level convenience that executes gap → unfiltered → parity → adversarial in order, each as a separate agent session. A single agent session cannot run multiple strategies (context budget), so `all` is implemented by the runner (run_playbook.sh) as a loop of `--next-iteration` calls. If any strategy finds zero new bugs, the runner stops early (diminishing returns). + +Usage: `./run_playbook.sh --next-iteration --strategy all ` + +--- + +## Strategy: `gap` (default) — find what the previous run missed + +Scan the previous run's coverage and deliberately explore elsewhere. Best when the first run was structurally sound but only covered a subset of the codebase. + +1. **Coverage scan (lightweight).** Read the previous `quality/EXPLORATION.md` using a divide-and-conquer strategy — do NOT load the entire file into context at once. Instead: + - Read just the section headers and first 2–3 lines of each section to build a coverage map + - For each section, record: section name, subsystems covered, number of findings, depth level (shallow = single-function mentions, deep = multi-function traces) + - Write the coverage map to `quality/ITERATION_PLAN.md` + +2. **Gap identification.** From the coverage map, identify: + - Subsystems or modules that were not explored at all + - Sections with shallow findings (few lines, single-function mentions, no code-path traces) + - Quality Risks scenarios that were listed but never traced to specific code + - Pattern deep dives that could apply but weren't selected (from the applicability matrix) + - Domain-knowledge questions from Step 6 that weren't addressed + +3. **Targeted deep-read.** For only the 2–3 thinnest or most gap-rich sections, read the full section content from the previous EXPLORATION.md. This gives you specific context about what was already found without consuming your entire context budget on previous findings. + +4. **Gap exploration.** Run a focused Phase 1 exploration targeting only the identified gaps. Use the same three-stage approach (open exploration → quality risks → selected patterns) but scoped to the uncovered areas. Write findings to `quality/EXPLORATION_ITER{N}.md` using the same template structure. + +--- + +## Strategy: `unfiltered` — pure domain-driven exploration without structural constraints + +Ignore the three-stage gated structure entirely. Explore the codebase the way an experienced developer would — reading code, following hunches, tracing suspicious paths — with no pattern templates, applicability matrices, or section format requirements. This strategy deliberately removes the structural scaffolding to let domain expertise drive discovery without constraint. + +**Why this strategy exists:** In benchmarking, the unfiltered domain-driven approach used in skill versions v1.3.25–v1.3.26 found bugs that the structured three-stage approach consistently missed, particularly in web frameworks and HTTP libraries. The structured approach excels at systematic coverage but can over-constrain exploration, causing the model to spend context on format compliance rather than deep code reading. The unfiltered strategy recovers that lost discovery power. + +1. **Lightweight previous-run scan.** Read just the `## Candidate Bugs for Phase 2` section and `quality/BUGS.md` from the previous run to know what was already found. Do NOT read the full EXPLORATION.md — you want a fresh perspective, not to be anchored by previous exploration paths. Write a brief note to `quality/ITERATION_PLAN.md` listing what the previous run found and confirming you are using the unfiltered strategy. + +2. **Unfiltered exploration.** Explore the codebase from scratch using pure domain knowledge. No required sections, no pattern applicability matrix, no gate self-check. Instead: + - Read source code deeply — entry points, hot paths, error handling, edge cases + - Follow your domain expertise: "What would an expert in [this domain] find suspicious?" + - For each suspicious finding, trace the code path across 2+ functions with file:line citations + - Generate bug hypotheses directly — not "areas to investigate" but "this specific code at file:line produces wrong behavior because [reason]" + - Write findings to `quality/EXPLORATION_ITER{N}.md` as a flat list of findings, each with file:line references and a bug hypothesis. No structural template required — depth and specificity matter, not section formatting. + - Minimum: 10 concrete findings with file:line references, at least 5 of which trace code paths across 2+ functions + +3. **Domain-knowledge questions.** Complete these questions using the code you just explored AND your domain knowledge. Write your answers inline with your findings, not in a separate gated section: + - What API surface inconsistencies exist between similar methods? + - Where does the code do ad-hoc string parsing of structured formats? + - What inputs would a domain expert try that a developer might not test? + - What metadata or configuration values could be silently wrong? + +--- + +## Strategy: `parity` — cross-path comparison and diffing + +Systematically enumerate parallel implementations of the same contract and diff them for inconsistencies. This strategy finds bugs by comparing code paths that should behave the same way but don't. + +**Why this strategy exists:** In benchmarking, the v1.3.40 skill version found 5 bugs in virtio using "fallback path parity" and "cross-implementation consistency" as explicit exploration patterns. Three of those bugs (MSI-X slow_virtqueues reattach, GFP_KERNEL under spinlock, INTx admin queue_idx) were found by lining up parallel code paths and spotting differences — not by exploring individual subsystems. The gap, unfiltered, and adversarial strategies all explore areas or challenge decisions, but none explicitly compare parallel paths. This strategy fills that gap. + +1. **Enumerate parallel paths.** Scan the codebase for groups of code that implement the same contract or handle the same logical operation via different paths. Common categories: + - **Transport/backend variants:** multiple implementations of the same interface (e.g., PCI vs MMIO vs vDPA, sync vs async, HTTP/1.1 vs HTTP/2) + - **Fallback chains:** primary path → fallback → last-resort (e.g., MSI-X → shared → INTx, rich error → generic error) + - **Setup vs teardown/reset:** initialization path vs cleanup/reset path for the same resource + - **Happy path vs error path:** normal flow vs exception/error handling for the same operation + - **Public API variants:** overloaded methods, convenience wrappers, format-specific parsers that should produce equivalent results + - Write the enumeration to `quality/ITERATION_PLAN.md` with a brief description of each parallel group. + +2. **Pairwise comparison.** For each parallel group, read the code paths side by side and systematically check each comparison sub-type below. Not every sub-type applies to every parallel group — but explicitly considering each one prevents the strategy from only finding "obvious" discrepancies while missing structural ones. + + **Comparison sub-type checklist** (check each one for every parallel group): + + - **Resource lifecycle parity:** Compare what setup/init does with a resource vs. what teardown/reset/cleanup does with the same resource. Every resource acquired in setup must be released in teardown — and in the same order, with the same scope. Look for resources that setup creates but reset forgets (e.g., a list populated during probe but not drained during reset). + - **Allocation context parity:** Compare allocation flags, lock context, and interrupt state across parallel paths. If one path allocates with `GFP_KERNEL` (sleepable) but runs under a spinlock that another path doesn't hold, that's a bug. Check: what locks are held? What allocation flags are valid in that context? Do parallel paths agree? + - **Identifier and index parity:** Compare how parallel paths compute indices, offsets, or identifiers for the same logical entity. If setup uses `queue_index + admin_offset` but reset uses `raw_queue_index`, the mismatch is a bug candidate. + - **Capability/feature-bit parity:** Compare which feature bits, flags, or capabilities each parallel path checks or sets. If the MSI-X path checks a slow-path vector list but the INTx fallback path doesn't, vectors may be misrouted after fallback. + - **Error/exception parity:** Compare error handling between paths. If the primary path handles an error gracefully but the fallback path lets it propagate, the fallback is less robust than the primary — which is backwards. + - **Iteration/collection parity:** Compare what collections each path iterates over. If setup iterates over `all_queues` but reset iterates over `active_queues`, resources for inactive queues leak. + + For each discrepancy found, trace both code paths with file:line citations and determine whether the difference is intentional (documented, tested, or structurally necessary) or a bug. + +3. **Cross-file contract tracing.** For the most promising discrepancies, trace the call chain across files to verify: + - What lock/interrupt context each path runs in + - What allocation flags are valid in that context + - Whether the contract (documented in specs, comments, or headers) requires parity + - Write findings to `quality/EXPLORATION_ITER{N}.md` with both code paths cited for each finding. + +4. **Minimum output:** At least 5 parallel groups enumerated, at least 8 pairwise comparisons traced with file:line references, at least 3 concrete discrepancy findings. + +--- + +## Strategy: `adversarial` — challenge the previous run's conclusions + +Re-investigate what the previous run dismissed, demoted, or marked SATISFIED. This strategy assumes the previous run made Type II errors (missed real bugs by being too conservative) and systematically challenges those decisions. + +**Why this strategy exists:** In benchmarking, the triage step reliably demotes legitimate findings by demanding excessive evidence, marking ambiguous cases as "design choice," or accepting code-review SATISFIED verdicts without deep verification. The adversarial strategy specifically targets these failure modes. + +1. **Load previous decisions.** Read these files from the previous run (use divide-and-conquer — section headers first, then targeted deep reads): + - `quality/EXPLORATION_MERGED.md` — specifically the `## Demoted Candidates` section (this is your primary input — it contains structured re-promotion criteria for each dismissed finding) + - `quality/BUGS.md` — what was confirmed (to avoid re-finding the same bugs) + - `quality/spec_audits/*triage*` — what was dismissed or demoted during triage, and why + - `quality/code_reviews/*.md` — Pass 2 SATISFIED/VIOLATED verdicts + - `quality/EXPLORATION.md` — just the `## Candidate Bugs for Phase 2` section to see which candidates didn't become confirmed bugs + - Write a summary to `quality/ITERATION_PLAN.md` listing: (a) demoted candidates from the manifest with their re-promotion criteria, (b) additional dismissed triage findings not yet in the manifest, (c) candidates that weren't promoted, (d) requirements marked SATISFIED that had thin evidence + +2. **Re-investigate dismissed findings with a lower evidentiary bar.** The adversarial strategy uses a deliberately lower evidentiary standard than earlier strategies. The baseline and gap strategies rightly demand strong evidence to avoid false positives during initial discovery. But by the adversarial iteration, remaining undiscovered bugs are precisely the ones that conservative triage keeps rejecting — they look ambiguous, they could be "design choices," they lack dramatic runtime failures. For these findings: + - A code-path trace showing observable semantic drift (output differs from what spec or contract requires) is sufficient to confirm — you do not need a runtime crash or dramatic failure + - "Permissive behavior" is not automatically a design choice — check whether the spec, docs, or API contract defines the expected behavior. If the code deviates from a documented contract, it's a bug regardless of whether the deviation is "permissive" + - If the Demoted Candidates Manifest includes re-promotion criteria, attempt to satisfy those criteria specifically. Each criterion was written to be actionable — follow it + - Read the specific code location cited in the finding + - Trace the code path independently — do not rely on the previous run's analysis + - Make an explicit CONFIRMED/FALSE-POSITIVE determination with fresh evidence + - Update the Demoted Candidates Manifest: change status to RE-PROMOTED or FALSE POSITIVE with the iteration attribution + +3. **Challenge SATISFIED verdicts.** For each requirement the code review marked SATISFIED with thin evidence (single-line citation, no code-path trace, or grouped with 3+ other requirements under one citation): + - Re-verify the requirement by reading the cited code and tracing the behavior + - Check whether the requirement is actually satisfied or whether the review took a shallow pass + +4. **Explore adjacent code.** For each re-confirmed or newly confirmed finding, explore the surrounding code for related bugs — bugs cluster. If a function has one bug, its callers and siblings likely have related issues. + +5. Write all findings to `quality/EXPLORATION_ITER{N}.md`. Each finding must include: the original source (triage dismissal, candidate demotion, or SATISFIED challenge), the fresh evidence, and the new determination. diff --git a/skills/quality-playbook/references/requirements_pipeline.md b/skills/quality-playbook/references/requirements_pipeline.md new file mode 100644 index 000000000..361730e03 --- /dev/null +++ b/skills/quality-playbook/references/requirements_pipeline.md @@ -0,0 +1,427 @@ +# Requirements Pipeline + +## Overview + +This document defines the five-phase requirements generation pipeline for Step 7 of the Quality Playbook. The pipeline separates contract discovery from requirement derivation, uses file-based external memory so the model doesn't need to hold everything in context simultaneously, and includes mechanical verification with a completeness gate. + +**Why a pipeline?** Single-pass requirement generation runs out of attention after ~70 requirements because the model is simultaneously discovering contracts and writing formal requirements. Separating these into distinct phases with file-based handoffs produces significantly more complete coverage. In testing on Gson (81 source files, ~21K lines), single-pass produced 48 requirements; the pipeline produced 110. + +## Files produced + +| File | Purpose | +|------|---------| +| `quality/CONTRACTS.md` | Raw behavioral contracts extracted from source | +| `quality/REQUIREMENTS.md` | Testable requirements with narrative (the primary deliverable) | +| `quality/COVERAGE_MATRIX.md` | Contract-to-requirement traceability | +| `quality/COMPLETENESS_REPORT.md` | Final completeness assessment with verdict | +| `quality/VERSION_HISTORY.md` | Review log with version table and provenance | +| `quality/REFINEMENT_HINTS.md` | Review progress and feedback (created during review) | + +Versioned backups go in `quality/history/vX.Y/`. + +--- + +## Phase A: Extract behavioral contracts + +**Input:** All source files in the project (or a scoped subsystem — see scaling check below). +**Output:** `quality/CONTRACTS.md` + +### Scaling check + +Before starting extraction, count the source files in the project (exclude tests, generated code, vendored dependencies, and build artifacts). + +- **Standard project (≤300 source files):** Proceed normally — extract contracts from all files. Projects in this range have been tested end-to-end (e.g., Gson at ~81 source files produced 110 requirements with full coverage). +- **Large project (301–500 source files):** Focus on the 3–5 core subsystems identified in Phase 1, Step 2. Extract contracts from those modules and their internal dependencies. Note the scope in the CONTRACTS.md header so reviewers know what was covered. +- **Very large project (>500 source files):** Recommend that the user scope the pipeline to one subsystem at a time. Each subsystem gets its own pipeline run producing its own REQUIREMENTS.md, CONTRACTS.md, etc. Tell the user: "This project has N source files. For best results, run the requirements pipeline separately for each major subsystem (e.g., 'Generate requirements for the authentication module'). A single pipeline run across the full codebase will miss contracts due to context limits." + +If the user explicitly asks for full-project scope on a large codebase, honor the request but warn that coverage will be thinner than subsystem-level runs. + +### Scope breadth on the initial pass + +On the first pipeline run, favor breadth over depth. Cover all major subsystems and modules rather than going deep on a few. The goal is a broad baseline that the self-refinement loop and later review/refinement passes can deepen. If you focus on 3 modules and skip 8 others, the completeness check can't find gaps in modules it never saw. + +For projects with both a core library and supporting modules (middleware, plugins, adapters, extensions), include at least the core and the highest-risk supporting modules in Phase A. Note the scope in the CONTRACTS.md header so it's clear what was covered and what wasn't. Refinement passes can expand scope later, but the initial pass should cast the widest net the context window allows. + +### Contract extraction + +Read every source file (within scope) and list every behavioral contract it implements or should implement. A behavioral contract is any promise the code makes to its callers: + +- **METHOD**: What a public method guarantees about return value, side effects, exceptions, thread safety +- **NULL**: What happens when null is passed, returned, or stored +- **CONFIG**: What effect a configuration option has at its boundaries +- **ERROR**: What exceptions are thrown, when, and with what diagnostic information +- **INVARIANT**: Properties that must always hold +- **COMPAT**: Behaviors preserved for backward compatibility +- **ORDER**: Whether output/iteration order is stable, documented, or undefined +- **LIFECYCLE**: Resource creation/cleanup, initialization sequencing +- **THREAD**: Thread-safety guarantees or requirements + +### Contract extraction rules + +- **Be thorough.** For a 200-line file, expect 5–15 contracts. For a 1000-line file, expect 20–40. If you're finding fewer than 3 contracts in a file with real logic, you're skipping things. +- **Include internal files.** Internal contracts matter because the public API depends on them. +- **Include "should exist" contracts** — things the code doesn't do but should based on its domain. These catch absence bugs. +- **Read the code, not just the Javadoc/docstrings.** When documentation and code disagree, list both. +- **This is discovery, not judgment.** List everything, even if it seems obvious. + +### Output format + +``` +# Behavioral Contract Extraction +Generated: [date] +Source files analyzed: N +Total contracts extracted: N + +## Summary by category +- METHOD: N +- NULL: N +- CONFIG: N +[etc.] + +### path/to/file.ext (N contracts) + +1. [METHOD] ClassName.methodName(): description of what it guarantees +2. [NULL] ClassName.methodName(): what happens when null is passed/returned +[etc.] +``` + +--- + +### Requirement heading format + +All requirements in REQUIREMENTS.md must use the format `### REQ-NNN: Title` where NNN is a zero-padded three-digit number and Title is a short descriptive name. Do not use alternative formats like `### REQ-NNN — Title`, `### REQ-NNN. Title`, `**REQ-NNN**: Title`, or freeform headings without a number. Consistent formatting enables automated tooling to parse and cross-reference requirements. + +--- + +## Phase B: Derive requirements from contracts + +**Input:** `quality/CONTRACTS.md`, project documentation, SKILL.md Step 7 template. +**Output:** `quality/REQUIREMENTS.md` + +### How to work + +**B.1 — Group related contracts.** Many contracts across different files serve the same behavioral requirement. Group them by behavioral concern, not by file. Don't merge unrelated contracts just because they're in the same file. + +**B.2 — Enrich with intent.** For each group, find the user story from documentation: GitHub issues state what users expect, the user guide states intended behavior, troubleshooting docs reveal known edge cases, design docs explain design goals. The "so that" clause must come from understanding who cares and why. + +**B.3 — Write requirements.** Use the 7-field template from SKILL.md Step 7. Conditions of satisfaction come from the individual contracts in the group — each contract becomes a condition of satisfaction. + +**B.4 — Check for orphan contracts.** After writing all requirements, verify every contract in CONTRACTS.md is covered. Uncovered contracts become new requirements or get added to existing requirements' conditions of satisfaction. + +### Rules + +- **Do not cap the requirement count.** Write as many as the contracts warrant. +- **Every contract must map to at least one requirement.** +- **One requirement per distinct behavioral concern.** Don't merge "thread safety" with "null handling" just because they're in the same class. +- **Do not modify CONTRACTS.md.** Only read it. + +--- + +## Phase C: Verify coverage (loop, max 3 iterations) + +**Input:** `quality/CONTRACTS.md`, `quality/REQUIREMENTS.md` +**Output:** `quality/COVERAGE_MATRIX.md`, updated `quality/REQUIREMENTS.md` + +For every contract in CONTRACTS.md, determine whether it is covered by a requirement. A contract is "covered" if a requirement's conditions of satisfaction explicitly test the behavior. A contract is NOT covered if it's only tangentially mentioned, implied but not stated, or if a different aspect of the same file is covered but this specific contract isn't. + +### Output format + +``` +# Contract Coverage Matrix +Generated: [date] +Total contracts: N +Covered: N (percentage) +Uncovered: N (percentage) +Partially covered: N (percentage) + +## Fully covered contracts +[file]: [contract summary] → REQ-NNN (conditions of satisfaction #M) + +## Partially covered contracts +[file]: [contract summary] → REQ-NNN covers the general area but misses [specific aspect] + +## Uncovered contracts +[file]: [contract summary] → No requirement addresses this behavior +``` + +After writing the matrix, fix gaps in REQUIREMENTS.md: add missing conditions to existing requirements or create new requirements. Report changes. + +**Loop termination:** If uncovered count reaches 0, proceed to Phase D. Otherwise, regenerate the matrix and check again. Maximum 3 iterations. + +--- + +## Phase D: Completeness check + +**Input:** `quality/REQUIREMENTS.md`, `quality/CONTRACTS.md`, `quality/COVERAGE_MATRIX.md`, source tree. +**Output:** `quality/COMPLETENESS_REPORT.md`, updated `quality/REQUIREMENTS.md` + +This is the final gate before the narrative pass. Run three checks: + +### Check 1: Domain completeness + +The following behavioral domains MUST have requirements. Check each one. This checklist is a minimum — if you notice a domain not listed that should have requirements for this project's domain, add it. + +- [ ] **Null handling:** explicit null, absent fields, null keys, null values in collections +- [ ] **Type coercion:** string↔number, string↔boolean, number precision, overflow +- [ ] **Primitive vs wrapper:** primitive vs object null semantics during deserialization (for languages with this distinction) +- [ ] **Generic types:** erasure boundaries, wildcard handling, recursive generics (for languages with generics) +- [ ] **Thread safety:** concurrent access, publication safety, cache visibility +- [ ] **Error diagnostics:** exception types, path context, location information +- [ ] **Resource management:** stream closing, reader/writer lifecycle +- [ ] **Backward compatibility:** wire format stability, API behavioral stability +- [ ] **Security:** DoS protection (nesting depth, string length), injection prevention +- [ ] **Encoding:** Unicode, BOM, surrogate pairs, escape sequences +- [ ] **Date/time:** format precedence, timezone handling, precision +- [ ] **Collections:** arrays, lists, sets, maps, queues — empty, null elements, ordering +- [ ] **Enums:** name resolution, aliases, unknown values +- [ ] **Polymorphism:** runtime type vs declared type, adapter/handler delegation +- [ ] **Tree model / intermediate representation:** mutation semantics, deep copy structural independence, null normalization +- [ ] **Configuration:** builder immutability, instance isolation, option composition +- [ ] **Entry points:** every distinct public entry point must have its own contract — string-based, stream-based, tree-based, standalone parsing, multi-value parsing. If the library has N ways to start a read or write, there must be N sets of contracts. +- [ ] **Output escaping:** which characters are escaped by default, what disabling escaping changes, how builder-level and writer-level controls interact +- [ ] **Built-in type handler contracts:** for each built-in handler that processes a standard library type, state what it promises about format, precision, normalization, and round-trip fidelity. The requirement should specify the handler's promise, not just that a handler exists. +- [ ] **Field/property serialization ordering:** whether output order follows declaration order, inheritance order, alphabetical order, or is undefined. State whether ordering is a promised contract or merely observed behavior. +- [ ] **Identity contracts for public types:** `toString()`, `hashCode()`/`equals()` (or language equivalent) on public model types. These are behavioral contracts users depend on for comparison, logging, and collection key usage. +- [ ] **Input validation:** for every configuration field with domain constraints, state the valid range and whether validation exists. + +For each domain, either cite the REQ-NNN numbers that cover it or flag it as a gap. + +### Check 2: Testability audit + +For each requirement, check whether its conditions of satisfaction are actually testable. Can a reviewer write a concrete test case from this condition? Is pass/fail unambiguous? Does the condition cover failure modes, not just the happy path? + +### Check 3: Cross-requirement consistency + +Check pairs of requirements that reference the same concept. Do ranges agree? Do null-handling rules agree? Do thread-safety guarantees conflict with lifecycle contracts? Do configuration defaults match across requirements? + +### Check 4: Cross-artifact consistency (if code review or spec audit results exist) + +If `quality/code_reviews/` or `quality/spec_audits/` contain results from a previous or current run, read them. For every finding with status VIOLATED, BUG, or INCONSISTENT, check whether the requirements address the behavioral concern that finding targets. If a code review found a bug in compression header parsing that the requirements don't cover, that's a completeness gap — add a requirement or conditions of satisfaction to close it. + +**The completeness report cannot say COMPLETE if unaddressed findings exist.** If any VIOLATED/BUG/INCONSISTENT finding from code review or spec audit targets behavior not covered by requirements, the verdict must be INCOMPLETE with the specific gaps listed. + +This check exists because earlier versions of the pipeline produced completeness reports that said "COMPLETE" while the code review in the same run found requirement violations. The completeness report must be consistent with all other quality artifacts. + +### Post-review completeness refresh (mandatory) + +**After the code review and spec audit are complete**, re-read `quality/COMPLETENESS_REPORT.md` and update it. The initial completeness report was written before the code review and spec audit ran, so it cannot reflect their findings. This refresh step reconciles the completeness verdict with the actual review results. + +**Procedure:** +1. Read the combined summary from `quality/code_reviews/` — count VIOLATED and BUG findings. +2. Read the triage summary from `quality/spec_audits/` — count confirmed code bugs. +3. For each finding, check whether REQUIREMENTS.md has a requirement covering that behavior. +4. Append a `## Post-Review Reconciliation` section to COMPLETENESS_REPORT.md: + +``` +## Post-Review Reconciliation +Updated: [date] + +### Code review findings: N VIOLATED, M BUG +- [finding summary] → covered by REQ-NNN / NOT COVERED (gap) +- ... + +### Spec audit findings: N confirmed code bugs +- [finding summary] → covered by REQ-NNN / NOT COVERED (gap) +- ... + +### Updated verdict +[COMPLETE if all findings are covered by requirements, INCOMPLETE if gaps remain] +``` + +5. If the original verdict was COMPLETE but unaddressed findings exist, change the verdict to INCOMPLETE. + +### Resolving code review vs spec audit conflicts + +When the code review and spec audit disagree about the same behavioral claim — one says BUG, the other says design choice or false positive — the reconciliation must resolve the conflict, not paper over it. + +**Resolution procedure:** +1. Identify the factual claim at the center of the disagreement. What does the code actually do? +2. Deploy a verification probe: give a model the disputed claim and the relevant source code, and ask it to report ground truth. (See `spec_audit.md` § "The Verification Probe.") +3. Record the resolution in the Post-Review Reconciliation section: + ``` + ### Conflicts resolved + - [finding description]: Code review said [X], spec audit said [Y]. + Verification probe: [what the code actually does]. + Resolution: [BUG CONFIRMED / FALSE POSITIVE / DESIGN CHOICE]. [Explanation.] + ``` +4. If the resolution confirms a BUG, ensure it has a regression test. If the resolution overturns a BUG, clean up the regression test per `review_protocols.md` § "Cleaning up after spec audit reversals." + +**Do not resolve conflicts by defaulting to one source.** Neither the code review nor the spec audit is automatically more authoritative — they use different methods (structural reading vs. spec comparison) and have different blind spots. The verification probe is the tiebreaker. + +**This refresh is not optional.** A completeness report that predates the code review is a timestamp, not a quality gate. The refresh turns it into an actual reconciliation. + +### Output format + +``` +# Completeness Report +Generated: [date] + +## Domain coverage +[For each domain: COVERED (REQ-NNN, REQ-NNN) or GAP (description)] + +## Testability issues +[For each vague requirement: REQ-NNN — condition N is not testable because...] + +## Consistency issues +[For each conflict: REQ-NNN and REQ-NNN disagree about...] + +## Cross-artifact gaps (if code review/spec audit results exist) +[For each unaddressed finding: finding summary → missing requirement or condition] + +## Verdict +COMPLETE or INCOMPLETE with recommended actions +``` + +Then fix what you can: add requirements for domain gaps, sharpen vague conditions, resolve consistency issues, and close cross-artifact gaps. + +**Important:** This is the final check. Be adversarial. Assume previous passes were imperfect. For each domain marked COVERED, verify that the cited requirements actually address the checklist item — don't just check the box. + +### Self-refinement loop (max 3 iterations) + +After the initial completeness check, run up to 3 refinement iterations to close the gaps Phase D identified: + +1. **Read the completeness report.** Identify all GAP entries, testability issues, and consistency issues. +2. **Fix gaps in REQUIREMENTS.md.** For each GAP: add a new requirement using the 7-field template, or add conditions of satisfaction to an existing requirement. For testability issues: sharpen the condition. For consistency issues: resolve the conflict. +3. **Re-run all three checks** (domain completeness, testability audit, cross-requirement consistency). Write the updated results to COMPLETENESS_REPORT.md. +4. **Count the delta.** How many new requirements were added or existing requirements modified in this iteration? +5. **Short-circuit check:** If the delta is fewer than 3 changes, stop — you've hit diminishing returns. Proceed to Phase E. + +**Why this works:** The initial completeness check identifies gaps but the model may not fix all of them in one pass, especially conceptual gaps where the model needs to re-read source files to understand what's missing. Each iteration shrinks the gap. Three iterations is enough to close the mechanical gaps; the remaining conceptual gaps are where cross-model audit and human review earn their keep. + +**Why it has limits:** This is self-refinement — the same model checking its own work. It catches gaps the model can see once they're pointed out (uncovered domains, vague conditions, numeric inconsistencies) but won't catch blind spots the model doesn't recognize as gaps. That's by design. The review and refinement protocols exist for closing those deeper gaps with different models or human input. + +After the loop completes (or short-circuits), proceed to Phase E. + +--- + +## Phase E: Narrative pass + +**Input:** `quality/REQUIREMENTS.md`, `quality/CONTRACTS.md`, project documentation, source tree. +**Output:** Restructured `quality/REQUIREMENTS.md` + +**Before starting:** Save a backup: `cp quality/REQUIREMENTS.md quality/REQUIREMENTS_pre_narrative.md` + +This phase transforms the specification into a guide. Add explanatory tissue so a new team member, code reviewer, or AI agent can read the document top-to-bottom and understand the software. + +### E.1 — Project overview (new, top of document) + +Write 400–600 words of connected prose explaining: what the software is, who uses it and why (primary personas and goals), how data flows through the major components, and the design philosophy (key architectural decisions and why they were made). + +### E.2 — Use cases (new, after overview) + +Write 6–8 use cases in the style of Applied Software Project Management (Stellman & Greene). Each has: + +- **Name**: Short descriptive name +- **Actor**: Who initiates it +- **Preconditions**: What must be true before this begins +- **Steps**: Numbered actor/system action sequence +- **Postconditions**: What is true on success +- **Alternative paths**: Variations and error cases +- **Requirements**: Which REQ-NNN numbers this use case exercises + +Cover the major usage patterns. The use cases are the bridge between "what the software does" and "what the requirements specify." + +### E.3 — Cross-cutting concerns (new, after use cases) + +Document architectural invariants that span multiple categories: threading model, null contract, error philosophy, backward compatibility strategy, configuration composition. Each references specific REQ-NNN numbers. Write as prose paragraphs. + +### E.4 — Category narratives (augment existing) + +For each requirement category, add 2–4 sentences before the first requirement explaining what the category covers, how it relates to other categories, and what a reviewer should keep in mind. + +### E.5 — Reorder for top-down flow + +Reorder categories from user-facing (entry points, configuration) to infrastructure (error handling, backward compatibility). Fold any catch-all sections into proper categories. + +### E.6 — Renumber sequentially + +After reordering, renumber all requirements REQ-001 through REQ-NNN following document order. Update all internal cross-references. + +### Rules + +- **Do not delete, merge, or weaken any existing requirement.** +- **Do not add new requirements in this pass.** +- **Write the overview and use cases from the user's perspective.** +- **Use cases must cite specific REQ numbers.** + +--- + +## Versioning protocol + +### Version scheme: major.minor + +- **Major** bump: structural changes (new pipeline architecture, narrative pass added, major scope expansion). Bumped by the user. +- **Minor** bump: refinement passes, gap fills, sharpened conditions. Increments automatically on each pipeline run or refinement pass. + +### VERSION_HISTORY.md + +Maintain a version history file at `quality/VERSION_HISTORY.md`: + +```markdown +# Requirements Version History + +## Current version: vX.Y + +| Version | Date | Model | Author | Reqs | Summary | +|---------|------|-------|--------|------|---------| +| v1.0 | YYYY-MM-DD | [model] | Quality Playbook | N | Initial pipeline generation | +| v1.1 | YYYY-MM-DD | [model] | [author] | N | [what changed] | + +## Pending review +[status from REFINEMENT_HINTS.md if review is in progress] +``` + +The **Author** column records provenance: "Quality Playbook" for automated pipeline runs, a person's name for manual edits, a model name for refinement passes. + +### Backup protocol + +Before each version change, copy all quality files to `quality/history/vX.Y/`: + +``` +quality/history/ +├── v1.0/ +│ ├── REQUIREMENTS.md +│ ├── CONTRACTS.md +│ ├── COVERAGE_MATRIX.md +│ └── COMPLETENESS_REPORT.md +├── v1.1/ +│ └── ... +└── v2.0/ + └── ... +``` + +Each version folder is a complete snapshot. Users can diff any two versions. + +### Version stamping + +The REQUIREMENTS.md header includes the current version: + +```markdown +# Behavioral Requirements — [Project Name] +Version: vX.Y +Generated: [date] +Pipeline: contract-extraction v2 with narrative pass +``` + +--- + +## After the pipeline: review and refinement + +The pipeline produces a solid baseline, but AI isn't 100% reliable. The skill provides two standalone tools for iterative improvement: + +### Requirements review (`quality/REVIEW_REQUIREMENTS.md`) + +An interactive or guided review of requirements organized by use case. Three modes: +- **Self-guided**: Pick use cases to drill into +- **Fully guided**: Walk through use cases sequentially +- **Cross-model audit**: A different model fact-checks the completeness report + +Progress and feedback are tracked in `quality/REFINEMENT_HINTS.md`. See the generated `quality/REVIEW_REQUIREMENTS.md` for the full protocol. + +### Requirements refinement (`quality/REFINE_REQUIREMENTS.md`) + +Reads `quality/REFINEMENT_HINTS.md` and updates `quality/REQUIREMENTS.md` to close identified gaps. Can be run with any model. Backs up the current version, bumps minor version, reports all changes. See the generated `quality/REFINE_REQUIREMENTS.md` for the full protocol. + +### Multi-model refinement + +Users can run refinement passes with different models to catch different blind spots. Each pass: backup → refine → version bump → log in VERSION_HISTORY.md. Run as many models as desired until diminishing returns. diff --git a/skills/quality-playbook/references/requirements_refinement.md b/skills/quality-playbook/references/requirements_refinement.md new file mode 100644 index 000000000..6a15a4e6c --- /dev/null +++ b/skills/quality-playbook/references/requirements_refinement.md @@ -0,0 +1,113 @@ +# Requirements Refinement Protocol + +## Overview + +This is the template for `quality/REFINE_REQUIREMENTS.md`. The playbook generates this file alongside the requirements pipeline output. It provides a structured process for updating requirements based on review feedback, and can be run with any model. + +## Generated file template + +The playbook should generate the following as `quality/REFINE_REQUIREMENTS.md`: + +--- + +```markdown +# Requirements Refinement Protocol: [Project Name] + +## How to use + +This protocol reads feedback from `quality/REFINEMENT_HINTS.md` and updates `quality/REQUIREMENTS.md` to close identified gaps. It can be run with any AI model — the protocol is self-contained. + +**Multi-model refinement:** You can run this protocol multiple times with different models. Each run backs up the current version, makes targeted improvements, bumps the minor version, and logs the changes. Run as many models as you want until you hit diminishing returns. + +--- + +## Before starting + +1. Read `quality/REFINEMENT_HINTS.md` — this contains the review feedback to address. +2. Read `quality/REQUIREMENTS.md` — the current requirements to update. +3. Read `quality/CONTRACTS.md` — for contract-level detail when adding new conditions. +4. Read `quality/VERSION_HISTORY.md` — to determine the current version number. + +## Step 1: Backup and version + +1. Read the current version from `quality/VERSION_HISTORY.md`. +2. Copy all files in `quality/` to `quality/history/vX.Y/` (current version number). +3. Bump the minor version: v1.2 becomes v1.3. +4. Update the version stamp at the top of `quality/REQUIREMENTS.md`. + +## Step 2: Process feedback + +Read each item in REFINEMENT_HINTS.md and categorize it: + +- **Gap — missing requirement:** A behavioral contract or domain area has no requirement. Create a new requirement using the 7-field template. +- **Gap — missing condition:** An existing requirement doesn't cover a specific scenario. Add a condition of satisfaction to the existing requirement. +- **Gap — missing use case coverage:** A use case doesn't link to a requirement that governs one of its steps. Add the REQ-NNN to the use case's Requirements line. +- **Sharpening — vague condition:** A condition of satisfaction is too vague to test. Rewrite it with concrete pass/fail criteria. +- **Correction — wrong content:** A requirement states something incorrect. Fix the specific field. +- **Cross-model audit finding:** A domain was marked COVERED in the completeness report but the cited requirements don't actually address it. Add the missing requirements. +- **Removal (user-directed only):** The user explicitly states a requirement is incorrect and should be removed (e.g., "REQ-047 is incorrect because X — remove it"). Only process removals when the hint clearly comes from the user, not from an automated pass. Log the removal and reason in the change report. + +## Step 3: Make changes + +For each feedback item: + +1. **New requirements:** Add at the end of the appropriate category section. Continue the existing numbering sequence. Follow the 7-field template exactly. +2. **Modified requirements:** Edit the specific field that needs changing. Do not rewrite requirements that aren't flagged. +3. **Use case updates:** Add newly created REQ numbers to the relevant use case's Requirements line. +4. **Cross-cutting concerns:** If new requirements affect cross-cutting concerns, update those sections. + +### Rules + +- **Do not delete or weaken existing requirements during automated refinement.** Every requirement that exists today must exist after refinement with at least the same conditions of satisfaction — unless the user has explicitly marked a requirement for removal with a reason. User-directed removals are the only exception. +- **Do not renumber existing requirements.** New requirements get the next available number. This preserves traceability across versions. +- **Do not restructure the document.** The narrative pass already established the structure. Refinement is surgical — add, sharpen, or fix individual items. +- **Each change must be traceable to a feedback item.** Don't make changes that weren't asked for. + +## Step 4: Report changes + +After all changes, append a summary to `quality/REFINEMENT_HINTS.md`: + +``` +## Refinement Pass — v[new version] +Date: [date] +Model: [model name] + +### Changes made +- REQ-NNN (NEW): [brief description] — addresses feedback: "[quoted hint]" +- REQ-NNN: Added condition of satisfaction for [what] — addresses feedback: "[quoted hint]" +- REQ-NNN: Sharpened condition #N: [what changed] — addresses feedback: "[quoted hint]" +- Use Case N: Added REQ-NNN to requirements list + +### Feedback items not addressed +- "[quoted hint]" — reason: [why this wasn't actionable or was out of scope] + +### Summary +Added N new requirements, modified N existing requirements, updated N use cases. +Total requirements: N (was N). +``` + +## Step 5: Update version history + +Add a row to `quality/VERSION_HISTORY.md`: + +``` +| vX.Y | YYYY-MM-DD | [model] | [author] | N | [summary of changes] | +``` + +## Step 6: Update completeness report + +If new requirements were added that address domain checklist gaps, update the relevant domain entries in `quality/COMPLETENESS_REPORT.md` to cite the new REQ numbers. + +--- + +## Running multiple refinement passes + +Each pass follows the same protocol: +1. Read the latest REFINEMENT_HINTS.md (which now includes the previous pass's report) +2. Focus only on feedback items marked "not addressed" or new feedback added since the last pass +3. Backup, bump version, make changes, report + +The user can add new hints between passes by editing REFINEMENT_HINTS.md directly. The next refinement pass picks them up automatically. + +The user can also run a fresh cross-model audit (Mode 3 of the review protocol) between refinement passes to find new gaps that the previous refinement didn't catch. This creates a review → refine → review → refine cycle that converges on completeness. +``` diff --git a/skills/quality-playbook/references/requirements_review.md b/skills/quality-playbook/references/requirements_review.md new file mode 100644 index 000000000..e395bb1c2 --- /dev/null +++ b/skills/quality-playbook/references/requirements_review.md @@ -0,0 +1,158 @@ +# Requirements Review Protocol + +## Overview + +This is the template for `quality/REVIEW_REQUIREMENTS.md`. The playbook generates this file alongside the requirements pipeline output. It provides three modes for reviewing requirements interactively after generation. + +## Generated file template + +The playbook should generate the following as `quality/REVIEW_REQUIREMENTS.md`: + +--- + +```markdown +# Requirements Review Protocol: [Project Name] + +## How to use + +This protocol helps you review the generated requirements for completeness and accuracy. Run it with any AI model — the review is self-contained and reads from the files in `quality/`. + +**Before starting:** Make sure `quality/REQUIREMENTS.md` exists (from the pipeline) and that you've read the Project Overview and Use Cases sections at the top. + +### Choose a review mode + +**Mode 1 — Self-guided review.** You pick which use cases to examine. Best when you already know which areas of the project need the most scrutiny. + +**Mode 2 — Fully guided review.** The AI walks you through every use case in order, drilling into each linked requirement. Best for a thorough first review. + +**Mode 3 — Cross-model audit.** A different AI model fact-checks the completeness report by verifying that every domain marked COVERED actually has requirements addressing the checklist item. Best run with a different model than the one that generated the requirements. + +All three modes track progress in `quality/REFINEMENT_HINTS.md`. + +--- + +## Mode 1: Self-guided review + +Read `quality/REQUIREMENTS.md` and present the user with a numbered list of use cases: + +``` +Use cases in REQUIREMENTS.md: +1. [x] Use Case 1: [name] (reviewed) +2. [ ] Use Case 2: [name] +3. [ ] Use Case 3: [name] +... +``` + +Check `quality/REFINEMENT_HINTS.md` for review progress — use cases marked `[x]` have already been reviewed. Present the list and ask the user which use case to examine. + +When the user picks a use case: +1. Show the use case (actor, steps, postconditions, alternative paths) +2. List the linked REQ-NNN numbers +3. Ask: "Want to drill into any of these requirements, or does this use case look complete?" + +When drilling into a requirement: +1. Show the full requirement (summary, user story, conditions of satisfaction, alternative paths) +2. Ask: "Does this capture the right behavior? Anything missing or wrong?" +3. Record feedback in REFINEMENT_HINTS.md under the use case heading + +After reviewing a use case, mark it `[x]` in REFINEMENT_HINTS.md and return to the use case list. + +Also offer: "Are there any cross-cutting concerns or requirements NOT linked to a use case that you'd like to review?" + +--- + +## Mode 2: Fully guided review + +Same as Mode 1, but instead of asking the user to pick, start at Use Case 1 and proceed sequentially. + +For each use case: +1. Present the use case overview +2. Walk through each linked requirement one by one +3. For each requirement, ask: "Does this look right? Anything missing?" +4. Record any feedback in REFINEMENT_HINTS.md +5. Mark the use case as reviewed +6. Move to the next use case + +After all use cases: +1. Present the Cross-Cutting Concerns section +2. Ask: "Any concerns about threading, null handling, errors, compatibility, or configuration composition?" +3. Ask: "Are there any requirements you expected to see that aren't here?" +4. Record feedback and present a summary of all hints collected + +--- + +## Mode 3: Cross-model audit + +Read `quality/COMPLETENESS_REPORT.md` and `quality/REQUIREMENTS.md`. For each domain in the completeness report: + +1. Read the domain checklist item (from the report's domain coverage section) +2. Read each cited REQ-NNN +3. Verify: does this requirement actually address the domain checklist item? +4. If the citation is wrong (the requirement covers something else), flag it as a gap + +Also check: +- Are there requirements that don't appear in any use case's Requirements list? If so, flag as potentially orphaned. +- Does every use case's alternative paths section have corresponding requirements for the error/edge cases it mentions? +- Do the cross-cutting concerns reference requirements that actually exist and address the stated concern? + +Write findings to `quality/REFINEMENT_HINTS.md` under a `## Cross-Model Audit` heading: + +``` +## Cross-Model Audit +Date: [date] +Model: [model name] + +### Verified domains +- Null handling: CONFIRMED (REQ-054, REQ-055 correctly address null semantics) +- ... + +### Gaps found +- Entry points: COMPLETENESS_REPORT cites REQ-100, REQ-101 but these are about + pretty printing, not entry point contracts. JsonStreamParser has no coverage. +- ... + +### Orphaned requirements +- REQ-NNN is not linked to any use case +- ... +``` + +Present findings to the user and ask which gaps should be addressed in a refinement pass. + +--- + +## REFINEMENT_HINTS.md format + +The review protocol creates and maintains this file: + +```markdown +# Refinement Hints + +## Review Progress +- [x] Use Case 1: [name] — reviewed, no issues +- [x] Use Case 2: [name] — reviewed, see feedback below +- [ ] Use Case 3: [name] +- [ ] Use Case 4: [name] +... + +## Cross-Cutting Concerns +- [ ] Threading model — not yet reviewed +- [ ] Null contract — not yet reviewed +- [ ] Error philosophy — not yet reviewed +- [ ] Backward compatibility — not yet reviewed +- [ ] Configuration composition — not yet reviewed + +## Feedback + +### Use Case 2: [name] +- REQ-NNN: [specific feedback about what's missing or wrong] +- General: [broader observation about this use case's coverage] + +### Cross-Model Audit +[if Mode 3 was run] + +## Additional hints +[freeform feedback from the user, not tied to a specific use case] +``` + +This file serves dual purpose: it tracks review progress (so the user can resume across sessions) AND accumulates feedback that the refinement pass reads. +``` diff --git a/skills/quality-playbook/references/review_protocols.md b/skills/quality-playbook/references/review_protocols.md index 3f3b0cb94..73975d35d 100644 --- a/skills/quality-playbook/references/review_protocols.md +++ b/skills/quality-playbook/references/review_protocols.md @@ -11,23 +11,16 @@ Before reviewing, read these files for context: 1. `quality/QUALITY.md` — Quality constitution and fitness-to-purpose scenarios -2. [Main architectural doc] -3. [Key design decisions doc] -4. [Any other essential context] +2. `quality/REQUIREMENTS.md` — Testable requirements derived during playbook generation +3. [Main architectural doc] +4. [Key design decisions doc] +5. [Any other essential context] -## What to Check +## Pass 1: Structural Review -### Focus Area 1: [Subsystem/Risk Area Name] +Read the code and report anything that looks wrong. No requirements, no focus areas — use your own knowledge of code correctness. Look for: race conditions, null pointer hazards, resource leaks, off-by-one errors, type mismatches, error handling gaps, and any code that looks suspicious. -**Where:** [Specific files and functions] -**What:** [Specific things to look for] -**Why:** [What goes wrong if this is incorrect] - -### Focus Area 2: [Subsystem/Risk Area Name] - -[Repeat for 4–6 focus areas, mapped to architecture and risk areas from exploration] - -## Guardrails +### Guardrails - **Line numbers are mandatory.** If you cannot cite a specific line, do not include the finding. - **Read function bodies, not just signatures.** Don't assume a function works correctly based on its name. @@ -35,21 +28,189 @@ Before reviewing, read these files for context: - **Grep before claiming missing.** If you think a feature is absent, search the codebase. If found in a different file, that's a location defect, not a missing feature. - **Do NOT suggest style changes, refactors, or improvements.** Only flag things that are incorrect or could cause failures. -## Output Format - -Save findings to `quality/code_reviews/YYYY-MM-DD-reviewer.md` +### Output For each file reviewed: -### filename.ext +#### filename.ext - **Line NNN:** [BUG / QUESTION / INCOMPLETE] Description. Expected vs. actual. Why it matters. -### Summary -- Total findings by severity -- Files with no findings +## Pass 2: Requirement Verification + +Read `quality/REQUIREMENTS.md`. For each requirement, check whether the code satisfies it. This is a pure verification pass — your only job is "does the code satisfy this requirement?" + +Do NOT also do a general code review. Do NOT look for other bugs. Do NOT evaluate code quality. Just check each requirement. + +For each requirement, report one of: +- **SATISFIED**: The code implements this requirement. Quote the specific code. +- **VIOLATED**: The code does NOT satisfy this requirement. Explain what the code does vs. what the requirement says. Quote the code. +- **PARTIALLY SATISFIED**: Some aspects implemented, others missing. Explain both. +- **NOT ASSESSABLE**: Can't be checked from the files under review. + +### Output + +For each requirement: + +#### REQ-N: [requirement text] +**Status**: SATISFIED / VIOLATED / PARTIALLY SATISFIED / NOT ASSESSABLE +**Evidence**: [file:line] — [code quote] +**Analysis**: [explanation] +[If VIOLATED] **Severity**: [impact description] + +## Pass 3: Cross-Requirement Consistency + +Compare pairs of requirements from `quality/REQUIREMENTS.md` that reference the same field, constant, range, or security policy. For each pair, check whether their constraints are mutually consistent. + +What to look for: +- **Numeric range vs bit width**: If one requirement says the valid range is [0, N) and another says the field is M bits wide, does N = 2^M? +- **Security policy propagation**: If one requirement says a CA file is configured, do all requirements about connections that should use it actually reference using it? +- **Validation bounds vs encoding limits**: Does a validation check in one file agree with the storage capacity in another? +- **Lifecycle consistency**: If a resource is created by one requirement's code, is it cleaned up by another's? + +For each pair that shares a concept, verify consistency against the actual code. + +### Output + +For each shared concept: + +#### Shared Concept: [name] +**Requirements**: REQ-X, REQ-Y +**What REQ-X claims**: [summary] +**What REQ-Y claims**: [summary] +**Consistency**: CONSISTENT / INCONSISTENT +**Code evidence**: [quotes from both locations] +**Analysis**: [explanation] +[If INCONSISTENT] **Impact**: [what happens when the contradiction is triggered] + +## Combined Summary + +| Source | Finding | Severity | Status | +|--------|---------|----------|--------| +| Pass 1 | [structural finding] | [severity] | BUG / QUESTION | +| Pass 2, REQ-N | [requirement violation] | [severity] | VIOLATED | +| Pass 3, REQ-X vs REQ-Y | [consistency issue] | [severity] | INCONSISTENT | + +- Total findings by pass and severity - Overall assessment: SHIP IT / FIX FIRST / NEEDS DISCUSSION ``` +### Execution requirements + +**All three passes are mandatory.** Do not consolidate passes into a single review. Each pass produces distinct findings because it uses a different lens: + +- **Pass 1** finds structural bugs (race conditions, null hazards, resource leaks) +- **Pass 2** finds requirement violations (missing behavior, spec deviations) +- **Pass 3** finds cross-requirement contradictions (inconsistent ranges, conflicting guarantees) + +**Write each pass as a clearly labeled section** in the output file. Use the headers `## Pass 1: Structural Review`, `## Pass 2: Requirement Verification`, `## Pass 3: Cross-Requirement Consistency`, and `## Combined Summary`. + +**If a pass has no findings, explain why.** Do not just write "No findings." Write what you checked and why nothing was wrong. For example: "Reviewed 12 functions in lib/response.js for null hazards, resource leaks, and error handling gaps. No confirmed bugs — all error paths either throw or return a well-defined default." A pass with no findings and no explanation is a pass that wasn't done. + +**Scoping for large codebases:** If the project has more than 50 requirements, Pass 2 does not need to verify every requirement against every file. Instead, focus Pass 2 on the requirements most relevant to the files being reviewed — check the requirements that reference those files or that govern the behavioral domain those files implement. The goal is depth on the files under review, not breadth across all requirements. + +**Self-check before finishing:** After writing all three passes and the combined summary, verify: (1) all three pass sections exist in the output, (2) Pass 2 references specific REQ-NNN numbers with SATISFIED/VIOLATED verdicts, (3) Pass 3 identifies at least one shared concept between requirements (even if consistent), (4) every BUG finding has a corresponding regression test in `quality/test_regression.*` (see Phase 2 below), (5) every regression test exercises the actual code path cited in the finding (see test-finding alignment check below). If any check fails, go back and complete the missing work. + +### Adversarial stance when documentation is available + +If the playbook was generated with supplemental documentation (docs_gathered/, community docs, user guides, API references), the code review must use that documentation *against* the code, not in its defense. Documentation tells you what the code is supposed to do. Your job is to find where it doesn't. + +**Do not let documentation explanations excuse code defects.** If the docs say "the library handles X gracefully" but the code doesn't check for X, that's a bug — the documentation makes it *more* of a bug, not less. A richer understanding of intent should make you *harder* on the code, not softer. + +The failure mode this addresses: when models have access to documentation, they build a richer mental model of the software and become more *forgiving* of code that approximately matches that model. The documentation gives the model reasons to believe the code works, which suppresses detections. Fight this by treating documentation as the prosecution's evidence — it defines what the code promised, and your job is to find broken promises. + +### Test-finding alignment check + +For each regression test that claims to reproduce a specific finding, verify that the test actually exercises the cited code path. A test that targets a different function, a different branch, or a different failure mode than the finding it claims to reproduce is worse than no test — it creates false confidence. + +**Verification procedure:** For each regression test: +1. Read the finding: note the specific file, line number, function, and failure condition +2. Read the test: identify which function it calls and what condition it asserts +3. Confirm alignment: the test must call the function cited in the finding, trigger the specific condition the finding describes, and assert on the behavior the finding says is wrong + +If the test doesn't exercise the cited code path, either fix the test or mark the finding as UNCONFIRMED. Do not ship a regression test that passes or fails for reasons unrelated to the finding. + +### Closure mandate + +Every confirmed BUG finding must produce a regression test in `quality/test_regression.*`. The test must be an executable source file in the project's language — not a Markdown file, not prose documentation, not a comment block describing what a test would do. If the project uses Java, write a `.java` file. If Python, a `.py` file. The test must compile (or parse) and be runnable by the project's test framework. + +**No language exemptions.** If introducing failing tests before fixes is a concern, use the language's expected-failure mechanism. The guard must be the **earliest syntactic guard for the framework** — a decorator or annotation where idiomatic, otherwise the first executable line in the test body: + +- **Python (pytest):** `@pytest.mark.xfail(strict=True, reason="BUG-NNN: [description]")` — decorator above `def test_...():`. When the bug is present: XFAIL (expected). When the bug is fixed but marker not removed: XPASS → strict mode fails, signaling the guard should be removed. +- **Python (unittest):** `@unittest.expectedFailure` — decorator above the test method. +- **Go:** `t.Skip("BUG-NNN: [description] — unskip after applying quality/patches/BUG-NNN-fix.patch")` — first line inside the test function. Note: Go's `t.Skip` hides the test (reports SKIP, not FAIL), which is weaker than Python's xfail. +- **Rust:** `#[ignore]` attribute on the test function — the standard "don't run in default suite" mechanism. Use `#[should_panic]` only for panic-shaped bugs. +- **Java (JUnit 5):** `@Disabled("BUG-NNN: [description]")` — annotation above the test method. +- **TypeScript/JavaScript (Jest):** `test.failing("BUG-NNN: [description]", () => { ... })` +- **TypeScript/JavaScript (Vitest):** `test.fails("BUG-NNN: [description]", () => { ... })` +- **JavaScript (Mocha):** `it.skip("BUG-NNN: [description]", () => { ... })` or `this.skip()` inside the test body for conditional skipping. + +Every guard must reference the bug ID (BUG-NNN format) and the fix patch path so that someone encountering a skipped test knows how to resolve it. + +These patterns ensure every bug has an executable test that can be enabled when the fix lands, without polluting CI with expected failures. + +**TDD red/green interaction with skip guards.** During TDD verification, the red and green phases must temporarily bypass the skip guard: +- **Red phase (NEVER SKIPPED):** Remove or disable the guard, run against unpatched code. Must fail. Re-enable guard after recording result. **The red phase is mandatory for every confirmed bug, even when no fix patch exists.** Record `verdict: "confirmed open"` with `red_phase: "fail"` and `green_phase: "skipped"`. Do not use `verdict: "skipped"` — that value is deprecated. +- **Green phase:** Remove or disable the guard, apply fix patch, run. Must pass. Re-enable guard if fix will be reverted. If no fix patch exists, record `green_phase: "skipped"`. +- **After successful red→green:** Generate a per-bug writeup at `quality/writeups/BUG-NNN.md` (see SKILL.md File 7, "Bug writeup generation"). Record the path in `tdd-results.json` as `writeup_path`. After writing `tdd-results.json`, reopen it and verify all required fields, enum values, and no extra undocumented root keys (see SKILL.md post-write validation step). Both sidecar JSON files must use `schema_version: "1.1"`. +- **After TDD cycle:** Guard remains in committed regression test file, removed only when fix is permanently merged. + +**The only acceptable exemption** is when a regression test genuinely cannot be written — for example, the bug requires multi-threaded timing that can't be reliably reproduced, or requires an external service not available in the test environment. In that case, write an explicit exemption note in the combined summary explaining why, and include a minimal code sketch showing what you would test if you could. + +Findings without either an executable regression test or an explicit exemption note are incomplete. The combined summary must not include unresolved findings — every BUG must have closure. + +### Regression test semantic convention + +All regression tests must assert **desired correct behavior** and be marked as expected-to-fail on the current code. Do not write tests that assert the current broken behavior and pass. The distinction matters: + +- **Correct:** Test says "this input should produce X" → test fails because buggy code produces Y → marked `xfail`/`@Disabled`/`t.Skip` → when bug is fixed, test passes and the skip marker is removed. +- **Wrong:** Test says "this input produces Y (the buggy output)" → test passes on buggy code → when bug is fixed, test fails silently → stale test that now asserts wrong behavior. + +The `xfail(strict=True)` pattern (Python/pytest) is the gold standard: it fails if the bug is present (expected), and also fails if the bug is fixed but the `xfail` marker wasn't removed (strict). Other languages should approximate this with skip + reason. + +### Post-review closure verification + +After writing all regression tests and the combined summary, run this checklist: + +1. **Count BUGs in the combined summary.** This is the expected count. +2. **Count test functions in `quality/test_regression.*`.** This should equal or exceed the BUG count (some BUGs may need multiple tests). +3. **For each BUG row in the summary**, verify it has either: + - A `REGRESSION TEST:` line citing the test function name, OR + - An `EXEMPTION:` line explaining why no test was written +4. **If any BUG lacks both**, go back and write the test or the exemption before declaring the review complete. + +This checklist is the enforcement mechanism for the closure mandate. Without it, the mandate is aspirational — agents document bugs fully in the pass summaries but skip the regression test and move on. + +### Post-spec-audit regression tests + +The closure mandate applies to spec-audit confirmed code bugs, not just code review bugs. After the spec audit triage categorizes findings, every finding classified as "Real code bug" must get a regression test — using the same conventions as code review regression tests (executable source file, expected-failure marker, test-finding alignment). + +**Why this is a separate step:** Code review regression tests are written immediately after the code review, before the spec audit runs. This means spec-audit bugs are systematically orphaned — they appear in the triage report but never enter the regression test file. Across v1.3.4 runs on 8 repos, spec-audit bugs accounted for ~30% of all findings, and only 1 of 8 repos (httpx) wrote regression tests for them. + +**Procedure:** +1. After spec audit triage, read the triage summary for findings classified as "Real code bug." +2. For each, write a regression test in `quality/test_regression.*` using the same format as code review regression tests. Use the spec audit report as the source citation: `[BUG from spec_audits/YYYY-MM-DD-triage.md]`. +3. Run the test to confirm it fails (expected) or passes (needs investigation). +4. Update the cumulative BUG tracker in PROGRESS.md with the test reference. + +If the spec audit produced no confirmed code bugs, skip this step — but document that in PROGRESS.md so the audit trail is complete. + +### Cleaning up after spec audit reversals + +When the spec audit overturns a code review finding (classifies a BUG as a design choice or false positive), the corresponding regression test must be either deleted or moved to a separate file (`quality/design_behavior_tests.*`) that documents intentional behavior. A failing test that points at documented-correct behavior is worse than no test — it creates noise and erodes trust in the regression suite. + +After spec audit triage, check: does any test in `quality/test_regression.*` correspond to a finding that was reclassified as non-defect? If so, remove it from the regression file. + +### Why Three Passes Instead of Focus Areas + +Previous experiments (the QPB NSQ benchmark) showed that focus areas don't reliably improve AI code review. A generic "review for bugs" prompt scored 65.5%, while a playbook with 7 named focus areas scored 48.3% — the focus areas narrowed the model's attention and suppressed detections. + +The three-pass pipeline works because each pass does one thing well with no cross-contamination: +- **Pass 1** lets the model do what it's already good at (structural review, ~65% of defects) +- **Pass 2** catches individual requirement violations that structural review misses (absence bugs, spec deviations) +- **Pass 3** catches contradictions between individually-correct pieces of code (cross-file arithmetic bugs, security policy gaps) + +Experiments on the NSQ codebase showed this pipeline finding 2 of 3 defects that were invisible to all structural review conditions — with zero knowledge of the specific bugs. The defects found were a cross-file numeric mismatch (validation bound vs bit field width) and a security design gap (configured CA not propagated to outbound auth client). + ### Phase 2: Regression Tests for Confirmed Bugs After the code review produces findings, write regression tests that reproduce each BUG finding. This transforms the review from "here are potential bugs" into "here are proven bugs with failing tests." @@ -212,7 +373,7 @@ As each test runs, report a one-line status update. Keep it compact — the user Use `✓` for pass, `✗` for fail, `⧗` for in-progress. If a test fails, show one line of context (the error message or assertion that failed), not the full stack trace. The user can ask for details if they want them. -### Phase 3: Results +### Phase 6: Results After all tests complete, show a summary table and a recommendation: @@ -362,6 +523,80 @@ The number of units/records/iterations per integration test run matters: Look for `chunk_size`, `batch_size`, or similar configuration in the project to calibrate. When in doubt, 10–30 records is usually the right range for integration testing — enough to catch real issues without burning API budget. +### Integration Testing for Skills and LLM-Automated Tools + +When the project under test is an AI skill, CLI tool that wraps LLM calls, or any software whose primary execution path involves invoking an AI model, the integration test protocol must include **LLM-automated integration tests** — tests that run the tool end-to-end via a command-line AI agent and structurally verify the output. + +This is distinct from standard integration tests because the system under test doesn't have a deterministic API to call. The "integration" is: install the skill into a test repo, invoke it through a CLI agent (GitHub Copilot CLI, Claude Code, or similar), and verify the output artifacts meet structural and content expectations. + +**Why this matters:** Skills and LLM tools cannot be tested by calling functions directly — their execution path goes through an AI agent that interprets instructions, reads files, and produces artifacts. The only way to test whether the skill works is to run it. Manual execution is fine for development, but a quality playbook should encode the test as a repeatable protocol. + +**Protocol structure for skill/LLM integration tests:** + +```markdown +## Skill Integration Test Protocol + +### Prerequisites +- CLI agent installed and configured (e.g., `gh copilot`, `claude`, `npx @anthropic-ai/claude-code`) +- Test repo prepared with skill installed at `.github/skills/SKILL.md` (or equivalent) +- Clean `quality/` directory (no artifacts from prior runs) +- Optional: `docs_gathered/` folder for with-docs comparison runs + +### Test Matrix + +| Test | Method | Pass Criteria | +|------|--------|---------------| +| Full execution | Run skill via CLI with "execute" prompt | All expected artifacts exist in `quality/` | +| PROGRESS.md completeness | Read `quality/PROGRESS.md` | All phases checked complete, BUG tracker populated | +| Artifact structural check | Verify each expected file | Files are non-empty, contain expected sections | +| BUG tracker closure | Count BUG entries vs regression tests | Every BUG has a test reference or exemption | +| Baseline vs with-docs (optional) | Run twice: without and with docs_gathered/ | With-docs run produces >= baseline requirement count | + +### Execution + +```bash +# Install skill into test repo +cp -r path/to/skill/.github test-repo/.github + +# Run via CLI agent (adapt command to your agent) +cd test-repo +gh copilot -p "Read .github/skills/SKILL.md and its reference files. Execute the quality playbook for this project." \ + --model gpt-5.4 --yolo > quality_run.output.txt 2>&1 +``` + +### Structural Verification (automated) + +After the run, verify output structurally: + +```bash +# Required artifacts exist and are non-empty +for f in quality/QUALITY.md quality/REQUIREMENTS.md quality/CONTRACTS.md \ + quality/COVERAGE_MATRIX.md quality/COMPLETENESS_REPORT.md \ + quality/PROGRESS.md quality/RUN_CODE_REVIEW.md \ + quality/RUN_INTEGRATION_TESTS.md quality/RUN_SPEC_AUDIT.md; do + [ -s "$f" ] || echo "FAIL: $f missing or empty" +done + +# Functional test file exists (language-appropriate name) +ls quality/test_functional.* quality/FunctionalSpec.* quality/functional.test.* 2>/dev/null \ + || echo "FAIL: no functional test file" + +# PROGRESS.md has all phases checked +grep -c '\[x\]' quality/PROGRESS.md # should equal total phase count + +# BUG tracker has entries (if bugs were found) +grep -c '^| [0-9]' quality/PROGRESS.md + +# Code reviews and spec audits produced substantive files +find quality/code_reviews -name "*.md" -size +500c | wc -l # should be >= 1 +find quality/spec_audits -name "*triage*" -size +500c | wc -l # should be >= 1 +``` +``` + +**Baseline vs with-docs comparison pattern:** Run the skill twice on the same repo — once without supplemental docs, once with a `docs_gathered/` folder containing project history. Compare: requirement count, scenario count, bug count, and pipeline completion. The with-docs run should produce equal or more requirements and equal or more bugs. If the baseline outperforms the with-docs run on bug detection, that's a finding about the docs quality, not a skill failure. + +**When to generate this protocol:** Generate a skill integration test section in `RUN_INTEGRATION_TESTS.md` whenever the project being analyzed is a skill, a CLI tool that wraps AI calls, or a framework for building AI-powered tools. Look for: `SKILL.md` files, prompt templates, LLM client configurations, agent orchestration code, or references to AI models in the codebase. + ### Post-Run Verification Depth A run that completes without errors may still be wrong. For each integration test run, verify at multiple levels: diff --git a/skills/quality-playbook/references/spec_audit.md b/skills/quality-playbook/references/spec_audit.md index 5049fa3cd..3e1ecfdf9 100644 --- a/skills/quality-playbook/references/spec_audit.md +++ b/skills/quality-playbook/references/spec_audit.md @@ -65,6 +65,31 @@ Requirements are tagged with `[Req: tier — source]`. Weight your findings by t --- +## Pre-audit docs validation (required triage section) + +The triage report must include a `## Pre-audit docs validation` section regardless of whether `docs_gathered/` exists. This section documents what the auditors used as their factual baseline. + +**If `docs_gathered/` exists:** Spot-check the gathered docs for factual accuracy before running the audit. Stale or incorrect docs can skew audit confidence — a model that reads "the library handles X by doing Y" in the docs will rate a divergent finding higher even if the docs are wrong. + +**Quick validation procedure (5 minutes max):** +1. Pick 2–3 factual claims from `docs_gathered/` that describe specific runtime behavior (e.g., "invalid input raises ValueError", "field X defaults to Y", "format Z is not supported"). +2. Grep the source code for the cited behavior. Does the code match the docs? +3. If any claim is wrong, note it in the triage header: "docs_gathered/ contains N known inaccuracies: [list]. Findings that rely on these claims are downgraded to NEEDS REVIEW." + +**Spot-check claims about code contents must extract, not assert.** When the spec audit prompt or pre-validation includes claims like "function X handles constant Y at line Z," the triage must read the cited lines and report what they actually contain. Do not confirm a claim by checking that the function exists or that the constant is defined somewhere — confirm it by showing the exact text at the cited lines. Format each spot-check result as: + +``` +Claim: "vring_transport_features() preserves VIRTIO_F_RING_RESET at line 3527" +Actual line 3527: `default:` +Result: CLAIM IS FALSE — line 3527 is the default branch, not a RING_RESET case label +``` + +Spot-check claims derived from generated requirements or gathered docs (rather than from the code) are **hypotheses to test**, not facts to confirm. This rule prevents the contamination chain observed in v1.3.17 where a false spot-check claim was accepted as "accurate" without reading the actual lines, causing three auditors to inherit a hallucinated code-presence claim. + +**If `docs_gathered/` does not exist:** State this explicitly: "No supplemental docs provided. Auditors relied on in-repo specs and code only." This confirms the absence is intentional, not an oversight. + +This section fires in every triage, not just when docs are present. In v1.3.5 cross-repo testing, it only fired in 1/8 repos because it was conditional — making it required ensures the audit trail always documents the factual baseline. + ## Running the Audit 1. Give the identical prompt to three AI tools @@ -73,7 +98,18 @@ Requirements are tagged with `[Req: tier — source]`. Weight your findings by t ## Triage Process -After all three models report, merge findings: +After all three models report, merge findings. + +**Log the effective council size.** If a model did not return a usable report (timeout, empty output, refusal), record this in the triage header: + +``` +## Council Status +- Model A: Fresh report received (YYYY-MM-DD) +- Model B: Fresh report received (YYYY-MM-DD) +- Model C: TIMEOUT — no usable report. Effective council: 2/3. +``` + +When the effective council is 2/3, downgrade the confidence tier: "All three" becomes impossible, "Two of three" becomes the ceiling. When the effective council is 1/3, all findings are "Needs verification" regardless of how confident that single model is. Do not silently substitute stale reports from prior runs — if a model didn't produce a fresh report for this run, it didn't participate. | Confidence | Found By | Action | |------------|----------|--------| @@ -81,10 +117,36 @@ After all three models report, merge findings: | High | Two of three | Likely real — verify and fix | | Needs verification | One only | Could be real or hallucinated — deploy verification probe | +**When the effective council is 2/3 or less:** Distinguish single-auditor findings from multi-auditor findings explicitly in the triage. With a 2/3 council, a finding from both present auditors has "High" confidence. A finding from only one present auditor has "Needs verification" — it cannot be promoted to confirmed BUG without a verification probe, because the missing auditor might have contradicted it. Do not treat all findings as equivalent just because the council is incomplete. + +In the triage summary table, add a column for auditor agreement: "2/2 present", "1/2 present", etc. This makes the confidence tier visible and auditable. + +**Incomplete council gate for enumeration/dispatch checks.** If the effective council is less than 3/3 and the run includes whitelist/enumeration/dispatch-function checks (claims about which constants a function handles), the audit may not conclude "no confirmed defects" for those checks without executed mechanical proof. Check whether `quality/mechanical/_cases.txt` exists for each relevant function. If it does and shows the constant is present, the claim is confirmed. If it does and shows the constant is absent, the claim is false regardless of what any auditor wrote. If no mechanical artifact exists, generate one before closing the enumeration check. This rule exists because v1.3.18 had an effective council of 1/3, and the single model's triage fabricated line contents for enumeration claims — a mechanical artifact would have caught the contradiction. + ### The Verification Probe When models disagree on factual claims, deploy a read-only probe: give one model the disputed claim and ask it to read the code and report ground truth. Never resolve factual disputes by majority vote — the majority can be wrong about what code actually does. +**Verification probes must produce executable evidence.** Prose reasoning is not sufficient for either confirmations or rejections. Every verification probe must produce a test assertion that mechanically proves the determination: + +**For rejections** (finding is false positive): Write an assertion that PASSES, proving the auditor's claim is wrong: +```python +# Rejection proof: function X does check for null at line 247 +assert "if (ptr == NULL)" in source_of("X"), "X has null check at line 247" +``` +If you cannot write a passing assertion, **do not reject the finding**. The inability to produce mechanical proof is itself evidence that the finding may be real. + +**For confirmations** (finding is a real bug): Write an assertion that FAILS (expected-failure), proving the bug exists: +```python +# Confirmation proof: RING_RESET is not a case label in the whitelist +assert "case VIRTIO_F_RING_RESET:" in source_of("vring_transport_features"), \ + "RING_RESET should be in the switch but is not — cleared by default at line 3527" +``` + +**Every assertion must cite an exact line number** for the evidence it references. Not "lines 3527-3528" but "line 3527: `default:`" — showing what the line actually contains. + +**Why this rule exists:** In v1.3.16 virtio testing, the triage received a correct minority finding that VIRTIO_F_RING_RESET was missing from a switch/case whitelist. The triage performed a verification probe that claimed lines 3527-3528 "explicitly preserve VIRTIO_F_RING_RESET" — but those lines contained the `default:` branch. The probe hallucinated compliance. Had it been required to write `assert "case VIRTIO_F_RING_RESET:" in source`, the assertion would have failed, exposing the hallucination. Requiring executable evidence makes hallucinated rejections self-defeating. + ### Categorize Each Confirmed Finding - **Spec bug** — Spec is wrong, code is fine → update spec @@ -96,6 +158,45 @@ When models disagree on factual claims, deploy a read-only probe: give one model That last category is the bridge between the spec audit and the test suite. Every confirmed finding not already covered by a test should become one. +### Legacy and historical scripts + +Scripts documented as "historical," "deprecated," or "not part of current workflow" are sometimes downgraded during triage on the theory that they don't affect current operations. This is correct when the script genuinely never runs. But if the script's bug has already materialized in canonical artifacts — duplicate entries in a published file, stale data in a checked-in cache, incorrect mappings that downstream tools consume — the bug is not historical. It's a live defect in the repository's published state. + +**Rule: If a legacy script's bug is already visible in canonical artifacts, promote it to confirmed BUG regardless of the script's status.** The script may be historical, but the damage it left behind is current. The regression test should target the artifact (the duplicate entry, the stale mapping), not the script — because the artifact is what users encounter. + +This rule exists because v1.3.5 bootstrap runs on QPB found duplicate changelog entries and stale cache mappings produced by a "historical" script. Both triages downgraded the findings because the script was historical. But the duplicate entries were already in the published library, visible to every user. + +### Cross-artifact consistency check + +After triage, compare the spec audit findings against the code review findings from `quality/code_reviews/`. If the code review and spec audit disagree on the same factual claim (one says a bug is real, the other calls it a false positive), flag the disagreement and deploy a verification probe. The code review and spec audit use different methods (structural reading vs. spec comparison), so disagreements are informative, not errors. But a factual contradiction about what the code actually does needs to be resolved before either report is trusted. + +## Detecting partial sessions and carried-over artifacts + +### Partial session detection + +A session that terminates early (timeout, context exhaustion, crash) may generate scaffolding (directory structure, empty templates) without producing the actual review or audit content. The retry mechanism in the run script can regenerate scaffolding but cannot recover the analytical work. + +**After any session completes, check for partial results:** +1. If `quality/code_reviews/` exists but contains no `.md` files with actual findings (or only contains template headers with no BUG/VIOLATED/INCONSISTENT entries), the code review did not run. Mark this as FAILED in PROGRESS.md, not as "complete with no findings." +2. If `quality/spec_audits/` exists but contains no triage summary, the spec audit did not run. +3. If `quality/test_regression.*` exists but contains only imports and no test functions, regression tests were not written. + +A partial session is not a "clean run with no findings" — it's a failed run that needs to be re-executed. PROGRESS.md should record this clearly: "Phase 6: FAILED — code review session terminated before producing findings. Re-run required." + +### Provenance headers on carried-over artifacts + +When a new playbook run finds existing artifacts from a previous run (after archiving), or when artifacts survive from a failed session, they must carry provenance headers so readers know their origin. + +**If any artifact was NOT generated fresh in the current run**, add a provenance header: + +```markdown + +``` + +This prevents the failure mode observed in v1.3.4 where express and zod silently preserved v1.3.3 code reviews and spec audits without marking them as archival. Users reading those artifacts assumed they were fresh v1.3.4 results. + ## Fix Execution Rules - Group fixes by subsystem, not by defect number @@ -130,6 +231,10 @@ Different models have different audit strengths. In practice: The specific models that excel will change over time. The principle holds: use multiple models with different strengths, and always include the four guardrails. +### Minimum model capability + +The audit protocol requires reading function bodies, citing line numbers, grepping before claiming missing, and classifying defect types. Lightweight or speed-optimized models (Haiku-class, GPT-4o-mini-class) are not suitable as auditors. They tend to skim rather than read, skip the grep step, and produce shallow or empty reports ("No defects found") on codebases where stronger models find real bugs. Use models with strong code-reading ability for all three auditor slots. A weak auditor doesn't just miss findings — it reduces the Council from three independent perspectives to two. + ## Tips for Writing Scrutiny Areas The scrutiny areas are the most important part of the prompt. Generic questions like "check if the code matches the spec" produce generic answers. Specific questions that name functions, files, and edge cases produce specific findings. diff --git a/skills/quality-playbook/references/verification.md b/skills/quality-playbook/references/verification.md index 66a0aaff7..1f553d463 100644 --- a/skills/quality-playbook/references/verification.md +++ b/skills/quality-playbook/references/verification.md @@ -1,4 +1,4 @@ -# Verification Checklist (Phase 3) +# Verification Checklist (Phase 6: Verify) Before declaring the quality playbook complete, check every benchmark below. If any fails, go back and fix it. @@ -53,6 +53,8 @@ Run the test suite using the project's test runner: **Check for both failures AND errors.** Most test frameworks distinguish between test failures (assertion errors) and test errors (setup failures, missing fixtures, import/resolution errors, exceptions during initialization). Both are broken tests. A common mistake: generating tests that reference shared fixtures or helpers that don't exist. These show up as setup errors, not assertion failures — but they are just as broken. +**Expected-failure (xfail) tests do not count against this benchmark.** Regression tests in `quality/test_regression.*` use expected-failure markers (`@pytest.mark.xfail(strict=True)`, `@Disabled`, `t.Skip`, `#[ignore]`) to confirm that known bugs are still present. These tests are *supposed* to fail — that's the point. The "zero failures and zero errors" benchmark applies to `quality/test_functional.*` (the functional test suite), not to `quality/test_regression.*` (the bug confirmation suite). If your test runner reports failures from xfail-marked regression tests, that's correct behavior, not a benchmark violation. If an xfail test unexpectedly *passes*, that means the bug was fixed and the xfail marker should be removed — treat that as a finding to investigate, not a test failure. + After running, check: - All tests passed — count must equal total test count - Zero failures @@ -70,7 +72,7 @@ Run the project's full test suite (not just your new tests). Your new files shou Every scenario should mention actual function names, file names, or patterns that exist in the codebase. Grep for each reference to confirm it exists. -If working from non-formal requirements, verify that each scenario and test includes a requirement tag using the canonical format: `[Req: formal — README §3]`, `[Req: inferred — from validate_input() behavior]`, `[Req: user-confirmed — "must handle empty input"]`. Inferred requirements should be flagged for user review in Phase 4. +If working from non-formal requirements, verify that each scenario and test includes a requirement tag using the canonical format: `[Req: formal — README §3]`, `[Req: inferred — from validate_input() behavior]`, `[Req: user-confirmed — "must handle empty input"]`. Inferred requirements should be flagged for user review in Phase 7. ### 11. RUN_CODE_REVIEW.md Is Self-Contained @@ -93,6 +95,158 @@ If any field name, count, or type is wrong, fix it before proceeding. The table The definitive audit prompt should work when pasted into Claude Code, Cursor, and Copilot without modification (except file reference syntax). +### 14. Structured Output Schemas Are Valid and Conformant + +Verify that `RUN_TDD_TESTS.md` and `RUN_INTEGRATION_TESTS.md` both instruct the agent to produce: +- JUnit XML output using the framework's native reporter (pytest `--junitxml`, gotestsum `--junitxml`, Maven Surefire reports, `jest-junit`, `cargo2junit`) +- A sidecar JSON file (`tdd-results.json` or `integration-results.json`) in `quality/results/` + +Check that each protocol's JSON schema includes all mandatory fields: +- **tdd-results.json:** `schema_version`, `skill_version`, `date`, `project`, `bugs`, `summary`. Per-bug: `id`, `requirement`, `red_phase`, `green_phase`, `verdict`, `fix_patch_present`, `writeup_path`. +- **integration-results.json:** `schema_version`, `skill_version`, `date`, `project`, `recommendation`, `groups`, `summary`, `uc_coverage`. Per-group: `group`, `name`, `use_cases`, `result`. + +Verify that the protocol does NOT contain flat command-list schemas (a `"results"` or `"commands_run"` array without `"groups"` is non-conformant). Verify that verdict/result enum values use only the allowed values defined in SKILL.md (e.g., `"TDD verified"`, `"red failed"`, `"green failed"`, `"confirmed open"` for TDD verdicts; `"pass"`, `"fail"`, `"skipped"`, `"error"` for integration results; `"SHIP"`, `"FIX BEFORE MERGE"`, `"BLOCK"` for recommendations). The TDD verdict `"skipped"` is deprecated — use `"confirmed open"` with `red_phase: "fail"` and `green_phase: "skipped"` instead. The TDD summary must include a `confirmed_open` count alongside `verified`, `red_failed`, and `green_failed`. + +Both sidecar JSON templates must use `schema_version: "1.1"` (v1.1 change: `verdict: "skipped"` deprecated in favor of `"confirmed open"`). Both protocols must include a **post-write validation step** instructing the agent to reopen the sidecar JSON after writing it and verify required fields, enum values, and no extra undocumented root keys. + +### 15. Patch Validation Gate Is Executable + +For each confirmed bug with patches, verify: +1. The `git apply --check` commands specified in the patch validation gate use the correct patch paths (`quality/patches/BUG-NNN-*.patch`) +2. The compile/syntax check command matches the project's actual build system — not a generic placeholder +3. For interpreted languages (Python, JavaScript), the gate specifies the appropriate syntax check (`python -m py_compile`, `node --check`, `pytest --collect-only`, or equivalent) +4. The gate includes a temporary worktree or stash-and-revert instruction to comply with the source boundary rule + +### 16. Regression Test Skip Guards Are Present + +Grep `quality/test_regression.*` for the language-appropriate skip/xfail mechanism. Every test function must have a guard: +- Python: `@pytest.mark.xfail` or `@unittest.expectedFailure` +- Go: `t.Skip(` +- Java: `@Disabled` +- Rust: `#[ignore]` +- TypeScript/JavaScript: `test.failing(`, `test.fails(`, or `it.skip(` + +A regression test without a skip guard will cause unexpected failures when the test suite runs on unpatched code. Every guard must reference the bug ID (BUG-NNN format) and the fix patch path. + +### 17. Integration Group Commands Pass Pre-Flight Discovery + +For each integration test group command in `RUN_INTEGRATION_TESTS.md`, verify that the command discovers at least one test using the framework's dry-run mode (`pytest --collect-only`, `go test -list`, `vitest list`, `jest --listTests`, `cargo test -- --list`). A group whose command fails discovery will produce a `covered_fail` result that masks a selector bug as a code bug. If a command cannot be validated (no dry-run mode available), note the limitation. + +### 18. Version Stamps Present on All Generated Files + +Grep every generated Markdown file in `quality/` for the attribution line: `Generated by [Quality Playbook]`. Grep every generated code file for `Generated by Quality Playbook`. Every file must have the stamp with the correct version number. Files without stamps are not traceable to the tool and version that created them. **Exemptions:** sidecar JSON files (use `skill_version` field), JUnit XML files (framework-generated), and `.patch` files (stamp would break `git apply`). For Python files with shebang or encoding pragma, verify the stamp comes after the pragma, not before. + +### 19. Enumeration Completeness Checks Performed + +Verify that the code review (Pass 1 and Pass 2) performed mechanical two-list enumeration checks wherever the code uses `switch`/`case`, `match`, or if-else chains to dispatch on named constants. For each such check, the review must show: (a) the list of constants defined in headers/enums/specs, (b) the list of case labels actually present in the code, (c) any gaps. A review that claims "the whitelist covers all values" or "all cases are handled" without showing the two-list comparison is non-conformant — this is the specific hallucination pattern the check prevents. + +### 20. Bug Writeups Generated for TDD-Verified Bugs + +For each bug with `verdict: "TDD verified"` in `tdd-results.json`, verify that a corresponding `quality/writeups/BUG-NNN.md` file exists and that `tdd-results.json` has a non-null `writeup_path` for that bug. Each writeup must include: summary, spec reference, code citation, observable consequence, fix diff, and test description. A TDD-verified bug without a writeup is incomplete. + +### 21. Triage Verification Probes Include Executable Evidence + +Open the triage report (`quality/spec_audits/YYYY-MM-DD-triage.md`). For every finding that was confirmed or rejected via a verification probe, verify that the triage entry includes a test assertion (not just prose reasoning). Rejections must include a PASSING assertion proving the finding is wrong. Confirmations must include a FAILING assertion proving the bug exists. Every assertion must cite an exact line number. A triage decision based on prose reasoning alone ("lines 3527-3528 explicitly preserve X") without a mechanical assertion is non-conformant. + +### 22. Enumeration Lists Extracted From Code, Not Copied From Requirements + +When the code review includes an enumeration check (e.g., "case labels present in function X"), verify that the code-side list includes per-item line numbers from the actual source. If the list matches the requirements list word-for-word without line numbers, the enumeration was likely copied rather than extracted and must be redone. Also verify that the triage pre-audit spot-checks report the actual contents of cited lines ("line 3527 contains `default:`") rather than merely confirming claims ("line 3527 preserves RING_RESET"). + +### 23. Mechanical Verification Artifacts Exist and Pass Integrity Check + +For every contract or requirement that asserts a function handles/preserves/dispatches a set of named constants (feature bits, enum values, opcode tables), verify that a corresponding `quality/mechanical/_cases.txt` file exists and was generated by a non-interactive shell pipeline. Contracts that reference dispatch-function coverage without citing a mechanical artifact are non-conformant. + +**Integrity check (mandatory):** Run `bash quality/mechanical/verify.sh`. This script re-executes the same extraction commands that generated each mechanical artifact and diffs the results. If ANY diff is non-empty, the artifact was tampered with — the model may have written expected output instead of capturing actual shell output. A mismatched artifact must be regenerated by re-running the extraction command (not by editing the file). This check exists because in v1.3.19, the model executed the correct awk/grep command but wrote a fabricated 9-line output (including a hallucinated `case VIRTIO_F_RING_RESET:`) to the file, when the actual command only produces 8 lines. + +### 24. Source-Inspection Regression Tests Execute (No `run=False`) + +Grep `quality/test_regression.*` for `run=False` (Python), `t.Skip` with a source-inspection comment, or equivalent skip mechanisms. Any regression test whose purpose is source-structure verification (string presence in function bodies, case label existence, enum extraction) must execute — it must NOT use `run=False`. These tests are safe, deterministic string-match operations. An `xfail(strict=True)` test that actually fails reports as XFAIL (expected), which is correct behavior. A source-inspection test with `run=False` is the worst possible state: the correct check exists but never fires. + +### 25. Contradiction Gate Passed (Executed Evidence vs. Prose) + + +Verify that no executed artifact contradicts a prose artifact at closure. Specifically: (a) if any `quality/mechanical/*` file shows a constant as absent, no prose artifact (`CONTRACTS.md`, `REQUIREMENTS.md`, code review, triage) may claim it is present; (b) if any regression test with `xfail` actually fails (XFAIL), `BUGS.md` may not claim that bug is "fixed in working tree" without a commit reference; (c) if TDD traceability shows a red-phase failure, the triage may not claim the corresponding code is compliant. Any contradiction must be resolved before closure. + +### 26. Version Stamp Consistency + +Read the `version:` field from the SKILL.md metadata (in `.github/skills/SKILL.md`). Check every generated artifact: PROGRESS.md's `Skill version:` field, every `> Generated by` attribution line, every code file header stamp, and every sidecar JSON `skill_version` field. Every version stamp must match the SKILL.md metadata exactly. A single mismatch is a benchmark failure. This check exists because in v1.3.21 benchmarking, 5 of 9 repos had version stamps from older skill versions due to a hardcoded template. + +### 27. Mechanical Directory Conformance + +If `quality/mechanical/` exists, it must contain at minimum a `verify.sh` file. An empty `quality/mechanical/` directory is non-conformant. If no dispatch-function contracts exist, the directory should not exist — instead record `Mechanical verification: NOT APPLICABLE` in PROGRESS.md. If the directory exists with extraction artifacts, `verify.sh` must include one verification block per saved file (not just one). A verify.sh that checks only one artifact when multiple exist is incomplete. + +### 28. TDD Artifact Closure + +If `quality/BUGS.md` contains any confirmed bugs, `quality/results/tdd-results.json` is mandatory. If any bug has a red-phase result, `quality/TDD_TRACEABILITY.md` is also mandatory. Zero-bug repos may omit both files. For repos where TDD cannot execute, tdd-results.json must exist with `verdict: "deferred"` and a `notes` field explaining why. + +### 29. Triage-to-BUGS.md Sync + +After spec audit triage, every finding confirmed as a code bug must appear in `quality/BUGS.md`. A triage report with confirmed code bugs and no corresponding BUGS.md entries is non-conformant. If BUGS.md does not exist when confirmed bugs exist, it must be created. + +### 30. Writeups for All Confirmed Bugs + +Every confirmed bug (TDD-verified or confirmed-open) must have a writeup at `quality/writeups/BUG-NNN.md`. For confirmed-open bugs without fix patches, the writeup notes the absence of fix/green-phase evidence. A run with confirmed bugs and no writeups directory is incomplete. + +### 31. Phase 4 Triage File Exists + +Phase 4 is not complete until a triage file exists at `quality/spec_audits/YYYY-MM-DD-triage.md`. If only auditor reports exist with no triage synthesis, Phase 4 is incomplete. + +### 32. Seed Checks Executed Mechanically (Continuation Mode) + +When `previous_runs/` exists and Phase 0 runs, verify that `quality/SEED_CHECKS.md` was generated with one entry per unique bug from prior runs. Each seed must have a mechanical verification result (FAIL = bug still present, PASS = bug fixed) obtained by actually running the assertion — not by reading prose from the prior run. If a seed's regression test exists in a prior run, the assertion must be re-executed against the current source tree. A seed marked FAIL without executing the assertion is non-conformant. This benchmark only applies when continuation mode is active (prior runs exist). + +### 33. Convergence Status Recorded in PROGRESS.md (Continuation Mode) + +When Phase 0 runs, verify that PROGRESS.md contains a `## Convergence` section with: run number, seed count, net-new bug count, and a CONVERGED/NOT CONVERGED verdict. The net-new count must equal the number of bugs in BUGS.md that don't match any seed by file:line. A missing convergence section when `SEED_CHECKS.md` exists is non-conformant. This benchmark only applies when continuation mode is active. + +### 34. BUGS.md Always Exists + +Every completed run must produce `quality/BUGS.md`. If the run confirmed source-code bugs, BUGS.md must list them. If the run found zero source-code bugs, BUGS.md must contain a `## Summary` with a positive assertion: "No confirmed source-code bugs found" with counts of candidates evaluated and eliminated. A completed run (Phase 5 marked complete) with no BUGS.md is non-conformant. This benchmark exists because in v1.3.22 benchmarking, express completed all phases with zero source bugs but produced no BUGS.md, making it ambiguous whether the file was intentionally omitted or accidentally skipped. + +### 35. Immediate Mechanical Integrity Gate (Phase 2a) + +If `quality/mechanical/` exists, verify that `bash quality/mechanical/verify.sh` was executed immediately after each `*_cases.txt` was written — before any contract, requirement, or triage artifact cites the extraction. Evidence: `quality/results/mechanical-verify.log` and `quality/results/mechanical-verify.exit` exist, and the exit file contains `0`. If these receipt files are missing or the exit code is non-zero, the mechanical extraction was not verified at the point of creation. This benchmark exists because v1.3.23 deferred verification to Phase 6, allowing downstream artifacts (CONTRACTS.md, REQUIREMENTS.md, triage probes) to build on a forged extraction for the entire run before the mismatch was (not) caught. + +### 36. Mechanical Artifacts Not Used as Evidence in Triage Probes + +Grep all triage and verification probe files (`quality/spec_audits/*`) for `open('quality/mechanical/` or `cat quality/mechanical/`. If any probe reads a `quality/mechanical/*.txt` file as sole evidence for what a source file contains, it is circular verification and the benchmark fails. Probes must read the source file directly or re-execute the extraction pipeline. This benchmark exists because v1.3.23 Probe C validated the forged mechanical artifact instead of the source code, passing with fabricated data. + +### 37. Phase 6 Mechanical Closure Uses Bash (Not Python Substitution) + +If `quality/mechanical/` exists, verify that Phase 6 ran `bash quality/mechanical/verify.sh` as a literal shell command — not a Python script reading the artifact file. Evidence: `quality/results/mechanical-verify.log` contains output from the bash script (lines like "OK: ..." or "MISMATCH: ..."), not Python tracebacks or `pathlib` output. PROGRESS.md must include a `## Phase 6 Mechanical Closure` heading with the recorded stdout and exit code. This benchmark exists because v1.3.23 substituted Python `Path.read_text()` for `bash verify.sh`, creating a circular check that passed despite the artifact being fabricated. + +### 38. Individual Auditor Report Artifacts Exist + +If Phase 4 (spec audit) ran, verify that individual auditor report files exist at `quality/spec_audits/YYYY-MM-DD-auditor-N.md` (one per auditor), not just the triage synthesis. A single triage file without individual reports conflates discovery with reconciliation. This benchmark exists to ensure pre-reconciliation findings are preserved for independent verification. + +### 39. BUGS.md Uses Canonical Heading Format + +Every confirmed bug in BUGS.md must use the heading level `### BUG-NNN`. Grep for `^### BUG-` and count; grep for other bug heading patterns (`^## BUG-`, `^\*\*BUG-`, `^- BUG-`) and verify zero matches. Inconsistent heading levels cause machine-readable counts to disagree with the document. + +### 40. Artifact File-Existence Gate Passed + +Before Phase 5 is marked complete, verify that all required artifacts exist as files on disk — not just referenced in PROGRESS.md. Required files: BUGS.md, REQUIREMENTS.md, QUALITY.md, PROGRESS.md, COVERAGE_MATRIX.md, COMPLETENESS_REPORT.md. If Phase 3 ran: at least one file in code_reviews/. If Phase 4 ran: at least one auditor file and a triage file in spec_audits/. If Phase 0 or 0b ran: SEED_CHECKS.md as a standalone file. If confirmed bugs exist: tdd-results.json in results/. This benchmark exists because v1.3.24 benchmarking showed express writing a terminal gate section to PROGRESS.md claiming 1 confirmed bug, but BUGS.md, code review files, and spec audit files were never written to disk. + +### 41. Sidecar JSON Post-Write Validation + +After `tdd-results.json` and/or `integration-results.json` are written, verify that each file contains all required keys with conformant values. For `tdd-results.json`: required root keys are `schema_version`, `skill_version`, `date`, `project`, `bugs`, `summary`. Each `bugs` entry must have `id`, `requirement`, `red_phase`, `green_phase`, `verdict`, `fix_patch_present`, `writeup_path`. The `summary` must include `confirmed_open`. For `integration-results.json`: required root keys are `schema_version`, `skill_version`, `date`, `project`, `recommendation`, `groups`, `summary`, `uc_coverage`. Both must have `schema_version: "1.1"`. A sidecar JSON with missing required keys, non-standard root keys, or invalid enum values is non-conformant. This benchmark exists because v1.3.25 benchmarking showed 6 of 8 repos with non-conformant sidecar JSON — httpx invented an alternate schema, serde used legacy shape, javalin omitted `summary` and per-bug fields, express used invalid phase values, and others used invalid verdict/result enum values. + +### 42. Script-Verified Closure Gate Passed + +Before Phase 5 is marked complete, `quality_gate.sh` must be executed from the project root and must exit 0. The script's full output must be saved to `quality/results/quality-gate.log`. A Phase 5 completion with no `quality-gate.log` or with a log showing FAIL results is non-conformant. This benchmark exists because v1.3.21–v1.3.25 relied entirely on model self-attestation for artifact conformance checks, and benchmarking showed persistent non-compliance (heading format, sidecar schema, use case identifiers, version stamps) that a script catches mechanically. + +### 43. Canonical Use Case Identifiers Present + +REQUIREMENTS.md must contain use cases labeled with canonical identifiers in the format `UC-01`, `UC-02`, etc. Grep for `UC-[0-9]` and count matches. A repo with use case content but no canonical identifiers is non-conformant. This benchmark exists because v1.3.25 benchmarking showed 7 of 8 repos with use case sections but no machine-readable identifiers — downstream tooling cannot count or cross-reference use cases without a canonical format. + +### 44. Regression-Test Patches Exist for Every Confirmed Bug + +For every confirmed bug (any BUG-NNN entry in BUGS.md), verify that `quality/patches/BUG-NNN-regression-test.patch` exists. A confirmed bug without a regression-test patch is incomplete — the patch is the strongest independent evidence that the bug exists. Fix patches (`BUG-NNN-fix.patch`) are optional but strongly encouraged for simple fixes. This benchmark exists because v1.3.25 and v1.3.26 benchmarking showed 4/8 repos with 0 patch files despite having confirmed bugs, and the writeups described what fixes should look like without generating actual patch files. + +### 45. Writeup Inline Fix Diffs + +Every writeup at `quality/writeups/BUG-NNN.md` must contain a ` ```diff ` fenced code block with the proposed fix in unified diff format. This is section 6 ("The fix") of the writeup template. A writeup that says "see patch file" or "no fix patch included" without an inline diff is incomplete — the inline diff is what makes the writeup actionable for a maintainer reading just the writeup without access to the patch directory. This benchmark exists because v1.3.27 benchmarking showed virtio producing 4 writeups with 0 inline diffs despite having fix patches in `quality/patches/`. The model wrote prose descriptions of the fix instead of pasting the actual diff. + ## Quick Checklist Format Use this as a final sign-off: @@ -112,3 +266,37 @@ Use this as a final sign-off: - [ ] Integration test quality gates were written from a Field Reference Table (not memory) - [ ] Integration tests have specific pass criteria - [ ] Spec audit prompt is copy-pasteable and uses `[Req: tier — source]` tag format +- [ ] Structured output schemas include all mandatory fields and valid enum values +- [ ] Patch validation gate uses correct commands for the project's build system +- [ ] Every regression test has a skip/xfail guard referencing the bug ID +- [ ] Integration group commands pass pre-flight discovery (dry-run finds tests) +- [ ] Every generated file has a version stamp with correct version number +- [ ] Enumeration completeness checks show two-list comparisons (not just assertions of coverage) +- [ ] Every TDD-verified bug has a writeup at `quality/writeups/BUG-NNN.md` +- [ ] Triage verification probes include test assertions (not just prose) for confirmations and rejections +- [ ] Enumeration code-side lists include per-item line numbers (not copied from requirements) +- [ ] Dispatch-function contracts cite `quality/mechanical/` artifacts (not hand-written lists) +- [ ] `bash quality/mechanical/verify.sh` passes (artifacts match re-extracted output) +- [ ] Source-inspection regression tests execute (no `run=False` for string-match tests) +- [ ] No executed artifact contradicts any prose artifact at closure (contradiction gate passed) +- [ ] All generated artifact version stamps match SKILL.md metadata version exactly +- [ ] `quality/mechanical/` is either absent (no dispatch contracts) or contains verify.sh + all extraction artifacts +- [ ] If BUGS.md has confirmed bugs: tdd-results.json exists (mandatory); TDD_TRACEABILITY.md exists if any bug has red-phase result +- [ ] Every confirmed bug in triage appears in BUGS.md (triage-to-BUGS.md sync) +- [ ] Every confirmed bug (TDD-verified or confirmed-open) has a writeup at `quality/writeups/BUG-NNN.md` +- [ ] Phase 4 has a triage file at `quality/spec_audits/YYYY-MM-DD-triage.md` +- [ ] (Continuation mode) Seed checks in `SEED_CHECKS.md` were executed mechanically, not inferred from prose +- [ ] Mechanical verification receipt files exist (`mechanical-verify.log` + `mechanical-verify.exit`) when `quality/mechanical/` exists +- [ ] No triage probe reads `quality/mechanical/*.txt` as sole evidence for source code contents +- [ ] Phase 6 mechanical closure used `bash verify.sh` (not Python substitution) +- [ ] Individual auditor reports exist at `quality/spec_audits/*-auditor-N.md` (not just triage) +- [ ] All BUGS.md bug headings use `### BUG-NNN` format +- [ ] quality/BUGS.md exists (zero-bug runs include a summary of candidates evaluated and eliminated) +- [ ] All required artifact files exist on disk before Phase 5 marked complete (not just referenced in PROGRESS.md) +- [ ] (Continuation mode) PROGRESS.md contains `## Convergence` section with net-new count and verdict +- [ ] `quality/BUGS.md` exists (zero-bug runs include a summary of candidates evaluated and eliminated) +- [ ] Sidecar JSON files (`tdd-results.json`, `integration-results.json`) contain all required keys with `schema_version: "1.1"` +- [ ] `quality_gate.sh` was executed and exited 0; output saved to `quality/results/quality-gate.log` +- [ ] REQUIREMENTS.md contains canonical use case identifiers (`UC-01`, `UC-02`, etc.) +- [ ] Every confirmed bug has `quality/patches/BUG-NNN-regression-test.patch` +- [ ] Every writeup has an inline fix diff (` ```diff ` block in section 6) From ee1e4f916e033c26d86b27ca5f89ee919be6e442 Mon Sep 17 00:00:00 2001 From: Andrew Stellman Date: Wed, 15 Apr 2026 11:03:53 -0400 Subject: [PATCH 2/4] Update docs --- docs/README.agents.md | 1 + docs/README.skills.md | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/README.agents.md b/docs/README.agents.md index fbafa8aec..ac8661a46 100644 --- a/docs/README.agents.md +++ b/docs/README.agents.md @@ -164,6 +164,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-agents) for guidelines on how to | [Python MCP Server Expert](../agents/python-mcp-expert.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fpython-mcp-expert.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fpython-mcp-expert.agent.md) | Expert assistant for developing Model Context Protocol (MCP) servers in Python | | | [Python Notebook Sample Builder](../agents/python-notebook-sample-builder.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fpython-notebook-sample-builder.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fpython-notebook-sample-builder.agent.md) | Custom agent for building Python Notebooks in VS Code that demonstrate Azure and AI features | | | [QA](../agents/qa-subagent.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fqa-subagent.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fqa-subagent.agent.md) | Meticulous QA subagent for test planning, bug hunting, edge-case analysis, and implementation verification. | | +| [Quality Playbook](../agents/quality-playbook.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fquality-playbook.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fquality-playbook.agent.md) | Run a complete quality engineering audit on any codebase. Derives behavioral requirements from the code, generates spec-traced functional tests, runs a three-pass code review with regression tests, executes a multi-model spec audit (Council of Three), and produces a consolidated bug report with patches and TDD verification. Finds the 35% of real defects that structural code review alone cannot catch. | | | [React18 Auditor](../agents/react18-auditor.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Freact18-auditor.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Freact18-auditor.agent.md) | Deep-scan specialist for React 16/17 class-component codebases targeting React 18.3.1. Finds unsafe lifecycle methods, legacy context, batching vulnerabilities, event delegation assumptions, string refs, and all 18.3.1 deprecation surface. Reads everything, touches nothing. Saves .github/react18-audit.md. | | | [React18 Batching Fixer](../agents/react18-batching-fixer.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Freact18-batching-fixer.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Freact18-batching-fixer.agent.md) | Automatic batching regression specialist. React 18 batches ALL setState calls including those in Promises, setTimeout, and native event handlers - React 16/17 did NOT. Class components with async state chains that assumed immediate intermediate re-renders will produce wrong state. This agent finds every vulnerable pattern and fixes with flushSync where semantically required. | | | [React18 Class Surgeon](../agents/react18-class-surgeon.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Freact18-class-surgeon.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Freact18-class-surgeon.agent.md) | Class component migration specialist for React 16/17 → 18.3.1. Migrates all three unsafe lifecycle methods with correct semantic replacements (not just UNSAFE_ prefix). Migrates legacy context to createContext, string refs to React.createRef(), findDOMNode to direct refs, and ReactDOM.render to createRoot. Uses memory to checkpoint per-file progress. | | diff --git a/docs/README.skills.md b/docs/README.skills.md index 400eb0554..8c0449b88 100644 --- a/docs/README.skills.md +++ b/docs/README.skills.md @@ -248,7 +248,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to | [pytest-coverage](../skills/pytest-coverage/SKILL.md) | Run pytest tests with coverage, discover lines missing coverage, and increase coverage to 100%. | None | | [python-mcp-server-generator](../skills/python-mcp-server-generator/SKILL.md) | Generate a complete MCP server project in Python with tools, resources, and proper configuration | None | | [python-pypi-package-builder](../skills/python-pypi-package-builder/SKILL.md) | End-to-end skill for building, testing, linting, versioning, and publishing a production-grade Python library to PyPI. Covers all four build backends (setuptools+setuptools_scm, hatchling, flit, poetry), PEP 440 versioning, semantic versioning, dynamic git-tag versioning, OOP/SOLID design, type hints (PEP 484/526/544/561), Trusted Publishing (OIDC), and the full PyPA packaging flow. Use for: creating Python packages, pip-installable SDKs, CLI tools, framework plugins, pyproject.toml setup, py.typed, setuptools_scm, semver, mypy, pre-commit, GitHub Actions CI/CD, or PyPI publishing. | `references/architecture-patterns.md`
`references/ci-publishing.md`
`references/community-docs.md`
`references/library-patterns.md`
`references/pyproject-toml.md`
`references/release-governance.md`
`references/testing-quality.md`
`references/tooling-ruff.md`
`references/versioning-strategy.md`
`scripts/scaffold.py` | -| [quality-playbook](../skills/quality-playbook/SKILL.md) | Explore any codebase from scratch and generate six quality artifacts: a quality constitution (QUALITY.md), spec-traced functional tests, a code review protocol with regression test generation, an integration testing protocol, a multi-model spec audit (Council of Three), and an AI bootstrap file (AGENTS.md). Includes state machine completeness analysis and missing safeguard detection. Works with any language (Python, Java, Scala, TypeScript, Go, Rust, etc.). Use this skill whenever the user asks to set up a quality playbook, generate functional tests from specifications, create a quality constitution, build testing protocols, audit code against specs, or establish a repeatable quality system for a project. Also trigger when the user mentions 'quality playbook', 'spec audit', 'Council of Three', 'fitness-to-purpose', 'coverage theater', or wants to go beyond basic test generation to build a full quality system grounded in their actual codebase. | `LICENSE.txt`
`references/constitution.md`
`references/defensive_patterns.md`
`references/functional_tests.md`
`references/review_protocols.md`
`references/schema_mapping.md`
`references/spec_audit.md`
`references/verification.md` | +| [quality-playbook](../skills/quality-playbook/SKILL.md) | Run a complete quality engineering audit on any codebase. Derives behavioral requirements from the code, generates spec-traced functional tests, runs a three-pass code review with regression tests, executes a multi-model spec audit (Council of Three), and produces a consolidated bug report with TDD-verified patches. Finds the 35% of real defects that structural code review alone cannot catch. Works with any language. Trigger on 'quality playbook', 'spec audit', 'Council of Three', 'fitness-to-purpose', or 'coverage theater'. | `LICENSE.txt`
`quality_gate.sh`
`references/constitution.md`
`references/defensive_patterns.md`
`references/exploration_patterns.md`
`references/functional_tests.md`
`references/iteration.md`
`references/requirements_pipeline.md`
`references/requirements_refinement.md`
`references/requirements_review.md`
`references/review_protocols.md`
`references/schema_mapping.md`
`references/spec_audit.md`
`references/verification.md` | | [quasi-coder](../skills/quasi-coder/SKILL.md) | Expert 10x engineer skill for interpreting and implementing code from shorthand, quasi-code, and natural language descriptions. Use when collaborators provide incomplete code snippets, pseudo-code, or descriptions with potential typos or incorrect terminology. Excels at translating non-technical or semi-technical descriptions into production-quality code. | None | | [react-audit-grep-patterns](../skills/react-audit-grep-patterns/SKILL.md) | Provides the complete, verified grep scan command library for auditing React codebases before a React 18.3.1 or React 19 upgrade. Use this skill whenever running a migration audit - for both the react18-auditor and react19-auditor agents. Contains every grep pattern needed to find deprecated APIs, removed APIs, unsafe lifecycle methods, batching vulnerabilities, test file issues, dependency conflicts, and React 19 specific removals. Always use this skill when writing audit scan commands - do not rely on memory for grep syntax, especially for the multi-line async setState patterns which require context flags. | `references/dep-scans.md`
`references/react18-scans.md`
`references/react19-scans.md`
`references/test-scans.md` | | [react18-batching-patterns](../skills/react18-batching-patterns/SKILL.md) | Provides exact patterns for diagnosing and fixing automatic batching regressions in React 18 class components. Use this skill whenever a class component has multiple setState calls in an async method, inside setTimeout, inside a Promise .then() or .catch(), or in a native event handler. Use it before writing any flushSync call - the decision tree here prevents unnecessary flushSync overuse. Also use this skill when fixing test failures caused by intermediate state assertions that break after React 18 upgrade. | `references/batching-categories.md`
`references/flushSync-guide.md` | From 9f962cec47a6174d46d69bfe08662c655c5b6245 Mon Sep 17 00:00:00 2001 From: Andrew Stellman Date: Wed, 15 Apr 2026 11:39:11 -0400 Subject: [PATCH 3/4] Update orchestrator agent with automated phase orchestration --- agents/quality-playbook.agent.md | 146 ++++++++++++++++++++++++------- docs/README.agents.md | 2 +- 2 files changed, 116 insertions(+), 32 deletions(-) diff --git a/agents/quality-playbook.agent.md b/agents/quality-playbook.agent.md index 48ca51fe0..4cc7325ef 100644 --- a/agents/quality-playbook.agent.md +++ b/agents/quality-playbook.agent.md @@ -1,64 +1,148 @@ --- name: "Quality Playbook" -description: "Run a complete quality engineering audit on any codebase. Derives behavioral requirements from the code, generates spec-traced functional tests, runs a three-pass code review with regression tests, executes a multi-model spec audit (Council of Three), and produces a consolidated bug report with patches and TDD verification. Finds the 35% of real defects that structural code review alone cannot catch." +description: "Run a complete quality engineering audit on any codebase. Orchestrates six phases — explore, generate, review, audit, reconcile, verify — each in its own context window for maximum depth. Then runs iteration strategies to find even more bugs. Finds the 35% of real defects that structural code review alone cannot catch." tools: - search/codebase - web/fetch --- -# Quality Playbook Agent +# Quality Playbook — Orchestrator Agent -You are a quality engineering agent. Your job is to run the Quality Playbook — a systematic methodology for finding bugs that require understanding what the code is *supposed* to do, not just what it does. +You are a quality engineering orchestrator. Your job is to run the Quality Playbook across multiple phases, giving each phase a clean context window so it can do deep analysis instead of running out of context partway through. -## Before you start +## Setup: find the skill -Check that the quality playbook skill is installed. Look for it in one of these locations: +Check that the quality playbook skill is installed. Look for SKILL.md in these locations, in order: -1. `.github/skills/quality-playbook/SKILL.md` -2. `.github/skills/SKILL.md` +1. `.github/skills/quality-playbook/SKILL.md` (Copilot) +2. `.github/skills/SKILL.md` (Copilot, flat layout) +3. `.claude/skills/quality-playbook/SKILL.md` (Claude Code) -Also check for the reference files directory alongside SKILL.md (in a `references/` folder). +Also check for a `references/` directory alongside SKILL.md containing iteration.md, review_protocols.md, spec_audit.md, and verification.md. **If the skill is not installed**, tell the user: -> The quality playbook skill isn't installed in this repository yet. You can install it from [awesome-copilot](https://awesome-copilot.github.com/#file=skills%2Fquality-playbook%2FSKILL.md) or from the [quality-playbook repository](https://github.com/andrewstellman/quality-playbook). Copy the `SKILL.md` file and the `references/` directory into `.github/skills/quality-playbook/`. +> The quality playbook skill isn't installed in this repository yet. Install it from the [quality-playbook repository](https://github.com/andrewstellman/quality-playbook): +> +> ```bash +> # For Copilot +> mkdir -p .github/skills/quality-playbook/references +> cp SKILL.md .github/skills/quality-playbook/SKILL.md +> cp references/* .github/skills/quality-playbook/references/ +> +> # For Claude Code +> mkdir -p .claude/skills/quality-playbook/references +> cp SKILL.md .claude/skills/quality-playbook/SKILL.md +> cp references/* .claude/skills/quality-playbook/references/ +> ``` Then stop and wait for the user to install it. -**If the skill is installed**, read SKILL.md and every file in the `references/` directory. Then follow the skill's instructions exactly — it defines six phases, each with entry gates and exit gates. +**If the skill is installed**, read SKILL.md and every file in the `references/` directory. Then follow the instructions below. -## How it works — phase by phase +## Pre-flight checks -The playbook runs one phase at a time. Each phase runs with a clean context window, producing files that the next phase reads. After each phase, stop and tell the user what happened and what to say next. +Before starting Phase 1, do two things: -1. **Phase 1 (Explore)** — Understand the codebase: architecture, risks, failure modes -2. **Phase 2 (Generate)** — Produce quality artifacts: requirements, tests, protocols -3. **Phase 3 (Code Review)** — Three-pass review with regression tests for every bug -4. **Phase 4 (Spec Audit)** — Three independent auditors check code against requirements -5. **Phase 5 (Reconciliation)** — TDD red-green verification for every confirmed bug -6. **Phase 6 (Verify)** — Self-check benchmarks validate all artifacts +1. **Check for documentation.** Look for a `docs/`, `docs_gathered/`, or `documentation/` directory. If none exists, give a prominent warning: -After all six phases, the user can run iteration strategies (gap, unfiltered, parity, adversarial) to find more bugs — iterations typically add 40-60% more confirmed bugs. + > **Documentation improves results significantly.** The playbook finds more bugs — and higher-confidence bugs — when it has specs, API docs, design documents, or community documentation to check the code against. Consider adding documentation to `docs_gathered/` before running. You can proceed without it, but results will be limited to structural findings. -**Default behavior: run Phase 1 only, then stop.** The user drives each phase forward by saying "keep going" or "run phase N". +2. **Ask about scope.** For large projects (50+ source files), ask whether the user wants to focus on specific modules or run against the entire codebase. -## Documentation warning +## How to run -Before starting Phase 1, check if the project has documentation (a `docs/` or `docs_gathered/` directory). If not, warn the user that the playbook finds significantly more bugs with documentation, and suggest they add specs or API docs to `docs_gathered/` before running. +The playbook has two modes. Ask the user which they want, or infer from their prompt: + +### Mode 1: Phase by phase (recommended for first run) + +Run Phase 1 in the current session. When it completes, show the end-of-phase summary and tell the user to say "keep going" or "run phase N" to continue. Each subsequent phase should run in a **new session or context window** so it gets maximum depth. + +This is the default if the user says "run the quality playbook." + +### Mode 2: Full orchestrated run + +Run all six phases automatically, each in its own context window, with intelligent handoffs between them. Use this when the user says "run the full playbook" or "run all phases." + +**Orchestration protocol:** + +For each phase (1 through 6): + +1. **Start a new context.** Spawn a sub-agent, open a new session, or start a new chat — whatever your tool supports. The goal is a clean context window. +2. **Pass the phase prompt.** Tell the new context: + - Read SKILL.md at [path to skill] + - Read all files in the references/ directory + - Read quality/PROGRESS.md (if it exists) for context from prior phases + - Execute Phase N +3. **Wait for completion.** The phase is done when it writes its checkpoint to quality/PROGRESS.md. +4. **Check the result.** Read quality/PROGRESS.md after the phase completes. Verify the phase wrote its checkpoint. If it didn't, the phase failed — report to the user and ask whether to retry. +5. **Report progress.** Between phases, briefly tell the user what happened: how many findings, any issues, what's next. +6. **Continue to next phase.** Repeat from step 1. + +After Phase 6 completes, report the full results and ask if the user wants to run iteration strategies. + +**Tool-specific guidance for spawning clean contexts:** + +- **Claude Code:** Use the Agent tool to spawn a sub-agent for each phase. Each sub-agent gets its own context window automatically. +- **Claude Cowork:** Use agent spawning to run each phase in a separate session. +- **GitHub Copilot:** Start a new chat for each phase. Include the phase prompt as your first message. +- **Cursor:** Open a new Composer for each phase with the phase prompt. +- **Windsurf / other tools:** Start a new conversation or chat for each phase. + +If your tool doesn't support spawning sub-agents or new contexts programmatically, fall back to Mode 1 (phase by phase with user driving). + +### Iteration strategies + +After all six phases, the playbook supports four iteration strategies that find different classes of bugs. Each strategy re-explores the codebase with a different approach, then re-runs Phases 2-6 on the merged findings. Read `references/iteration.md` for full details. + +The four strategies, in recommended order: + +1. **gap** — Explore areas the baseline missed +2. **unfiltered** — Fresh-eyes re-review without structural constraints +3. **parity** — Compare parallel code paths (setup vs. teardown, encode vs. decode) +4. **adversarial** — Challenge prior dismissals and recover Type II errors + +Each iteration runs the same way as the baseline: Phase 1 through 6, each in its own context window. Between iterations, report what was found and suggest the next strategy. + +Iterations typically add 40-60% more confirmed bugs on top of the baseline. + +## The six phases + +1. **Phase 1 (Explore)** — Read the codebase: architecture, quality risks, candidate bugs. Output: `quality/EXPLORATION.md` +2. **Phase 2 (Generate)** — Produce quality artifacts: requirements, constitution, functional tests, review protocols, TDD protocol, AGENTS.md. Output: nine files in `quality/` +3. **Phase 3 (Code Review)** — Three-pass review: structural, requirement verification, cross-requirement consistency. Regression tests for every confirmed bug. Output: `quality/code_reviews/`, patches +4. **Phase 4 (Spec Audit)** — Three independent auditors check code against requirements. Triage with verification probes. Output: `quality/spec_audits/`, additional regression tests +5. **Phase 5 (Reconciliation)** — Close the loop: every bug tracked, regression-tested, TDD red-green verified. Output: `quality/BUGS.md`, TDD logs, completeness report +6. **Phase 6 (Verify)** — 45 self-check benchmarks validate all generated artifacts. Output: final PROGRESS.md checkpoint + +Each phase has entry gates (prerequisites from prior phases) and exit gates (what must be true before the phase is considered complete). SKILL.md defines these gates precisely — follow them exactly. ## Responding to user questions -- **"help" / "how does this work"** — Explain the six phases, mention that documentation improves results, and suggest "Run the quality playbook on this project" to get started. -- **"what happened" / "what's going on"** — Read `quality/PROGRESS.md` and give a status update. +- **"help" / "how does this work"** — Explain the six phases and two run modes. Mention that documentation improves results. Suggest "Run the quality playbook on this project" to get started with Mode 1, or "Run the full playbook" for automatic orchestration. +- **"what happened" / "what's going on" / "status"** — Read `quality/PROGRESS.md` and give a status update: which phases completed, how many bugs found, what's next. - **"keep going" / "continue" / "next"** — Run the next phase in sequence. - **"run phase N"** — Run the specified phase (check prerequisites first). +- **"run iterations"** — Start the iteration cycle. Read `references/iteration.md` and run gap strategy first. +- **"run [strategy] iteration"** — Run a specific iteration strategy. + +## Error recovery + +If a phase fails (crashes, runs out of context, doesn't write its checkpoint): + +1. Read quality/PROGRESS.md to see what was completed +2. Report the failure to the user with specifics +3. Suggest retrying the failed phase in a new context +4. Do not skip phases — each phase depends on the prior phase's output -## How to invoke +If the tool runs out of context mid-phase, the phase's incremental writes to disk are preserved. A retry in a new context can pick up where it left off by reading PROGRESS.md and the quality/ directory. -Tell the user they can invoke you by name in Copilot Chat. Example prompts: +## Example prompts -- "Run the quality playbook on this project" -- "Keep going" (after any phase completes) -- "Run quality playbook phase 3" -- "Help — how does the quality playbook work?" -- "What happened? What should I do next?" +- "Run the quality playbook on this project" — Mode 1, starts Phase 1 +- "Run the full playbook" — Mode 2, orchestrates all six phases +- "Run the full playbook with all iterations" — Mode 2 + all four iteration strategies +- "Keep going" — Continue to next phase +- "What happened?" — Status check +- "Run the adversarial iteration" — Specific iteration strategy +- "Help" — Explain how it works diff --git a/docs/README.agents.md b/docs/README.agents.md index ac8661a46..df7ba7b3b 100644 --- a/docs/README.agents.md +++ b/docs/README.agents.md @@ -164,7 +164,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-agents) for guidelines on how to | [Python MCP Server Expert](../agents/python-mcp-expert.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fpython-mcp-expert.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fpython-mcp-expert.agent.md) | Expert assistant for developing Model Context Protocol (MCP) servers in Python | | | [Python Notebook Sample Builder](../agents/python-notebook-sample-builder.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fpython-notebook-sample-builder.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fpython-notebook-sample-builder.agent.md) | Custom agent for building Python Notebooks in VS Code that demonstrate Azure and AI features | | | [QA](../agents/qa-subagent.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fqa-subagent.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fqa-subagent.agent.md) | Meticulous QA subagent for test planning, bug hunting, edge-case analysis, and implementation verification. | | -| [Quality Playbook](../agents/quality-playbook.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fquality-playbook.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fquality-playbook.agent.md) | Run a complete quality engineering audit on any codebase. Derives behavioral requirements from the code, generates spec-traced functional tests, runs a three-pass code review with regression tests, executes a multi-model spec audit (Council of Three), and produces a consolidated bug report with patches and TDD verification. Finds the 35% of real defects that structural code review alone cannot catch. | | +| [Quality Playbook](../agents/quality-playbook.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fquality-playbook.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fquality-playbook.agent.md) | Run a complete quality engineering audit on any codebase. Orchestrates six phases — explore, generate, review, audit, reconcile, verify — each in its own context window for maximum depth. Then runs iteration strategies to find even more bugs. Finds the 35% of real defects that structural code review alone cannot catch. | | | [React18 Auditor](../agents/react18-auditor.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Freact18-auditor.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Freact18-auditor.agent.md) | Deep-scan specialist for React 16/17 class-component codebases targeting React 18.3.1. Finds unsafe lifecycle methods, legacy context, batching vulnerabilities, event delegation assumptions, string refs, and all 18.3.1 deprecation surface. Reads everything, touches nothing. Saves .github/react18-audit.md. | | | [React18 Batching Fixer](../agents/react18-batching-fixer.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Freact18-batching-fixer.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Freact18-batching-fixer.agent.md) | Automatic batching regression specialist. React 18 batches ALL setState calls including those in Promises, setTimeout, and native event handlers - React 16/17 did NOT. Class components with async state chains that assumed immediate intermediate re-renders will produce wrong state. This agent finds every vulnerable pattern and fixes with flushSync where semantically required. | | | [React18 Class Surgeon](../agents/react18-class-surgeon.agent.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Freact18-class-surgeon.agent.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Freact18-class-surgeon.agent.md) | Class component migration specialist for React 16/17 → 18.3.1. Migrates all three unsafe lifecycle methods with correct semantic replacements (not just UNSAFE_ prefix). Migrates legacy context to createContext, string refs to React.createRef(), findDOMNode to direct refs, and ReactDOM.render to createRoot. Uses memory to checkpoint per-file progress. | | From 0b8776079cf50334b083c2556b6ae9757e8af910 Mon Sep 17 00:00:00 2001 From: Andrew Stellman Date: Thu, 16 Apr 2026 02:06:39 -0400 Subject: [PATCH 4/4] Update quality-playbook to v1.4.1 - Recheck mode: say "recheck" after fixing bugs to verify fixes without re-running the full pipeline (2-10 min vs 60-90 min) - Fixed 19 bugs from bootstrap self-audit: eval injection in quality_gate.sh, bash 3.2 empty array crashes, required artifacts downgraded to WARN, json_key_count false positives, missing artifact checks, documentation inconsistencies - quality_gate.sh: integration-results.json validation depth parity, #### heading detection, functional test alternative name patterns Co-Authored-By: Claude Opus 4.6 --- skills/quality-playbook/SKILL.md | 270 ++++++++++++++---- skills/quality-playbook/quality_gate.sh | 141 +++++++-- .../references/review_protocols.md | 2 +- .../references/verification.md | 10 +- 4 files changed, 342 insertions(+), 81 deletions(-) diff --git a/skills/quality-playbook/SKILL.md b/skills/quality-playbook/SKILL.md index 7d5b30ff0..89e76205e 100644 --- a/skills/quality-playbook/SKILL.md +++ b/skills/quality-playbook/SKILL.md @@ -3,7 +3,7 @@ name: quality-playbook description: "Run a complete quality engineering audit on any codebase. Derives behavioral requirements from the code, generates spec-traced functional tests, runs a three-pass code review with regression tests, executes a multi-model spec audit (Council of Three), and produces a consolidated bug report with TDD-verified patches. Finds the 35% of real defects that structural code review alone cannot catch. Works with any language. Trigger on 'quality playbook', 'spec audit', 'Council of Three', 'fitness-to-purpose', or 'coverage theater'." license: Complete terms in LICENSE.txt metadata: - version: 1.4.0 + version: 1.4.1 author: Andrew Stellman github: https://github.com/andrewstellman/quality-playbook --- @@ -28,17 +28,30 @@ Before reading any other section of this skill, understand the plan and its depe **Phase 6 (Verify):** Run self-check benchmarks against all generated artifacts. Check for internal consistency, version stamp correctness, and convergence. +**Phase 7 (Present, Explore, Improve):** Present results to the user with a scannable summary table, offer drill-down on any artifact, and provide a menu of improvement paths (iteration strategies, requirement refinement, integration test tuning). This is the interactive phase where the user takes ownership of the quality system. + Every bug found traces back to a requirement, and every requirement traces back to an exploration finding. **The critical dependency chain:** Exploration findings → EXPLORATION.md → Requirements → Code review + Spec audit → Bug discovery. A shallow exploration produces abstract requirements. Abstract requirements miss bugs. The exploration phase is where bugs are won or lost. **MANDATORY FIRST ACTION:** After reading and understanding the plan above, print the following message to the user, then explain the plan in your own words — what you'll do, what each phase produces, and why the exploration phase matters most. Emphasize that exploration starts with open-ended domain-driven investigation, followed by domain-knowledge risk analysis that reasons about what goes wrong in systems like this, then supplemented by selected structured patterns. Do not copy the plan verbatim; paraphrase it to demonstrate understanding. -> Quality Playbook v1.4.0 — by Andrew Stellman +> Quality Playbook v1.4.1 — by Andrew Stellman > https://github.com/andrewstellman/quality-playbook Generate a complete quality system tailored to a specific codebase. Unlike test stub generators that work mechanically from source code, this skill explores the project first — understanding its domain, architecture, specifications, and failure history — then produces a quality playbook grounded in what it finds. +### Locating reference files + +This skill references files in a `references/` directory (e.g., `references/iteration.md`, `references/review_protocols.md`). The location depends on how the skill was installed. When a reference file is mentioned, resolve it by checking these paths in order and using the first one that exists: + +1. `references/` (relative to SKILL.md — works when running from the skill directory) +2. `.claude/skills/quality-playbook/references/` (Claude Code installation) +3. `.github/skills/references/` (GitHub Copilot flat installation) +4. `.github/skills/quality-playbook/references/` (alternate Copilot installation) + +All reference file mentions in this skill use the short form `references/filename.md`. If the relative path doesn't resolve, walk the fallback list above. + ## Why This Exists Most software projects have tests, but few have a quality *system*. Tests check whether code works. A quality system answers harder questions: what does "working correctly" mean for this specific project? What are the ways it could fail that wouldn't be caught by tests? What should every developer (human or AI) know before touching this code? @@ -63,7 +76,7 @@ Nine files that together form a repeatable quality system: Plus output directories: `quality/code_reviews/`, `quality/spec_audits/`, `quality/results/`, `quality/history/`. -The pipeline also generates supporting artifacts: `quality/PROGRESS.md` (phase-by-phase checkpoint log with cumulative BUG tracker), `quality/CONTRACTS.md` (behavioral contracts), `quality/COVERAGE_MATRIX.md` (traceability), `quality/COMPLETENESS_REPORT.md` (final gate), `quality/VERSION_HISTORY.md` (review log), `quality/REVIEW_REQUIREMENTS.md` (interactive review protocol), and `quality/REFINE_REQUIREMENTS.md` (refinement pass protocol). +The pipeline also generates supporting artifacts: `quality/PROGRESS.md` (phase-by-phase checkpoint log with cumulative BUG tracker), `quality/CONTRACTS.md` (behavioral contracts), `quality/COVERAGE_MATRIX.md` (traceability), `quality/COMPLETENESS_REPORT.md` (final gate), and `quality/VERSION_HISTORY.md` (review log). Phase 7 can additionally generate `quality/REVIEW_REQUIREMENTS.md` (interactive review protocol) and `quality/REFINE_REQUIREMENTS.md` (refinement pass protocol) for iterative improvement. The two critical deliverables are the requirements file and the functional test file. The requirements file (`quality/REQUIREMENTS.md`) feeds the code review protocol's verification and consistency passes — it's what makes the code review catch more than structural anomalies. The functional test file (named for the project's language and test framework conventions) is the automated safety net. The Markdown protocols are documentation for humans and AI agents. @@ -73,6 +86,7 @@ The quality gate (`quality_gate.sh`) validates these artifacts. If the gate chec | Artifact | Location | Required? | Created In | |----------|----------|-----------|------------| +| Exploration findings | `quality/EXPLORATION.md` | Yes | Phase 1 | | Quality constitution | `quality/QUALITY.md` | Yes | Phase 2 | | Requirements (UC identifiers) | `quality/REQUIREMENTS.md` | Yes | Phase 2 | | Behavioral contracts | `quality/CONTRACTS.md` | Yes | Phase 2 | @@ -84,21 +98,24 @@ The quality gate (`quality_gate.sh`) validates these artifacts. If the gate chec | TDD verification protocol | `quality/RUN_TDD_TESTS.md` | Yes | Phase 2 | | Bug tracker | `quality/BUGS.md` | Yes | Phase 3 | | Coverage matrix | `quality/COVERAGE_MATRIX.md` | Yes | Phase 2 | -| Completeness report | `quality/COMPLETENESS_REPORT.md` | Yes | Phase 5 | +| Completeness report | `quality/COMPLETENESS_REPORT.md` | Yes | Phase 2 (baseline), Phase 5 (final verdict) | | Progress tracker | `quality/PROGRESS.md` | Yes | Throughout | | AI bootstrap | `AGENTS.md` | Yes | Phase 2 | | Bug writeups | `quality/writeups/BUG-NNN.md` | If bugs found | Phase 5 | | Regression patches | `quality/patches/BUG-NNN-regression-test.patch` | If bugs found | Phase 3 | | Fix patches | `quality/patches/BUG-NNN-fix.patch` | Optional | Phase 3 | +| TDD traceability | `quality/TDD_TRACEABILITY.md` | If bugs have red-phase results | Phase 5 | | TDD sidecar | `quality/results/tdd-results.json` | If bugs found | Phase 5 | | TDD red-phase logs | `quality/results/BUG-NNN.red.log` | If bugs found | Phase 5 | | TDD green-phase logs | `quality/results/BUG-NNN.green.log` | If fix patch exists | Phase 5 | | Integration sidecar | `quality/results/integration-results.json` | When integration tests run | Phase 5 | -| Mechanical verify script | `quality/mechanical/verify.sh` | Yes (benchmark) | Phase 5 | +| Mechanical verify script | `quality/mechanical/verify.sh` | Yes (benchmark) | Phase 2 | | Verify receipt | `quality/results/mechanical-verify.log` + `.exit` | Yes (benchmark) | Phase 5 | | Triage probes | `quality/spec_audits/triage_probes.sh` | When triage runs | Phase 4 | | Code review reports | `quality/code_reviews/*.md` | Yes | Phase 3 | | Spec audit reports | `quality/spec_audits/*auditor*.md` + `*triage*` | Yes | Phase 4 | +| Recheck results (JSON) | `quality/results/recheck-results.json` | When recheck runs | Recheck | +| Recheck summary (MD) | `quality/results/recheck-summary.md` | When recheck runs | Recheck | **Sidecar JSON lifecycle:** Write all bug writeups *before* finalizing `tdd-results.json` — the sidecar's `writeup_path` field must point to an existing file, not a placeholder. Similarly, run integration tests and collect results before writing `integration-results.json`. @@ -109,7 +126,7 @@ The quality gate (`quality_gate.sh`) validates these artifacts. If the gate chec ```json { "schema_version": "1.1", - "skill_version": "1.4.0", + "skill_version": "1.4.1", "date": "2026-04-12", "project": "repo-name", "bugs": [ @@ -124,7 +141,7 @@ The quality gate (`quality_gate.sh`) validates these artifacts. If the gate chec } ], "summary": { - "total": 3, "confirmed_open": 1, "red_failed": 0, "green_failed": 0, "tdd_verified": 2 + "total": 3, "confirmed_open": 1, "red_failed": 0, "green_failed": 0, "verified": 2 } } ``` @@ -136,13 +153,13 @@ The quality gate (`quality_gate.sh`) validates these artifacts. If the gate chec ```json { "schema_version": "1.1", - "skill_version": "1.4.0", + "skill_version": "1.4.1", "date": "2026-04-12", "project": "repo-name", "recommendation": "SHIP", - "groups": [{ "name": "Group 1", "tests": [{ "name": "happy path", "status": "pass" }] }], - "summary": { "total": 12, "passed": 11, "failed": 1, "skipped": 0 }, - "uc_coverage": { "UC-01": "covered", "UC-02": "not covered — no API key" } + "groups": [{ "group": 1, "name": "Group 1", "use_cases": ["UC-01"], "result": "pass", "tests_passed": 3, "tests_failed": 0, "notes": "" }], + "summary": { "total_groups": 12, "passed": 11, "failed": 1, "skipped": 0 }, + "uc_coverage": { "UC-01": "covered_pass", "UC-02": "not_mapped" } } ``` @@ -211,7 +228,7 @@ Use this when a previous playbook run exists and you want to find additional bug **When to use iteration mode:** After a complete playbook run, when you believe the codebase has more bugs than the first run found. This is especially effective for large codebases where a single run can only cover 3–5 subsystems, and for library/framework codebases where different exploration paths find different bug classes. -**Read `.github/skills/references/iteration.md` for detailed strategy instructions.** That file contains the full operational detail for each strategy, shared rules, merge steps, and the completion gate. The summary below describes when to use each strategy. +**Read `references/iteration.md` for detailed strategy instructions.** That file contains the full operational detail for each strategy, shared rules, merge steps, and the completion gate. The summary below describes when to use each strategy. **TDD applies to iteration runs.** Every newly confirmed bug in an iteration run must go through the full TDD red-green cycle and produce `quality/results/BUG-NNN.red.log` (and `.green.log` if a fix patch exists). The quality gate enforces this — missing logs cause FAIL. See `references/iteration.md` shared rule 5 and the TDD Log Closure Gate in Phase 5. @@ -293,7 +310,7 @@ When no `previous_runs/` directory exists but sibling versioned directories do, ## Phase 1: Explore the Codebase (Write As You Go) > **Required references for this phase** — read these before proceeding: -> - `.github/skills/references/exploration_patterns.md` — six bug-finding patterns to apply after open exploration +> - `references/exploration_patterns.md` — six bug-finding patterns to apply after open exploration Spend the first phase understanding the project. The quality playbook must be grounded in this specific codebase — not generic advice. @@ -356,7 +373,7 @@ This context is gold. A chat history where the developer discussed "why we chose If the user doesn't have chat history, proceed normally — the skill works without it, just with less context. -**Autonomous fallback:** When running in benchmark mode, via `run_playbook.sh`, or without user interaction (e.g., `--single-pass`), skip Step 0's question and proceed directly to Step 1. If chat history folders are visible in the project tree (e.g., `AI Chat History/`, `.chat_exports/`), scan them without asking. If no chat history is found, proceed — do not block waiting for a response that won't come. +**Autonomous fallback:** When running in benchmark mode, via `run_playbook.sh` (benchmark runner, not shipped with the skill), or without user interaction (e.g., `--single-pass`), skip Step 0's question and proceed directly to Step 1. If chat history folders are visible in the project tree (e.g., `AI Chat History/`, `.chat_exports/`), scan them without asking. If no chat history is found, proceed — do not block waiting for a response that won't come. ### Step 1: Identify Domain, Stack, and Specifications @@ -658,7 +675,7 @@ For each category, check whether the requirements contain specific conditions co **Carry-forward rule:** When a prior run's REQUIREMENTS.md exists in the quality directory, the pipeline must read it and check whether any conditions from the prior version were dropped. If conditions were dropped, the pipeline must either: (a) re-derive them with updated justification, or (b) document why the condition is no longer relevant. Silent drops are not permitted — they are a direct cause of regressions where previously learned requirements are lost between runs. -**After the pipeline:** The skill also generates `quality/REVIEW_REQUIREMENTS.md` (interactive review protocol) and `quality/REFINE_REQUIREMENTS.md` (refinement pass protocol). These support iterative improvement — the user can review requirements interactively, run refinement passes with different models, and keep versioned backups of each iteration. See `references/requirements_pipeline.md` for the full versioning protocol and backup structure. +**After the pipeline:** Phase 7 can generate `quality/REVIEW_REQUIREMENTS.md` (interactive review protocol) and `quality/REFINE_REQUIREMENTS.md` (refinement pass protocol). These are not Phase 2 artifacts — they support the Phase 7 interactive improvement paths. The user can review requirements interactively, run refinement passes with different models, and keep versioned backups of each iteration. See `references/requirements_pipeline.md` for the full versioning protocol and backup structure. Record all requirements in a structured format. These feed directly into the code review protocol's verification and consistency passes. @@ -680,17 +697,18 @@ Write the initial PROGRESS.md: ## Run metadata Started: [date/time] Project: [project name] -Skill version: [read from .github/skills/SKILL.md metadata — must match exactly] +Skill version: [read from SKILL.md metadata using the reference file resolution order — must match exactly] With docs: [yes/no] ## Phase completion - [x] Phase 1: Exploration — completed [date/time] -- [ ] Phase 2: Artifact generation (QUALITY.md, REQUIREMENTS.md, tests, protocols, BUGS.md, RUN_TDD_TESTS.md, AGENTS.md) +- [ ] Phase 2: Artifact generation (QUALITY.md, REQUIREMENTS.md, tests, protocols, RUN_TDD_TESTS.md, AGENTS.md) - [ ] Phase 3: Code review + regression tests - [ ] Phase 4: Spec audit + triage - [ ] Phase 5: Post-review reconciliation + closure verification - [ ] TDD logs: red-phase log for every confirmed bug, green-phase log for every bug with fix patch - [ ] Phase 6: Verification benchmarks +- [ ] Phase 7: Present, Explore, Improve (interactive) ## Artifact inventory | Artifact | Status | Path | Notes | @@ -869,12 +887,12 @@ Or say "keep going" to continue automatically. > **Required references for this phase** — read these before proceeding: > - `quality/EXPLORATION.md` — your Phase 1 findings (architecture, requirements, use cases, pattern analysis) -> - `.github/skills/references/requirements_pipeline.md` — five-phase pipeline for requirement derivation -> - `.github/skills/references/defensive_patterns.md` — grep patterns for finding defensive code -> - `.github/skills/references/schema_mapping.md` — field mapping format for schema-aware tests -> - `.github/skills/references/constitution.md` — QUALITY.md template -> - `.github/skills/references/functional_tests.md` — test structure and anti-patterns -> - `.github/skills/references/review_protocols.md` — code review and integration test templates +> - `references/requirements_pipeline.md` — five-phase pipeline for requirement derivation +> - `references/defensive_patterns.md` — grep patterns for finding defensive code +> - `references/schema_mapping.md` — field mapping format for schema-aware tests +> - `references/constitution.md` — QUALITY.md template +> - `references/functional_tests.md` — test structure and anti-patterns +> - `references/review_protocols.md` — code review and integration test templates **Phase 2 entry gate (mandatory — HARD STOP).** Before generating any artifacts, read `quality/EXPLORATION.md` from disk and verify ALL of the following exact section titles exist (grep or search — do not rely on memory): @@ -885,23 +903,23 @@ Or say "keep going" to continue automatically. 5. `## Candidate Bugs for Phase 2` — must exist verbatim 6. `## Gate Self-Check` — must exist (proves the Phase 1 gate was run) -If the file does not exist, has fewer than 120 lines, or is **missing ANY of these exact section titles**, STOP and go back to Phase 1. Do not attempt to proceed with "equivalent" sections under different names — the exact titles above are required. Write EXPLORATION.md now, starting with domain-driven open exploration, then domain-knowledge risk analysis, then selecting 3–4 patterns from `.github/skills/references/exploration_patterns.md` for deep dives. Do not proceed with Phase 2 until EXPLORATION.md passes the Phase 1 completion gate. This check exists because single-pass execution can skip the Phase 1 gate — this is the backstop. In v1.3.43, two repos bypassed both gates and produced zero bugs. +If the file does not exist, has fewer than 120 lines, or is **missing ANY of these exact section titles**, STOP and go back to Phase 1. Do not attempt to proceed with "equivalent" sections under different names — the exact titles above are required. Write EXPLORATION.md now, starting with domain-driven open exploration, then domain-knowledge risk analysis, then selecting 3–4 patterns from `references/exploration_patterns.md` for deep dives. Do not proceed with Phase 2 until EXPLORATION.md passes the Phase 1 completion gate. This check exists because single-pass execution can skip the Phase 1 gate — this is the backstop. In v1.3.43, two repos bypassed both gates and produced zero bugs. Use `quality/EXPLORATION.md` as your primary source for this phase — do not re-explore the codebase from scratch. The exploration findings contain the architecture map, derived requirements, use cases, and risk analysis that drive every artifact below. If you find yourself reading source files to figure out what the project does, go back to EXPLORATION.md instead. Re-exploration wastes context and produces inconsistencies between what Phase 1 found and what Phase 2 generates. -Now write the nine files. For each one, follow the structure below and consult the relevant reference file for detailed guidance. +Now write the Phase 2 artifacts. The requirements pipeline above produced REQUIREMENTS.md, CONTRACTS.md, COVERAGE_MATRIX.md, and COMPLETENESS_REPORT.md. The seven files below complete the set. For each one, follow the structure below and consult the relevant reference file for detailed guidance. **Version stamp (mandatory on every generated file).** Every Markdown file the playbook generates must begin with the following attribution line immediately after the file's title heading: ``` -> Generated by [Quality Playbook](https://github.com/andrewstellman/quality-playbook) v1.4.0 — Andrew Stellman +> Generated by [Quality Playbook](https://github.com/andrewstellman/quality-playbook) v1.4.1 — Andrew Stellman > Date: YYYY-MM-DD · Project: ``` Every generated code file (test files, scripts) must begin with a comment header: ``` -# Generated by Quality Playbook v1.4.0 — https://github.com/andrewstellman/quality-playbook +# Generated by Quality Playbook v1.4.1 — https://github.com/andrewstellman/quality-playbook # Author: Andrew Stellman · Date: YYYY-MM-DD · Project: ``` @@ -988,7 +1006,7 @@ The code review protocol has three passes. Each pass runs independently — a fr **Do not assert that a whitelist "covers all values" or "preserves supported bits" without performing this two-list comparison.** AI models reliably hallucinate completeness for switch/case constructs — the model sees the function, sees the constants defined elsewhere, and assumes coverage without checking each case label. The most dangerous form of this hallucination is copying from an upstream artifact (like REQUIREMENTS.md) that asserts a constant is present, rather than extracting from the code. In v1.3.17, the code review's "case labels present" list was word-for-word identical to the requirements list — proving it was copied rather than extracted. The mechanical check with per-label line numbers is the fix. -These five areas must appear as labeled subsections in the Pass 1 report. If a project has no meaningful concurrency, say so explicitly and document why rather than omitting the section. Add project-specific scrutiny areas beyond these four as warranted. +These five areas must appear as labeled subsections in the Pass 1 report. If a project has no meaningful concurrency, say so explicitly and document why rather than omitting the section. Add project-specific scrutiny areas beyond these five as warranted. Pass 1 catches ~65% of real defects: race conditions, null pointer hazards, resource leaks, off-by-one errors, type mismatches — structural problems visible in the code. @@ -1035,7 +1053,7 @@ Do NOT demand "executed request-level evidence" or defer findings because "they --- /dev/null +++ b/quality/test_regression_virtio.c @@ -0,0 +1,15 @@ -+// Generated by Quality Playbook v1.4.0 ++// Generated by Quality Playbook v1.4.1 +// Regression test for BUG-004: VIRTIO_F_RING_RESET missing from vring_transport_features() +#include +#include @@ -1389,7 +1407,7 @@ The generated protocol must include: } ``` - **Required top-level fields:** `schema_version`, `skill_version`, `date`, `project`, `bugs`, `summary`. **Required per-bug fields:** `id`, `requirement`, `red_phase`, `green_phase`, `verdict`, `fix_patch_present`, `writeup_path`. If any required field is missing, the result is non-conformant. + **Required top-level fields:** `schema_version`, `skill_version`, `date`, `project`, `bugs`, `summary`. **Required per-bug fields:** `id`, `requirement`, `red_phase`, `green_phase`, `verdict`, `fix_patch_present`, `writeup_path`. If any required field is missing, the result is non-conformant. **Optional per-bug fields** (shown in the template above but not gate-checked): `regression_patch`, `fix_patch`, `patch_gate_passed`, `junit_red`, `junit_green`, `junit_available`, `notes`. Include these when the data is available; omit them without penalty. **Required summary sub-keys:** The `summary` object must contain exactly these keys: `total`, `verified`, `confirmed_open`, `red_failed`, `green_failed`. All five are required — omitting any of them (especially `red_failed` or `green_failed`) makes the summary non-conformant. @@ -1403,7 +1421,7 @@ The generated protocol must include: - `"verdict": "skipped"` — this value is deprecated; use `"confirmed open"` with `red_phase: "fail"` and `green_phase: "skipped"`. - Missing `"schema_version"` at the root — every tdd-results.json must include this field. - Valid `verdict` values: `"TDD verified"` (FAIL→PASS), `"red failed"` (test passed on unpatched code — test doesn't detect the bug), `"green failed"` (test still fails after fix — fix is incomplete or patch is corrupt), `"confirmed open"` (red phase ran and confirmed the bug, no fix patch available). **Do not use `"skipped"` as a verdict** — every confirmed bug must have a red-phase result. A bug with `verdict: "confirmed open"` must have `red_phase: "fail"` (red ran and confirmed the bug) and `green_phase: "skipped"` (no fix to apply). Valid `red_phase`/`green_phase` values: `"fail"`, `"pass"`, `"error"` (compile/apply failure), `"skipped"` (green only — red is never skipped). The `patch_gate_passed` field records whether the patch validation gate (apply-check + compile) succeeded — `false` if the gate failed and the patch was repaired, `null` if no fix patch exists. The `writeup_path` field points to the per-bug writeup file (see "Bug writeup generation" below) — `null` if no writeup was generated for this bug. + Valid `verdict` values: `"TDD verified"` (FAIL→PASS), `"red failed"` (test passed on unpatched code — test doesn't detect the bug), `"green failed"` (test still fails after fix — fix is incomplete or patch is corrupt), `"confirmed open"` (red phase ran and confirmed the bug, no fix patch available), `"deferred"` (TDD cannot execute in this environment — use with `notes` explaining why). **Do not use `"skipped"` as a verdict** — every confirmed bug must have a red-phase result. A bug with `verdict: "confirmed open"` must have `red_phase: "fail"` (red ran and confirmed the bug) and `green_phase: "skipped"` (no fix to apply). Valid `red_phase`/`green_phase` values: `"fail"`, `"pass"`, `"error"` (compile/apply failure), `"skipped"` (green only — red is never skipped). The `patch_gate_passed` field records whether the patch validation gate (apply-check + compile) succeeded — `false` if the gate failed and the patch was repaired, `null` if no fix patch exists. The `writeup_path` field points to the per-bug writeup file (see "Bug writeup generation" below) — `null` if no writeup was generated for this bug. Runner scripts and CI tools should read the sidecar JSON for pass/fail counts rather than grepping the Markdown report. @@ -1413,7 +1431,7 @@ The generated protocol must include: **Execution UX:** Same three-phase pattern as the integration tests — (1) show the plan as a numbered table of bugs to verify, (2) report one-line progress as each red-green cycle runs (`FAIL ✓ → PASS ✓` or `FAIL ✗ — test passes on unpatched code, rewriting`), (3) show a summary table with verified/failed/rewritten counts. -7. **Bug writeup generation (for TDD-verified bugs).** After a successful red→green cycle (`verdict: "TDD verified"`), generate a self-contained writeup at `quality/writeups/BUG-NNN.md`. This file is designed to be emailed to a maintainer, attached to a Jira ticket, or reviewed outside the repository — it must stand alone without requiring the reader to navigate the rest of the quality artifacts. +7. **Bug writeup generation (for all confirmed bugs).** After a successful red→green cycle (`verdict: "TDD verified"`) or confirmation without a fix (`verdict: "confirmed open"`), generate a self-contained writeup at `quality/writeups/BUG-NNN.md`. This file is designed to be emailed to a maintainer, attached to a Jira ticket, or reviewed outside the repository — it must stand alone without requiring the reader to navigate the rest of the quality artifacts. **Template (sections 1–4, 6, 7 are required in every writeup; add 5 when the depth judgment fires; add 8 when related bugs exist):** @@ -1449,7 +1467,7 @@ Re-read `quality/PROGRESS.md`. Update: - Add exploration summary notes if not already present **Phase 2 completion gate (mandatory).** Before proceeding to Phase 3, verify: -1. All nine core artifacts exist on disk (`QUALITY.md`, `CONTRACTS.md`, `REQUIREMENTS.md`, `COVERAGE_MATRIX.md`, `test_functional.*`, `RUN_CODE_REVIEW.md`, `RUN_INTEGRATION_TESTS.md`, `RUN_SPEC_AUDIT.md`, `RUN_TDD_TESTS.md`). +1. All core artifacts exist on disk (`QUALITY.md`, `CONTRACTS.md`, `REQUIREMENTS.md`, `COVERAGE_MATRIX.md`, `COMPLETENESS_REPORT.md`, `test_functional.*`, `RUN_CODE_REVIEW.md`, `RUN_INTEGRATION_TESTS.md`, `RUN_SPEC_AUDIT.md`, `RUN_TDD_TESTS.md`, `AGENTS.md`). 2. `REQUIREMENTS.md` contains requirements with specific conditions of satisfaction referencing actual code (file paths, function names, line numbers) — not abstract behavioral descriptions. 3. If dispatch/enumeration contracts exist: `quality/mechanical/verify.sh` exists and has been executed. 4. PROGRESS.md marks Phase 2 complete with timestamp. @@ -1483,7 +1501,7 @@ Or say "keep going" to continue automatically. > **Required references for this phase:** > - `quality/REQUIREMENTS.md` — target list for the code review -> - `.github/skills/references/review_protocols.md` — three-pass protocol and regression test conventions +> - `references/review_protocols.md` — three-pass protocol and regression test conventions Run the code review protocol (all three passes) as described in File 3. After producing findings, write regression tests for every confirmed BUG per the closure mandate in `references/review_protocols.md`. @@ -1511,7 +1529,7 @@ Or say "keep going" to continue automatically. ## Phase 4: Spec Audit and Triage > **Required references for this phase:** -> - `.github/skills/references/spec_audit.md` — Council of Three protocol, triage process, verification probes +> - `references/spec_audit.md` — Council of Three protocol, triage process, verification probes Run the spec audit protocol as described in File 5. The triage report **must** include a `## Pre-audit docs validation` section (see `references/spec_audit.md` for the full template). This section is required even if `docs_gathered/` is empty — in that case, note what baseline the auditors used instead. Every verification probe in the triage must produce executable evidence (test assertions with line-number citations) per the "Verification probes must produce executable evidence" rule above. After triage, categorize each confirmed finding. @@ -1556,9 +1574,9 @@ Or say "keep going" to continue automatically. > **Required references for this phase:** > - `quality/PROGRESS.md` — cumulative BUG tracker (authoritative finding list) -> - `.github/skills/references/requirements_pipeline.md` — post-review reconciliation process -> - `.github/skills/references/review_protocols.md` — regression test cleanup after reversals -> - `.github/skills/references/spec_audit.md` — verification probe protocol for conflicts +> - `references/requirements_pipeline.md` — post-review reconciliation process +> - `references/review_protocols.md` — regression test cleanup after reversals +> - `references/spec_audit.md` — verification probe protocol for conflicts Re-read `quality/PROGRESS.md` — specifically the cumulative BUG tracker. This is the authoritative list of all findings across both code review and spec audit. @@ -1572,7 +1590,7 @@ Re-read `quality/PROGRESS.md` — specifically the cumulative BUG tracker. This **Executed evidence outranks narrative artifacts (contradiction gate).** Before running the terminal gate, check for contradictions between executed evidence and prose artifacts. Executed evidence includes: mechanical verification artifacts (`quality/mechanical/*`), verification receipt files (`quality/results/mechanical-verify.log`, `quality/results/mechanical-verify.exit`), regression test results (`test_regression.*` with `xfail` outcomes), TDD red-phase log files (`quality/results/BUG-NNN.red.log`), and any shell command output saved during the pipeline. Prose artifacts include: `REQUIREMENTS.md`, `CONTRACTS.md`, code reviews, spec audit triage, and `BUGS.md`. If an executed artifact shows a constant is absent (mechanical check), a test fails (regression test), or a red-phase confirms a bug (TDD traceability) — but a prose artifact claims the constant is present, the bug is fixed, or the code is compliant — the executed result wins. Re-open and correct the contradictory prose artifact before proceeding. Specifically: if `mechanical-verify.exit` contains a non-zero value, PROGRESS.md may not claim "Mechanical verification: passed" and the terminal gate may not pass — regardless of what any other artifact says. In v1.3.18, the triage claimed RING_RESET was preserved (`spec_audits/triage.md`), BUGS.md claimed "fixed in working tree," but TDD traceability showed the assertion `assert "case VIRTIO_F_RING_RESET:" in func` failed on the current source. Those three cannot all be true — the executed failure is the ground truth. This gate would have caught that contradiction. -**Version stamp consistency check (mandatory).** Read the `version:` field from the SKILL.md metadata (in `.github/skills/SKILL.md`). Then check every generated artifact: PROGRESS.md's `Skill version:` field, every `> Generated by` attribution line, every code file header stamp, and every sidecar JSON `skill_version` field. Every version stamp must match the SKILL.md metadata exactly. A single mismatch is a benchmark failure — fix the stamp before proceeding. This check exists because in v1.3.21 benchmarking, 5 of 9 repos had version stamps from older skill versions (v1.3.16 or v1.3.20) because the PROGRESS.md template contained a hardcoded version number. +**Version stamp consistency check (mandatory).** Read the `version:` field from the SKILL.md metadata (using the reference file resolution order). Then check every generated artifact: PROGRESS.md's `Skill version:` field, every `> Generated by` attribution line, every code file header stamp, and every sidecar JSON `skill_version` field. Every version stamp must match the SKILL.md metadata exactly. A single mismatch is a benchmark failure — fix the stamp before proceeding. This check exists because in v1.3.21 benchmarking, 5 of 9 repos had version stamps from older skill versions (v1.3.16 or v1.3.20) because the PROGRESS.md template contained a hardcoded version number. **Mechanical directory conformance check.** If `quality/mechanical/` exists, it must contain at minimum a `verify.sh` file. An empty `quality/mechanical/` directory is non-conformant — it implies the step was attempted but abandoned. If no dispatch-function contracts exist in this project's scope, do not create a `mechanical/` directory at all. Instead, record in PROGRESS.md: `Mechanical verification: NOT APPLICABLE — no dispatch/registry/enumeration contracts in scope.` If dispatch contracts do exist, `verify.sh` must include one verification block per saved extraction file under `quality/mechanical/` (not just one). A verify.sh that checks only one artifact when multiple exist is incomplete. @@ -1629,7 +1647,7 @@ For each missing file, create it now. Do not mark Phase 5 complete with missing **Sidecar JSON post-write validation (mandatory).** After writing `quality/results/tdd-results.json` and/or `quality/results/integration-results.json`, immediately reopen each file and verify it contains all required keys. For `tdd-results.json`, the required root keys are: `schema_version`, `skill_version`, `date`, `project`, `bugs`, `summary`. Each entry in `bugs` must have: `id`, `requirement`, `red_phase`, `green_phase`, `verdict`, `fix_patch_present`, `writeup_path`. The `summary` object must include `confirmed_open` alongside `verified`, `red_failed`, `green_failed`. For `integration-results.json`, the required root keys are: `schema_version`, `skill_version`, `date`, `project`, `recommendation`, `groups`, `summary`, `uc_coverage`. Both files must have `schema_version: "1.1"`. If any key is missing, add it now — do not leave a non-conformant JSON file on disk. This validation exists because v1.3.25 benchmarking showed 6 of 8 repos with non-conformant sidecar JSON: httpx invented an alternate schema, serde used legacy shape, javalin omitted `summary` and per-bug fields, and others used invalid enum values. -**Script-verified closure gate (mandatory, final step before marking Phase 5 complete).** Run `bash .github/skills/quality_gate.sh .` from the project root directory. This script mechanically validates: file existence, BUGS.md heading format, sidecar JSON required keys AND per-bug field names (`id`, `requirement`, `red_phase`, `green_phase`, `verdict`, `fix_patch_present`, `writeup_path`) AND enum values AND summary consistency, use case identifiers, terminal gate section, mechanical verification receipts, version stamps, writeup completeness, **regression-test patch presence for every confirmed bug**, and **inline fix diffs in every writeup** (every `quality/writeups/BUG-NNN.md` must contain a ` ```diff ` block). If the script reports any FAIL results, fix each failing check before proceeding — the most common FAILs are: (1) missing `quality/patches/BUG-NNN-regression-test.patch` files, (2) non-canonical JSON field names like `bug_id` instead of `id`, (3) missing `confirmed_open` in the TDD summary, (4) writeups without inline fix diffs (section 6 must include a concrete diff, not just "see patch file"). Do not mark Phase 5 complete until `quality_gate.sh` exits 0. Append the script's full output to `quality/results/quality-gate.log`. +**Script-verified closure gate (mandatory, final step before marking Phase 5 complete).** Locate `quality_gate.sh` using the same fallback as reference files (check `quality_gate.sh`, `.claude/skills/quality-playbook/quality_gate.sh`, `.github/skills/quality_gate.sh` in order) and run it from the project root directory. This script mechanically validates: file existence, BUGS.md heading format, sidecar JSON required keys AND per-bug field names (`id`, `requirement`, `red_phase`, `green_phase`, `verdict`, `fix_patch_present`, `writeup_path`) AND enum values AND summary consistency, use case identifiers, terminal gate section, mechanical verification receipts, version stamps, writeup completeness, **regression-test patch presence for every confirmed bug**, and **inline fix diffs in every writeup** (every `quality/writeups/BUG-NNN.md` must contain a ` ```diff ` block). If the script reports any FAIL results, fix each failing check before proceeding — the most common FAILs are: (1) missing `quality/patches/BUG-NNN-regression-test.patch` files, (2) non-canonical JSON field names like `bug_id` instead of `id`, (3) missing `confirmed_open` in the TDD summary, (4) writeups without inline fix diffs (section 6 must include a concrete diff, not just "see patch file"). Do not mark Phase 5 complete until `quality_gate.sh` exits 0. Append the script's full output to `quality/results/quality-gate.log`. **Use case identifier format.** REQUIREMENTS.md must use canonical use case identifiers in the format `UC-01`, `UC-02`, etc. for all derived use cases. Each use case must be labeled with its identifier. This is required for machine-readable traceability — the identifier format enables `quality_gate.sh` and downstream tooling to count and cross-reference use cases programmatically. Use cases written as prose paragraphs without identifiers are non-conformant. @@ -1658,7 +1676,7 @@ Or say "keep going" to continue automatically. ## Phase 6: Verify > **Required references for this phase:** -> - `.github/skills/references/verification.md` — 45 self-check benchmarks +> - `references/verification.md` — 45 self-check benchmarks **Why a verification phase?** AI-generated output can look polished and be subtly wrong. Tests that reference undefined fixtures report 0 failures but 16 errors — and "0 failures" sounds like success. Integration protocols can list field names that don't exist in the actual schemas. The verification phase catches these problems before the user discovers them, which is important because trust in a generated quality playbook is fragile — one wrong field name undermines confidence in everything else. @@ -1689,7 +1707,7 @@ Record in PROGRESS.md under `## Phase 6 Mechanical Closure` and append to `quali Run the mechanical validation gate: ```bash -bash .github/skills/quality_gate.sh . > quality/results/quality-gate.log 2>&1 +bash quality_gate.sh . > quality/results/quality-gate.log 2>&1 # locate via fallback: quality_gate.sh, .claude/skills/quality-playbook/quality_gate.sh, .github/skills/quality_gate.sh echo $? >> quality/results/phase6-verification.log ``` @@ -1740,12 +1758,12 @@ Process the remaining verification benchmarks from `references/verification.md` Append each batch result to `quality/results/phase6-verification.log`: ``` -[Step 3.4A] QUALITY.md scenarios: PASS — 8 scenarios, all reference real code -[Step 3.4B] Functional test quality: PASS — 30% cross-variant, assertion depth OK -[Step 3.4C] Protocol files: PASS — all self-contained and executable -[Step 3.4D] Regression tests: PASS — all skip guards present -[Step 3.4E] Enumeration/triage: PASS — two-list checks present, probes have assertions -[Step 3.4F] Continuation mode: SKIP — no SEED_CHECKS.md +[Step 6.4A] QUALITY.md scenarios: PASS — 8 scenarios, all reference real code +[Step 6.4B] Functional test quality: PASS — 30% cross-variant, assertion depth OK +[Step 6.4C] Protocol files: PASS — all self-contained and executable +[Step 6.4D] Regression tests: PASS — all skip guards present +[Step 6.4E] Enumeration/triage: PASS — two-list checks present, probes have assertions +[Step 6.4F] Continuation mode: SKIP — no SEED_CHECKS.md ``` If any batch fails, fix the issue immediately before proceeding to the next batch. @@ -1854,11 +1872,19 @@ You can now run iteration strategies to find additional bugs. Iterations typical add 40-60% more confirmed bugs on top of the baseline. The recommended cycle is: gap → unfiltered → parity → adversarial. -To start the first iteration, say: +To run all four iterations automatically, say: + + Run all iterations. + +I'll orchestrate each strategy as a separate sub-agent with its own context window. + +To run one iteration at a time, say: Run the next iteration of the quality playbook. Or ask me about the results: "Tell me about BUG-001" or "Which bugs are highest priority?" + +After you fix the bugs, say "recheck" to verify the fixes were applied correctly. ``` **After printing this message, STOP. Do not proceed to iterations unless the user explicitly asks.** @@ -1880,6 +1906,8 @@ The next recommended strategy is [next strategy]. To run it, say: All four iteration strategies have been run. Total confirmed bugs: N. You can review the results, ask about specific bugs, or re-run any strategy. +After you fix the bugs, say "recheck" to verify the fixes were applied correctly. + Or say "keep going" to run the next iteration automatically. ``` @@ -1887,11 +1915,150 @@ Or say "keep going" to run the next iteration automatically. --- +## Recheck Mode — Verify Bug Fixes + +Recheck mode is a lightweight verification pass that checks whether bugs from a previous run have been fixed. Instead of re-running the full six-phase pipeline (60-90 minutes), recheck reads the existing `quality/BUGS.md`, checks each bug against the current source tree, and reports which bugs are fixed vs. still open. A typical recheck takes 2-10 minutes. + +**When to use recheck mode:** After the user (or another agent) has applied fixes for bugs found by the playbook. The user says "recheck" or "verify the bug fixes" or "check which bugs are fixed." + +**Do not use recheck mode** as a substitute for running the full playbook. Recheck only verifies previously found bugs — it does not find new ones. + +### Recheck procedure + +**Step 1: Read the bug inventory.** + +Read `quality/BUGS.md` and parse every `### BUG-NNN` entry. For each bug, extract: +- Bug ID (e.g., BUG-001) +- File path and line number from the `**File:**` field +- Description summary (first sentence of `**Description:**`) +- Severity +- Fix patch path from `**Fix patch:**` field (e.g., `quality/patches/BUG-001-fix.patch`) +- Regression test path from `**Regression test:**` field + +**Step 2: Check each bug against the current source.** + +For each bug, perform these checks in order: + +1. **Fix patch check.** If a fix patch exists at the referenced path, run `git apply --check --reverse quality/patches/BUG-NNN-fix.patch` against the current tree. If the reverse-apply succeeds (exit 0), the fix patch is already applied — the bug is likely fixed. If it fails, the fix has not been applied or the code has changed. + +2. **Source inspection.** Open the file at the cited line number. Read the surrounding context (±20 lines). Compare what you see against the bug description. Has the problematic code been changed? Does the fix address the root cause described in the bug report? + +3. **Regression test execution.** If a regression test patch exists: + - Apply it: `git apply quality/patches/BUG-NNN-regression-test.patch` + - Run the test (using the project's test runner). If the test PASSES, the bug is fixed. If it FAILS, the bug is still present. + - Reverse the patch: `git apply -R quality/patches/BUG-NNN-regression-test.patch` + + If the regression test patch doesn't apply cleanly (because the source has changed), note this and fall back to source inspection alone. + +4. **Verdict.** Assign one of these statuses: + - **FIXED** — Fix patch is applied AND regression test passes (or source inspection confirms the fix if test can't run) + - **PARTIALLY_FIXED** — The problematic code has changed but the regression test still fails, or the fix addresses some but not all aspects of the bug + - **STILL_OPEN** — The original problematic code is unchanged, or the regression test still fails + - **INCONCLUSIVE** — Can't determine status (file moved, code heavily refactored, patches don't apply) + +**Step 3: Generate recheck results.** + +Write `quality/results/recheck-results.json` with this schema: + +```json +{ + "schema_version": "1.0", + "skill_version": "1.4.1", + "date": "YYYY-MM-DD", + "project": "", + "source_run": { + "bugs_md_date": "", + "total_bugs": + }, + "results": [ + { + "id": "BUG-001", + "severity": "HIGH", + "summary": "", + "status": "FIXED", + "evidence": "" + } + ], + "summary": { + "total": , + "fixed": , + "partially_fixed": , + "still_open": , + "inconclusive": + } +} +``` + +Also write a human-readable summary to `quality/results/recheck-summary.md`: + +```markdown +# Recheck Results + +> Recheck of quality/BUGS.md from +> Recheck run: +> Skill version: + +## Summary + +| Status | Count | +|--------|-------| +| Fixed | N | +| Partially fixed | N | +| Still open | N | +| Inconclusive | N | +| **Total** | **N** | + +## Per-Bug Results + +| Bug | Severity | Status | Evidence | +|-----|----------|--------|----------| +| BUG-001 | HIGH | FIXED | Reverse-apply succeeded, regression test passes | +| BUG-002 | MEDIUM | STILL_OPEN | Original code unchanged at quality_gate.sh:125 | +| ... | ... | ... | ... | + +## Still Open — Details + +[For each STILL_OPEN or PARTIALLY_FIXED bug, include a brief explanation of what remains to be fixed.] +``` + +**Step 4: Print the recheck summary.** + +Print the summary table to the user, then STOP. Example: + +``` +# Recheck Complete + +Checked 19 bugs from quality/BUGS.md against current source. + +| Status | Count | +|--------|-------| +| Fixed | 17 | +| Still open | 2 | +| **Total** | **19** | + +Fixed bugs: BUG-001, BUG-002, BUG-003, BUG-004, BUG-005, BUG-006, BUG-007, +BUG-008, BUG-009, BUG-010, BUG-011, BUG-013, BUG-014, BUG-015, BUG-016, +BUG-017, BUG-018 + +Still open: BUG-012 (stale .orig file still present), BUG-019 (benchmark 40 +artifact list not updated) + +Results saved to: +- quality/results/recheck-results.json (machine-readable) +- quality/results/recheck-summary.md (human-readable) +``` + +### Triggering recheck mode + +Recheck mode activates when the user says any of: "recheck", "verify the bug fixes", "check which bugs are fixed", "recheck the bugs", "run recheck mode", or similar phrasing that clearly indicates they want to verify fixes rather than find new bugs. When triggered, skip Phases 1-7 entirely and execute only the recheck procedure above. + +--- + ## Phase 7: Present, Explore, Improve (Interactive) After generating and verifying, present the results clearly and give the user control over what happens next. This phase has three parts: a scannable summary, drill-down on demand, and a menu of improvement paths. -**Do not skip this phase.** The autonomous output from Phases 1-3 is a solid starting point, but the user needs to understand what was generated, explore what matters to them, and choose how to improve it. A quality playbook is only useful if the people who own the project trust it and understand it. Dumping six files without explanation creates artifacts nobody reads. +**Do not skip this phase.** The autonomous output from Phases 1-6 is a solid starting point, but the user needs to understand what was generated, explore what matters to them, and choose how to improve it. A quality playbook is only useful if the people who own the project trust it and understand it. Dumping six files without explanation creates artifacts nobody reads. ### Part 1: The Summary Table @@ -2060,10 +2227,13 @@ Read these as you work through each phase: | File | When to Read | Contains | |------|-------------|----------| +| `references/exploration_patterns.md` | Phase 1 (explore) | Pattern applicability matrix, deep-dive templates, domain-knowledge questions | | `references/defensive_patterns.md` | Step 5 (finding skeletons) | Grep patterns, how to convert findings to scenarios | | `references/schema_mapping.md` | Step 5b (schema types) | Field mapping format, mutation validity rules | +| `references/requirements_pipeline.md` | Phase 2 (requirements) | Five-phase pipeline, versioning protocol, carry-forward rules | | `references/constitution.md` | File 1 (QUALITY.md) | Full template with section-by-section guidance | | `references/functional_tests.md` | File 2 (functional tests) | Test structure, anti-patterns, cross-variant strategy | | `references/review_protocols.md` | Files 3–4 (code review, integration) | Templates for both protocols, patch validation, skip guards | | `references/spec_audit.md` | File 5 (Council of Three) | Full audit protocol, triage process, fix execution | +| `references/iteration.md` | Iterations (after Phase 6) | Four iteration strategies: gap, unfiltered, parity, adversarial | | `references/verification.md` | Phase 6 (verify) | Complete self-check checklist (45 benchmarks) including structured output, patch gate, skip guard validation, pre-flight discovery, version stamps, bug writeups, enumeration completeness, triage executable evidence, code-extracted enumeration lists, mechanical verification artifacts, source-inspection test execution, contradiction gate, seed check execution, convergence tracking, sidecar JSON schema validation, script-verified closure gate, canonical use case identifiers, and writeup inline fix diffs | diff --git a/skills/quality-playbook/quality_gate.sh b/skills/quality-playbook/quality_gate.sh index 11a59937a..3fb989a40 100755 --- a/skills/quality-playbook/quality_gate.sh +++ b/skills/quality-playbook/quality_gate.sh @@ -41,7 +41,7 @@ STRICTNESS="benchmark" # "benchmark" (default) or "general" # Parse args EXPECT_VERSION=false -for arg in "$@"; do +for arg in ${@+"$@"}; do if [ "$EXPECT_VERSION" = true ]; then VERSION="$arg" EXPECT_VERSION=false @@ -58,7 +58,7 @@ done # Detect version from SKILL.md — try multiple locations if [ -z "$VERSION" ]; then - for loc in "${SCRIPT_DIR}/../SKILL.md" "${SCRIPT_DIR}/SKILL.md" ".github/skills/SKILL.md"; do + for loc in "${SCRIPT_DIR}/../SKILL.md" "${SCRIPT_DIR}/SKILL.md" "SKILL.md" ".claude/skills/quality-playbook/SKILL.md" ".github/skills/SKILL.md" ".github/skills/quality-playbook/SKILL.md"; do if [ -f "$loc" ]; then VERSION=$(grep -m1 'version:' "$loc" 2>/dev/null | sed 's/.*version: *//' | tr -d ' ') [ -n "$VERSION" ] && break @@ -84,10 +84,10 @@ json_str_val() { | head -1 | sed 's/.*: *"\([^"]*\)"/\1/' } -# Helper: count occurrences of a key in JSON +# Helper: count occurrences of a key in JSON (matches key: value pairs only) json_key_count() { local file="$1" key="$2" - grep -c "\"${key}\"" "$file" 2>/dev/null || echo 0 + grep -c "\"${key}\"[[:space:]]*:" "$file" 2>/dev/null || echo 0 } check_repo() { @@ -112,6 +112,33 @@ check_repo() { fi done + # Additional required artifacts (Required: Yes in SKILL.md artifact contract table) + for f in CONTRACTS.md RUN_CODE_REVIEW.md RUN_SPEC_AUDIT.md RUN_INTEGRATION_TESTS.md RUN_TDD_TESTS.md; do + if [ -f "${q}/${f}" ]; then + pass "${f} exists" + else + fail "${f} missing" + fi + done + # Functional test file — check all SKILL.md-documented naming patterns + if ls ${q}/test_functional.* ${q}/FunctionalSpec.* ${q}/FunctionalTest.* ${q}/functional.test.* &>/dev/null 2>&1; then + pass "functional test file exists" + else + fail "functional test file missing (test_functional.*, FunctionalSpec.*, FunctionalTest.*, functional.test.*)" + fi + # AGENTS.md — required per SKILL.md artifact contract table (Phase 2) + if [ -f "${repo_dir}/AGENTS.md" ]; then + pass "AGENTS.md exists" + else + fail "AGENTS.md missing (required at project root)" + fi + # EXPLORATION.md — mandatory in all modes (SKILL.md line 259) + if [ -f "${q}/EXPLORATION.md" ]; then + pass "EXPLORATION.md exists" + else + fail "EXPLORATION.md missing" + fi + # Code reviews dir if [ -d "${q}/code_reviews" ] && [ -n "$(ls ${q}/code_reviews/*.md 2>/dev/null)" ]; then pass "code_reviews/ has .md files" @@ -158,7 +185,9 @@ check_repo() { correct_headings=${correct_headings:-0} wrong_headings=$(grep -E '^## BUG-[0-9]+' "${q}/BUGS.md" 2>/dev/null | grep -cvE '^### BUG-' || true) wrong_headings=${wrong_headings:-0} - local bold_headings bullet_headings + local deep_headings bold_headings bullet_headings + deep_headings=$(grep -cE '^#{4,} BUG-[0-9]+' "${q}/BUGS.md" || true) + deep_headings=${deep_headings:-0} bold_headings=$(grep -cE '^\*\*BUG-[0-9]+' "${q}/BUGS.md" || true) bold_headings=${bold_headings:-0} bullet_headings=$(grep -cE '^- BUG-[0-9]+' "${q}/BUGS.md" || true) @@ -166,10 +195,11 @@ check_repo() { bug_count=$correct_headings - if [ "$correct_headings" -gt 0 ] && [ "$wrong_headings" -eq 0 ] && [ "$bold_headings" -eq 0 ] && [ "$bullet_headings" -eq 0 ]; then + if [ "$correct_headings" -gt 0 ] && [ "$wrong_headings" -eq 0 ] && [ "$deep_headings" -eq 0 ] && [ "$bold_headings" -eq 0 ] && [ "$bullet_headings" -eq 0 ]; then pass "All ${correct_headings} bug headings use ### BUG-NNN format" else [ "$wrong_headings" -gt 0 ] && fail "${wrong_headings} heading(s) use ## instead of ###" + [ "$deep_headings" -gt 0 ] && fail "${deep_headings} heading(s) use #### or deeper instead of ###" [ "$bold_headings" -gt 0 ] && fail "${bold_headings} heading(s) use **BUG- format" [ "$bullet_headings" -gt 0 ] && fail "${bullet_headings} heading(s) use - BUG- format" if [ "$correct_headings" -eq 0 ] && [ "$wrong_headings" -eq 0 ]; then @@ -177,7 +207,7 @@ check_repo() { pass "Zero-bug run — no headings expected" else # Count wrong-format headings as bugs for patch check - bug_count=$((wrong_headings + bold_headings + bullet_headings)) + bug_count=$((wrong_headings + deep_headings + bold_headings + bullet_headings)) warn "No ### BUG-NNN headings found in BUGS.md" fi else @@ -225,8 +255,8 @@ check_repo() { fi done - # Summary must include confirmed_open, red_failed, green_failed - for skey in confirmed_open red_failed green_failed; do + # Summary must include all 5 required keys + for skey in total verified confirmed_open red_failed green_failed; do if json_has_key "$json_file" "$skey"; then pass "summary has '${skey}'" else @@ -282,10 +312,18 @@ check_repo() { local bug_ids bug_ids=$(grep -oE 'BUG-[0-9]+' "${q}/BUGS.md" 2>/dev/null \ | grep -E '^BUG-[0-9]+$' | sort -u -t'-' -k2,2n) + local red_bad_tag=0 green_bad_tag=0 for bid in $bug_ids; do # Red-phase log — required for every confirmed bug if [ -f "${q}/results/${bid}.red.log" ]; then red_found=$((red_found + 1)) + # Validate first-line status tag + local red_tag + red_tag=$(head -1 "${q}/results/${bid}.red.log" 2>/dev/null | tr -d '[:space:]') + case "$red_tag" in + RED|GREEN|NOT_RUN|ERROR) ;; + *) red_bad_tag=$((red_bad_tag + 1)) ;; + esac else red_missing=$((red_missing + 1)) fi @@ -294,6 +332,13 @@ check_repo() { green_expected=$((green_expected + 1)) if [ -f "${q}/results/${bid}.green.log" ]; then green_found=$((green_found + 1)) + # Validate first-line status tag + local green_tag + green_tag=$(head -1 "${q}/results/${bid}.green.log" 2>/dev/null | tr -d '[:space:]') + case "$green_tag" in + RED|GREEN|NOT_RUN|ERROR) ;; + *) green_bad_tag=$((green_bad_tag + 1)) ;; + esac else green_missing=$((green_missing + 1)) fi @@ -317,6 +362,26 @@ check_repo() { else info "No fix patches found — green-phase logs not required" fi + + # Status tag validation + if [ "$red_bad_tag" -gt 0 ]; then + fail "${red_bad_tag} red-phase log(s) missing valid first-line status tag (expected RED/GREEN/NOT_RUN/ERROR)" + elif [ "$red_found" -gt 0 ]; then + pass "All red-phase logs have valid status tags" + fi + if [ "$green_bad_tag" -gt 0 ]; then + fail "${green_bad_tag} green-phase log(s) missing valid first-line status tag (expected RED/GREEN/NOT_RUN/ERROR)" + elif [ "$green_found" -gt 0 ]; then + pass "All green-phase logs have valid status tags" + fi + # TDD_TRACEABILITY.md — mandatory when bugs have red-phase results (benchmark 28) + if [ "$red_found" -gt 0 ]; then + if [ -f "${q}/TDD_TRACEABILITY.md" ]; then + pass "TDD_TRACEABILITY.md exists (${red_found} bugs with red-phase results)" + else + fail "TDD_TRACEABILITY.md missing (mandatory when bugs have red-phase results)" + fi + fi else info "Zero bugs — TDD log files not required" fi @@ -329,6 +394,32 @@ check_repo() { json_has_key "$ij" "$key" && pass "has '${key}'" || fail "missing key '${key}'" done + # schema_version value (must be "1.1" — same check as tdd-results.json) + local isv + isv=$(json_str_val "$ij" "schema_version") + [ "$isv" = "1.1" ] && pass "integration schema_version is '1.1'" || fail "integration schema_version is '${isv:-missing}', expected '1.1'" + + # Date validation — same checks as tdd-results.json + local int_date + int_date=$(json_str_val "$ij" "date") + if [ -n "$int_date" ]; then + if echo "$int_date" | grep -qE '^[0-9]{4}-[0-9]{2}-[0-9]{2}$'; then + if [ "$int_date" = "YYYY-MM-DD" ] || [ "$int_date" = "0000-00-00" ]; then + fail "integration-results.json date is placeholder '${int_date}'" + else + local today_int + today_int=$(date +%Y-%m-%d) + if [[ "$int_date" > "$today_int" ]]; then + fail "integration-results.json date '${int_date}' is in the future" + else + pass "integration-results.json date '${int_date}' is valid" + fi + fi + else + fail "integration-results.json date '${int_date}' is not ISO 8601 (YYYY-MM-DD)" + fi + fi + # Recommendation enum local rec rec=$(json_str_val "$ij" "recommendation") @@ -392,17 +483,16 @@ check_repo() { # Detect project language using find (portable, no globstar needed). # Exclude vendor/, node_modules/, .git/, and quality/ to avoid false positives. local detected_lang="" - local find_exclude="-not -path '*/vendor/*' -not -path '*/node_modules/*' -not -path '*/.git/*' -not -path '*/quality/*'" - if eval "find '${repo_dir}' -maxdepth 3 ${find_exclude} -name '*.go' -print -quit" 2>/dev/null | grep -q .; then detected_lang="go" - elif eval "find '${repo_dir}' -maxdepth 3 ${find_exclude} -name '*.py' -print -quit" 2>/dev/null | grep -q .; then detected_lang="py" - elif eval "find '${repo_dir}' -maxdepth 3 ${find_exclude} -name '*.java' -print -quit" 2>/dev/null | grep -q .; then detected_lang="java" - elif eval "find '${repo_dir}' -maxdepth 3 ${find_exclude} -name '*.kt' -print -quit" 2>/dev/null | grep -q .; then detected_lang="kt" - elif eval "find '${repo_dir}' -maxdepth 3 ${find_exclude} -name '*.rs' -print -quit" 2>/dev/null | grep -q .; then detected_lang="rs" - elif eval "find '${repo_dir}' -maxdepth 3 ${find_exclude} -name '*.ts' -print -quit" 2>/dev/null | grep -q .; then detected_lang="ts" - elif eval "find '${repo_dir}' -maxdepth 3 ${find_exclude} -name '*.js' -print -quit" 2>/dev/null | grep -q .; then detected_lang="js" - elif eval "find '${repo_dir}' -maxdepth 3 ${find_exclude} -name '*.scala' -print -quit" 2>/dev/null | grep -q .; then detected_lang="scala" - elif eval "find '${repo_dir}' -maxdepth 3 ${find_exclude} -name '*.c' -print -quit" 2>/dev/null | grep -q .; then detected_lang="c" - elif eval "find '${repo_dir}' -maxdepth 3 ${find_exclude} -name '*.agc' -print -quit" 2>/dev/null | grep -q .; then detected_lang="agc" + if find "${repo_dir}" -maxdepth 3 -not -path '*/vendor/*' -not -path '*/node_modules/*' -not -path '*/.git/*' -not -path '*/quality/*' -name '*.go' -print -quit 2>/dev/null | grep -q .; then detected_lang="go" + elif find "${repo_dir}" -maxdepth 3 -not -path '*/vendor/*' -not -path '*/node_modules/*' -not -path '*/.git/*' -not -path '*/quality/*' -name '*.py' -print -quit 2>/dev/null | grep -q .; then detected_lang="py" + elif find "${repo_dir}" -maxdepth 3 -not -path '*/vendor/*' -not -path '*/node_modules/*' -not -path '*/.git/*' -not -path '*/quality/*' -name '*.java' -print -quit 2>/dev/null | grep -q .; then detected_lang="java" + elif find "${repo_dir}" -maxdepth 3 -not -path '*/vendor/*' -not -path '*/node_modules/*' -not -path '*/.git/*' -not -path '*/quality/*' -name '*.kt' -print -quit 2>/dev/null | grep -q .; then detected_lang="kt" + elif find "${repo_dir}" -maxdepth 3 -not -path '*/vendor/*' -not -path '*/node_modules/*' -not -path '*/.git/*' -not -path '*/quality/*' -name '*.rs' -print -quit 2>/dev/null | grep -q .; then detected_lang="rs" + elif find "${repo_dir}" -maxdepth 3 -not -path '*/vendor/*' -not -path '*/node_modules/*' -not -path '*/.git/*' -not -path '*/quality/*' -name '*.ts' -print -quit 2>/dev/null | grep -q .; then detected_lang="ts" + elif find "${repo_dir}" -maxdepth 3 -not -path '*/vendor/*' -not -path '*/node_modules/*' -not -path '*/.git/*' -not -path '*/quality/*' -name '*.js' -print -quit 2>/dev/null | grep -q .; then detected_lang="js" + elif find "${repo_dir}" -maxdepth 3 -not -path '*/vendor/*' -not -path '*/node_modules/*' -not -path '*/.git/*' -not -path '*/quality/*' -name '*.scala' -print -quit 2>/dev/null | grep -q .; then detected_lang="scala" + elif find "${repo_dir}" -maxdepth 3 -not -path '*/vendor/*' -not -path '*/node_modules/*' -not -path '*/.git/*' -not -path '*/quality/*' -name '*.c' -print -quit 2>/dev/null | grep -q .; then detected_lang="c" + elif find "${repo_dir}" -maxdepth 3 -not -path '*/vendor/*' -not -path '*/node_modules/*' -not -path '*/.git/*' -not -path '*/quality/*' -name '*.agc' -print -quit 2>/dev/null | grep -q .; then detected_lang="agc" fi if [ -n "$detected_lang" ]; then @@ -504,7 +594,8 @@ check_repo() { if [ -d "${q}/writeups" ]; then writeup_count=$(ls ${q}/writeups/BUG-*.md 2>/dev/null | wc -l | tr -d ' ') # Check each writeup for inline diff (section 6 requirement) - for wf in ${q}/writeups/BUG-*.md; do + # Note: the [ -f "$wf" ] guard handles the case where the glob doesn't match + for wf in "${q}"/writeups/BUG-*.md; do [ -f "$wf" ] || continue if grep -q '```diff' "$wf" 2>/dev/null; then writeup_diff_count=$((writeup_diff_count + 1)) @@ -514,7 +605,7 @@ check_repo() { if [ "$writeup_count" -ge "$bug_count" ]; then pass "${writeup_count} writeup(s) for ${bug_count} bug(s)" elif [ "$writeup_count" -gt 0 ]; then - warn "${writeup_count} writeup(s) for ${bug_count} bug(s) — incomplete" + fail "${writeup_count} writeup(s) for ${bug_count} bug(s) — all confirmed bugs require writeups (SKILL.md line 1454)" else fail "No writeups for ${bug_count} confirmed bug(s)" fi @@ -534,7 +625,7 @@ check_repo() { # --- Version stamp consistency (benchmark 26) --- echo "[Version Stamps]" local skill_version="" - for loc in "${repo_dir}/.github/skills/SKILL.md" "${repo_dir}/SKILL.md"; do + for loc in "${repo_dir}/SKILL.md" "${repo_dir}/.claude/skills/quality-playbook/SKILL.md" "${repo_dir}/.github/skills/SKILL.md" "${repo_dir}/.github/skills/quality-playbook/SKILL.md" "${SCRIPT_DIR}/../SKILL.md" "${SCRIPT_DIR}/SKILL.md"; do if [ -f "$loc" ]; then skill_version=$(grep -m1 'version:' "$loc" 2>/dev/null | sed 's/.*version: *//' | tr -d ' ') [ -n "$skill_version" ] && break @@ -592,7 +683,7 @@ elif [ ${#REPO_DIRS[@]} -eq 1 ] && [ "${REPO_DIRS[0]}" = "." ]; then REPO_DIRS=("$(pwd)") else resolved=() - for name in "${REPO_DIRS[@]}"; do + for name in ${REPO_DIRS[@]+"${REPO_DIRS[@]}"}; do if [ -d "$name/quality" ]; then resolved+=("$name") elif [ -d "${SCRIPT_DIR}/${name}-${VERSION}" ]; then @@ -603,7 +694,7 @@ else echo "WARNING: Cannot find repo '${name}'" fi done - REPO_DIRS=("${resolved[@]}") + REPO_DIRS=(${resolved[@]+"${resolved[@]}"}) fi if [ ${#REPO_DIRS[@]} -eq 0 ]; then diff --git a/skills/quality-playbook/references/review_protocols.md b/skills/quality-playbook/references/review_protocols.md index 73975d35d..cc7395e9b 100644 --- a/skills/quality-playbook/references/review_protocols.md +++ b/skills/quality-playbook/references/review_protocols.md @@ -373,7 +373,7 @@ As each test runs, report a one-line status update. Keep it compact — the user Use `✓` for pass, `✗` for fail, `⧗` for in-progress. If a test fails, show one line of context (the error message or assertion that failed), not the full stack trace. The user can ask for details if they want them. -### Phase 6: Results +### Phase 3: Results After all tests complete, show a summary table and a recommendation: diff --git a/skills/quality-playbook/references/verification.md b/skills/quality-playbook/references/verification.md index 1f553d463..2111b246e 100644 --- a/skills/quality-playbook/references/verification.md +++ b/skills/quality-playbook/references/verification.md @@ -72,7 +72,7 @@ Run the project's full test suite (not just your new tests). Your new files shou Every scenario should mention actual function names, file names, or patterns that exist in the codebase. Grep for each reference to confirm it exists. -If working from non-formal requirements, verify that each scenario and test includes a requirement tag using the canonical format: `[Req: formal — README §3]`, `[Req: inferred — from validate_input() behavior]`, `[Req: user-confirmed — "must handle empty input"]`. Inferred requirements should be flagged for user review in Phase 7. +If working from non-formal requirements, verify that each scenario and test includes a requirement tag using the canonical format: `[Req: formal — README §3]`, `[Req: inferred — from validate_input() behavior]`, `[Req: user-confirmed — "must handle empty input"]`. Inferred requirements should be flagged for user review in the Phase 7 interactive session. ### 11. RUN_CODE_REVIEW.md Is Self-Contained @@ -140,9 +140,9 @@ Grep every generated Markdown file in `quality/` for the attribution line: `Gene Verify that the code review (Pass 1 and Pass 2) performed mechanical two-list enumeration checks wherever the code uses `switch`/`case`, `match`, or if-else chains to dispatch on named constants. For each such check, the review must show: (a) the list of constants defined in headers/enums/specs, (b) the list of case labels actually present in the code, (c) any gaps. A review that claims "the whitelist covers all values" or "all cases are handled" without showing the two-list comparison is non-conformant — this is the specific hallucination pattern the check prevents. -### 20. Bug Writeups Generated for TDD-Verified Bugs +### 20. Bug Writeups Generated for All Confirmed Bugs -For each bug with `verdict: "TDD verified"` in `tdd-results.json`, verify that a corresponding `quality/writeups/BUG-NNN.md` file exists and that `tdd-results.json` has a non-null `writeup_path` for that bug. Each writeup must include: summary, spec reference, code citation, observable consequence, fix diff, and test description. A TDD-verified bug without a writeup is incomplete. +For each bug in `tdd-results.json` (both `verdict: "TDD verified"` and `verdict: "confirmed open"`), verify that a corresponding `quality/writeups/BUG-NNN.md` file exists and that `tdd-results.json` has a non-null `writeup_path` for that bug. Each writeup must include: summary, spec reference, code citation, observable consequence, fix diff, and test description. A confirmed bug without a writeup is incomplete. ### 21. Triage Verification Probes Include Executable Evidence @@ -169,7 +169,7 @@ Verify that no executed artifact contradicts a prose artifact at closure. Specif ### 26. Version Stamp Consistency -Read the `version:` field from the SKILL.md metadata (in `.github/skills/SKILL.md`). Check every generated artifact: PROGRESS.md's `Skill version:` field, every `> Generated by` attribution line, every code file header stamp, and every sidecar JSON `skill_version` field. Every version stamp must match the SKILL.md metadata exactly. A single mismatch is a benchmark failure. This check exists because in v1.3.21 benchmarking, 5 of 9 repos had version stamps from older skill versions due to a hardcoded template. +Read the `version:` field from the SKILL.md metadata (locate SKILL.md in the skill installation directory — typically `.github/skills/SKILL.md` or `.claude/skills/quality-playbook/SKILL.md`). Check every generated artifact: PROGRESS.md's `Skill version:` field, every `> Generated by` attribution line, every code file header stamp, and every sidecar JSON `skill_version` field. Every version stamp must match the SKILL.md metadata exactly. A single mismatch is a benchmark failure. This check exists because in v1.3.21 benchmarking, 5 of 9 repos had version stamps from older skill versions due to a hardcoded template. ### 27. Mechanical Directory Conformance @@ -225,7 +225,7 @@ Every confirmed bug in BUGS.md must use the heading level `### BUG-NNN`. Grep fo ### 40. Artifact File-Existence Gate Passed -Before Phase 5 is marked complete, verify that all required artifacts exist as files on disk — not just referenced in PROGRESS.md. Required files: BUGS.md, REQUIREMENTS.md, QUALITY.md, PROGRESS.md, COVERAGE_MATRIX.md, COMPLETENESS_REPORT.md. If Phase 3 ran: at least one file in code_reviews/. If Phase 4 ran: at least one auditor file and a triage file in spec_audits/. If Phase 0 or 0b ran: SEED_CHECKS.md as a standalone file. If confirmed bugs exist: tdd-results.json in results/. This benchmark exists because v1.3.24 benchmarking showed express writing a terminal gate section to PROGRESS.md claiming 1 confirmed bug, but BUGS.md, code review files, and spec audit files were never written to disk. +Before Phase 5 is marked complete, verify that all required artifacts exist as files on disk — not just referenced in PROGRESS.md. Required files: EXPLORATION.md, BUGS.md, REQUIREMENTS.md, QUALITY.md, PROGRESS.md, COVERAGE_MATRIX.md, COMPLETENESS_REPORT.md, CONTRACTS.md, test_functional.* (or language-appropriate alternative: FunctionalSpec.*, FunctionalTest.*, functional.test.*), RUN_CODE_REVIEW.md, RUN_INTEGRATION_TESTS.md, RUN_SPEC_AUDIT.md, RUN_TDD_TESTS.md, and AGENTS.md (at project root). If Phase 3 ran: at least one file in code_reviews/. If Phase 4 ran: at least one auditor file and a triage file in spec_audits/. If Phase 0 or 0b ran: SEED_CHECKS.md as a standalone file. If confirmed bugs exist: tdd-results.json in results/. If any bug has a red-phase result: TDD_TRACEABILITY.md. This benchmark exists because v1.3.24 benchmarking showed express writing a terminal gate section to PROGRESS.md claiming 1 confirmed bug, but BUGS.md, code review files, and spec audit files were never written to disk. ### 41. Sidecar JSON Post-Write Validation