Skip to content

Overhaul the Claude-driven docs PR review pipeline (v1)#18680

Draft
CamSoper wants to merge 194 commits into
masterfrom
CamSoper/pr-review-overhaul
Draft

Overhaul the Claude-driven docs PR review pipeline (v1)#18680
CamSoper wants to merge 194 commits into
masterfrom
CamSoper/pr-review-overhaul

Conversation

@CamSoper
Copy link
Copy Markdown
Contributor

@CamSoper CamSoper commented Apr 23, 2026

Replaces the legacy single-comment Claude review on pulumi/docs with a domain-aware, re-entrant pipeline. Goal: keep maintainer-grade fact-checking running on every PR as agentic workflows raise contribution velocity beyond manual review capacity.

What ships

Two skill packages working as a pair:

  • docs-review (CI) — runs on every PR. Deterministic Python classifier routes each PR to its right domain reference: docs, blog, infra, programs, or website. Posts a pinned <!-- CLAUDE_REVIEW --> comment with status table + tiered findings (🚨 / ⚠️ / 💡 / ✅) + suggestion blocks. Re-entrant via @claude mention (refresh / dispute / re-verify).
  • pr-review (interactive) — local maintainer skill (/pr-review <PR#>). Reads the latest CI-posted review, applies a trust-and-scrutiny model, presents an action menu (approve / request changes / make changes / close). Optimized for "act on this PR right now without re-reading everything."

Triage gating. Trivial PRs (≤10 added lines, ≤2 docs/blog files) and frontmatter-only PRs short-circuit through a fast Haiku spelling/grammar pass instead of a full Opus review. Marketing/legal pages route to a dedicated domain:website review with verification-ask framing per FTC truthfulness norms.

Workflows. claude-triage.yml (classify + prose check on pull_request), claude-code-review.yml (full review on ready_for_review), claude.yml (re-entrant on @claude mention).

Pinned-comment script. _common/scripts/pinned-comment.sh manages the review as a single logical comment sequence (<!-- CLAUDE_REVIEW N/M -->) with in-place edits, overflow append, and tail prune.

Domain references. references/{shared-criteria,docs,blog,infra,programs,website,fact-check,prose-patterns,spelling-grammar,code-examples,image-review,output-format,update,domain-routing}.md. fact-check.md is the shared claim-extraction engine used by both CI and pr-review.

Labels. domain:{docs,blog,infra,programs,website,mixed} + review:{trivial,frontmatter-only,prose-flagged,claude-ran,claude-stale,claude-working} + needs-author-response. Deployed via scripts/labels/sync-labels.sh.

Contributor guidance. Draft-first posture, AI-authored-PR conventions, and re-entrant review semantics documented in CONTRIBUTING.md and AGENTS.md.

Benchmark

Validated head-to-head against the live legacy pipeline on 11 production PRs (full report at scratch/2026-05-01-live-comparison-v2/REPORT.md on branch):

  • 10 substantive bugs caught that legacy missed — every one would have shipped to production
  • 100% coverage of legacy's author-addressed substantive findings (correctly silent on already-fixed)
  • 0% false positive rate on both pipelines
  • Maintainer signal quality: 95% (new) vs 30% (legacy) on tier / evidence / grouping / suggestion-block axes
  • Cost: $0.65 per incremental shipped-defect prevented; 1.93× legacy on this sample, projected ~1.5× on production mix once trivial-skip fires at expected ~43% rate

Side-by-side comparison material at CamSoper/pulumi.docs#105–#115. Internal exec summary lives in Notion under Knowledge Preservation → Docs.

Notable catches new caught and legacy missed: workflow-breaking SAML/SCIM nav bugs (#18605), OutSystems source-misattribution propagated to LinkedIn/Bluesky social copy (#18647), broken /docs/ai/integrations/ link on a launch post (#18685), AGENTS.md canonical-path regressions (#18568, #18599), Java snippet truncation introduced while addressing legacy feedback (#18331).

One regression: PR 18573 trivial-cap edge case (4-line nav rewrite in a multi-section doc) — minor, soft-watch.

Status before merge

  • ✅ Cam fork validated end-to-end (all 5 domain paths exercised across 11 PRs)
  • ✅ All commits ready on CamSoper/pr-review-overhaul
  • domain:website label needs deploy to pulumi/docs upstream label set: scripts/labels/sync-labels.sh --repo pulumi/docs
  • ⏳ Trivial-cap edge case (PR 18573 shape) — soft-watch, not a merge blocker

How to review

  • Quick read: the benchmark REPORT.md on this branch, or the side-by-side fork PRs at CamSoper/pulumi.docs#105–#115.
  • Diff-by-diff: start with .claude/commands/docs-review/scripts/triage-classify.py (the classifier), then .claude/commands/docs-review/references/{docs,blog,infra,programs,website}.md (the domain reviews), then .github/workflows/claude-{triage,code-review}.yml (the wiring).
  • Design history: SESSION-NOTES.md carries 18+ sessions of rationale. Sessions 5–7 (initial domain composition), 9–10 (shared-criteria + label rename), 12–13 (audit + cost optimization), 16–18 (e2e validation + trivial-cap calibration), 19 (domain:website + trivial/fmonly tightening to docs+blog only).

Notes

  • CI fact-check is public-sources-only by design: no Notion, no Slack MCP. Rationale lives in docs-review-ci.md.
  • Models: claude-opus-4-7 for initial domain reviews, claude-sonnet-4-6 for re-entrant updates, claude-haiku-4-5-20251001 for triage prose checks (50KB diff cap, JSON output).
  • SESSION-NOTES.md is for this PR's review cycle; marked for removal once merged.

Ship as draft; ready-flip after the upstream label deploy lands.

@pulumi-bot
Copy link
Copy Markdown
Collaborator

pulumi-bot commented Apr 23, 2026

CamSoper added a commit to CamSoper/pulumi.docs that referenced this pull request Apr 23, 2026
Fork-only tweak so claude.yml works without org-side ESC setup.
@claude retains all its capabilities (re-entrant reviews, Q&A,
make-changes on PRs) -- only difference is commits pushed with
GITHUB_TOKEN don't trigger downstream workflows, which is fine for
fork testing.

This commit is NOT for upstream. Origin/master and pulumi#18680
keep the ESC design. Do not cherry-pick.
CamSoper added a commit to CamSoper/pulumi.docs that referenced this pull request Apr 23, 2026
Fork-only tweak so claude.yml works without org-side ESC setup.
@claude retains all its capabilities (re-entrant reviews, Q&A,
make-changes on PRs) -- only difference is commits pushed with
GITHUB_TOKEN don't trigger downstream workflows, which is fine for
fork testing.

This commit is NOT for upstream. Origin/master and pulumi#18680
keep the ESC design. Do not cherry-pick.
CamSoper added a commit to CamSoper/pulumi.docs that referenced this pull request Apr 23, 2026
Fork-only tweak so claude.yml works without org-side ESC setup.
@claude retains all its capabilities (re-entrant reviews, Q&A,
make-changes on PRs) -- only difference is commits pushed with
GITHUB_TOKEN don't trigger downstream workflows, which is fine for
fork testing.

This commit is NOT for upstream. Origin/master and pulumi#18680
keep the ESC design. Do not cherry-pick.
CamSoper added a commit to CamSoper/pulumi.docs that referenced this pull request Apr 23, 2026
Fork-only tweak so claude.yml works without org-side ESC setup.
@claude retains all its capabilities (re-entrant reviews, Q&A,
make-changes on PRs) -- only difference is commits pushed with
GITHUB_TOKEN don't trigger downstream workflows, which is fine for
fork testing.

This commit is NOT for upstream. Origin/master and pulumi#18680
keep the ESC design. Do not cherry-pick.
CamSoper added a commit to CamSoper/pulumi.docs that referenced this pull request Apr 23, 2026
Fork-only tweak so claude.yml works without org-side ESC setup.
@claude retains all its capabilities (re-entrant reviews, Q&A,
make-changes on PRs) -- only difference is commits pushed with
GITHUB_TOKEN don't trigger downstream workflows, which is fine for
fork testing.

This commit is NOT for upstream. Origin/master and pulumi#18680
keep the ESC design. Do not cherry-pick.
CamSoper added a commit to CamSoper/pulumi.docs that referenced this pull request Apr 23, 2026
Fork-only tweak so claude.yml works without org-side ESC setup.
@claude retains all its capabilities (re-entrant reviews, Q&A,
make-changes on PRs) -- only difference is commits pushed with
GITHUB_TOKEN don't trigger downstream workflows, which is fine for
fork testing.

This commit is NOT for upstream. Origin/master and pulumi#18680
keep the ESC design. Do not cherry-pick.
CamSoper added a commit to CamSoper/pulumi.docs that referenced this pull request Apr 23, 2026
Fork-only tweak so claude.yml works without org-side ESC setup.
@claude retains all its capabilities (re-entrant reviews, Q&A,
make-changes on PRs) -- only difference is commits pushed with
GITHUB_TOKEN don't trigger downstream workflows, which is fine for
fork testing.

This commit is NOT for upstream. Origin/master and pulumi#18680
keep the ESC design. Do not cherry-pick.
CamSoper added a commit to CamSoper/pulumi.docs that referenced this pull request Apr 23, 2026
Fork-only tweak so claude.yml works without org-side ESC setup.
@claude retains all its capabilities (re-entrant reviews, Q&A,
make-changes on PRs) -- only difference is commits pushed with
GITHUB_TOKEN don't trigger downstream workflows, which is fine for
fork testing.

This commit is NOT for upstream. Origin/master and pulumi#18680
keep the ESC design. Do not cherry-pick.
CamSoper added a commit to CamSoper/pulumi.docs that referenced this pull request Apr 23, 2026
Fork-only tweak so claude.yml works without org-side ESC setup.
@claude retains all its capabilities (re-entrant reviews, Q&A,
make-changes on PRs) -- only difference is commits pushed with
GITHUB_TOKEN don't trigger downstream workflows, which is fine for
fork testing.

This commit is NOT for upstream. Origin/master and pulumi#18680
keep the ESC design. Do not cherry-pick.
CamSoper added a commit to CamSoper/pulumi.docs that referenced this pull request Apr 23, 2026
Fork-only tweak so claude.yml works without org-side ESC setup.
@claude retains all its capabilities (re-entrant reviews, Q&A,
make-changes on PRs) -- only difference is commits pushed with
GITHUB_TOKEN don't trigger downstream workflows, which is fine for
fork testing.

This commit is NOT for upstream. Origin/master and pulumi#18680
keep the ESC design. Do not cherry-pick.
CamSoper and others added 30 commits May 11, 2026 17:25
The validator's `internal-link-existence` check rejects links to pages the
PR itself is creating — the destination doesn't exist on the base branch
but will once the PR merges, so the link is valid. S34's pr18568 captures
hit this twice (the new Azure provider guides reference one another via
canonical paths the PR is adding).

Changes:
- New helper `gh_pr_diff_added_files()` queries the GitHub API for files
  with `status="added"` on the PR, returning a `set[str]` of relative
  paths.
- New `Context.diff_files_added: set[str]` field, populated in `cmd_check`
  alongside `diff_files`.
- `check_internal_link_existence` now accepts a candidate path if it's in
  the PR-added set, before falling through to the alias grep.

Validated:
- s34-runs/spot-check/pr18568-spot2 (was 1x internal-link): now 0.
- s34-runs/run1/pr18568-r1 (was 2x internal-link): now 0.
- s34-runs/spot-check/pr18647-spot2 (was clean): still clean.
- Synthetic guard (link to a path in neither base nor added): still flags.
Empirical: across 83 pinned-review captures, the AI-drafting H3 section
rendered on 6 — all 6 on the canonical S30 fixture (pr-17240). Outside
the calibration corpus, the feature has never fired. Author-allowlist +
AI-trailer detection in claude-triage.yml already covers the obvious
cases at intake; the content-side detector was a backstop with no
demonstrated catch.

This commit removes the dedicated section + dispatch and re-routes the
specific tells that fit cleanly into Vale rules. Findings render as
existing-style nits in the low-confidence section; the maintainer
weighs them like any other prose flag. Hedged copy ("often appears in
AI-drafted prose; consider...") makes clear we're flagging a smell, not
asserting authorship.

Removed:
- prose-patterns.md §AI-drafting signals (6 detectors + Sonnet/Sonnet
  parallel-subagent dispatch). Replaced with a brief §AI-drafting tells
  that points at the four Vale rules below.
- output-format.md §AI-drafting signals (rendered section spec, the
  investigation-log bullet, the placeholder in the schematic, and the
  passing reference in §Subagent decomposition).
- validate-pinned.py: check_ai_drafting_threshold_section and its rule
  entry. INVESTIGATION_LOG_BULLETS goes from 9 to 8.

Added (styles/Pulumi/):
- SetPieceTransitions.yml — fixed phrase list from prose-patterns.md
  detector 2 ("But here's the thing", "Let's dive in", etc.)
- EmDashDensity.yml — paragraph-scope occurrence rule, max=2 (fires
  when any paragraph has 3+ em-dashes)
- ListicleH2Headings.yml — heading.h2-scope existence rule for
  numbered listicle prefixes (`**1.**`, `1.`, `Part N`, `Section N`,
  `Phase N`)
- HedgeThenPivot.yml — `While X, Y is also worth ...` /
  `Although X, what really matters is Y` constructions

vale-findings-filter.py: 4 new RULE_CATEGORIES entries so findings
surface with user-friendly category names.

Calibration on the 4 fixtures:
- pr-17240 (canonical): 2 hits (Part 1, Part 2 listicle H2s)
- pr-18647 (blog control): 0 hits
- pr-18605 (docs control): 0 hits
- pr-18599 (docs control): 0 hits across 8 files

Existing pinned-review captures re-validate cleanly under the trimmed
INVESTIGATION_LOG_BULLETS (extra "AI-drafting-signals pass" lines in
older captures are ignored, not flagged).

D1 (uniform per-section template) and D3 (parallel four-bullet lists)
from the original spec stay dropped — they don't fit Vale's scope model
cleanly, and the data shows they only ever fired on the calibration
fixture.
Inserts a deterministic Haiku 4.5 fix-pass into cmd_upsert_validated for
violation classes where the fix is a localized text edit. Closes the
S34 retry-loop dead-end where Opus's "fix" re-render reproduced the same
class of violation in a different form (e.g., another hallucinated link
path), so the soft-floor kept publishing bodies with leftover violations.

New: scripts/validator-fix.py
  Reads /tmp/validate-pinned.fix-me.json. Exits 2 if any violation falls
  outside the surgical set (caller proceeds to soft-floor without
  invoking Haiku). For surgical violations, dispatches one claude CLI
  call per violation with a class-specific prompt and tool use disabled,
  cap of 5 dispatches per body (cost ceiling).

  Surgical classes shipped (5 of 8 from s34-validator-loop-findings.md):
  - internal-link-existence (verified end-to-end on a synthetic case)
  - external-claim-pass2-outcome (verified end-to-end on
    s34-runs/run1/pr18685-r1, the real-world S34 capture)
  - shortcode-existence
  - bucket-bullet-line-range-prefix
  - mandatory-h3-order

  The other 3 (external-claim-state-format, -dispatch-metadata,
  -routed-metadata) follow the same single-line-edit shape and can be
  added with one prompt template each; deferred to a follow-up rather
  than padding this commit's untested surface.

Modified: scripts/pinned-comment.sh, cmd_upsert_validated
  On validator pass 1 fail: snapshot body to body.pre-haiku.bak, run
  validator-fix.py, re-validate. On success, publish; on persistent
  failure, restore from backup and return 1 (soft-floor publishes the
  ORIGINAL body, never a Haiku-degraded one). The re-validate step
  intentionally omits --soft-floor — we want a clean retry-0 verdict
  on the post-fix body, not a soft-floor downgrade.

Modified: scripts/per-tool-spend.py
  Adds Bash:validator-fix category at $0.015/call (one Haiku 4.5
  dispatch with medium prompt). Without this, validator-fix.py
  invocations bucket as Bash:other and the cost-variance reports
  miss the new spend line.

Tested:
- pr18685-r1 (external-claim-pass2-outcome): validator pass 1 fails →
  validator-fix dispatches Haiku → Pass 2 segment gets `(verified 2,
  contradicted 0, unverifiable 0)` appended → re-validate clean. ~35s.
- Synthetic internal-link-existence on pr18647-spot2: validator pass 1
  flags hallucinated link → validator-fix removes the [text](path)
  wrapper → re-validate clean. ~35s.
- Mixed-violation gate: surgical + frontmatter-locations (re-render
  required) → exit 2 immediately, no Haiku dispatch.

Per-call latency is dominated by claude CLI startup (~30s without
--bare). --bare requires ANTHROPIC_API_KEY explicitly; CI has it but
local OAuth users don't, so the CLI invocation is run without --bare
for portability. Optimization to direct SDK calls is a follow-up if
the wall-clock cost matters in production.
I'd dropped --bare during initial Ship 4 work because local testing
without ANTHROPIC_API_KEY errored "Not logged in" — and rationalized
the 30s startup cost as "follow-up" instead of fixing it. Cam caught
this: the action sets ANTHROPIC_API_KEY in env, and subprocess
invocations from pinned-comment.sh inherit it, so --bare works in CI.

--bare skips hooks, LSP, plugin sync, CLAUDE.md auto-discovery, and
keychain reads — drops dispatch latency from ~30s to ~2-3s per
violation. Local testing of this script now requires
ANTHROPIC_API_KEY=... in the environment.
Split the External claim verification "external" lane into two:
* Pass 2 -- consult `.fetched-urls.json` (workflow pre-step writes it)
* Pass 3 -- WebSearch + WebFetch fan-out for external-public claims
  with no URL in the diff

Stream-JSON audit of S35 captures showed docs reviews rendered Pass 2
routing without any Agent / WebFetch / WebSearch dispatches. Schema v5
adds four faithfulness floors so that drift can no longer pass review:

* `pass-2-fetch-faithfulness` -- F > 0 requires non-empty `.fetched-urls.json`.
* `pass-3-dispatch-mandate`   -- Y > I+P+F with empty fetched-urls must route
                                  to Pass 3 (S > 0).
* `pass-3-unverifiable-evidence` -- ⚠️ Pass 3 unverifiable verdicts must
                                     name the search that was attempted.
* `external-claim-pass3-outcome` -- mirror of pass2-outcome rule for the
                                     new lane (V/C/U attribution; sum check).

Routed-metadata regex extends to the optional `, S Pass 3` segment;
existing v4 captures re-validate cleanly (S Pass 3 = 0 by absence).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the mechanical parts of the editorial-balance pass into a workflow
pre-step. Tier 1 (listicle / FAQ trigger detection, section-depth stats,
outlier flag) computes deterministically from the post-PR blog markdown
and writes `.editorial-balance.json`. Tier 2 (comparison trigger via
canonical entity list, entity counting, recommendation steering, FAQ-
answer voting) stays model-computed in S36 -- defer to S37 once Tier 1
stabilizes. Tier 3 (don't-flag exceptions) stays model-judged forever.

The validator's new `editorial-balance-counts-faithful` rule cross-checks
the rendered Editorial balance section's Tier 1 fields against the JSON:

* trigger=null in JSON forces empty-form rendering
* trigger != null forces rich-form rendering
* section count, mean, median, std must match (±10% tolerance)
* JSON-flagged outliers must appear in the rendered Section depth bullet

Rendered-vs-recomputed validation (`editorial-balance-counts`) stays in
place as a complementary check; both arrive at the same numbers when
Tier 1 is faithful, which is the load-bearing signal.

Empirical justification (S35 audit): editorial-balance fired only on
pr-17240 (11/86 captures), but on its target case it caught real signal
(Pulumi Neo section ~7.1× median, FAQ steering ≥60%, ⚠️ flags fired
correctly). The "always-on-calibration" pattern was rare comparison/
listicle/FAQ posts, not a broken feature -- so determinize Tier 1 rather
than dropping the section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the 3 deferred External-claim-verification surgical classes to
validator-fix.py (S35's Ship 4 logged them as carry-overs):

* `external-claim-state-format` (rewrite the leading state form)
* `external-claim-dispatch-metadata` (append extraction-specialists tail)
* `external-claim-routed-metadata` (append routed-verification tail)

All three follow the same single-line edit shape as the already-shipped
`external-claim-pass2-outcome` template. The routed-metadata prompt now
also emits per-lane V/C/U attribution inline so a single Haiku-fix
recovers the canonical form (no chained second-pass needed).

Corpus-tested all 6 surgical classes (3 above + 3 untested-but-shipped
from S35: shortcode-existence, bucket-bullet-line-range-prefix,
mandatory-h3-order) against synthetic mutations of pr18647-spot2. All
recover on first Haiku-fix attempt:

  state-format          ✅ first attempt
  dispatch-metadata     ✅ first attempt
  routed-metadata       ✅ first attempt
  shortcode             ✅ first attempt
  bucket-prefix         ✅ first attempt (after sharpening prompt to
                            preserve trail anchor format)
  h3-order              ✅ first attempt

Bucket-prefix prompt sharpened to look up the exact trail anchor (was
inventing `[L40-40]` from a single-line `L40` trail record). Routed-
metadata prompt sharpened to preserve the dispatch-metadata segment
verbatim (was overwriting "cross-specialist corroborations" text).

Local-test path: HAIKU_TIMEOUT_S falls back to 120s when ANTHROPIC_API_KEY
is unset (OAuth mode adds ~30s of CLI startup). Production --bare mode
unaffected (~2-3s per dispatch; 60s timeout still applies).

Mutation seeds live in scratch: `s36-runs/synth/pr18647-spot2-mut-*.md`
(committed separately to the scratch repo for regression replay).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fact-check.md §Inline lane: descriptive "<3 turns each" → prescriptive
hard cap of 5 gh CLI calls per claim, with explicit don't-iterate rule
against gh api repos/pulumi/docs/issues|pulls (exploration ≠ verification).
After cap, reclassify to ambiguous (→ Pass 1) or external-public
(→ Pass 2/3) and let the harder-verification lane take it.

per-tool-spend.py: emit ::warning:: GitHub Actions annotations when
Bash:gh > 25 OR num_turns > 80. Validated on existing S36 captures —
fires on pr18568 r2 (34 gh calls, the rabbit hole) and pr18647 r1
(30 gh + 83 turns). Doesn't fire on the 3 clean runs.
Audit of S36 stream-JSON captures grouped 27 actionable rejections into
4 patterns. One YAML-fixable; three need spec instructions.

claude-code-review.yml: add Bash(python3 -c:*) to --allowed-tools.
Closes Pattern 1 (inline python3 -c "..." rejected as multi-op part —
the path-prefix patterns over-restricted routine inline scripting).

ci.md Hard rule 7: three patterns the harness sandbox blocks regardless
of allow-list — write commands that avoid them.
- /tmp/ paths: filesystem-path policy blocks cat/grep/redirect; use
  the Read tool. Workspace root scratch files (.fetched-urls.json etc.)
  remain Bash-accessible.
- Shell control flow (for/while/case/if): multi-op decomposer rejects
  even when constituent commands are allow-listed. Use python3 -c "..."
  for iteration.
- Brace expansion / subshell grouping: same decomposer issue; expand
  manually or move to python3 -c "...".

ci.md §4: change cat /tmp/validate-pinned.fix-me.md example to Read
(the previous example contradicted the harness sandbox).

Audit notes: scratch/2026-05-06-final-battery/s37-runs/notes/
harness-friction-audit.md.
Inline-lane spec gains a canonical-source table mapping common claim
shapes (menu, example-program, sibling-pattern, schema, shortcode,
alias) to the path that resolves them, plus token-first / path-second
search-order rules and a shrug rule (3 targeted reads → ambiguous).

S37 post-session analysis (n=43 captures, S33-S37) showed the deep
inline runs were doing real cross-file verification, not wandering —
S37 pr18568 i=8 caught 3 critical structural bugs the i=13 cheap run
missed. The S37 per-claim cap fired against the wrong target. The
playbook gives the model better starting points so 3-5 calls/claim
closes the typical structural verification.

The existing per-claim 5-call cap stays as a backstop.
Per-claim 5-call cap stays as backstop. New per-PR 40-call cap is the
primary control: beyond ~40 inline gh calls, the model summarizes
unresolved pulumi-internal claims and dispatches a final Pass 1 batch
with the playbook embedded — that batch is the escalation tier.

per-tool-spend.py keeps the existing gh>25 ::warning:: (productivity-
zone observability) and adds gh>50 ::error:: (genuine over-spend).
The S37 pr151-r1 stream-JSON (75 gh calls) is the canonical historical
case: trips both annotations under the new spec, no regression on the
five other S37 captures.
New validator rule catches the inline-lane exploration patterns that
the canonical-source playbook (Ship A, fact-check.md §Inline lane) is
designed to displace. Trail evidence containing `gh api repos/.../
issues|pulls` or recursive `git/trees/<sha>?recursive=...` is flagged
— these don't read canonical source.

Self-validation:
- All 6 S37 captures re-validate clean under v6 (the trail evidence
  in those captures already cites canonical paths; the exploration
  was in tool calls, not trail evidence).
- Synthetic violation capture trips all 3 exploration patterns
  (issues, pulls, trees recursive).

Schema version: 5 → 6.
Strip rationale from the canonical-source playbook — preamble cut to
one sentence, table rows lose parentheticals, search-order rules drop
the "why" tail, shrug rule trimmed to the directive. Same shape, same
intent, fewer runtime tokens.
S38 Ship D variance retest revealed that pr18568 r1+r2 both classified
the fixture as "not in a templated section" because the changed
file's directory had zero peers. The model checked the wrong path
(`content/docs/iac/clouds/<x>/guides/`) and didn't broaden the
search to the parallel category tree (`content/docs/iac/guides/
clouds/<x>/`) where the actual sibling set lives. Result: the
file-location, alias-collision, and menu-parent triplet that S37 r1
caught was silently dropped.

Add a zero-peer-check rule under §Cross-sibling consistency: when
the changed file's directory has 0 peers but the category has known
parallel pages elsewhere, search adjacent paths before concluding
"no siblings yet." The empty result is itself a sibling-consistency
claim and a 🚨 file-location finding.

Spike-tested next.
Pipeline-of-artifacts experiment. Discovery layer was independently
failable in S38 Ship D (pr18568 r1+r2 both classified the changed
file as "not in a templated section" and skipped the sibling sweep
that surfaces structural bugs). Ship F's inline zero-peer rule
helped on a single spike but depends on the model running it.

Encode the same logic deterministically as a workflow pre-step:
walks the docs tree per changed file, runs the parallel-path check
table-driven, emits .cross-sibling-discovery.json. Spec in
fact-check.md §Cross-sibling consistency now reads from the
artifact first; the structural_warning surfaces directly as a 🚨
file-location finding.

Self-tested on PR 140 (SAML JumpCloud, in_templated=true with 8
peers) and PR 151 (pr18568, parallel-path warning + canonical
sibling at content/docs/iac/guides/clouds/azure.md).

Smallest viable architectural pivot: one step, one artifact. If
the structural guarantee makes discovery reliable across runs,
the pipeline pivot is justified for further extraction in S39+.
Bundle 1 of the atomized discovery pattern (see new
docs-review:references:pre-computation for the architectural
codification that emerged across S38 Ship G + this work).

Ship H: frontmatter-validate.py + workflow wire-in + spec update.
Two checks bundled:

1. Menu-parent identifier resolution. For each menu.<name>.parent
   declared in PR-changed frontmatter, walk content/**/*.md to build
   a global menu-identifier map and check whether the parent resolves
   in the same named menu. The S37/S38 pr18568 case: menu.iac.parent:
   azure-clouds resolves only against menu.integrations — wrong-menu
   parent. Closes the L11 finding Ship G missed.

2. Alias collision detection. Build a global alias map from
   content/**/*.md, cross-reference the PR's declared aliases. Flag
   PR-internal collisions and repo-wide collisions (against existing
   canonical pages). Self-tested on pr18568: caught both
   /docs/clouds/azure/guides/ and /docs/clouds/azure/guides/providers/
   as repo-wide collisions with the existing canonical
   content/docs/iac/guides/clouds/azure.md.

Ship I: references/pre-computation.md is the architectural meta-doc.
Codifies the principle (scripts find structural facts, agent makes
editorial judgments), the bundle-by-reading-pattern architecture,
the false-positive triage contract, and how to add a new pre-step.
Lets S39+ ship the remaining bundles consistently without
rediscovering the pattern.

Self-tested locally on pr18568 and pr18605 (clean — no false positives
on the SAML JumpCloud fixture).

Spike retest pending.
Cam pushback on cross-sibling-discover.py's PARALLEL_PATTERNS table:
hardcoded against the one observed pr18568 layout swap; brittle and
maintenance-heavy — every new layout divergence requires a code edit.

Replace with a data-driven approach using signals the codebase already
curates intentionally: Hugo `aliases:` frontmatter (declared per file
when content is moved) and S3 redirect tables under `scripts/redirects/`
(maintained alongside non-Hugo URL routing).

Algorithm:

1. frontmatter-validate.py builds a unified URL-ownership map keyed on
   normalized URLs (leading slash, no `index.html`, trailing slash).
   Each entry: {file, scope: hugo-alias|s3-redirect}.
2. For each PR-changed file, compute the file's rendered Hugo URL from
   its path.
3. Look up that URL in the ownership map. Any other file (alias) or
   redirect entry claiming the URL → emit url_collision with scope tag.
4. Spec mandates surfacing url_collisions as 🚨 by default. The S38
   pr18568 case: PR drops content at /docs/iac/clouds/azure/guides/
   which is already aliased by content/docs/iac/guides/clouds/azure.md.

Net change:
- cross-sibling-discover.py shrinks 241 → 130 lines (drops PARALLEL_PATTERNS,
  parallel-path check, and the structural_warning logic that came with it).
  Now does just what the name says: peer-counting for templated-section
  detection.
- frontmatter-validate.py grows ~80 lines (URL-ownership map building,
  normalization, derive_url_from_path, url_collisions per-file output).
- fact-check.md spec rewritten around the URL-ownership check; "Zero-peer
  check" subsection (Ship F's inline rule) removed since the pre-step
  fully replaces it.

Self-tested locally on pr18568 (catches /docs/iac/clouds/azure/guides/
and /docs/iac/clouds/azure/guides/providers/ as hugo-alias collisions
against content/docs/iac/guides/clouds/azure.md) and pr18605 (clean —
no false positives on the SAML JumpCloud fixture).

The S3 redirect coverage is bonus value: PRs landing content at URLs
that already have S3 redirects pointing somewhere else will now be
caught — that class of collision was previously invisible to all
review layers.
Atomized Hugo build validation as a workflow pre-step. The agent now
reads `.hugo-build.json` for the build-correctness floor instead of
trying to reason about whether the build would succeed (the workflow
intentionally skips `make build` per ci.md hard rule 4).

Bundles three checks in one script:
- `hugo --renderToMemory` at HEAD → errors, warnings, link-integrity.
- `hugo list all` at HEAD and BASE (worktree from base SHA) → sitemap
  diff (added/removed URLs).
- Output schema v1: errors, warnings, link_integrity, sitemap_diff.

Subsumes the originally-queued `docs-reference-graph` bundle (Hugo's
own warnings cover broken refs / broken shortcodes / missing assets,
and sitemap-diff covers orphaned-target detection). Bundle 2 retired.

Wall-clock: ~135-180s in this worktree (Hugo full render + base list);
acceptable on top of the 5-15min review wall-clock. CI runners may add
~30% — re-evaluate if it becomes blocking.

Spec updates:
- references/fact-check.md: artifact contract + surface rules + known
  false-positive scenarios.
- references/pre-computation.md: bundle-inventory entry + retire
  Bundle 2.

Phase 1 (S39) gate that authorized this ship: pr18568 finding-parity
N=2 PASSED (3🚨 vs 3🚨, structural triplet caught both runs, cost
variance ±4%). Architecture demonstrably reproducible at the load-
bearing fixture before piling on more.
Spike retest revealed two integration bugs in the original Ship K commit:

1. **mise.toml didn't include hugo.** CI runner had no `hugo` binary on
   PATH, so `subprocess.run(["hugo", ...])` raised FileNotFoundError —
   added `hugo = "0.157.0"` matching the local Codespace pin.
2. **`run_hugo_list` didn't catch FileNotFoundError.** First Hugo
   invocation (`run_hugo_render`) caught it and degraded gracefully;
   the second (`run_hugo_list` for HEAD) did not, so the script
   crashed mid-flight, dumping a traceback to stderr (which the
   workflow had `2>/dev/null`'d) and tripping the workflow's `||`
   fallback. Result: `.hugo-build.json` was the empty stub regardless
   of what would have surfaced.

Fixes:
- mise.toml: pin hugo 0.157.0 (matches local Codespace).
- script: catch FileNotFoundError + OSError in every hugo subprocess
  call site; wrap `main()` in `safe_main()` that emits a structured
  error artifact on any uncaught exception so the workflow always
  receives a useful JSON, never the empty fallback.
- workflow: drop `2>/dev/null` so tracebacks surface in CI logs;
  fallback `echo` now emits `errors: ["hugo-build-validate.py failed
  to start"]` to make zero-output runs distinguishable from clean
  builds.

Lesson: the workflow's `|| echo` fallback masks script failures from
the agent's view of the artifact. The fix moves error-state
representation INTO the artifact (well-formed JSON with a non-empty
`errors` array) instead of relying on file presence as a success
signal.

Spike on pr18568 to follow this commit.
Phase 1 (S39) confirmed: pr18568 finding-parity holds at N=2 AND
Java-column miss persists (zero Java-specific 🚨 across both runs).
Both Ship E gate conditions met. Ship.

Add a third code-examples specialist `cross-reference` (Sonnet 4.6,
`general-purpose`) split out from the existing `existence` specialist.
Existing two specialists' scope:
- `structural` (Sonnet 4.6) — syntax, casing, idiomatic per-language.
- `existence` (Haiku 4.5) — imports + provider API currency.
  **Cross-reference body-vs-code coverage moved out (was 3rd
  responsibility; now its own specialist).**

New specialist fans out **once per content file** (not per code block
like the other two). Receives:
- the full content body
- a structured catalog: every fenced block + language declaration +
  first 8 lines, every `{{< example-program >}}` shortcode and its
  referenced `name`, every `static/programs/<name>-<language>/` directory
  listing.

Verifies in both directions:
- (a) every body language claim corroborated by an inline fenced block
  or a `static/programs/` directory.
- (b) every cited program directory's language variant set matches what
  the body advertises.

Always-🚨 carve-out: column or list claiming language X without a
corroborating snippet → 🚨 (page promises something it doesn't deliver).
Reciprocal direction → ⚠️ (orphan variant; usually intentional).

The static/programs/ exemption from per-block dispatch (covers `structural`
+ `existence` — closed by the test harness) does NOT apply to
`cross-reference`: program-only diffs may still rebalance the language
inventory of a referenced page, so the body-level check runs whenever a
content file is in the diff.

Why a separate specialist (not just better prompting):
The body↔code correspondence requires holding the entire comparison
table + every language claim + every cited program directory in
attention simultaneously. Folded into `existence` (which is also doing
per-block import / API checks), it gets squeezed under attention
pressure — observed across S37/S38/S39 Phase 1 as a persistent
Java-column-class miss across multiple sessions.

Validator impact: none. The DISPATCH_METADATA_RE regex matches the
EXTRACTION-side specialists (numerical, cross-reference, capability,
framing — Pass 1), not code-examples. Code-examples bullet has no
specialist-count enforcement.

Spike on pr18568 to follow this commit.
…py + fact-check.md)

The Hugo build pre-step renders without `make ensure`, so it reliably emits
CI-environment-only errors (PostCSS/Hugo-Pipes fingerprint failure on /404;
`data/openapi-spec.json not found`) that are not PR-introduced. S39's Ship K
spike text told the agent to "surface every entry as 🚨 build-failure" — the
agent correctly suppressed the noise instead, but that contradicted the spec.

- hugo-build-validate.py: add `KNOWN_CI_NOISE_PATTERNS`; strip matching lines
  from `errors`/`warnings`/`link_integrity` before emitting; collect them under
  a new `suppressed_ci_noise` artifact field + `stats.suppressed_ci_noise_count`;
  add `head_exit_nonzero_is_ci_noise` so the agent doesn't have to reason about a
  non-zero exit explained entirely by the stripped noise. All inside `safe_main()`.
- workflow: update the `||` stub to match the new schema fields.
- fact-check.md §Hugo build artifact: drop the "surface every entry" mandate;
  note the script pre-filters; add a "demote a residual CI-env-only error
  silently with a `suppressed: CI-env-only` trail note" rule + a "Known
  CI-environment-only error classes" reference list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Critical-evaluation pass over the docs-review spec files: fix accuracy drift
left behind by the Ship A→K refactors, resolve cross-file contradictions, and
trim clearly-redundant text. No new review behavior beyond resolving the
contradictions.

- code-examples.md + output-format.md: rename the Ship E code-examples
  specialist `cross-reference` → `body-code-coverage` (it collided with
  fact-check.md's `claim_type: cross-reference`); fix output-format.md's stale
  "2 specialists" investigation-log line → "3 specialists"; correct the
  `static/programs/`-only-diff behavior (`body-code-coverage` still runs).
- pre-computation.md: drop the two aspirational "Queued" table rows (no scripts
  exist), replace with a "Next candidates" paragraph reflecting S39's
  reprioritization (`markdown-link-validate.py` first); tidy the existing-bundle
  table labels.
- output-format.md DO-NOT item 7 ⟷ shared-criteria.md §Ordered-list numbering:
  removed "ordered-list `1.` numbering" from the lint-caught list — `markdownlint`
  MD029 (`one_or_ordered`) doesn't flag ascending lists and `.md` is in
  `.prettierignore`, so it stays in scope per shared-criteria.md. Added the
  reasoning inline.
- §Style nits → §Style findings: aligned the section reference name across
  output-format.md, docs.md, blog.md, prose-patterns.md (the actual heading is
  `#### Style findings`).
- style-bullet format: SKILL.md and ci.md now reference output-format.md's
  Style-findings render contract instead of restating a divergent inline form;
  output-format.md's bullet form now carries the `[style]` tag the CI workflow
  prompt already uses.
- shared-criteria.md + docs.md: the internal-link / frontmatter / alias checks
  now point at the `.hugo-build.json` and `.frontmatter-validation.json`
  pre-step artifacts first, with the model-side `gh api` checks reframed as the
  fallback. Always-🚨 carve-out definitions kept intact.
- ci.md: the workspace-root pre-step-artifact list now includes
  `.cross-sibling-discovery.json`, `.frontmatter-validation.json`,
  `.hugo-build.json`.
- programs.md: §Compilability check notes the test harness is the CI floor and
  isn't runnable in CI; merged the redundant §Scope section into the preamble.
- spelling-grammar.md: dropped the "Tokens that look like errors but are
  protected" section — five examples that each restate a protected-token rule
  stated 20 lines above.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tional spec

The "Ship A/B/.../K" letters and "Bundle 1/2/3" numbers were session-tracking
shorthand that leaked into the reviewer-loaded spec — they convey nothing the
descriptive name doesn't ("Ship K" vs "the Hugo build pre-step (hugo-build-validate.py)").
Removed from references/*.md, the pre-step script docstrings, and the workflow
comments; kept the *content* of the history notes (why a script is shaped the
way it is) and the session-provenance where it's mild ("...added S39"). The
"Ship X" record stays in SESSION-NOTES.md / REPORT.md (history), not the spec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e it

It's a Claude Code runtime lock file (sessionId/pid/timestamp) written by
ScheduleWakeup — not source. Slipped into the previous commit via `git add -A`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…step

S41's fresh-fixture battery showed blog/claims-heavy PR reviews aren't
single-run-reproducible at the 🚨 tier — claim *discovery* is model-generated
and varies run to run, so one run catches a real blocking finding the next
misses (#18771 StrongDM misattribution, #18743 p5.48xlarge price vs Llama-3.3
nonexistence). Discovery is the weak link; verification is fine.

This lifts claim extraction out of the variable Opus review into a pre-step:

- extract-claims.py — Layer A: deterministic regex floor (numbers, version
  pins, temporal words, source attributions, URLs, named-entity/spec claims,
  positioning/comparison triggers) over the whole diff. Guarantees the
  concrete claims can never be silently dropped. safe_main().
- extract-claims-llm.py — Layer B: two redundant, differently-framed Sonnet
  passes (atomic/per-sentence and holistic/paragraph), direct /v1/messages
  call with temperature 0 + forced extract_claims tool schema, one call per
  changed content/**/*.md file, prompt-cached system prompt. Prompted with the
  new references/claim-extraction.md (taxonomy + the "what is NOT a claim"
  list incl. the third-party-attribution flip + framing rule + ≥10 worked
  examples, the S41 misses among them). safe_main(); degrades gracefully.
- merge-claims.py — unions the three layers into .candidate-claims.json:
  dedup by overlapping line range + token overlap, anchor LLM line ranges to
  file content, found_by provenance, pass-count → confidence.
- claude-code-review.yml — wires the four pre-steps; timeout-minutes: 25 on
  the claude-review job (S41 saw a review hang ~18 min).
- fact-check.md — .candidate-claims.json is the claim *floor* the review MUST
  verify (MAY add more); the in-review 4-way claim-finder dispatch retires on
  the normal path (the pre-step subsumes it), kept as a degraded-pre-step
  fallback; frontmatter-sweep scope pinned to frontmatter-validate.py's new
  per-file frontmatter_keys (fixes the #18745-r2 social.* omission).
- validate-pinned.py (schema v6→v7) — candidate-claims-coverage rule fails
  the review (soft-flooring loudly) if a candidate claim has no overlapping
  trail record; trail-bucket-consistency relaxed for pure-layout/0-claim PRs
  (#18857-r1 over-trigger).
- test_extract_claims.py + testdata/ — synthetic per-category tests + the 3
  real S41-fixture diffs (assert the dropped claims surface) + merge-claims
  dedup/anchor/provenance tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Refactor claim extraction logic in `extract-claims-llm.py` for clarity.
- Add a new script `markdown-syntax-findings.py` to identify markdown syntax issues.
- Update Vale configuration to include additional style checks for vague link text, empty alt text, and directional references.
- Improve documentation consistency by refining prose patterns and adding new rules for command backticks and product names.
- Refactor claim extraction logic in `extract-claims-llm.py` for clarity.
- Add a new script `markdown-syntax-findings.py` to identify markdown syntax issues.
- Update Vale configuration to include additional style checks for vague link text, empty alt text, and directional references.
- Improve documentation consistency by refining prose patterns and adding new rules for command backticks and product names.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants