Overhaul the Claude-driven docs PR review pipeline (v1) by CamSoper · Pull Request #18680 · pulumi/docs

CamSoper · 2026-04-23T17:39:07Z

Replaces the legacy single-comment Claude review on pulumi/docs with a domain-aware, re-entrant pipeline. Goal: keep maintainer-grade fact-checking running on every PR as agentic workflows raise contribution velocity beyond manual review capacity.

What ships

Two skill packages working as a pair:

docs-review (CI) — runs on every PR. Deterministic Python classifier routes each PR to its right domain reference: docs, blog, infra, programs, or website. Posts a pinned  comment with status table + tiered findings (🚨 / ⚠️ / 💡 / ✅) + suggestion blocks. Re-entrant via @claude mention (refresh / dispute / re-verify).
pr-review (interactive) — local maintainer skill (/pr-review <PR#>). Reads the latest CI-posted review, applies a trust-and-scrutiny model, presents an action menu (approve / request changes / make changes / close). Optimized for "act on this PR right now without re-reading everything."

Triage gating. Trivial PRs (≤10 added lines, ≤2 docs/blog files) and frontmatter-only PRs short-circuit through a fast Haiku spelling/grammar pass instead of a full Opus review. Marketing/legal pages route to a dedicated domain:website review with verification-ask framing per FTC truthfulness norms.

Workflows. claude-triage.yml (classify + prose check on pull_request), claude-code-review.yml (full review on ready_for_review), claude.yml (re-entrant on @claude mention).

Pinned-comment script. _common/scripts/pinned-comment.sh manages the review as a single logical comment sequence () with in-place edits, overflow append, and tail prune.

Domain references. references/{shared-criteria,docs,blog,infra,programs,website,fact-check,prose-patterns,spelling-grammar,code-examples,image-review,output-format,update,domain-routing}.md. fact-check.md is the shared claim-extraction engine used by both CI and pr-review.

Labels. domain:{docs,blog,infra,programs,website,mixed} + review:{trivial,frontmatter-only,prose-flagged,claude-ran,claude-stale,claude-working} + needs-author-response. Deployed via scripts/labels/sync-labels.sh.

Contributor guidance. Draft-first posture, AI-authored-PR conventions, and re-entrant review semantics documented in CONTRIBUTING.md and AGENTS.md.

Benchmark

Validated head-to-head against the live legacy pipeline on 11 production PRs (full report at scratch/2026-05-01-live-comparison-v2/REPORT.md on branch):

10 substantive bugs caught that legacy missed — every one would have shipped to production
100% coverage of legacy's author-addressed substantive findings (correctly silent on already-fixed)
0% false positive rate on both pipelines
Maintainer signal quality: 95% (new) vs 30% (legacy) on tier / evidence / grouping / suggestion-block axes
Cost: $0.65 per incremental shipped-defect prevented; 1.93× legacy on this sample, projected ~1.5× on production mix once trivial-skip fires at expected ~43% rate

Side-by-side comparison material at CamSoper/pulumi.docs#105–#115. Internal exec summary lives in Notion under Knowledge Preservation → Docs.

Notable catches new caught and legacy missed: workflow-breaking SAML/SCIM nav bugs (#18605), OutSystems source-misattribution propagated to LinkedIn/Bluesky social copy (#18647), broken /docs/ai/integrations/ link on a launch post (#18685), AGENTS.md canonical-path regressions (#18568, #18599), Java snippet truncation introduced while addressing legacy feedback (#18331).

One regression: PR 18573 trivial-cap edge case (4-line nav rewrite in a multi-section doc) — minor, soft-watch.

Status before merge

✅ Cam fork validated end-to-end (all 5 domain paths exercised across 11 PRs)
✅ All commits ready on CamSoper/pr-review-overhaul
⏳ domain:website label needs deploy to pulumi/docs upstream label set: scripts/labels/sync-labels.sh --repo pulumi/docs
⏳ Trivial-cap edge case (PR 18573 shape) — soft-watch, not a merge blocker

How to review

Quick read: the benchmark REPORT.md on this branch, or the side-by-side fork PRs at CamSoper/pulumi.docs#105–#115.
Diff-by-diff: start with .claude/commands/docs-review/scripts/triage-classify.py (the classifier), then .claude/commands/docs-review/references/{docs,blog,infra,programs,website}.md (the domain reviews), then .github/workflows/claude-{triage,code-review}.yml (the wiring).
Design history: SESSION-NOTES.md carries 18+ sessions of rationale. Sessions 5–7 (initial domain composition), 9–10 (shared-criteria + label rename), 12–13 (audit + cost optimization), 16–18 (e2e validation + trivial-cap calibration), 19 (domain:website + trivial/fmonly tightening to docs+blog only).

Notes

CI fact-check is public-sources-only by design: no Notion, no Slack MCP. Rationale lives in docs-review-ci.md.
Models: claude-opus-4-7 for initial domain reviews, claude-sonnet-4-6 for re-entrant updates, claude-haiku-4-5-20251001 for triage prose checks (50KB diff cap, JSON output).
SESSION-NOTES.md is for this PR's review cycle; marked for removal once merged.

Ship as draft; ready-flip after the upstream label deploy lands.

pulumi-bot · 2026-04-23T17:46:11Z

Your site preview for commit aefa15c is ready! 🎉

http://www-testing-pulumi-docs-origin-pr-18680-aefa15cc.s3-website.us-west-2.amazonaws.com

@claude

Fork-only tweak so claude.yml works without org-side ESC setup. @claude retains all its capabilities (re-entrant reviews, Q&A, make-changes on PRs) -- only difference is commits pushed with GITHUB_TOKEN don't trigger downstream workflows, which is fine for fork testing. This commit is NOT for upstream. Origin/master and pulumi#18680 keep the ESC design. Do not cherry-pick.

@claude

Fork-only tweak so claude.yml works without org-side ESC setup. @claude retains all its capabilities (re-entrant reviews, Q&A, make-changes on PRs) -- only difference is commits pushed with GITHUB_TOKEN don't trigger downstream workflows, which is fine for fork testing. This commit is NOT for upstream. Origin/master and pulumi#18680 keep the ESC design. Do not cherry-pick.

@claude

Fork-only tweak so claude.yml works without org-side ESC setup. @claude retains all its capabilities (re-entrant reviews, Q&A, make-changes on PRs) -- only difference is commits pushed with GITHUB_TOKEN don't trigger downstream workflows, which is fine for fork testing. This commit is NOT for upstream. Origin/master and pulumi#18680 keep the ESC design. Do not cherry-pick.

@claude

Fork-only tweak so claude.yml works without org-side ESC setup. @claude retains all its capabilities (re-entrant reviews, Q&A, make-changes on PRs) -- only difference is commits pushed with GITHUB_TOKEN don't trigger downstream workflows, which is fine for fork testing. This commit is NOT for upstream. Origin/master and pulumi#18680 keep the ESC design. Do not cherry-pick.

@claude

Fork-only tweak so claude.yml works without org-side ESC setup. @claude retains all its capabilities (re-entrant reviews, Q&A, make-changes on PRs) -- only difference is commits pushed with GITHUB_TOKEN don't trigger downstream workflows, which is fine for fork testing. This commit is NOT for upstream. Origin/master and pulumi#18680 keep the ESC design. Do not cherry-pick.

@claude

Fork-only tweak so claude.yml works without org-side ESC setup. @claude retains all its capabilities (re-entrant reviews, Q&A, make-changes on PRs) -- only difference is commits pushed with GITHUB_TOKEN don't trigger downstream workflows, which is fine for fork testing. This commit is NOT for upstream. Origin/master and pulumi#18680 keep the ESC design. Do not cherry-pick.

@claude

Fork-only tweak so claude.yml works without org-side ESC setup. @claude retains all its capabilities (re-entrant reviews, Q&A, make-changes on PRs) -- only difference is commits pushed with GITHUB_TOKEN don't trigger downstream workflows, which is fine for fork testing. This commit is NOT for upstream. Origin/master and pulumi#18680 keep the ESC design. Do not cherry-pick.

@claude

Fork-only tweak so claude.yml works without org-side ESC setup. @claude retains all its capabilities (re-entrant reviews, Q&A, make-changes on PRs) -- only difference is commits pushed with GITHUB_TOKEN don't trigger downstream workflows, which is fine for fork testing. This commit is NOT for upstream. Origin/master and pulumi#18680 keep the ESC design. Do not cherry-pick.

@claude

Fork-only tweak so claude.yml works without org-side ESC setup. @claude retains all its capabilities (re-entrant reviews, Q&A, make-changes on PRs) -- only difference is commits pushed with GITHUB_TOKEN don't trigger downstream workflows, which is fine for fork testing. This commit is NOT for upstream. Origin/master and pulumi#18680 keep the ESC design. Do not cherry-pick.

@claude

Fork-only tweak so claude.yml works without org-side ESC setup. @claude retains all its capabilities (re-entrant reviews, Q&A, make-changes on PRs) -- only difference is commits pushed with GITHUB_TOKEN don't trigger downstream workflows, which is fine for fork testing. This commit is NOT for upstream. Origin/master and pulumi#18680 keep the ESC design. Do not cherry-pick.

The validator's `internal-link-existence` check rejects links to pages the PR itself is creating — the destination doesn't exist on the base branch but will once the PR merges, so the link is valid. S34's pr18568 captures hit this twice (the new Azure provider guides reference one another via canonical paths the PR is adding). Changes: - New helper `gh_pr_diff_added_files()` queries the GitHub API for files with `status="added"` on the PR, returning a `set[str]` of relative paths. - New `Context.diff_files_added: set[str]` field, populated in `cmd_check` alongside `diff_files`. - `check_internal_link_existence` now accepts a candidate path if it's in the PR-added set, before falling through to the alias grep. Validated: - s34-runs/spot-check/pr18568-spot2 (was 1x internal-link): now 0. - s34-runs/run1/pr18568-r1 (was 2x internal-link): now 0. - s34-runs/spot-check/pr18647-spot2 (was clean): still clean. - Synthetic guard (link to a path in neither base nor added): still flags.

Empirical: across 83 pinned-review captures, the AI-drafting H3 section rendered on 6 — all 6 on the canonical S30 fixture (pr-17240). Outside the calibration corpus, the feature has never fired. Author-allowlist + AI-trailer detection in claude-triage.yml already covers the obvious cases at intake; the content-side detector was a backstop with no demonstrated catch. This commit removes the dedicated section + dispatch and re-routes the specific tells that fit cleanly into Vale rules. Findings render as existing-style nits in the low-confidence section; the maintainer weighs them like any other prose flag. Hedged copy ("often appears in AI-drafted prose; consider...") makes clear we're flagging a smell, not asserting authorship. Removed: - prose-patterns.md §AI-drafting signals (6 detectors + Sonnet/Sonnet parallel-subagent dispatch). Replaced with a brief §AI-drafting tells that points at the four Vale rules below. - output-format.md §AI-drafting signals (rendered section spec, the investigation-log bullet, the placeholder in the schematic, and the passing reference in §Subagent decomposition). - validate-pinned.py: check_ai_drafting_threshold_section and its rule entry. INVESTIGATION_LOG_BULLETS goes from 9 to 8. Added (styles/Pulumi/): - SetPieceTransitions.yml — fixed phrase list from prose-patterns.md detector 2 ("But here's the thing", "Let's dive in", etc.) - EmDashDensity.yml — paragraph-scope occurrence rule, max=2 (fires when any paragraph has 3+ em-dashes) - ListicleH2Headings.yml — heading.h2-scope existence rule for numbered listicle prefixes (`**1.**`, `1.`, `Part N`, `Section N`, `Phase N`) - HedgeThenPivot.yml — `While X, Y is also worth ...` / `Although X, what really matters is Y` constructions vale-findings-filter.py: 4 new RULE_CATEGORIES entries so findings surface with user-friendly category names. Calibration on the 4 fixtures: - pr-17240 (canonical): 2 hits (Part 1, Part 2 listicle H2s) - pr-18647 (blog control): 0 hits - pr-18605 (docs control): 0 hits - pr-18599 (docs control): 0 hits across 8 files Existing pinned-review captures re-validate cleanly under the trimmed INVESTIGATION_LOG_BULLETS (extra "AI-drafting-signals pass" lines in older captures are ignored, not flagged). D1 (uniform per-section template) and D3 (parallel four-bullet lists) from the original spec stay dropped — they don't fit Vale's scope model cleanly, and the data shows they only ever fired on the calibration fixture.

Inserts a deterministic Haiku 4.5 fix-pass into cmd_upsert_validated for violation classes where the fix is a localized text edit. Closes the S34 retry-loop dead-end where Opus's "fix" re-render reproduced the same class of violation in a different form (e.g., another hallucinated link path), so the soft-floor kept publishing bodies with leftover violations. New: scripts/validator-fix.py Reads /tmp/validate-pinned.fix-me.json. Exits 2 if any violation falls outside the surgical set (caller proceeds to soft-floor without invoking Haiku). For surgical violations, dispatches one claude CLI call per violation with a class-specific prompt and tool use disabled, cap of 5 dispatches per body (cost ceiling). Surgical classes shipped (5 of 8 from s34-validator-loop-findings.md): - internal-link-existence (verified end-to-end on a synthetic case) - external-claim-pass2-outcome (verified end-to-end on s34-runs/run1/pr18685-r1, the real-world S34 capture) - shortcode-existence - bucket-bullet-line-range-prefix - mandatory-h3-order The other 3 (external-claim-state-format, -dispatch-metadata, -routed-metadata) follow the same single-line-edit shape and can be added with one prompt template each; deferred to a follow-up rather than padding this commit's untested surface. Modified: scripts/pinned-comment.sh, cmd_upsert_validated On validator pass 1 fail: snapshot body to body.pre-haiku.bak, run validator-fix.py, re-validate. On success, publish; on persistent failure, restore from backup and return 1 (soft-floor publishes the ORIGINAL body, never a Haiku-degraded one). The re-validate step intentionally omits --soft-floor — we want a clean retry-0 verdict on the post-fix body, not a soft-floor downgrade. Modified: scripts/per-tool-spend.py Adds Bash:validator-fix category at $0.015/call (one Haiku 4.5 dispatch with medium prompt). Without this, validator-fix.py invocations bucket as Bash:other and the cost-variance reports miss the new spend line. Tested: - pr18685-r1 (external-claim-pass2-outcome): validator pass 1 fails → validator-fix dispatches Haiku → Pass 2 segment gets `(verified 2, contradicted 0, unverifiable 0)` appended → re-validate clean. ~35s. - Synthetic internal-link-existence on pr18647-spot2: validator pass 1 flags hallucinated link → validator-fix removes the [text](path) wrapper → re-validate clean. ~35s. - Mixed-violation gate: surgical + frontmatter-locations (re-render required) → exit 2 immediately, no Haiku dispatch. Per-call latency is dominated by claude CLI startup (~30s without --bare). --bare requires ANTHROPIC_API_KEY explicitly; CI has it but local OAuth users don't, so the CLI invocation is run without --bare for portability. Optimization to direct SDK calls is a follow-up if the wall-clock cost matters in production.

I'd dropped --bare during initial Ship 4 work because local testing without ANTHROPIC_API_KEY errored "Not logged in" — and rationalized the 30s startup cost as "follow-up" instead of fixing it. Cam caught this: the action sets ANTHROPIC_API_KEY in env, and subprocess invocations from pinned-comment.sh inherit it, so --bare works in CI. --bare skips hooks, LSP, plugin sync, CLAUDE.md auto-discovery, and keychain reads — drops dispatch latency from ~30s to ~2-3s per violation. Local testing of this script now requires ANTHROPIC_API_KEY=... in the environment.

Split the External claim verification "external" lane into two: * Pass 2 -- consult `.fetched-urls.json` (workflow pre-step writes it) * Pass 3 -- WebSearch + WebFetch fan-out for external-public claims with no URL in the diff Stream-JSON audit of S35 captures showed docs reviews rendered Pass 2 routing without any Agent / WebFetch / WebSearch dispatches. Schema v5 adds four faithfulness floors so that drift can no longer pass review: * `pass-2-fetch-faithfulness` -- F > 0 requires non-empty `.fetched-urls.json`. * `pass-3-dispatch-mandate` -- Y > I+P+F with empty fetched-urls must route to Pass 3 (S > 0). * `pass-3-unverifiable-evidence` -- ⚠️ Pass 3 unverifiable verdicts must name the search that was attempted. * `external-claim-pass3-outcome` -- mirror of pass2-outcome rule for the new lane (V/C/U attribution; sum check). Routed-metadata regex extends to the optional `, S Pass 3` segment; existing v4 captures re-validate cleanly (S Pass 3 = 0 by absence). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move the mechanical parts of the editorial-balance pass into a workflow pre-step. Tier 1 (listicle / FAQ trigger detection, section-depth stats, outlier flag) computes deterministically from the post-PR blog markdown and writes `.editorial-balance.json`. Tier 2 (comparison trigger via canonical entity list, entity counting, recommendation steering, FAQ- answer voting) stays model-computed in S36 -- defer to S37 once Tier 1 stabilizes. Tier 3 (don't-flag exceptions) stays model-judged forever. The validator's new `editorial-balance-counts-faithful` rule cross-checks the rendered Editorial balance section's Tier 1 fields against the JSON: * trigger=null in JSON forces empty-form rendering * trigger != null forces rich-form rendering * section count, mean, median, std must match (±10% tolerance) * JSON-flagged outliers must appear in the rendered Section depth bullet Rendered-vs-recomputed validation (`editorial-balance-counts`) stays in place as a complementary check; both arrive at the same numbers when Tier 1 is faithful, which is the load-bearing signal. Empirical justification (S35 audit): editorial-balance fired only on pr-17240 (11/86 captures), but on its target case it caught real signal (Pulumi Neo section ~7.1× median, FAQ steering ≥60%, ⚠️ flags fired correctly). The "always-on-calibration" pattern was rare comparison/ listicle/FAQ posts, not a broken feature -- so determinize Tier 1 rather than dropping the section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add the 3 deferred External-claim-verification surgical classes to validator-fix.py (S35's Ship 4 logged them as carry-overs): * `external-claim-state-format` (rewrite the leading state form) * `external-claim-dispatch-metadata` (append extraction-specialists tail) * `external-claim-routed-metadata` (append routed-verification tail) All three follow the same single-line edit shape as the already-shipped `external-claim-pass2-outcome` template. The routed-metadata prompt now also emits per-lane V/C/U attribution inline so a single Haiku-fix recovers the canonical form (no chained second-pass needed). Corpus-tested all 6 surgical classes (3 above + 3 untested-but-shipped from S35: shortcode-existence, bucket-bullet-line-range-prefix, mandatory-h3-order) against synthetic mutations of pr18647-spot2. All recover on first Haiku-fix attempt: state-format ✅ first attempt dispatch-metadata ✅ first attempt routed-metadata ✅ first attempt shortcode ✅ first attempt bucket-prefix ✅ first attempt (after sharpening prompt to preserve trail anchor format) h3-order ✅ first attempt Bucket-prefix prompt sharpened to look up the exact trail anchor (was inventing `[L40-40]` from a single-line `L40` trail record). Routed- metadata prompt sharpened to preserve the dispatch-metadata segment verbatim (was overwriting "cross-specialist corroborations" text). Local-test path: HAIKU_TIMEOUT_S falls back to 120s when ANTHROPIC_API_KEY is unset (OAuth mode adds ~30s of CLI startup). Production --bare mode unaffected (~2-3s per dispatch; 60s timeout still applies). Mutation seeds live in scratch: `s36-runs/synth/pr18647-spot2-mut-*.md` (committed separately to the scratch repo for regression replay). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fact-check.md §Inline lane: descriptive "<3 turns each" → prescriptive hard cap of 5 gh CLI calls per claim, with explicit don't-iterate rule against gh api repos/pulumi/docs/issues|pulls (exploration ≠ verification). After cap, reclassify to ambiguous (→ Pass 1) or external-public (→ Pass 2/3) and let the harder-verification lane take it. per-tool-spend.py: emit ::warning:: GitHub Actions annotations when Bash:gh > 25 OR num_turns > 80. Validated on existing S36 captures — fires on pr18568 r2 (34 gh calls, the rabbit hole) and pr18647 r1 (30 gh + 83 turns). Doesn't fire on the 3 clean runs.

Audit of S36 stream-JSON captures grouped 27 actionable rejections into 4 patterns. One YAML-fixable; three need spec instructions. claude-code-review.yml: add Bash(python3 -c:*) to --allowed-tools. Closes Pattern 1 (inline python3 -c "..." rejected as multi-op part — the path-prefix patterns over-restricted routine inline scripting). ci.md Hard rule 7: three patterns the harness sandbox blocks regardless of allow-list — write commands that avoid them. - /tmp/ paths: filesystem-path policy blocks cat/grep/redirect; use the Read tool. Workspace root scratch files (.fetched-urls.json etc.) remain Bash-accessible. - Shell control flow (for/while/case/if): multi-op decomposer rejects even when constituent commands are allow-listed. Use python3 -c "..." for iteration. - Brace expansion / subshell grouping: same decomposer issue; expand manually or move to python3 -c "...". ci.md §4: change cat /tmp/validate-pinned.fix-me.md example to Read (the previous example contradicted the harness sandbox). Audit notes: scratch/2026-05-06-final-battery/s37-runs/notes/ harness-friction-audit.md.

Inline-lane spec gains a canonical-source table mapping common claim shapes (menu, example-program, sibling-pattern, schema, shortcode, alias) to the path that resolves them, plus token-first / path-second search-order rules and a shrug rule (3 targeted reads → ambiguous). S37 post-session analysis (n=43 captures, S33-S37) showed the deep inline runs were doing real cross-file verification, not wandering — S37 pr18568 i=8 caught 3 critical structural bugs the i=13 cheap run missed. The S37 per-claim cap fired against the wrong target. The playbook gives the model better starting points so 3-5 calls/claim closes the typical structural verification. The existing per-claim 5-call cap stays as a backstop.

Per-claim 5-call cap stays as backstop. New per-PR 40-call cap is the primary control: beyond ~40 inline gh calls, the model summarizes unresolved pulumi-internal claims and dispatches a final Pass 1 batch with the playbook embedded — that batch is the escalation tier. per-tool-spend.py keeps the existing gh>25 ::warning:: (productivity- zone observability) and adds gh>50 ::error:: (genuine over-spend). The S37 pr151-r1 stream-JSON (75 gh calls) is the canonical historical case: trips both annotations under the new spec, no regression on the five other S37 captures.

New validator rule catches the inline-lane exploration patterns that the canonical-source playbook (Ship A, fact-check.md §Inline lane) is designed to displace. Trail evidence containing `gh api repos/.../ issues|pulls` or recursive `git/trees/<sha>?recursive=...` is flagged — these don't read canonical source. Self-validation: - All 6 S37 captures re-validate clean under v6 (the trail evidence in those captures already cites canonical paths; the exploration was in tool calls, not trail evidence). - Synthetic violation capture trips all 3 exploration patterns (issues, pulls, trees recursive). Schema version: 5 → 6.

Strip rationale from the canonical-source playbook — preamble cut to one sentence, table rows lose parentheticals, search-order rules drop the "why" tail, shrug rule trimmed to the directive. Same shape, same intent, fewer runtime tokens.

S38 Ship D variance retest revealed that pr18568 r1+r2 both classified the fixture as "not in a templated section" because the changed file's directory had zero peers. The model checked the wrong path (`content/docs/iac/clouds/<x>/guides/`) and didn't broaden the search to the parallel category tree (`content/docs/iac/guides/ clouds/<x>/`) where the actual sibling set lives. Result: the file-location, alias-collision, and menu-parent triplet that S37 r1 caught was silently dropped. Add a zero-peer-check rule under §Cross-sibling consistency: when the changed file's directory has 0 peers but the category has known parallel pages elsewhere, search adjacent paths before concluding "no siblings yet." The empty result is itself a sibling-consistency claim and a 🚨 file-location finding. Spike-tested next.

Pipeline-of-artifacts experiment. Discovery layer was independently failable in S38 Ship D (pr18568 r1+r2 both classified the changed file as "not in a templated section" and skipped the sibling sweep that surfaces structural bugs). Ship F's inline zero-peer rule helped on a single spike but depends on the model running it. Encode the same logic deterministically as a workflow pre-step: walks the docs tree per changed file, runs the parallel-path check table-driven, emits .cross-sibling-discovery.json. Spec in fact-check.md §Cross-sibling consistency now reads from the artifact first; the structural_warning surfaces directly as a 🚨 file-location finding. Self-tested on PR 140 (SAML JumpCloud, in_templated=true with 8 peers) and PR 151 (pr18568, parallel-path warning + canonical sibling at content/docs/iac/guides/clouds/azure.md). Smallest viable architectural pivot: one step, one artifact. If the structural guarantee makes discovery reliable across runs, the pipeline pivot is justified for further extraction in S39+.

Bundle 1 of the atomized discovery pattern (see new docs-review:references:pre-computation for the architectural codification that emerged across S38 Ship G + this work). Ship H: frontmatter-validate.py + workflow wire-in + spec update. Two checks bundled: 1. Menu-parent identifier resolution. For each menu.<name>.parent declared in PR-changed frontmatter, walk content/**/*.md to build a global menu-identifier map and check whether the parent resolves in the same named menu. The S37/S38 pr18568 case: menu.iac.parent: azure-clouds resolves only against menu.integrations — wrong-menu parent. Closes the L11 finding Ship G missed. 2. Alias collision detection. Build a global alias map from content/**/*.md, cross-reference the PR's declared aliases. Flag PR-internal collisions and repo-wide collisions (against existing canonical pages). Self-tested on pr18568: caught both /docs/clouds/azure/guides/ and /docs/clouds/azure/guides/providers/ as repo-wide collisions with the existing canonical content/docs/iac/guides/clouds/azure.md. Ship I: references/pre-computation.md is the architectural meta-doc. Codifies the principle (scripts find structural facts, agent makes editorial judgments), the bundle-by-reading-pattern architecture, the false-positive triage contract, and how to add a new pre-step. Lets S39+ ship the remaining bundles consistently without rediscovering the pattern. Self-tested locally on pr18568 and pr18605 (clean — no false positives on the SAML JumpCloud fixture). Spike retest pending.

Cam pushback on cross-sibling-discover.py's PARALLEL_PATTERNS table: hardcoded against the one observed pr18568 layout swap; brittle and maintenance-heavy — every new layout divergence requires a code edit. Replace with a data-driven approach using signals the codebase already curates intentionally: Hugo `aliases:` frontmatter (declared per file when content is moved) and S3 redirect tables under `scripts/redirects/` (maintained alongside non-Hugo URL routing). Algorithm: 1. frontmatter-validate.py builds a unified URL-ownership map keyed on normalized URLs (leading slash, no `index.html`, trailing slash). Each entry: {file, scope: hugo-alias|s3-redirect}. 2. For each PR-changed file, compute the file's rendered Hugo URL from its path. 3. Look up that URL in the ownership map. Any other file (alias) or redirect entry claiming the URL → emit url_collision with scope tag. 4. Spec mandates surfacing url_collisions as 🚨 by default. The S38 pr18568 case: PR drops content at /docs/iac/clouds/azure/guides/ which is already aliased by content/docs/iac/guides/clouds/azure.md. Net change: - cross-sibling-discover.py shrinks 241 → 130 lines (drops PARALLEL_PATTERNS, parallel-path check, and the structural_warning logic that came with it). Now does just what the name says: peer-counting for templated-section detection. - frontmatter-validate.py grows ~80 lines (URL-ownership map building, normalization, derive_url_from_path, url_collisions per-file output). - fact-check.md spec rewritten around the URL-ownership check; "Zero-peer check" subsection (Ship F's inline rule) removed since the pre-step fully replaces it. Self-tested locally on pr18568 (catches /docs/iac/clouds/azure/guides/ and /docs/iac/clouds/azure/guides/providers/ as hugo-alias collisions against content/docs/iac/guides/clouds/azure.md) and pr18605 (clean — no false positives on the SAML JumpCloud fixture). The S3 redirect coverage is bonus value: PRs landing content at URLs that already have S3 redirects pointing somewhere else will now be caught — that class of collision was previously invisible to all review layers.

Atomized Hugo build validation as a workflow pre-step. The agent now reads `.hugo-build.json` for the build-correctness floor instead of trying to reason about whether the build would succeed (the workflow intentionally skips `make build` per ci.md hard rule 4). Bundles three checks in one script: - `hugo --renderToMemory` at HEAD → errors, warnings, link-integrity. - `hugo list all` at HEAD and BASE (worktree from base SHA) → sitemap diff (added/removed URLs). - Output schema v1: errors, warnings, link_integrity, sitemap_diff. Subsumes the originally-queued `docs-reference-graph` bundle (Hugo's own warnings cover broken refs / broken shortcodes / missing assets, and sitemap-diff covers orphaned-target detection). Bundle 2 retired. Wall-clock: ~135-180s in this worktree (Hugo full render + base list); acceptable on top of the 5-15min review wall-clock. CI runners may add ~30% — re-evaluate if it becomes blocking. Spec updates: - references/fact-check.md: artifact contract + surface rules + known false-positive scenarios. - references/pre-computation.md: bundle-inventory entry + retire Bundle 2. Phase 1 (S39) gate that authorized this ship: pr18568 finding-parity N=2 PASSED (3🚨 vs 3🚨, structural triplet caught both runs, cost variance ±4%). Architecture demonstrably reproducible at the load- bearing fixture before piling on more.

Spike retest revealed two integration bugs in the original Ship K commit: 1. **mise.toml didn't include hugo.** CI runner had no `hugo` binary on PATH, so `subprocess.run(["hugo", ...])` raised FileNotFoundError — added `hugo = "0.157.0"` matching the local Codespace pin. 2. **`run_hugo_list` didn't catch FileNotFoundError.** First Hugo invocation (`run_hugo_render`) caught it and degraded gracefully; the second (`run_hugo_list` for HEAD) did not, so the script crashed mid-flight, dumping a traceback to stderr (which the workflow had `2>/dev/null`'d) and tripping the workflow's `||` fallback. Result: `.hugo-build.json` was the empty stub regardless of what would have surfaced. Fixes: - mise.toml: pin hugo 0.157.0 (matches local Codespace). - script: catch FileNotFoundError + OSError in every hugo subprocess call site; wrap `main()` in `safe_main()` that emits a structured error artifact on any uncaught exception so the workflow always receives a useful JSON, never the empty fallback. - workflow: drop `2>/dev/null` so tracebacks surface in CI logs; fallback `echo` now emits `errors: ["hugo-build-validate.py failed to start"]` to make zero-output runs distinguishable from clean builds. Lesson: the workflow's `|| echo` fallback masks script failures from the agent's view of the artifact. The fix moves error-state representation INTO the artifact (well-formed JSON with a non-empty `errors` array) instead of relying on file presence as a success signal. Spike on pr18568 to follow this commit.

Phase 1 (S39) confirmed: pr18568 finding-parity holds at N=2 AND Java-column miss persists (zero Java-specific 🚨 across both runs). Both Ship E gate conditions met. Ship. Add a third code-examples specialist `cross-reference` (Sonnet 4.6, `general-purpose`) split out from the existing `existence` specialist. Existing two specialists' scope: - `structural` (Sonnet 4.6) — syntax, casing, idiomatic per-language. - `existence` (Haiku 4.5) — imports + provider API currency. **Cross-reference body-vs-code coverage moved out (was 3rd responsibility; now its own specialist).** New specialist fans out **once per content file** (not per code block like the other two). Receives: - the full content body - a structured catalog: every fenced block + language declaration + first 8 lines, every `{{< example-program >}}` shortcode and its referenced `name`, every `static/programs/<name>-<language>/` directory listing. Verifies in both directions: - (a) every body language claim corroborated by an inline fenced block or a `static/programs/` directory. - (b) every cited program directory's language variant set matches what the body advertises. Always-🚨 carve-out: column or list claiming language X without a corroborating snippet → 🚨 (page promises something it doesn't deliver). Reciprocal direction → ⚠️ (orphan variant; usually intentional). The static/programs/ exemption from per-block dispatch (covers `structural` + `existence` — closed by the test harness) does NOT apply to `cross-reference`: program-only diffs may still rebalance the language inventory of a referenced page, so the body-level check runs whenever a content file is in the diff. Why a separate specialist (not just better prompting): The body↔code correspondence requires holding the entire comparison table + every language claim + every cited program directory in attention simultaneously. Folded into `existence` (which is also doing per-block import / API checks), it gets squeezed under attention pressure — observed across S37/S38/S39 Phase 1 as a persistent Java-column-class miss across multiple sessions. Validator impact: none. The DISPATCH_METADATA_RE regex matches the EXTRACTION-side specialists (numerical, cross-reference, capability, framing — Pass 1), not code-examples. Code-examples bullet has no specialist-count enforcement. Spike on pr18568 to follow this commit.

…py + fact-check.md) The Hugo build pre-step renders without `make ensure`, so it reliably emits CI-environment-only errors (PostCSS/Hugo-Pipes fingerprint failure on /404; `data/openapi-spec.json not found`) that are not PR-introduced. S39's Ship K spike text told the agent to "surface every entry as 🚨 build-failure" — the agent correctly suppressed the noise instead, but that contradicted the spec. - hugo-build-validate.py: add `KNOWN_CI_NOISE_PATTERNS`; strip matching lines from `errors`/`warnings`/`link_integrity` before emitting; collect them under a new `suppressed_ci_noise` artifact field + `stats.suppressed_ci_noise_count`; add `head_exit_nonzero_is_ci_noise` so the agent doesn't have to reason about a non-zero exit explained entirely by the stripped noise. All inside `safe_main()`. - workflow: update the `||` stub to match the new schema fields. - fact-check.md §Hugo build artifact: drop the "surface every entry" mandate; note the script pre-filters; add a "demote a residual CI-env-only error silently with a `suppressed: CI-env-only` trail note" rule + a "Known CI-environment-only error classes" reference list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Critical-evaluation pass over the docs-review spec files: fix accuracy drift left behind by the Ship A→K refactors, resolve cross-file contradictions, and trim clearly-redundant text. No new review behavior beyond resolving the contradictions. - code-examples.md + output-format.md: rename the Ship E code-examples specialist `cross-reference` → `body-code-coverage` (it collided with fact-check.md's `claim_type: cross-reference`); fix output-format.md's stale "2 specialists" investigation-log line → "3 specialists"; correct the `static/programs/`-only-diff behavior (`body-code-coverage` still runs). - pre-computation.md: drop the two aspirational "Queued" table rows (no scripts exist), replace with a "Next candidates" paragraph reflecting S39's reprioritization (`markdown-link-validate.py` first); tidy the existing-bundle table labels. - output-format.md DO-NOT item 7 ⟷ shared-criteria.md §Ordered-list numbering: removed "ordered-list `1.` numbering" from the lint-caught list — `markdownlint` MD029 (`one_or_ordered`) doesn't flag ascending lists and `.md` is in `.prettierignore`, so it stays in scope per shared-criteria.md. Added the reasoning inline. - §Style nits → §Style findings: aligned the section reference name across output-format.md, docs.md, blog.md, prose-patterns.md (the actual heading is `#### Style findings`). - style-bullet format: SKILL.md and ci.md now reference output-format.md's Style-findings render contract instead of restating a divergent inline form; output-format.md's bullet form now carries the `[style]` tag the CI workflow prompt already uses. - shared-criteria.md + docs.md: the internal-link / frontmatter / alias checks now point at the `.hugo-build.json` and `.frontmatter-validation.json` pre-step artifacts first, with the model-side `gh api` checks reframed as the fallback. Always-🚨 carve-out definitions kept intact. - ci.md: the workspace-root pre-step-artifact list now includes `.cross-sibling-discovery.json`, `.frontmatter-validation.json`, `.hugo-build.json`. - programs.md: §Compilability check notes the test harness is the CI floor and isn't runnable in CI; merged the redundant §Scope section into the preamble. - spelling-grammar.md: dropped the "Tokens that look like errors but are protected" section — five examples that each restate a protected-token rule stated 20 lines above. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tional spec The "Ship A/B/.../K" letters and "Bundle 1/2/3" numbers were session-tracking shorthand that leaked into the reviewer-loaded spec — they convey nothing the descriptive name doesn't ("Ship K" vs "the Hugo build pre-step (hugo-build-validate.py)"). Removed from references/*.md, the pre-step script docstrings, and the workflow comments; kept the *content* of the history notes (why a script is shaped the way it is) and the session-provenance where it's mild ("...added S39"). The "Ship X" record stays in SESSION-NOTES.md / REPORT.md (history), not the spec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e it It's a Claude Code runtime lock file (sessionId/pid/timestamp) written by ScheduleWakeup — not source. Slipped into the previous commit via `git add -A`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…step S41's fresh-fixture battery showed blog/claims-heavy PR reviews aren't single-run-reproducible at the 🚨 tier — claim *discovery* is model-generated and varies run to run, so one run catches a real blocking finding the next misses (#18771 StrongDM misattribution, #18743 p5.48xlarge price vs Llama-3.3 nonexistence). Discovery is the weak link; verification is fine. This lifts claim extraction out of the variable Opus review into a pre-step: - extract-claims.py — Layer A: deterministic regex floor (numbers, version pins, temporal words, source attributions, URLs, named-entity/spec claims, positioning/comparison triggers) over the whole diff. Guarantees the concrete claims can never be silently dropped. safe_main(). - extract-claims-llm.py — Layer B: two redundant, differently-framed Sonnet passes (atomic/per-sentence and holistic/paragraph), direct /v1/messages call with temperature 0 + forced extract_claims tool schema, one call per changed content/**/*.md file, prompt-cached system prompt. Prompted with the new references/claim-extraction.md (taxonomy + the "what is NOT a claim" list incl. the third-party-attribution flip + framing rule + ≥10 worked examples, the S41 misses among them). safe_main(); degrades gracefully. - merge-claims.py — unions the three layers into .candidate-claims.json: dedup by overlapping line range + token overlap, anchor LLM line ranges to file content, found_by provenance, pass-count → confidence. - claude-code-review.yml — wires the four pre-steps; timeout-minutes: 25 on the claude-review job (S41 saw a review hang ~18 min). - fact-check.md — .candidate-claims.json is the claim *floor* the review MUST verify (MAY add more); the in-review 4-way claim-finder dispatch retires on the normal path (the pre-step subsumes it), kept as a degraded-pre-step fallback; frontmatter-sweep scope pinned to frontmatter-validate.py's new per-file frontmatter_keys (fixes the #18745-r2 social.* omission). - validate-pinned.py (schema v6→v7) — candidate-claims-coverage rule fails the review (soft-flooring loudly) if a candidate claim has no overlapping trail record; trail-bucket-consistency relaxed for pure-layout/0-claim PRs (#18857-r1 over-trigger). - test_extract_claims.py + testdata/ — synthetic per-category tests + the 3 real S41-fixture diffs (assert the dropped claims surface) + merge-claims dedup/anchor/provenance tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Refactor claim extraction logic in `extract-claims-llm.py` for clarity. - Add a new script `markdown-syntax-findings.py` to identify markdown syntax issues. - Update Vale configuration to include additional style checks for vague link text, empty alt text, and directional references. - Improve documentation consistency by refining prose patterns and adding new rules for command backticks and product names.

…mi/docs into CamSoper/pr-review-overhaul

CamSoper temporarily deployed to testing April 23, 2026 17:39 — with GitHub Actions Inactive

CamSoper temporarily deployed to testing April 23, 2026 19:05 — with GitHub Actions Inactive

CamSoper mentioned this pull request Apr 23, 2026

Land Claude PR review pipeline v1 for testing CamSoper/pulumi.docs#23

Closed

CamSoper temporarily deployed to testing April 23, 2026 19:24 — with GitHub Actions Inactive

CamSoper temporarily deployed to testing April 23, 2026 20:14 — with GitHub Actions Inactive

CamSoper temporarily deployed to testing April 23, 2026 20:33 — with GitHub Actions Inactive

CamSoper temporarily deployed to testing April 23, 2026 20:56 — with GitHub Actions Inactive

CamSoper temporarily deployed to testing April 23, 2026 21:01 — with GitHub Actions Inactive

CamSoper temporarily deployed to testing April 23, 2026 21:24 — with GitHub Actions Inactive

CamSoper temporarily deployed to testing April 23, 2026 21:38 — with GitHub Actions Inactive

CamSoper temporarily deployed to testing April 23, 2026 22:00 — with GitHub Actions Inactive

CamSoper temporarily deployed to testing April 23, 2026 22:03 — with GitHub Actions Inactive

CamSoper temporarily deployed to testing April 23, 2026 22:30 — with GitHub Actions Inactive

CamSoper temporarily deployed to testing April 23, 2026 22:51 — with GitHub Actions Inactive

CamSoper and others added 30 commits May 11, 2026 17:25

Add necessity guidelines for image review criteria

4e5076c

S38 Ship A fixup: trim playbook prose

7793e5a

Strip rationale from the canonical-source playbook — preamble cut to one sentence, table rows lose parentheticals, search-order rules drop the "why" tail, shrug rule trimmed to the directive. Same shape, same intent, fewer runtime tokens.

Merge branch 'CamSoper/pr-review-overhaul' of https://github.com/pulu…

7f076d3

…mi/docs into CamSoper/pr-review-overhaul

Remove note about pre-existing issues from review output format

6b95f13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul the Claude-driven docs PR review pipeline (v1)#18680

Overhaul the Claude-driven docs PR review pipeline (v1)#18680
CamSoper wants to merge 194 commits into
masterfrom
CamSoper/pr-review-overhaul

CamSoper commented Apr 23, 2026 •

edited

Loading

Uh oh!

pulumi-bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CamSoper commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What ships

Benchmark

Status before merge

How to review

Notes

Uh oh!

pulumi-bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CamSoper commented Apr 23, 2026 •

edited

Loading

pulumi-bot commented Apr 23, 2026 •

edited

Loading