Phoebe: reconcile exit-code (loud routine/quiet backfill) + window_start index + single-flight doc by hhuuggoo · Pull Request #18 · saturncloud/phoebe

hhuuggoo · 2026-06-15T23:50:14Z

Fix pass on the rater after PR #17 round 1 (Ben's reconcile-DELETE recommendations + mechanical findings). Money path — fail-closed, every behavioral change paired with an invariant-named test. Stacked on yaml-rater-fixes-3.

Contracts

Reconcile-delete exit-code (option (c): loud on routine, quiet on backfill). A run that DELETES a previously-billed rated_usage row (ReconciledDeletions > 0) is a revenue change. The exit code now depends on WHY the window was chosen:

Routine (default trailing-hours window, no --since/--until): a prior bill vanishing on a routine cadence means data loss / an upstream regression → ERROR + exit 2 (the anomaly code; page someone).
Backfill (--since/--until set): intended convergence (e.g. after a late price fix) → INFO + exit 0.

The gate is ReconciledDeletions > 0 && !windowExplicit. The decision lives in cmd/rater (the only place that knows the window was explicit) via a unit-testable exitCode helper; it is deliberately NOT folded into Result.HasAnomaly() (that would make explicit backfills exit nonzero too). resolveWindow now returns windowExplicit (true iff either flag was set). Semantics unchanged — store.go always reconciles ("what the latest run says is what bills"); this is purely the observability/exit contract.

New index for the reconcile DELETE. rated_usage_window_start_ix ON rated_usage (window_start) — the deleted CTE filters window_start alone, but every existing index leads with auth_id, so the reconcile seq-scanned and over-locked on every run. Added to both migrations/0002_rating.sql and the Atlas c2f1a3b4d5e6 revision. Also fixed the asymmetric downgrade: both drop_index calls now use if_exists=True for up/down/up idempotency. Verified against live PG: the reconcile DELETE plans as Index Only Scan using rated_usage_window_start_ix.

Single-flight requirement (Atlas concurrencyPolicy: Forbid hand-off). The reconcile DELETE + ordered upsert are deadlock-safe ONLY under single-flight (the ordered upsert prevents self-deadlock; cross-rater ABBA is unreachable only because no two raters run over overlapping windows). The ordered-upsert comment in store.go was overclaiming ("lock order across concurrent raters"); it now states the real contract. cmd/rater's package doc documents the single-flight assumption and that the deployment MUST enforce it. No delete-lock-ordering machinery added (per Ben/ledger — explicitly excluded). The actual concurrencyPolicy: Forbid lives in the Atlas manifest repo, outside phoebe — a hand-off item.

Tests (invariant-named)

TestRater_RoutineRunReconcileDeleteExitsNonzero — RED-demonstrated against the old anomaly-only exit before wiring.
TestRater_BackfillReconcileDeleteExitsZero — and a real anomaly during a backfill still exits nonzero.
TestResolveWindow_SingleFlagIsExplicit + windowExplicit assertions on the existing window tests.
TestIntegration_ReconcileDeleteUsesWindowStartIndex — live-PG EXPLAIN (seqscan off) proves the new index serves the reconcile predicate.

Gate

go build, go vet (+ -tags=integration), go test -race, golangci-lint v1.64.8 (default + integration tags), gofmt -l — all clean. Live-PG integration + e2e green; migration chain applies clean and the new index is used.

🤖 Generated with Claude Code

A rater run that DELETES a previously-billed rated_usage row (ReconciledDeletions > 0) is a revenue change. Whether it should page depends on WHY the window was chosen: - DEFAULT trailing-hours window (no --since/--until): a routine cron run rewriting a prior bill means events vanished from billing_event (data loss) or an upstream regression now drops them — ERROR + exit nonzero (treat like the other anomalies; page someone). - EXPLICIT --since/--until (operator backfill, e.g. after a late price fix): convergence the operator asked for — INFO + exit 0. The decision lives in cmd/rater (the only place that knows the window was explicit), via a unit-testable exitCode(reconciledDeletions, windowExplicit, hasAnomaly) helper; it is deliberately NOT folded into Result.HasAnomaly() (that would make explicit backfills exit nonzero too). resolveWindow now returns windowExplicit (true iff either flag was set) as the single source of truth for the routine-vs-backfill distinction. Semantics unchanged ("what the latest run says is what bills" — store.go always reconciles); this is purely the observability/exit contract. Tests (RED-demonstrated on the routine gate before wiring): - TestRater_RoutineRunReconcileDeleteExitsNonzero - TestRater_BackfillReconcileDeleteExitsZero - TestResolveWindow_SingleFlagIsExplicit + windowExplicit assertions on the existing resolveWindow tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The reconcile DELETE (the deleted CTE) filters rated_usage on window_start ALONE (window_start >= $1 AND window_start < $2). Every existing index leads with auth_id, leaving window_start trailing, so no index could serve the predicate — the reconcile seq-scanned rated_usage and took a full-trailing- window lock footprint on every run (the default window re-rates 24 closed hours). Add rated_usage_window_start_ix ON rated_usage (window_start) to BOTH migrations/0002_rating.sql and the Atlas c2f1a3b4d5e6 revision (create_index in upgrade; drop_index if_exists=True in downgrade, in reverse order). While here, fix the asymmetric existing downgrade: rated_usage_auth_id_window_start_ix now drops with if_exists=True too, matching the idempotent upgrade so up/down/up is clean. Verified against live PG: the migration chain applies clean, the index exists, and the reconcile DELETE predicate now plans as an Index (Only) Scan using rated_usage_window_start_ix. New integration check: TestIntegration_ReconcileDeleteUsesWindowStartIndex (seqscan off → a remaining seq scan would fail it). schemaDDL mirrors the production indexes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The store.go upsert comment overclaimed "Deterministic lock order across concurrent raters (no ABBA deadlock)". The ordered upsert only prevents a SINGLE rater self-deadlocking against its own rows; cross-rater deadlock-safety (the ordered upsert vs. the deleted CTE's DELETE — a potential ABBA pair) holds ONLY UNDER SINGLE-FLIGHT. State the real contract: the deployment forbids two raters over overlapping windows (Atlas CronJob concurrencyPolicy: Forbid), so the cross-rater hazard is unreachable and no delete-lock-ordering machinery is added (per Ben/ledger — explicitly excluded). cmd/rater's package doc already carries the single-flight contract (added with the reconcile-exit commit): the rater MUST run single-flight, the Atlas CronJob enforces it (concurrencyPolicy: Forbid), and that manifest lives outside this repo. Also drop a stale cross-PR "FIX 2" label from a reconcile integration test docstring (it now collides with this PR's FIX 2 = the index). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The captured request body is hard-capped before to_tsvector (an uncapped body fails the io_log INSERT past ~1 MiB). That truncation is correct and stays, but it was silent — an operator had no way to see that a stored body was cut. Add a single WARN at the capture call site with the request_id and the original→cap sizes when, and only when, truncation fires. captureRequestBody now also returns the full body length so the caller can log original-vs-cap without re-reading. Behaviour is unchanged otherwise: the forwarded request still carries the full body. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

hhuuggoo · 2026-06-15T23:59:39Z

🔋 Re-battery — Round 1 (status: ESCALATE), WITH ledger

11 raw → 3 refuted → 8 confirmed + 3 persona. One real finding, narrow and good: the reconcile observability contract is HALF-implemented. The ledger suppressed all the settled decisions — this is a genuine implementation gap in the newest change.

The gap (contract-vs-implementation mismatch)

Option (c) says: routine reconcile-delete → loud (ERROR) + exit nonzero; backfill → quiet (INFO) + exit 0. The exit-code half is correct (exitCode() in cmd/rater gates on windowExplicit). But the log half is not: rater.go:134 emits the reconcile-delete line unconditionally at INFO — and rater.Run never receives windowExplicit, so it can't raise to ERROR for the routine case. Result: on the alarming path (a routine run rewriting a prior bill = data vanished / upstream regression), the process exits nonzero but the logs show only a reassuring INFO "what the latest run says is what bills" — the calmest message on the loudest event. Severity-keyed alerting sees an apparently-clean run.

Plus: TestRater_RoutineRunReconcileDeleteExitsNonzero (main_test.go:110) claims to pin "the loud half" but only asserts the exit code — the ERROR-log half is untested because it's unimplemented (the overclaim the cleanup angle flagged).

Why this is finish-the-ratified-decision, not a new door

The contract is decided (option c, in the ledger). The escalation was only about the package boundary (where routine-vs-backfill knowledge lives). That's a mechanical taste call, not a contract — resolution: thread windowExplicit into Rater.Run, mirroring the exit-code gate (keeps rating observability in the rating package), emit the routine reconcile-delete at ERROR with investigate-text, and add a test that asserts the ERROR line fires (closing the overclaim). I'm implementing this.

Also (mechanical, same pass)

The two ReconcileDeleteUsesWindowStartIndex integration tests EXPLAIN against an empty/standalone table (a freshly-created empty rated_usage → the planner may pick a seqscan regardless of the index) — tighten so the EXPLAIN actually proves index use on a populated table.

Battery wf_1e6324ec-606. Note: a trivial proxy-only commit (00833a2, iolog truncation-log carried from the now-closed #13) landed on the branch top after this battery started — it's outside the rating diff and was reviewed in #13's lineage; the final tip will re-battery clean. #18 needs the log-severity fix, then a dry round.

…XPLAIN Completes the ratified reconcile observability contract (option (c)) on the LOG side; the exit-code half (exitCode() in cmd/rater) was already done. FIX 1 — thread windowExplicit into Rater.Run and gate the reconcile-delete log severity on it. A reconcile-DELETE of a previously-billed rated_usage row is now LOUD on a ROUTINE run (default trailing-hours window, !windowExplicit) — ERROR, because rewriting a prior bill with no operator behind it means events vanished from billing_event (data loss) or an upstream regression dropped them — and QUIET on an EXPLICIT backfill (--since/--until) — INFO, intended convergence. This mirrors the exit-code gate; the reconcile SEMANTICS are unchanged. All Run call sites updated to pass windowExplicit (routine cron path passes false). FIX 2 — test the loud-log half. TestRater_RoutineReconcileDeleteLogsError pins that a routine reconcile-delete (no other anomaly, HasAnomaly stays false) emits an ERROR line; TestRater_BackfillReconcileDeleteLogsInfoNoError pins that the same delete under an explicit backfill logs INFO and NO ERROR. Both capture the INFO and ERROR streams into buffers. The routine-ERROR assertion was demonstrated RED against the pre-fix always-INFO code. Fixed the over-claiming docstring on TestRater_RoutineRunReconcileDeleteExitsNonzero to state it pins ONLY the exit code, now that the log half is separately tested. FIX 3 — make the reconcile-DELETE index EXPLAIN test a real proof. It previously EXPLAINed a standalone `SELECT 1` against an empty rated_usage, where the planner seqscans regardless. Now it populates rated_usage (50 auths x 200 hours = 10k rows), ANALYZEs, and EXPLAINs the ACTUAL reconcile DELETE (the `deleted` CTE shape: window_start range + NOT EXISTS anti-join), asserting the plan chooses rated_usage_window_start_ix at DEFAULT cost (seqscan enabled). Verified discriminating: dropping the index makes the plan fall back to the far costlier auth-leading composite (~280 vs ~8), so the assertion fails without the index. Gate: go build, go vet (+integration), golangci-lint v1.64.8, gofmt all clean; go test -race ./... and -tags=integration -race ./... against live PG all pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Phoebe: finish reconcile observability (loud-log half) + real index EXPLAIN

hhuuggoo · 2026-06-16T01:03:11Z

Landed on main via the squashed rating merge (3b22908). Closing the stack.

hhuuggoo and others added 4 commits June 15, 2026 23:49

hhuuggoo mentioned this pull request Jun 15, 2026

Phoebe: D1 truncation logging + D2 event_count BIGINT (Hugo's decisions) #13

Closed

hhuuggoo and others added 2 commits June 16, 2026 00:07

Merge pull request #19 from saturncloud/yaml-rater-fixes-5

301e9f6

Phoebe: finish reconcile observability (loud-log half) + real index EXPLAIN

hhuuggoo closed this Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phoebe: reconcile exit-code (loud routine/quiet backfill) + window_start index + single-flight doc#18

Phoebe: reconcile exit-code (loud routine/quiet backfill) + window_start index + single-flight doc#18
hhuuggoo wants to merge 6 commits into
yaml-rater-fixes-3from
yaml-rater-fixes-4

hhuuggoo commented Jun 15, 2026

Uh oh!

hhuuggoo commented Jun 15, 2026

Uh oh!

hhuuggoo commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hhuuggoo commented Jun 15, 2026

Contracts

Tests (invariant-named)

Gate

Uh oh!

hhuuggoo commented Jun 15, 2026

🔋 Re-battery — Round 1 (status: ESCALATE), WITH ledger

The gap (contract-vs-implementation mismatch)

Why this is finish-the-ratified-decision, not a new door

Also (mechanical, same pass)

Uh oh!

hhuuggoo commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant