Phoebe: reconcile exit-code (loud routine/quiet backfill) + window_start index + single-flight doc#18
Phoebe: reconcile exit-code (loud routine/quiet backfill) + window_start index + single-flight doc#18hhuuggoo wants to merge 6 commits into
Conversation
A rater run that DELETES a previously-billed rated_usage row
(ReconciledDeletions > 0) is a revenue change. Whether it should page
depends on WHY the window was chosen:
- DEFAULT trailing-hours window (no --since/--until): a routine cron run
rewriting a prior bill means events vanished from billing_event (data
loss) or an upstream regression now drops them — ERROR + exit nonzero
(treat like the other anomalies; page someone).
- EXPLICIT --since/--until (operator backfill, e.g. after a late price
fix): convergence the operator asked for — INFO + exit 0.
The decision lives in cmd/rater (the only place that knows the window was
explicit), via a unit-testable exitCode(reconciledDeletions, windowExplicit,
hasAnomaly) helper; it is deliberately NOT folded into Result.HasAnomaly()
(that would make explicit backfills exit nonzero too). resolveWindow now
returns windowExplicit (true iff either flag was set) as the single source of
truth for the routine-vs-backfill distinction. Semantics unchanged ("what the
latest run says is what bills" — store.go always reconciles); this is purely
the observability/exit contract.
Tests (RED-demonstrated on the routine gate before wiring):
- TestRater_RoutineRunReconcileDeleteExitsNonzero
- TestRater_BackfillReconcileDeleteExitsZero
- TestResolveWindow_SingleFlagIsExplicit
+ windowExplicit assertions on the existing resolveWindow tests.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The reconcile DELETE (the deleted CTE) filters rated_usage on window_start ALONE (window_start >= $1 AND window_start < $2). Every existing index leads with auth_id, leaving window_start trailing, so no index could serve the predicate — the reconcile seq-scanned rated_usage and took a full-trailing- window lock footprint on every run (the default window re-rates 24 closed hours). Add rated_usage_window_start_ix ON rated_usage (window_start) to BOTH migrations/0002_rating.sql and the Atlas c2f1a3b4d5e6 revision (create_index in upgrade; drop_index if_exists=True in downgrade, in reverse order). While here, fix the asymmetric existing downgrade: rated_usage_auth_id_window_start_ix now drops with if_exists=True too, matching the idempotent upgrade so up/down/up is clean. Verified against live PG: the migration chain applies clean, the index exists, and the reconcile DELETE predicate now plans as an Index (Only) Scan using rated_usage_window_start_ix. New integration check: TestIntegration_ReconcileDeleteUsesWindowStartIndex (seqscan off → a remaining seq scan would fail it). schemaDDL mirrors the production indexes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The store.go upsert comment overclaimed "Deterministic lock order across concurrent raters (no ABBA deadlock)". The ordered upsert only prevents a SINGLE rater self-deadlocking against its own rows; cross-rater deadlock-safety (the ordered upsert vs. the deleted CTE's DELETE — a potential ABBA pair) holds ONLY UNDER SINGLE-FLIGHT. State the real contract: the deployment forbids two raters over overlapping windows (Atlas CronJob concurrencyPolicy: Forbid), so the cross-rater hazard is unreachable and no delete-lock-ordering machinery is added (per Ben/ledger — explicitly excluded). cmd/rater's package doc already carries the single-flight contract (added with the reconcile-exit commit): the rater MUST run single-flight, the Atlas CronJob enforces it (concurrencyPolicy: Forbid), and that manifest lives outside this repo. Also drop a stale cross-PR "FIX 2" label from a reconcile integration test docstring (it now collides with this PR's FIX 2 = the index). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The captured request body is hard-capped before to_tsvector (an uncapped body fails the io_log INSERT past ~1 MiB). That truncation is correct and stays, but it was silent — an operator had no way to see that a stored body was cut. Add a single WARN at the capture call site with the request_id and the original→cap sizes when, and only when, truncation fires. captureRequestBody now also returns the full body length so the caller can log original-vs-cap without re-reading. Behaviour is unchanged otherwise: the forwarded request still carries the full body. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
🔋 Re-battery — Round 1 (status: ESCALATE), WITH ledger11 raw → 3 refuted → 8 confirmed + 3 persona. One real finding, narrow and good: the reconcile observability contract is HALF-implemented. The ledger suppressed all the settled decisions — this is a genuine implementation gap in the newest change. The gap (contract-vs-implementation mismatch)Option (c) says: routine reconcile-delete → loud (ERROR) + exit nonzero; backfill → quiet (INFO) + exit 0. The exit-code half is correct ( Plus: Why this is finish-the-ratified-decision, not a new doorThe contract is decided (option c, in the ledger). The escalation was only about the package boundary (where routine-vs-backfill knowledge lives). That's a mechanical taste call, not a contract — resolution: thread Also (mechanical, same pass)The two Battery |
…XPLAIN Completes the ratified reconcile observability contract (option (c)) on the LOG side; the exit-code half (exitCode() in cmd/rater) was already done. FIX 1 — thread windowExplicit into Rater.Run and gate the reconcile-delete log severity on it. A reconcile-DELETE of a previously-billed rated_usage row is now LOUD on a ROUTINE run (default trailing-hours window, !windowExplicit) — ERROR, because rewriting a prior bill with no operator behind it means events vanished from billing_event (data loss) or an upstream regression dropped them — and QUIET on an EXPLICIT backfill (--since/--until) — INFO, intended convergence. This mirrors the exit-code gate; the reconcile SEMANTICS are unchanged. All Run call sites updated to pass windowExplicit (routine cron path passes false). FIX 2 — test the loud-log half. TestRater_RoutineReconcileDeleteLogsError pins that a routine reconcile-delete (no other anomaly, HasAnomaly stays false) emits an ERROR line; TestRater_BackfillReconcileDeleteLogsInfoNoError pins that the same delete under an explicit backfill logs INFO and NO ERROR. Both capture the INFO and ERROR streams into buffers. The routine-ERROR assertion was demonstrated RED against the pre-fix always-INFO code. Fixed the over-claiming docstring on TestRater_RoutineRunReconcileDeleteExitsNonzero to state it pins ONLY the exit code, now that the log half is separately tested. FIX 3 — make the reconcile-DELETE index EXPLAIN test a real proof. It previously EXPLAINed a standalone `SELECT 1` against an empty rated_usage, where the planner seqscans regardless. Now it populates rated_usage (50 auths x 200 hours = 10k rows), ANALYZEs, and EXPLAINs the ACTUAL reconcile DELETE (the `deleted` CTE shape: window_start range + NOT EXISTS anti-join), asserting the plan chooses rated_usage_window_start_ix at DEFAULT cost (seqscan enabled). Verified discriminating: dropping the index makes the plan fall back to the far costlier auth-leading composite (~280 vs ~8), so the assertion fails without the index. Gate: go build, go vet (+integration), golangci-lint v1.64.8, gofmt all clean; go test -race ./... and -tags=integration -race ./... against live PG all pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phoebe: finish reconcile observability (loud-log half) + real index EXPLAIN
|
Landed on main via the squashed rating merge (3b22908). Closing the stack. |
Fix pass on the rater after PR #17 round 1 (Ben's reconcile-DELETE recommendations + mechanical findings). Money path — fail-closed, every behavioral change paired with an invariant-named test. Stacked on
yaml-rater-fixes-3.Contracts
Reconcile-delete exit-code (option (c): loud on routine, quiet on backfill). A run that DELETES a previously-billed
rated_usagerow (ReconciledDeletions > 0) is a revenue change. The exit code now depends on WHY the window was chosen:--since/--until): a prior bill vanishing on a routine cadence means data loss / an upstream regression → ERROR + exit 2 (the anomaly code; page someone).--since/--untilset): intended convergence (e.g. after a late price fix) → INFO + exit 0.The gate is
ReconciledDeletions > 0 && !windowExplicit. The decision lives incmd/rater(the only place that knows the window was explicit) via a unit-testableexitCodehelper; it is deliberately NOT folded intoResult.HasAnomaly()(that would make explicit backfills exit nonzero too).resolveWindownow returnswindowExplicit(true iff either flag was set). Semantics unchanged — store.go always reconciles ("what the latest run says is what bills"); this is purely the observability/exit contract.New index for the reconcile DELETE.
rated_usage_window_start_ix ON rated_usage (window_start)— thedeletedCTE filterswindow_startalone, but every existing index leads withauth_id, so the reconcile seq-scanned and over-locked on every run. Added to bothmigrations/0002_rating.sqland the Atlasc2f1a3b4d5e6revision. Also fixed the asymmetric downgrade: bothdrop_indexcalls now useif_exists=Truefor up/down/up idempotency. Verified against live PG: the reconcile DELETE plans asIndex Only Scan using rated_usage_window_start_ix.Single-flight requirement (Atlas
concurrencyPolicy: Forbidhand-off). The reconcile DELETE + ordered upsert are deadlock-safe ONLY under single-flight (the ordered upsert prevents self-deadlock; cross-rater ABBA is unreachable only because no two raters run over overlapping windows). The ordered-upsert comment instore.gowas overclaiming ("lock order across concurrent raters"); it now states the real contract.cmd/rater's package doc documents the single-flight assumption and that the deployment MUST enforce it. No delete-lock-ordering machinery added (per Ben/ledger — explicitly excluded). The actualconcurrencyPolicy: Forbidlives in the Atlas manifest repo, outside phoebe — a hand-off item.Tests (invariant-named)
TestRater_RoutineRunReconcileDeleteExitsNonzero— RED-demonstrated against the old anomaly-only exit before wiring.TestRater_BackfillReconcileDeleteExitsZero— and a real anomaly during a backfill still exits nonzero.TestResolveWindow_SingleFlagIsExplicit+windowExplicitassertions on the existing window tests.TestIntegration_ReconcileDeleteUsesWindowStartIndex— live-PG EXPLAIN (seqscan off) proves the new index serves the reconcile predicate.Gate
go build,go vet(+-tags=integration),go test -race, golangci-lint v1.64.8 (default + integration tags),gofmt -l— all clean. Live-PG integration + e2e green; migration chain applies clean and the new index is used.🤖 Generated with Claude Code