Skip to content

Phoebe: reconcile exit-code (loud routine/quiet backfill) + window_start index + single-flight doc#18

Closed
hhuuggoo wants to merge 6 commits into
yaml-rater-fixes-3from
yaml-rater-fixes-4
Closed

Phoebe: reconcile exit-code (loud routine/quiet backfill) + window_start index + single-flight doc#18
hhuuggoo wants to merge 6 commits into
yaml-rater-fixes-3from
yaml-rater-fixes-4

Conversation

@hhuuggoo

Copy link
Copy Markdown
Contributor

Fix pass on the rater after PR #17 round 1 (Ben's reconcile-DELETE recommendations + mechanical findings). Money path — fail-closed, every behavioral change paired with an invariant-named test. Stacked on yaml-rater-fixes-3.

Contracts

Reconcile-delete exit-code (option (c): loud on routine, quiet on backfill). A run that DELETES a previously-billed rated_usage row (ReconciledDeletions > 0) is a revenue change. The exit code now depends on WHY the window was chosen:

  • Routine (default trailing-hours window, no --since/--until): a prior bill vanishing on a routine cadence means data loss / an upstream regression → ERROR + exit 2 (the anomaly code; page someone).
  • Backfill (--since/--until set): intended convergence (e.g. after a late price fix) → INFO + exit 0.

The gate is ReconciledDeletions > 0 && !windowExplicit. The decision lives in cmd/rater (the only place that knows the window was explicit) via a unit-testable exitCode helper; it is deliberately NOT folded into Result.HasAnomaly() (that would make explicit backfills exit nonzero too). resolveWindow now returns windowExplicit (true iff either flag was set). Semantics unchanged — store.go always reconciles ("what the latest run says is what bills"); this is purely the observability/exit contract.

New index for the reconcile DELETE. rated_usage_window_start_ix ON rated_usage (window_start) — the deleted CTE filters window_start alone, but every existing index leads with auth_id, so the reconcile seq-scanned and over-locked on every run. Added to both migrations/0002_rating.sql and the Atlas c2f1a3b4d5e6 revision. Also fixed the asymmetric downgrade: both drop_index calls now use if_exists=True for up/down/up idempotency. Verified against live PG: the reconcile DELETE plans as Index Only Scan using rated_usage_window_start_ix.

Single-flight requirement (Atlas concurrencyPolicy: Forbid hand-off). The reconcile DELETE + ordered upsert are deadlock-safe ONLY under single-flight (the ordered upsert prevents self-deadlock; cross-rater ABBA is unreachable only because no two raters run over overlapping windows). The ordered-upsert comment in store.go was overclaiming ("lock order across concurrent raters"); it now states the real contract. cmd/rater's package doc documents the single-flight assumption and that the deployment MUST enforce it. No delete-lock-ordering machinery added (per Ben/ledger — explicitly excluded). The actual concurrencyPolicy: Forbid lives in the Atlas manifest repo, outside phoebe — a hand-off item.

Tests (invariant-named)

  • TestRater_RoutineRunReconcileDeleteExitsNonzero — RED-demonstrated against the old anomaly-only exit before wiring.
  • TestRater_BackfillReconcileDeleteExitsZero — and a real anomaly during a backfill still exits nonzero.
  • TestResolveWindow_SingleFlagIsExplicit + windowExplicit assertions on the existing window tests.
  • TestIntegration_ReconcileDeleteUsesWindowStartIndex — live-PG EXPLAIN (seqscan off) proves the new index serves the reconcile predicate.

Gate

go build, go vet (+ -tags=integration), go test -race, golangci-lint v1.64.8 (default + integration tags), gofmt -l — all clean. Live-PG integration + e2e green; migration chain applies clean and the new index is used.

🤖 Generated with Claude Code

hhuuggoo and others added 4 commits June 15, 2026 23:49
A rater run that DELETES a previously-billed rated_usage row
(ReconciledDeletions > 0) is a revenue change. Whether it should page
depends on WHY the window was chosen:

  - DEFAULT trailing-hours window (no --since/--until): a routine cron run
    rewriting a prior bill means events vanished from billing_event (data
    loss) or an upstream regression now drops them — ERROR + exit nonzero
    (treat like the other anomalies; page someone).
  - EXPLICIT --since/--until (operator backfill, e.g. after a late price
    fix): convergence the operator asked for — INFO + exit 0.

The decision lives in cmd/rater (the only place that knows the window was
explicit), via a unit-testable exitCode(reconciledDeletions, windowExplicit,
hasAnomaly) helper; it is deliberately NOT folded into Result.HasAnomaly()
(that would make explicit backfills exit nonzero too). resolveWindow now
returns windowExplicit (true iff either flag was set) as the single source of
truth for the routine-vs-backfill distinction. Semantics unchanged ("what the
latest run says is what bills" — store.go always reconciles); this is purely
the observability/exit contract.

Tests (RED-demonstrated on the routine gate before wiring):
  - TestRater_RoutineRunReconcileDeleteExitsNonzero
  - TestRater_BackfillReconcileDeleteExitsZero
  - TestResolveWindow_SingleFlagIsExplicit
  + windowExplicit assertions on the existing resolveWindow tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The reconcile DELETE (the deleted CTE) filters rated_usage on window_start
ALONE (window_start >= $1 AND window_start < $2). Every existing index leads
with auth_id, leaving window_start trailing, so no index could serve the
predicate — the reconcile seq-scanned rated_usage and took a full-trailing-
window lock footprint on every run (the default window re-rates 24 closed
hours).

Add rated_usage_window_start_ix ON rated_usage (window_start) to BOTH
migrations/0002_rating.sql and the Atlas c2f1a3b4d5e6 revision (create_index
in upgrade; drop_index if_exists=True in downgrade, in reverse order). While
here, fix the asymmetric existing downgrade: rated_usage_auth_id_window_start_ix
now drops with if_exists=True too, matching the idempotent upgrade so up/down/up
is clean.

Verified against live PG: the migration chain applies clean, the index exists,
and the reconcile DELETE predicate now plans as an Index (Only) Scan using
rated_usage_window_start_ix. New integration check:
TestIntegration_ReconcileDeleteUsesWindowStartIndex (seqscan off → a remaining
seq scan would fail it). schemaDDL mirrors the production indexes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The store.go upsert comment overclaimed "Deterministic lock order across
concurrent raters (no ABBA deadlock)". The ordered upsert only prevents a
SINGLE rater self-deadlocking against its own rows; cross-rater deadlock-safety
(the ordered upsert vs. the deleted CTE's DELETE — a potential ABBA pair) holds
ONLY UNDER SINGLE-FLIGHT. State the real contract: the deployment forbids two
raters over overlapping windows (Atlas CronJob concurrencyPolicy: Forbid), so
the cross-rater hazard is unreachable and no delete-lock-ordering machinery is
added (per Ben/ledger — explicitly excluded).

cmd/rater's package doc already carries the single-flight contract (added with
the reconcile-exit commit): the rater MUST run single-flight, the Atlas CronJob
enforces it (concurrencyPolicy: Forbid), and that manifest lives outside this
repo.

Also drop a stale cross-PR "FIX 2" label from a reconcile integration test
docstring (it now collides with this PR's FIX 2 = the index).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The captured request body is hard-capped before to_tsvector (an uncapped
body fails the io_log INSERT past ~1 MiB). That truncation is correct and
stays, but it was silent — an operator had no way to see that a stored body
was cut. Add a single WARN at the capture call site with the request_id and
the original→cap sizes when, and only when, truncation fires.

captureRequestBody now also returns the full body length so the caller can
log original-vs-cap without re-reading. Behaviour is unchanged otherwise:
the forwarded request still carries the full body.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@hhuuggoo

Copy link
Copy Markdown
Contributor Author

🔋 Re-battery — Round 1 (status: ESCALATE), WITH ledger

11 raw → 3 refuted → 8 confirmed + 3 persona. One real finding, narrow and good: the reconcile observability contract is HALF-implemented. The ledger suppressed all the settled decisions — this is a genuine implementation gap in the newest change.

The gap (contract-vs-implementation mismatch)

Option (c) says: routine reconcile-delete → loud (ERROR) + exit nonzero; backfill → quiet (INFO) + exit 0. The exit-code half is correct (exitCode() in cmd/rater gates on windowExplicit). But the log half is not: rater.go:134 emits the reconcile-delete line unconditionally at INFO — and rater.Run never receives windowExplicit, so it can't raise to ERROR for the routine case. Result: on the alarming path (a routine run rewriting a prior bill = data vanished / upstream regression), the process exits nonzero but the logs show only a reassuring INFO "what the latest run says is what bills" — the calmest message on the loudest event. Severity-keyed alerting sees an apparently-clean run.

Plus: TestRater_RoutineRunReconcileDeleteExitsNonzero (main_test.go:110) claims to pin "the loud half" but only asserts the exit code — the ERROR-log half is untested because it's unimplemented (the overclaim the cleanup angle flagged).

Why this is finish-the-ratified-decision, not a new door

The contract is decided (option c, in the ledger). The escalation was only about the package boundary (where routine-vs-backfill knowledge lives). That's a mechanical taste call, not a contract — resolution: thread windowExplicit into Rater.Run, mirroring the exit-code gate (keeps rating observability in the rating package), emit the routine reconcile-delete at ERROR with investigate-text, and add a test that asserts the ERROR line fires (closing the overclaim). I'm implementing this.

Also (mechanical, same pass)

The two ReconcileDeleteUsesWindowStartIndex integration tests EXPLAIN against an empty/standalone table (a freshly-created empty rated_usage → the planner may pick a seqscan regardless of the index) — tighten so the EXPLAIN actually proves index use on a populated table.


Battery wf_1e6324ec-606. Note: a trivial proxy-only commit (00833a2, iolog truncation-log carried from the now-closed #13) landed on the branch top after this battery started — it's outside the rating diff and was reviewed in #13's lineage; the final tip will re-battery clean. #18 needs the log-severity fix, then a dry round.

hhuuggoo and others added 2 commits June 16, 2026 00:07
…XPLAIN

Completes the ratified reconcile observability contract (option (c)) on the
LOG side; the exit-code half (exitCode() in cmd/rater) was already done.

FIX 1 — thread windowExplicit into Rater.Run and gate the reconcile-delete log
severity on it. A reconcile-DELETE of a previously-billed rated_usage row is now
LOUD on a ROUTINE run (default trailing-hours window, !windowExplicit) — ERROR,
because rewriting a prior bill with no operator behind it means events vanished
from billing_event (data loss) or an upstream regression dropped them — and QUIET
on an EXPLICIT backfill (--since/--until) — INFO, intended convergence. This
mirrors the exit-code gate; the reconcile SEMANTICS are unchanged. All Run call
sites updated to pass windowExplicit (routine cron path passes false).

FIX 2 — test the loud-log half. TestRater_RoutineReconcileDeleteLogsError pins
that a routine reconcile-delete (no other anomaly, HasAnomaly stays false) emits
an ERROR line; TestRater_BackfillReconcileDeleteLogsInfoNoError pins that the same
delete under an explicit backfill logs INFO and NO ERROR. Both capture the INFO
and ERROR streams into buffers. The routine-ERROR assertion was demonstrated RED
against the pre-fix always-INFO code. Fixed the over-claiming docstring on
TestRater_RoutineRunReconcileDeleteExitsNonzero to state it pins ONLY the exit
code, now that the log half is separately tested.

FIX 3 — make the reconcile-DELETE index EXPLAIN test a real proof. It previously
EXPLAINed a standalone `SELECT 1` against an empty rated_usage, where the planner
seqscans regardless. Now it populates rated_usage (50 auths x 200 hours = 10k
rows), ANALYZEs, and EXPLAINs the ACTUAL reconcile DELETE (the `deleted` CTE
shape: window_start range + NOT EXISTS anti-join), asserting the plan chooses
rated_usage_window_start_ix at DEFAULT cost (seqscan enabled). Verified
discriminating: dropping the index makes the plan fall back to the far costlier
auth-leading composite (~280 vs ~8), so the assertion fails without the index.

Gate: go build, go vet (+integration), golangci-lint v1.64.8, gofmt all clean;
go test -race ./... and -tags=integration -race ./... against live PG all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phoebe: finish reconcile observability (loud-log half) + real index EXPLAIN
@hhuuggoo

Copy link
Copy Markdown
Contributor Author

Landed on main via the squashed rating merge (3b22908). Closing the stack.

@hhuuggoo hhuuggoo closed this Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant