Skip to content

Phoebe: add resource_id to rated_usage rollup grain (billing customer attribution, E2)#22

Closed
hhuuggoo wants to merge 1 commit into
mainfrom
rated-usage-resource-id
Closed

Phoebe: add resource_id to rated_usage rollup grain (billing customer attribution, E2)#22
hhuuggoo wants to merge 1 commit into
mainfrom
rated-usage-resource-id

Conversation

@hhuuggoo

Copy link
Copy Markdown
Contributor

The rater currently DROPS resource_id — but billing needs it to identify the customer org (E2: bill the deployment's org via resource_idorg_id). resource_id is already on billing_event (plumbed from the proxy through the drainer); this change carries it through the rater into the rated_usage grain. No prod data exists (rater has no CronJob, nothing deployed), so the existing 0002 migration revision is edited in place, not stacked.

Contracts

  • rated_usage grain / unique key — changes from (auth_id, model_id, window_start) to (auth_id, resource_id, model_id, window_start). The constraint is renamed rated_usage_auth_resource_model_window_uq. model_id is KEPT (not redundant with resource_id): price resolution is per-model, one deployment can serve multiple models/adapters at different rates, and collapsing models would sum traffic priced at different rates. Two deployments of the same model by the same auth in one hour → two rows (correct — they may bill to different orgs). The ON CONFLICT target, the ORDER BY (deadlock-free lock order), the deterministic md5 surrogate-key natural key (length-prefixed, fixed field order auth_id|resource_id|model_id|epoch), and the reconcile deleted CTE anti-join all move to the new key in lockstep.

  • New indexrated_usage_resource_id_window_start_ix on (resource_id, window_start). E2 reads rated_usage by deployment over a time window (resolve the org, then sum that deployment's cost); a resource_id-leading index makes that a tight slice rather than a scan over the auth-leading index where resource_id only trails.

  • NULL-resource_id → unattributable (fail closed)resource_id is NULLABLE on billing_event but the new key column is NON-NULL. A row that can't name its deployment/org CANNOT be billed. The grouped/priced filter requires resource_id IS NOT NULL; the unattributable partition counts resource_id IS NULL (alongside auth_id/model_id); the unpriced count requires full attribution so a NULL-resource_id unpriced row is counted ONLY as unattributable. Net: a NULL-resource_id row is COUNTED as unattributable (surfaced, exits nonzero), never silently $0-billed or billed to a NULL org. The partition invariant still holds exactly: events_rated + unpriced + unattributable + ambiguous == total in-window events.

Tests

  • Go oracle (oracleStore) grain + md5 mirror the new key; SQL-shape fragments updated.
  • TestRater_DistinctDeploymentsBillSeparately — two deployments, same auth+model+hour → two rows (Go oracle); live-PG twin TestIntegration_ResourceIDGrainAndFailClosed.
  • TestRater_NullResourceIdIsUnattributable + extended TestRater_UnattributableCountedNotSilent — NULL resource_id counted, never billed; partition holds.
  • Integration conformance asserts every written rollup carries the seeded resource_id; e2e asserts the rollup carries X-Saturn-Resource-Id end-to-end.

Gate (all green)

go build ./..., go vet ./... + go vet -tags=integration ./..., go test -race ./..., golangci-lint v1.64.8 (plain + --build-tags=integration), gofmt -l . empty, and the full live-Postgres integration + e2e suite (PHOEBE_TEST_DATABASE_URL=… on postgres:16).

Note for Hugo

The grain decision (KEEP model_id, ADD resource_id) is flagged for your awareness: Ben endorsed it; there is no prod data; it is the correct grain for E2. Two same-model-same-auth deployments in one hour now bill as two rows by design.

…ibution)

The rater dropped resource_id, but billing needs it to identify the customer
org (E2: bill the deployment's org via resource_id→org_id). resource_id is
already on billing_event (nullable); plumb it through the rater into the
rated_usage grain.

New grain / unique key: (auth_id, resource_id, model_id, window_start). model_id
stays — price resolution is per-model and one deployment can serve multiple
models/adapters at different rates, so collapsing models would sum traffic
priced differently. Two deployments of the same model by the same auth in one
hour now correctly produce two rows (they may bill to different orgs).

FAIL-CLOSED ATTRIBUTION: resource_id is NULLABLE on billing_event but the new
key column is NON-NULL. A row that can't name its deployment/org CANNOT be
billed. The grouped/priced filter requires resource_id IS NOT NULL, and the
unattributable partition counts resource_id IS NULL — so a NULL-resource_id row
is surfaced (exits nonzero), never silently $0-billed or billed to a NULL org.
The partition invariant holds: events_rated + unpriced + unattributable +
ambiguous == total in-window events.

Touch-points: ev/resolved/grouped/priced CTEs, the md5 surrogate-key natural
key (length-prefixed, fixed order), ON CONFLICT target + ORDER BY (lock order),
the reconcile `deleted` CTE anti-join, both migrations (0002_rating.sql +
alembic, edited in place — unapplied, no prod data), a new
(resource_id, window_start) index for E2 per-deployment reads, doc comments, and
the oracle + SQL-shape + integration + e2e tests. New negative tests:
DistinctDeploymentsBillSeparately and NullResourceIdIsUnattributable (Go oracle
+ live-PG), plus a resource_id assertion in the e2e rollup read.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@hhuuggoo

Copy link
Copy Markdown
Contributor Author

🔋 Battery — Round 1 (status: ESCALATE, but clean)

9 raw → 6 refuted → 3 confirmed (all low, all test files) + 1 persona. Zero production findings — the resource_id grain + fail-closed attribution partition are solid; the verifiers refuted the scary candidates. The two escalations are minor judgment calls, both with clear answers I'm applying (neither is a contract door):

1. Drop the speculative (resource_id, window_start) index. It was added "for E2 reads by deployment" — but no reader of rated_usage by resource_id exists in this repo (that's the Atlas/Stripe consumer, not built). Ben's own "forward intent without speculative build" rule says: don't pay write-amplification on every rated_usage upsert for a reader that may land later. Fix: remove the index here; it lands in the PR that adds the E2 reader, with proven usage. (The auth_id and window_start indexes stay — they have in-repo consumers.)

2. Document the empty-string-vs-NULL invariant at the oracle (comment only). The oracle uses Go "" to model SQL NULL; prod SQL filters resource_id IS NULL. The verifier proved '' is unreachable in prod: the proxy billing gate fails closed on empty ResourceID before metering, and the drainer's nullStr maps ''→NULL on write. So it's a test-fidelity symmetry note, not a bug. Fix: a comment at the oracle noting "" models NULL and prod guarantees ''→NULL at the drain. (Same pre-existing convention auth_id/model_id already rely on.)

Plus two low-sev test cleanups (store_test.go assertion tidies).

The core change is correct: the partition invariant rated + unpriced + unattributable + ambiguous == total holds, NULL-resource_id is fail-loud unattributable (live-PG tested). Applying the two fixes + cleanups, then a confirming dry round.


Battery wf_bc105ec5-a2f. The grain change itself needs no rework — only the speculative index + a doc comment.

@hhuuggoo

Copy link
Copy Markdown
Contributor Author

Merged to main via the squashed resource_id grain change (e854bba).

@hhuuggoo hhuuggoo closed this Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant