Skip to content

feat(billing,codex): per-tier pricing overlays + per-TTL cache writes#65

Closed
Menci wants to merge 4 commits into
mainfrom
precursor-tier-aware-billing
Closed

feat(billing,codex): per-tier pricing overlays + per-TTL cache writes#65
Menci wants to merge 4 commits into
mainfrom
precursor-tier-aware-billing

Conversation

@Menci

@Menci Menci commented Jun 19, 2026

Copy link
Copy Markdown
Owner

Summary

One of four parallel precursor PRs splitting cross-cutting infrastructure out of #60.

  • Tier-aware billing schemaBillingDimension adds input_cache_write_1h (Anthropic's extended-cache-ttl-2025-04-11 1h TTL). ModelPricing gains a tiers? overlay; TokenUsage gains a tier field. Cost compute resolves effective pricing through resolveEffectivePricing(pricing, tier). Migration 0034_usage_per_ttl_and_tier.sql widens dimension CHECK list and adds tier to usage + usage_requests bucket identity (existing rows backfill tier = NULL, treated as base pricing → historical aggregations compute identically).
  • Per-shape tier extractors — Messages reads usage.cache_creation.ephemeral_{5m,1h}_input_tokens (split when present) + usage.speed. Responses + Chat read usage.service_tier. 'standard' / 'default' / 'auto' normalize to null (base); 'fast' / 'priority' / 'flex' stored verbatim.
  • Codex flex/priority pricing — every priced Codex slug gains tiers.flex (50% off) + tiers.priority (~2x) overlays per OpenAI's published rates. Codex CLI's /fast writes service_tier: "priority" per openai/codex source — so operator-facing rows tagged "fast" cost out at the priority row.
  • Codex refresh-token race recoveryrecoverFromRefreshRace: on invalid_grant, re-read state, compare refresh_token, defer to sibling rotation if changed. Same architecture as the Claude Code data-plane in feat(provider): Claude Code subscription provider #60. REFRESH_TERMINAL_OAUTH_CODES expanded to match sub2api's full non-retryable set.

Test plan

  • typecheck + test (2714 passing, +14 net new tests) clean
  • lint clean for changes in this PR (2 pre-existing import-order warnings in websocket_test.ts predate this branch and are owned by PR3)
  • Deploy: pnpm run db:migrate:remote && pnpm run deploy — applies migration 0034

Follow-ups

PR3 (control-plane endpoints) ships a per-upstream usage rollup that doesn't yet apply tier-aware compute; once this PR lands, the rollup can be extended to use resolveEffectivePricing. #60 (claude-code) consumes the tier extractor for usage.speed and ships the corresponding pricing tables (Opus 4.6+ fast-mode rates, Sonnet/Haiku/Fable 1h cache rates).

Menci added 4 commits June 20, 2026 00:40
…ier pricing override

Adds two cross-provider concepts the BillingDimension-based shape did not
capture:

- `input_cache_write_1h` is a new disjoint billing dimension for Anthropic's
  extended-cache-ttl-2025-04-11 1-hour cache writes. `cache_creation` on the
  Messages response splits writes into `ephemeral_5m_input_tokens` (existing
  `input_cache_write`) and `ephemeral_1h_input_tokens` (this new dimension).
  `unitPriceForDimension` falls back 1h -> 5m -> input.
- `ModelPricing.tiers` carries per-service-tier overrides (Anthropic
  `usage.speed`, OpenAI `usage.service_tier`). `resolveEffectivePricing`
  folds a tier override into a flat snapshot before any unit-price lookup.
  `UsageRecord` and the SQL usage tables grow a `tier` column that is part
  of bucket identity, so distinct tiers for the same model/hour aggregate
  as separate buckets with distinct unit-price snapshots.

Migration 0034 recreates `usage` and `usage_requests` to add the `tier`
column and to widen the dimension CHECK list to include the new bucket.
Existing rows backfill with `tier = NULL`, which `resolveEffectivePricing`
treats as base pricing - historical aggregations compute identically.
…tocol shapes

- Messages: when the upstream emits `usage.cache_creation` (the structured
  sub-object from extended-cache-ttl-2025-04-11), split the per-TTL counts
  onto `input_cache_write` (5m) and `input_cache_write_1h` (1h). Fall back
  to the flat `cache_creation_input_tokens` when the sub-object is absent.
  Capture `usage.speed: 'fast'` as `tier: 'fast'`; standard tier is left
  unset so base-tier rows aggregate with historical no-tier rows.
- Responses: surface the top-level `response.service_tier` as the bucket
  `tier`. Drop `default` and `auto` (both denote base pricing) so we don't
  needlessly split base-tier buckets. The WebSocket path now reads
  service_tier from the response object too — matching the HTTP path.
- Chat Completions: same as Responses but reading the top-level
  `chunk.service_tier` (chat.completion[.chunk]).

Protocol types grow `MessagesUsage.cache_creation`, `MessagesUsage.speed`,
`ResponsesResult.service_tier`, `ChatCompletionsResult.service_tier`, and
`ChatCompletionsStreamEvent.service_tier`.

Dashboard usage page folds `input_cache_write_1h` into the existing Cache
Write column so the per-TTL split shows up correctly in totals, prefill,
and per-bucket detail rows. The model editor exposes a dedicated 1-hour
cache write input field so operators on custom upstreams can price it.

Tests cover the per-TTL split (sub-object takes precedence over the
rolled-up flat field), fallback to the flat field, speed=fast extraction,
and per-tier override applying during cost compute including the new
input_cache_write_1h dimension fallback chain (1h -> 5m -> input).
Add `tiers.flex` and `tiers.priority` overlays for every priced Codex slug
so the dashboard's notional cost reflects which OpenAI service tier the
request actually ran on. The gateway already captures
`usage.service_tier` into `TokenUsage.tier`; this commit completes the
loop by giving the cost compute a per-tier rate row to look up.

Tier overrides match OpenAI's public pricing (verified 2026-06-19 against
https://platform.openai.com/docs/pricing):

  gpt-5.5         flex \$2.5/\$0.25/\$15      priority \$12.5/\$1.25/\$75
  gpt-5.4         flex \$1.25/\$0.13/\$7.5    priority \$5/\$0.5/\$30
  gpt-5.4-mini    flex \$0.375/\$0.0375/\$2.25  priority \$1.5/\$0.15/\$9

`codex-auto-review` shares `gpt-5.4`'s pricing including the tier
overrides. Codex CLI's `/fast` toggle writes `service_tier: "priority"`
on the wire (per openai/codex's `ServiceTier::Fast.request_value()`), so
operator-facing rows tagged "fast" cost out at the priority row.

Cache-write rate stays unset on these entries — OpenAI charges cache
creation at the same rate as input, which `unitPriceForDimension`'s
fallback chain already covers.
Under burst load, two workers can both observe a stale access token on
the same Codex upstream and both attempt a refresh. OpenAI rotates the
refresh_token on every successful /oauth/token call, so exactly one
racer wins; the other's request is rejected with `invalid_grant` for
trying to redeem the rotated-out copy. The previous flow treated every
`invalid_grant` as a dead credential and let the caller flip the
account to `refresh_failed` - destroying the working credential a
sibling had just rotated, and forcing the operator to re-import.

On `invalid_grant`, the access-token cache now re-reads upstream state
for the same `chatgpt-account-id` slot and compares the refresh_token
it tried against what is now stored. If they differ, a sibling rotated
and we return their freshly-minted access token (the caller treats it
as a normal cache hit and skips the terminal flip). If they match, we
re-raise the original error so the data-plane / control-plane caller
flips the row as before. The other refresh-terminal codes -
`app_session_terminated`, `invalid_refresh_token`, `invalid_client`,
`unauthorized_client`, `access_denied` - bypass recovery entirely;
none of them are caused by a rotation race.

`CodexOAuthSessionTerminatedError` now carries the raw OAuth `error`
value as a `code` field alongside the existing `upstreamMessage` so the
recovery branch can single out `invalid_grant` from the catch.
`REFRESH_TERMINAL_OAUTH_CODES` is broadened to the audit-aligned set
(`app_session_terminated`, `invalid_grant`, `invalid_refresh_token`,
`invalid_client`, `unauthorized_client`, `access_denied`) - Codex is
OpenAI OAuth, so the list matches sub2api's `isNonRetryableRefreshError`
verbatim.

Sub2api's tryRecoverFromRefreshRace
(backend/internal/service/oauth_refresh_api.go:173-193) is the canonical
pattern; we apply it to Codex's per-account credential here. The token
rotation persistence hook stays awaited; the recovery branch reads from
the just-persisted state via the upstream repo and returns the
sibling's cached access token directly so no second mint fires from
this call site.
@Menci Menci changed the title Tier-aware billing schema + per-TTL cache + Codex flex/priority pricing + Codex refresh-race recovery feat(billing,codex): per-tier pricing overlays + per-TTL cache writes Jun 19, 2026
@Menci Menci force-pushed the precursor-tier-aware-billing branch from 5eb94f0 to c634ddc Compare June 19, 2026 19:20
@Menci

Menci commented Jun 19, 2026

Copy link
Copy Markdown
Owner Author

Split into #67 (Codex refresh-race), #68 (input_cache_write_1h + migration), #69 (tier-aware overlay + extractors + Codex pricing) for focused review.

@Menci Menci closed this Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant