feat(billing,codex): per-tier pricing overlays + per-TTL cache writes by Menci · Pull Request #65 · Menci/Floway

Menci · 2026-06-19T17:14:45Z

Summary

One of four parallel precursor PRs splitting cross-cutting infrastructure out of #60.

Tier-aware billing schema — BillingDimension adds input_cache_write_1h (Anthropic's extended-cache-ttl-2025-04-11 1h TTL). ModelPricing gains a tiers? overlay; TokenUsage gains a tier field. Cost compute resolves effective pricing through resolveEffectivePricing(pricing, tier). Migration 0034_usage_per_ttl_and_tier.sql widens dimension CHECK list and adds tier to usage + usage_requests bucket identity (existing rows backfill tier = NULL, treated as base pricing → historical aggregations compute identically).
Per-shape tier extractors — Messages reads usage.cache_creation.ephemeral_{5m,1h}_input_tokens (split when present) + usage.speed. Responses + Chat read usage.service_tier. 'standard' / 'default' / 'auto' normalize to null (base); 'fast' / 'priority' / 'flex' stored verbatim.
Codex flex/priority pricing — every priced Codex slug gains tiers.flex (50% off) + tiers.priority (~2x) overlays per OpenAI's published rates. Codex CLI's /fast writes service_tier: "priority" per openai/codex source — so operator-facing rows tagged "fast" cost out at the priority row.
Codex refresh-token race recovery — recoverFromRefreshRace: on invalid_grant, re-read state, compare refresh_token, defer to sibling rotation if changed. Same architecture as the Claude Code data-plane in feat(provider): Claude Code subscription provider #60. REFRESH_TERMINAL_OAUTH_CODES expanded to match sub2api's full non-retryable set.

Test plan

typecheck + test (2714 passing, +14 net new tests) clean
lint clean for changes in this PR (2 pre-existing import-order warnings in websocket_test.ts predate this branch and are owned by PR3)
Deploy: pnpm run db:migrate:remote && pnpm run deploy — applies migration 0034

Follow-ups

PR3 (control-plane endpoints) ships a per-upstream usage rollup that doesn't yet apply tier-aware compute; once this PR lands, the rollup can be extended to use resolveEffectivePricing. #60 (claude-code) consumes the tier extractor for usage.speed and ships the corresponding pricing tables (Opus 4.6+ fast-mode rates, Sonnet/Haiku/Fable 1h cache rates).

…ier pricing override Adds two cross-provider concepts the BillingDimension-based shape did not capture: - `input_cache_write_1h` is a new disjoint billing dimension for Anthropic's extended-cache-ttl-2025-04-11 1-hour cache writes. `cache_creation` on the Messages response splits writes into `ephemeral_5m_input_tokens` (existing `input_cache_write`) and `ephemeral_1h_input_tokens` (this new dimension). `unitPriceForDimension` falls back 1h -> 5m -> input. - `ModelPricing.tiers` carries per-service-tier overrides (Anthropic `usage.speed`, OpenAI `usage.service_tier`). `resolveEffectivePricing` folds a tier override into a flat snapshot before any unit-price lookup. `UsageRecord` and the SQL usage tables grow a `tier` column that is part of bucket identity, so distinct tiers for the same model/hour aggregate as separate buckets with distinct unit-price snapshots. Migration 0034 recreates `usage` and `usage_requests` to add the `tier` column and to widen the dimension CHECK list to include the new bucket. Existing rows backfill with `tier = NULL`, which `resolveEffectivePricing` treats as base pricing - historical aggregations compute identically.

…tocol shapes - Messages: when the upstream emits `usage.cache_creation` (the structured sub-object from extended-cache-ttl-2025-04-11), split the per-TTL counts onto `input_cache_write` (5m) and `input_cache_write_1h` (1h). Fall back to the flat `cache_creation_input_tokens` when the sub-object is absent. Capture `usage.speed: 'fast'` as `tier: 'fast'`; standard tier is left unset so base-tier rows aggregate with historical no-tier rows. - Responses: surface the top-level `response.service_tier` as the bucket `tier`. Drop `default` and `auto` (both denote base pricing) so we don't needlessly split base-tier buckets. The WebSocket path now reads service_tier from the response object too — matching the HTTP path. - Chat Completions: same as Responses but reading the top-level `chunk.service_tier` (chat.completion[.chunk]). Protocol types grow `MessagesUsage.cache_creation`, `MessagesUsage.speed`, `ResponsesResult.service_tier`, `ChatCompletionsResult.service_tier`, and `ChatCompletionsStreamEvent.service_tier`. Dashboard usage page folds `input_cache_write_1h` into the existing Cache Write column so the per-TTL split shows up correctly in totals, prefill, and per-bucket detail rows. The model editor exposes a dedicated 1-hour cache write input field so operators on custom upstreams can price it. Tests cover the per-TTL split (sub-object takes precedence over the rolled-up flat field), fallback to the flat field, speed=fast extraction, and per-tier override applying during cost compute including the new input_cache_write_1h dimension fallback chain (1h -> 5m -> input).

Add `tiers.flex` and `tiers.priority` overlays for every priced Codex slug so the dashboard's notional cost reflects which OpenAI service tier the request actually ran on. The gateway already captures `usage.service_tier` into `TokenUsage.tier`; this commit completes the loop by giving the cost compute a per-tier rate row to look up. Tier overrides match OpenAI's public pricing (verified 2026-06-19 against https://platform.openai.com/docs/pricing): gpt-5.5 flex \$2.5/\$0.25/\$15 priority \$12.5/\$1.25/\$75 gpt-5.4 flex \$1.25/\$0.13/\$7.5 priority \$5/\$0.5/\$30 gpt-5.4-mini flex \$0.375/\$0.0375/\$2.25 priority \$1.5/\$0.15/\$9 `codex-auto-review` shares `gpt-5.4`'s pricing including the tier overrides. Codex CLI's `/fast` toggle writes `service_tier: "priority"` on the wire (per openai/codex's `ServiceTier::Fast.request_value()`), so operator-facing rows tagged "fast" cost out at the priority row. Cache-write rate stays unset on these entries — OpenAI charges cache creation at the same rate as input, which `unitPriceForDimension`'s fallback chain already covers.

Under burst load, two workers can both observe a stale access token on the same Codex upstream and both attempt a refresh. OpenAI rotates the refresh_token on every successful /oauth/token call, so exactly one racer wins; the other's request is rejected with `invalid_grant` for trying to redeem the rotated-out copy. The previous flow treated every `invalid_grant` as a dead credential and let the caller flip the account to `refresh_failed` - destroying the working credential a sibling had just rotated, and forcing the operator to re-import. On `invalid_grant`, the access-token cache now re-reads upstream state for the same `chatgpt-account-id` slot and compares the refresh_token it tried against what is now stored. If they differ, a sibling rotated and we return their freshly-minted access token (the caller treats it as a normal cache hit and skips the terminal flip). If they match, we re-raise the original error so the data-plane / control-plane caller flips the row as before. The other refresh-terminal codes - `app_session_terminated`, `invalid_refresh_token`, `invalid_client`, `unauthorized_client`, `access_denied` - bypass recovery entirely; none of them are caused by a rotation race. `CodexOAuthSessionTerminatedError` now carries the raw OAuth `error` value as a `code` field alongside the existing `upstreamMessage` so the recovery branch can single out `invalid_grant` from the catch. `REFRESH_TERMINAL_OAUTH_CODES` is broadened to the audit-aligned set (`app_session_terminated`, `invalid_grant`, `invalid_refresh_token`, `invalid_client`, `unauthorized_client`, `access_denied`) - Codex is OpenAI OAuth, so the list matches sub2api's `isNonRetryableRefreshError` verbatim. Sub2api's tryRecoverFromRefreshRace (backend/internal/service/oauth_refresh_api.go:173-193) is the canonical pattern; we apply it to Codex's per-account credential here. The token rotation persistence hook stays awaited; the recovery branch reads from the just-persisted state via the upstream repo and returns the sibling's cached access token directly so no second mint fires from this call site.

Menci · 2026-06-19T20:12:22Z

Split into #67 (Codex refresh-race), #68 (input_cache_write_1h + migration), #69 (tier-aware overlay + extractors + Codex pricing) for focused review.

Menci added 4 commits June 20, 2026 00:40

Menci changed the title ~~Tier-aware billing schema + per-TTL cache + Codex flex/priority pricing + Codex refresh-race recovery~~ feat(billing,codex): per-tier pricing overlays + per-TTL cache writes Jun 19, 2026

Menci force-pushed the precursor-tier-aware-billing branch from 5eb94f0 to c634ddc Compare June 19, 2026 19:20

This was referenced Jun 19, 2026

fix(codex): recover from concurrent refresh-token rotation races #67

Open

feat(protocols,gateway): input_cache_write_1h dimension + migration adding tier column #68

Open

feat(billing,codex): per-tier pricing overlays + service tier extractors #69

Open

Menci closed this Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(billing,codex): per-tier pricing overlays + per-TTL cache writes#65

feat(billing,codex): per-tier pricing overlays + per-TTL cache writes#65
Menci wants to merge 4 commits into
mainfrom
precursor-tier-aware-billing

Menci commented Jun 19, 2026

Uh oh!

Menci commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Menci commented Jun 19, 2026

Summary

Test plan

Follow-ups

Uh oh!

Menci commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant