feat(billing,codex): per-tier pricing overlays + per-TTL cache writes#65
Closed
Menci wants to merge 4 commits into
Closed
feat(billing,codex): per-tier pricing overlays + per-TTL cache writes#65Menci wants to merge 4 commits into
Menci wants to merge 4 commits into
Conversation
…ier pricing override Adds two cross-provider concepts the BillingDimension-based shape did not capture: - `input_cache_write_1h` is a new disjoint billing dimension for Anthropic's extended-cache-ttl-2025-04-11 1-hour cache writes. `cache_creation` on the Messages response splits writes into `ephemeral_5m_input_tokens` (existing `input_cache_write`) and `ephemeral_1h_input_tokens` (this new dimension). `unitPriceForDimension` falls back 1h -> 5m -> input. - `ModelPricing.tiers` carries per-service-tier overrides (Anthropic `usage.speed`, OpenAI `usage.service_tier`). `resolveEffectivePricing` folds a tier override into a flat snapshot before any unit-price lookup. `UsageRecord` and the SQL usage tables grow a `tier` column that is part of bucket identity, so distinct tiers for the same model/hour aggregate as separate buckets with distinct unit-price snapshots. Migration 0034 recreates `usage` and `usage_requests` to add the `tier` column and to widen the dimension CHECK list to include the new bucket. Existing rows backfill with `tier = NULL`, which `resolveEffectivePricing` treats as base pricing - historical aggregations compute identically.
…tocol shapes - Messages: when the upstream emits `usage.cache_creation` (the structured sub-object from extended-cache-ttl-2025-04-11), split the per-TTL counts onto `input_cache_write` (5m) and `input_cache_write_1h` (1h). Fall back to the flat `cache_creation_input_tokens` when the sub-object is absent. Capture `usage.speed: 'fast'` as `tier: 'fast'`; standard tier is left unset so base-tier rows aggregate with historical no-tier rows. - Responses: surface the top-level `response.service_tier` as the bucket `tier`. Drop `default` and `auto` (both denote base pricing) so we don't needlessly split base-tier buckets. The WebSocket path now reads service_tier from the response object too — matching the HTTP path. - Chat Completions: same as Responses but reading the top-level `chunk.service_tier` (chat.completion[.chunk]). Protocol types grow `MessagesUsage.cache_creation`, `MessagesUsage.speed`, `ResponsesResult.service_tier`, `ChatCompletionsResult.service_tier`, and `ChatCompletionsStreamEvent.service_tier`. Dashboard usage page folds `input_cache_write_1h` into the existing Cache Write column so the per-TTL split shows up correctly in totals, prefill, and per-bucket detail rows. The model editor exposes a dedicated 1-hour cache write input field so operators on custom upstreams can price it. Tests cover the per-TTL split (sub-object takes precedence over the rolled-up flat field), fallback to the flat field, speed=fast extraction, and per-tier override applying during cost compute including the new input_cache_write_1h dimension fallback chain (1h -> 5m -> input).
Add `tiers.flex` and `tiers.priority` overlays for every priced Codex slug so the dashboard's notional cost reflects which OpenAI service tier the request actually ran on. The gateway already captures `usage.service_tier` into `TokenUsage.tier`; this commit completes the loop by giving the cost compute a per-tier rate row to look up. Tier overrides match OpenAI's public pricing (verified 2026-06-19 against https://platform.openai.com/docs/pricing): gpt-5.5 flex \$2.5/\$0.25/\$15 priority \$12.5/\$1.25/\$75 gpt-5.4 flex \$1.25/\$0.13/\$7.5 priority \$5/\$0.5/\$30 gpt-5.4-mini flex \$0.375/\$0.0375/\$2.25 priority \$1.5/\$0.15/\$9 `codex-auto-review` shares `gpt-5.4`'s pricing including the tier overrides. Codex CLI's `/fast` toggle writes `service_tier: "priority"` on the wire (per openai/codex's `ServiceTier::Fast.request_value()`), so operator-facing rows tagged "fast" cost out at the priority row. Cache-write rate stays unset on these entries — OpenAI charges cache creation at the same rate as input, which `unitPriceForDimension`'s fallback chain already covers.
Under burst load, two workers can both observe a stale access token on the same Codex upstream and both attempt a refresh. OpenAI rotates the refresh_token on every successful /oauth/token call, so exactly one racer wins; the other's request is rejected with `invalid_grant` for trying to redeem the rotated-out copy. The previous flow treated every `invalid_grant` as a dead credential and let the caller flip the account to `refresh_failed` - destroying the working credential a sibling had just rotated, and forcing the operator to re-import. On `invalid_grant`, the access-token cache now re-reads upstream state for the same `chatgpt-account-id` slot and compares the refresh_token it tried against what is now stored. If they differ, a sibling rotated and we return their freshly-minted access token (the caller treats it as a normal cache hit and skips the terminal flip). If they match, we re-raise the original error so the data-plane / control-plane caller flips the row as before. The other refresh-terminal codes - `app_session_terminated`, `invalid_refresh_token`, `invalid_client`, `unauthorized_client`, `access_denied` - bypass recovery entirely; none of them are caused by a rotation race. `CodexOAuthSessionTerminatedError` now carries the raw OAuth `error` value as a `code` field alongside the existing `upstreamMessage` so the recovery branch can single out `invalid_grant` from the catch. `REFRESH_TERMINAL_OAUTH_CODES` is broadened to the audit-aligned set (`app_session_terminated`, `invalid_grant`, `invalid_refresh_token`, `invalid_client`, `unauthorized_client`, `access_denied`) - Codex is OpenAI OAuth, so the list matches sub2api's `isNonRetryableRefreshError` verbatim. Sub2api's tryRecoverFromRefreshRace (backend/internal/service/oauth_refresh_api.go:173-193) is the canonical pattern; we apply it to Codex's per-account credential here. The token rotation persistence hook stays awaited; the recovery branch reads from the just-persisted state via the upstream repo and returns the sibling's cached access token directly so no second mint fires from this call site.
5eb94f0 to
c634ddc
Compare
Owner
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
One of four parallel precursor PRs splitting cross-cutting infrastructure out of #60.
BillingDimensionaddsinput_cache_write_1h(Anthropic'sextended-cache-ttl-2025-04-111h TTL).ModelPricinggains atiers?overlay;TokenUsagegains atierfield. Cost compute resolves effective pricing throughresolveEffectivePricing(pricing, tier). Migration0034_usage_per_ttl_and_tier.sqlwidens dimension CHECK list and addstiertousage+usage_requestsbucket identity (existing rows backfilltier = NULL, treated as base pricing → historical aggregations compute identically).usage.cache_creation.ephemeral_{5m,1h}_input_tokens(split when present) +usage.speed. Responses + Chat readusage.service_tier.'standard'/'default'/'auto'normalize to null (base);'fast'/'priority'/'flex'stored verbatim.tiers.flex(50% off) +tiers.priority(~2x) overlays per OpenAI's published rates. Codex CLI's/fastwritesservice_tier: "priority"peropenai/codexsource — so operator-facing rows tagged "fast" cost out at the priority row.recoverFromRefreshRace: oninvalid_grant, re-read state, compare refresh_token, defer to sibling rotation if changed. Same architecture as the Claude Code data-plane in feat(provider): Claude Code subscription provider #60.REFRESH_TERMINAL_OAUTH_CODESexpanded to match sub2api's full non-retryable set.Test plan
websocket_test.tspredate this branch and are owned by PR3)pnpm run db:migrate:remote && pnpm run deploy— applies migration 0034Follow-ups
PR3 (control-plane endpoints) ships a per-upstream usage rollup that doesn't yet apply tier-aware compute; once this PR lands, the rollup can be extended to use
resolveEffectivePricing. #60 (claude-code) consumes the tier extractor forusage.speedand ships the corresponding pricing tables (Opus 4.6+ fast-mode rates, Sonnet/Haiku/Fable 1h cache rates).