fix(codex): recover from concurrent refresh-token rotation races#67
Merged
Conversation
Under burst load, two workers can both observe a stale access token on the same Codex upstream and both attempt a refresh. OpenAI rotates the refresh_token on every successful /oauth/token call, so exactly one racer wins; the other's request is rejected with `invalid_grant` for trying to redeem the rotated-out copy. The previous flow treated every `invalid_grant` as a dead credential and let the caller flip the account to `refresh_failed` - destroying the working credential a sibling had just rotated, and forcing the operator to re-import. On `invalid_grant`, the access-token cache now re-reads upstream state for the same `chatgpt-account-id` slot and compares the refresh_token it tried against what is now stored. If they differ, a sibling rotated and we return their freshly-minted access token (the caller treats it as a normal cache hit and skips the terminal flip). If they match, we re-raise the original error so the data-plane / control-plane caller flips the row as before. The other refresh-terminal codes - `app_session_terminated`, `invalid_refresh_token`, `invalid_client`, `unauthorized_client`, `access_denied` - bypass recovery entirely; none of them are caused by a rotation race. `CodexOAuthSessionTerminatedError` now carries the raw OAuth `error` value as a `code` field alongside the existing `upstreamMessage` so the recovery branch can single out `invalid_grant` from the catch. `REFRESH_TERMINAL_OAUTH_CODES` is broadened to the audit-aligned set (`app_session_terminated`, `invalid_grant`, `invalid_refresh_token`, `invalid_client`, `unauthorized_client`, `access_denied`) - Codex is OpenAI OAuth, so the list matches sub2api's `isNonRetryableRefreshError` verbatim. Sub2api's tryRecoverFromRefreshRace (backend/internal/service/oauth_refresh_api.go:173-193) is the canonical pattern; we apply it to Codex's per-account credential here. The token rotation persistence hook stays awaited; the recovery branch reads from the just-persisted state via the upstream repo and returns the sibling's cached access token directly so no second mint fires from this call site.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Codex OAuth refresh-token rotation can race under concurrent refreshes: two workers attempting refresh from the same cached refresh_token deterministically produce one winner + one
invalid_grantloser. Without recovery, the loser flips the credential torefresh_failedand destroys the rotation a sibling just persisted.CodexOAuthSessionTerminatedErrornow carries the OAuthcodefield alongside the existing message, so the cache layer can branch oninvalid_grantspecifically vs other terminal codes.REFRESH_TERMINAL_OAUTH_CODESexpanded to match sub2api'sisNonRetryableRefreshErrorset (app_session_terminated,invalid_grant,invalid_refresh_token,invalid_client,unauthorized_client,access_denied) — dead credentials get retried forever on the missing codes today.recoverFromRefreshRaceinaccess-token-cache.ts: oninvalid_grant, re-read state, compare refresh_token; if changed → sibling rotated, return sibling's access token as a normal cache hit; if same → genuine death, flip terminal. Depth-guarded to at most one recovery attempt per request. Scoped to the requestedaccountIdslot (Codex supports N accounts per upstream).Sub2api reference:
oauth_refresh_api.go:tryRecoverFromRefreshRace.Test plan