Skip to content

fix(codex): recover from concurrent refresh-token rotation races#67

Merged
Menci merged 1 commit into
mainfrom
precursor-codex-race-recovery
Jun 20, 2026
Merged

fix(codex): recover from concurrent refresh-token rotation races#67
Menci merged 1 commit into
mainfrom
precursor-codex-race-recovery

Conversation

@Menci

@Menci Menci commented Jun 19, 2026

Copy link
Copy Markdown
Owner

Summary

Codex OAuth refresh-token rotation can race under concurrent refreshes: two workers attempting refresh from the same cached refresh_token deterministically produce one winner + one invalid_grant loser. Without recovery, the loser flips the credential to refresh_failed and destroys the rotation a sibling just persisted.

  • CodexOAuthSessionTerminatedError now carries the OAuth code field alongside the existing message, so the cache layer can branch on invalid_grant specifically vs other terminal codes.
  • REFRESH_TERMINAL_OAUTH_CODES expanded to match sub2api's isNonRetryableRefreshError set (app_session_terminated, invalid_grant, invalid_refresh_token, invalid_client, unauthorized_client, access_denied) — dead credentials get retried forever on the missing codes today.
  • recoverFromRefreshRace in access-token-cache.ts: on invalid_grant, re-read state, compare refresh_token; if changed → sibling rotated, return sibling's access token as a normal cache hit; if same → genuine death, flip terminal. Depth-guarded to at most one recovery attempt per request. Scoped to the requested accountId slot (Codex supports N accounts per upstream).

Sub2api reference: oauth_refresh_api.go:tryRecoverFromRefreshRace.

Test plan

  • typecheck + test + lint clean

Under burst load, two workers can both observe a stale access token on
the same Codex upstream and both attempt a refresh. OpenAI rotates the
refresh_token on every successful /oauth/token call, so exactly one
racer wins; the other's request is rejected with `invalid_grant` for
trying to redeem the rotated-out copy. The previous flow treated every
`invalid_grant` as a dead credential and let the caller flip the
account to `refresh_failed` - destroying the working credential a
sibling had just rotated, and forcing the operator to re-import.

On `invalid_grant`, the access-token cache now re-reads upstream state
for the same `chatgpt-account-id` slot and compares the refresh_token
it tried against what is now stored. If they differ, a sibling rotated
and we return their freshly-minted access token (the caller treats it
as a normal cache hit and skips the terminal flip). If they match, we
re-raise the original error so the data-plane / control-plane caller
flips the row as before. The other refresh-terminal codes -
`app_session_terminated`, `invalid_refresh_token`, `invalid_client`,
`unauthorized_client`, `access_denied` - bypass recovery entirely;
none of them are caused by a rotation race.

`CodexOAuthSessionTerminatedError` now carries the raw OAuth `error`
value as a `code` field alongside the existing `upstreamMessage` so the
recovery branch can single out `invalid_grant` from the catch.
`REFRESH_TERMINAL_OAUTH_CODES` is broadened to the audit-aligned set
(`app_session_terminated`, `invalid_grant`, `invalid_refresh_token`,
`invalid_client`, `unauthorized_client`, `access_denied`) - Codex is
OpenAI OAuth, so the list matches sub2api's `isNonRetryableRefreshError`
verbatim.

Sub2api's tryRecoverFromRefreshRace
(backend/internal/service/oauth_refresh_api.go:173-193) is the canonical
pattern; we apply it to Codex's per-account credential here. The token
rotation persistence hook stays awaited; the recovery branch reads from
the just-persisted state via the upstream repo and returns the
sibling's cached access token directly so no second mint fires from
this call site.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant