Skip to content

fix: refresh upstream tokens transparently instead of forcing re-auth#4036

Open
aron-muon wants to merge 4 commits intostacklok:mainfrom
aron-muon:aron/retoken-issue
Open

fix: refresh upstream tokens transparently instead of forcing re-auth#4036
aron-muon wants to merge 4 commits intostacklok:mainfrom
aron-muon:aron/retoken-issue

Conversation

@aron-muon
Copy link
Contributor

@aron-muon aron-muon commented Mar 6, 2026

Summary

Users are forced to fully re-authenticate with upstream OAuth providers every time the upstream access token expires (controlled by accessTokenLifespan), even though valid refresh tokens exist in storage.

The root cause is that upstream access tokens and refresh tokens are stored together in a single storage entry, with the entry's TTL/expiry set to the access token's expiry. When the access token expires, the entry is deleted (Redis) or marked expired (memory) — losing the refresh token, which is typically valid for 30-90 days or longer depending on the provider. The upstreamswap middleware has no refresh path and returns 401, forcing full re-authentication.

This affects both Redis and in-memory storage backends — Redis deletes the key via TTL, and memory storage's cleanup goroutine removes the expired entry.

Type of change

  • Bug fix

Root cause

The storage model bundles access + refresh tokens in a single entry (UpstreamTokens struct). The entry TTL is derived from the access token's ExpiresAt, so when the access token expires:

  1. Storage deletes or expires the entry — losing the still-valid refresh token
  2. GetUpstreamTokens returns nil, ErrExpired — discarding the token data
  3. Middleware returns 401 — no refresh path exists

Ideally, access and refresh tokens would be stored separately with independent TTLs matching their actual lifetimes. This fix extends the bundled entry's TTL as a pragmatic solution; a future refactor could separate them for cleaner lifecycle management.

Changes

Storage layer (pkg/authserver/storage/)

  • Extended upstream token entry TTL by DefaultRefreshTokenTTL (30 days) in both Redis and memory storage, so refresh tokens survive past access token expiry
  • Changed GetUpstreamTokens to return token data alongside ErrExpired (instead of nil) so callers can use the refresh token
  • Memory storage now checks the token's own ExpiresAt (access token expiry) rather than the entry's expiresAt (storage TTL) for the expired check

Token refresher (pkg/authserver/refresher.go)

  • New UpstreamTokenRefresher interface in storage/types.go
  • Implementation wraps upstream.OAuth2Provider.RefreshTokens() + UpstreamTokenStorage.StoreUpstreamTokens()
  • Preserves binding fields (ProviderID, UserID, UpstreamSubject, ClientID) across refresh
  • Handles refresh token rotation (keeps old refresh token if provider doesn't issue a new one)

Plumbing

  • Exposed refresher through ServerEmbeddedAuthServerRunnerMiddlewareRunner using the same lazy accessor pattern as GetUpstreamTokenStorage

Middleware (pkg/auth/upstreamswap/)

  • Middleware now attempts transparent refresh before returning 401
  • Extracted getOrRefreshUpstreamTokens helper to keep cyclomatic complexity under lint threshold
  • Only requires re-auth when the refresh token itself is invalid/revoked

Production validation

Deployed to a production cluster with Redis (AWS Valkey) storage (we use a sentinel emulator which basically just returns the Valkey URL in each case. One of the little clever ways we could use Valkey). All four upstream providers successfully refreshed tokens transparently — no user re-authentication required:

Provider Token Endpoint Access Token Lifetime Refresh Token Rotated
Atlassian cf.mcp.atlassian.com/v1/token 1 hour Yes
Asana app.asana.com/-/oauth_token 1 hour Yes
Slack (GovSlack) slack-gov.com/api/oauth.v2.access 12 hours Yes
Google oauth2.googleapis.com/token 1 hour Yes

Redis TTLs confirmed updated to ~30 days (previously ~1 hour). GitHub has not yet expired its 8-hour access token but uses the same code path.

Test plan

  • Updated storage tests to verify tokens returned alongside ErrExpired
  • Updated cleanup tests for extended TTL
  • Updated middleware tests with refresher parameter
  • All existing unit tests pass
  • Build clean, golangci-lint clean
  • Deployed to production — verified transparent refresh for Atlassian, Asana, Slack, and Google

Does this introduce a user-facing change?

Yes — upstream OAuth sessions now persist beyond the access token lifetime. Users will no longer be forced to re-authenticate as long as their upstream refresh token is valid (typically 30 days to indefinite depending on the provider).

Generated with Claude Code

@github-actions github-actions bot added the size/M Medium PR: 300-599 lines changed label Mar 6, 2026
@aron-muon aron-muon changed the title Aron/retoken issue Refresh upstream tokens transparently instead of forcing re-auth Mar 6, 2026
@aron-muon aron-muon changed the title Refresh upstream tokens transparently instead of forcing re-auth fix: refresh upstream tokens transparently instead of forcing re-auth Mar 6, 2026
@aron-muon aron-muon force-pushed the aron/retoken-issue branch from efdbe42 to cd44bbf Compare March 6, 2026 14:07
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 6, 2026
@aron-muon aron-muon marked this pull request as ready for review March 6, 2026 14:09
@aron-muon aron-muon force-pushed the aron/retoken-issue branch from cd44bbf to 0dd2c1c Compare March 6, 2026 14:11
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 6, 2026
The upstreamswap middleware returned 401 when upstream access tokens
expired, forcing users through full re-authentication even though
valid refresh tokens existed in storage. This happened because:

1. Redis/memory storage TTL was set to access token expiry, deleting
   the entry (and refresh token) when the access token expired
2. Storage returned nil on ErrExpired, discarding the refresh token
3. The middleware had no refresh path — only 401

Fix all three layers:

- Add DefaultRefreshTokenTTL (30 days) to storage entry TTL so
  refresh tokens survive past access token expiry
- Return token data alongside ErrExpired from storage so callers
  can use the refresh token
- Add UpstreamTokenRefresher interface and implementation that wraps
  the upstream OAuth2Provider and storage
- Plumb the refresher through Server → EmbeddedAuthServer → Runner →
  MiddlewareRunner
- Update upstreamswap middleware to attempt refresh before returning
  401, only requiring re-auth when the refresh token itself fails

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aron-muon aron-muon force-pushed the aron/retoken-issue branch from 0dd2c1c to ab9807b Compare March 6, 2026 14:12
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 6, 2026
@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 79.79798% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.69%. Comparing base (c3aeb02) to head (18b628d).

Files with missing lines Patch % Lines
pkg/auth/upstreamswap/middleware.go 83.33% 4 Missing and 2 partials ⚠️
pkg/authserver/server_impl.go 0.00% 6 Missing ⚠️
pkg/runner/runner.go 0.00% 5 Missing ⚠️
pkg/authserver/runner/embeddedauthserver.go 0.00% 2 Missing ⚠️
pkg/authserver/storage/redis.go 88.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4036      +/-   ##
==========================================
- Coverage   68.70%   68.69%   -0.01%     
==========================================
  Files         445      446       +1     
  Lines       45374    45451      +77     
==========================================
+ Hits        31173    31224      +51     
- Misses      11796    11818      +22     
- Partials     2405     2409       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add comprehensive tests for RefreshAndStore (6 cases) and middleware
refresh paths (4 cases: successful refresh, failed refresh, no refresh
token, defense-in-depth expired-without-error).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 6, 2026
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Mar 6, 2026
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Large PR: 600-999 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant