Skip to content

fix(auth/m2m): remove double-caching to enable proactive token refresh#1550

Open
aausch wants to merge 1 commit intodatabricks:mainfrom
aausch:aausch/fix/cached-token-source-stale-refresh
Open

fix(auth/m2m): remove double-caching to enable proactive token refresh#1550
aausch wants to merge 1 commit intodatabricks:mainfrom
aausch:aausch/fix/cached-token-source-stale-refresh

Conversation

@aausch
Copy link

@aausch aausch commented Mar 16, 2026

Summary

Fixes a double-caching bug in M2M OAuth that caused the proactive async token refresh to have no effect, resulting in bursts of HTTP 401 errors at each token rotation boundary (~every hour).

Closes #1549.

Why

M2mCredentials.Configure previously called clientcredentials.Config.TokenSource(ctx), which returns an oauth2.ReuseTokenSource. That source was then passed to refreshableVisitor, which wraps it in a cachedTokenSource. The resulting stack is:

cachedTokenSource          ← Databricks async cache (proactive, T-20min)
    └── authTokenSource    ← discards context
            └── oauth2.ReuseTokenSource  ← Go stdlib cache, expiryDelta=10s
                    └── clientcredentials.tokenSource  ← HTTP /token endpoint

When cachedTokenSource triggers its async refresh at T−20 min, it calls through to ReuseTokenSource.Token(). Because the token still has 20 minutes of life, ReuseTokenSource considers it valid and returns the cached token without an HTTP call. The async refresh fires repeatedly (with an accelerating schedule) but each call hits the same inner cache — until only ~10 s remain, at which point ReuseTokenSource's expiryDelta window is crossed and a real network call is finally made.

Any request that receives the about-to-expire token and whose round-trip to Databricks completes after the token's expiry time gets HTTP 401. In production this manifests as a burst of 401s at precisely one-hour intervals, correlated with pod startup times.

What changed

Interface changes

None.

Behavioral changes

  • M2M OAuth token refresh now makes a real HTTP call to the token endpoint at T−20 min (the intended proactive window) rather than at T−10 s.
  • Callers will see one additional token endpoint call per hour per process (the proactive refresh), offset by eliminating the burst of 401s at expiry.

Internal changes

  • auth_m2m.go: replaced clientcredentials.Config.TokenSource(ctx) (which returns a ReuseTokenSource) with a TokenSourceFn closure that calls ccfg.Token(ctx) directly. cachedTokenSource is now the sole caching layer.
  • cache_test.go: added reuseTokenSource test helper and two new tests:
    • TestCachedTokenSource_AsyncRefreshBlockedByInnerCache — documents the double-caching behaviour using a mock clock.
    • TestCachedTokenSource_AsyncRefreshWithDirectSource — verifies that a direct (non-caching) inner source triggers an HTTP fetch at the proactive window.
  • NEXT_CHANGELOG.md: updated.

How is this tested?

  • TestCachedTokenSource_AsyncRefreshBlockedByInnerCache: uses a fake clock and a controlled reuseTokenSource helper to confirm that inner caching delays the HTTP fetch to T−10 s rather than T−20 min.
  • TestCachedTokenSource_AsyncRefreshWithDirectSource: uses the same fake clock with a direct token source to confirm the fetch occurs at T−20 min after the fix.
  • Existing TestM2mHappyFlow and TestM2mHappyFlowForAccount continue to pass.

clientcredentials.Config.TokenSource returns an oauth2.ReuseTokenSource,
which caches the token internally with a 10s expiryDelta. Wrapping this
in cachedTokenSource creates a double-caching stack where async refresh
calls return the inner-cached token instead of making a real HTTP request.

As a result, the proactive 20-min async refresh window is wasted: the
underlying token endpoint is not reached until ~10s before expiry. Any
request that holds the about-to-expire token and whose HTTP round-trip
to Databricks completes after the expiry time receives HTTP 401.

Replace clientcredentials.Config.TokenSource (ReuseTokenSource) with a
direct TokenSourceFn that always calls ccfg.Token(ctx). cachedTokenSource
becomes the sole cache layer and async refresh proactively fetches a fresh
token at T-20min as intended.

Fixes databricks#1549.

Tests:
- TestCachedTokenSource_AsyncRefreshBlockedByInnerCache: documents that
  inner ReuseTokenSource delays the real fetch to near T-10s
- TestCachedTokenSource_AsyncRefreshWithDirectSource: verifies that a
  direct source causes the fetch at T-20min as intended
- Existing TestM2mHappyFlow / TestM2mHappyFlowForAccount: still pass

Signed-off-by: Alex Ausch <alex@ausch.name>
@github-actions
Copy link

If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:

Trigger:
go/deco-tests-run/sdk-go

Inputs:

  • PR number: 1550
  • Commit SHA: b68a94111be2de295c84599c275bdc4ff3827d02

Checks will be approved automatically on success.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ISSUE] M2M OAuth: double-caching causes async token refresh to be ineffective until ~10s before expiry, causing 401 bursts

1 participant