Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 32 additions & 5 deletions docs/arch/11-auth-server-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,19 +64,46 @@ The storage layer implements multiple interfaces from the [fosite](https://githu
- Memory backend: `pkg/authserver/storage/memory.go`
- Redis backend: `pkg/authserver/storage/redis.go`

## Synthesis-mode subjects
## Identity resolution for pure OAuth2 providers

OAuth2 upstreams configured without a userInfo endpoint use a fallback identity-resolution mode: the embedded auth server synthesizes a non-PII subject by hashing the upstream access token. The mode changes what `UserStorage` and `UpstreamTokenStorage` see and is observable to operators inspecting stored state.
For pure OAuth 2.0 upstream providers (`OAuth2Config`), OIDC is unavailable and there is no ID token. `BaseOAuth2Provider.ExchangeCodeForIdentity` resolves user identity through a three-way priority chain. Each path has distinct implications for `UserStorage`, `UpstreamTokenStorage`, and the Redis secondary index.

**When the path triggers.** Pure OAuth 2.0 upstream provider (`OAuth2Config`) configured with `userInfo == nil`. Reached at `BaseOAuth2Provider.ExchangeCodeForIdentity` after token exchange when no userInfo endpoint is available to consult. OIDC providers and OAuth2 providers with `userInfo` configured continue to resolve identity normally and are not affected.
### IdentityFromToken (priority 1)

An operator opt-in path that extracts identity claims directly from the token endpoint response body, skipping the userinfo HTTP call entirely.

**When the path triggers.** `IdentityFromToken` is configured on the upstream provider (`p.config.IdentityFromToken != nil`). The `tokenResponseRewriter` intercepts the token endpoint response and runs extraction against the raw pre-rewrite body; the result is available to `ExchangeCodeForIdentity` without an additional round-trip.

**Subject format.** Real, stable subject string extracted from the token response body via a gjson dot-notation path (e.g. `username`, `authed_user.id`). For token responses that embed a JWT, the `@upstreamjwt` modifier decodes the payload for further drilling (e.g. `access_token|@upstreamjwt|sub`). The `@upstreamjwt` modifier performs no signature verification — it is intended only for JWTs received directly from the upstream token endpoint over a TLS-authenticated channel. The returned `*Identity` carries `Synthetic = false`. Path semantics and trust-model notes are documented on the runtime config struct `IdentityFromTokenConfig` in `pkg/authserver/upstream/identity_from_token.go`. The corresponding CRD type (`cmd/thv-operator/api/v1alpha1.IdentityFromTokenConfig`) is defined in a sibling PR; operator-to-runner translation of this config lands separately.

**`UserResolver` interaction.** Because `Identity.Synthetic` is false, `callback.go` takes the normal path: `UserResolver.ResolveUser` runs, a row is created (or looked up) in `UserStorage`, a provider-identities entry is written, and `UpdateLastAuthenticated` is called. `UpstreamTokens.UserID` carries the resolved internal user UUID, not the raw operator-supplied subject string.

**Reverse-index implication (Redis backend).** Stable user IDs mean `KeyTypeUserUpstream` works as designed — one set per user accumulates session IDs across re-authentications. No set churn.

**Operator visibility.** The `IdentitySynthesized` condition does not fire for upstreams using `IdentityFromToken`. However, `SyntheticIdentityUpstreams()` (the controller-side predicate that drives the condition) currently checks only for `userInfo == nil` and does not yet inspect `IdentityFromToken`. Until the CRD type and controller logic land in a follow-up, an upstream with `IdentityFromToken` configured but no `userInfo` will still trigger `IdentitySynthesizedActive` — even though synthesis is not reached at runtime.

**Implementation.**
- `pkg/authserver/upstream/oauth2.go` — `ExchangeCodeForIdentity` priority 1 branch
- `pkg/authserver/upstream/identity_from_token.go` — `IdentityFromTokenConfig`, `extractIdentityFromTokenResponse`, `@upstreamjwt` modifier
- `pkg/authserver/upstream/token_exchange.go` — `tokenResponseRewriter.RoundTrip` extracts identity from the raw pre-rewrite body

### UserInfo endpoint (priority 2)

Existing behavior. When `IdentityFromToken` is unconfigured and `userInfo` is set, `fetchUserInfo` is called with the upstream access token. Subject, name, and email come from the userinfo response. `UserResolver.ResolveUser` runs normally, `Identity.Synthetic` is false.

### Synthesis-mode subjects (priority 3)

Reached when both `IdentityFromToken` is unconfigured AND `userInfo` is absent. The embedded auth server synthesizes a non-PII subject by hashing the upstream access token. The mode changes what `UserStorage` and `UpstreamTokenStorage` see and is observable to operators inspecting stored state.

**When the path triggers.** Pure OAuth 2.0 upstream provider (`OAuth2Config`) where both `IdentityFromToken` and `userInfo` are unconfigured. Reached at `BaseOAuth2Provider.ExchangeCodeForIdentity` as the final fallback. OIDC providers and OAuth2 providers with either `IdentityFromToken` or `userInfo` configured are not affected.

**Subject format.** `tk-` followed by 32 lowercase hex characters (the first 16 bytes of `SHA-256(accessToken)`), e.g. `tk-89abcdef0123456789abcdef01234567`. The output is opaque: assuming the upstream issues opaque (non-JWT) bearer tokens, the digest reveals nothing about the input that an attacker holding a candidate token could not already confirm by re-hashing. The returned `*Identity` carries `Synthetic = true`; the `upstream.IsSynthesizedSubject(string)` predicate lets bare-string consumers recognize the prefix.

**`UserResolver` bypass.** Synthetic identities skip `UserResolver.ResolveUser` entirely — no row is created in `UserStorage`, no entry is written to provider-identities, and `UpdateLastAuthenticated` is not called. The synthesized subject rotates per access token, so persisting it would create a fresh `users` row on every re-authentication. `UpstreamTokens.UserID` therefore carries the `tk-…` value directly rather than a stable internal UUID.
**`UserResolver` bypass.** The bypass is gated on `Identity.Synthetic` in `callback.go` — synthesis is the only path that sets this field. Synthetic identities skip `UserResolver.ResolveUser` entirely — no row is created in `UserStorage`, no entry is written to provider-identities, and `UpdateLastAuthenticated` is not called. The synthesized subject rotates per access token, so persisting it would create a fresh `users` row on every re-authentication. `UpstreamTokens.UserID` therefore carries the `tk-…` value directly rather than a stable internal UUID.

**Reverse-index implication (Redis backend).** The `KeyTypeUserUpstream` secondary-index set under `thv:auth:{ns:name}:user:upstream:{userID}` is designed around stable user IDs — one set per user, holding all of that user's session IDs. Under synthesis the userID rotates with every re-authentication, so each session lands in its own one-element set. Reads continue to work, but set churn is much higher than under OIDC. The existing TODO at `pkg/authserver/storage/redis.go:43-45` to scan and clean up stale secondary-index entries applies, and synthesis-mode workloads make a periodic scan more important.

**Operator visibility.** When at least one configured OAuth2 upstream has `userInfo == nil`, the controller surfaces the `IdentitySynthesized` condition on the `MCPExternalAuthConfig` and `VirtualMCPServer` status (Reason `IdentitySynthesizedActive`, naming the affected upstreams). The condition flips to `False` (Reason `IdentitySynthesizedInactive`) once every upstream has `userInfo` configured.
**Operator visibility.** When at least one configured OAuth2 upstream has `userInfo == nil`, the controller surfaces the `IdentitySynthesized` condition on the `MCPExternalAuthConfig` and `VirtualMCPServer` status (Reason `IdentitySynthesizedActive`, naming the affected upstreams). The condition flips to `False` (Reason `IdentitySynthesizedInactive`) once every upstream has `userInfo` configured. Note: the controller predicate (`SyntheticIdentityUpstreams`) checks only for `userInfo == nil` and does not yet account for `IdentityFromToken`; see the known gap noted under priority 1.

**Implementation.**
- `pkg/authserver/upstream/oauth2.go` — `synthesizeIdentity`, `synthesizeSubjectFromAccessToken`, `IsSynthesizedSubject`
Expand Down
156 changes: 120 additions & 36 deletions pkg/authserver/upstream/oauth2.go
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,16 @@ type OAuth2Config struct {
// When set, the provider performs the token exchange HTTP call directly (bypassing
// golang.org/x/oauth2) and extracts fields using gjson dot-notation paths.
// When nil, standard OAuth 2.0 token response parsing is used.
// See also: IdentityFromToken for extracting user identity from the same response.
TokenResponseMapping *TokenResponseMapping `json:"token_response_mapping,omitempty" yaml:"token_response_mapping,omitempty"`

// IdentityFromToken extracts user identity from the token-endpoint response
// body when the upstream provider includes identity claims there (e.g.,
// Snowflake's `username`, Slack's `authed_user.id`). When set, the embedded
// auth server skips the userinfo HTTP call entirely. See the CRD type
// (cmd/thv-operator/api/v1alpha1.IdentityFromTokenConfig) for the
// authoritative trust-model and uniqueness documentation.
IdentityFromToken *IdentityFromTokenConfig `json:"identity_from_token,omitempty" yaml:"identity_from_token,omitempty"`
}

// TokenResponseMapping configures extraction of token fields from non-standard
Expand Down Expand Up @@ -195,6 +204,11 @@ func (c *OAuth2Config) Validate() error {
return errors.New("token_response_mapping.access_token_path is required when token_response_mapping is set")
}
}
if c.IdentityFromToken != nil {
if c.IdentityFromToken.SubjectPath == "" {
return errors.New("identity_from_token.subject_path is required when identity_from_token is set")
}
}
return c.CommonOAuthConfig.Validate()
}

Expand Down Expand Up @@ -400,39 +414,71 @@ func (p *BaseOAuth2Provider) buildAuthorizationURL(
}

// ExchangeCodeForIdentity exchanges an authorization code for tokens and resolves
// the user's identity in a single atomic operation.
// For pure OAuth2 providers, identity is resolved via UserInfo when configured;
// otherwise Subject is synthesized via synthesizeIdentity (which rejects empty
// access tokens to prevent the well-known sha256("") subject collision) and
// Name/Email are left empty. The nonce parameter is ignored (no ID token).
// the user's identity in a single atomic operation. For pure OAuth2 providers
// (no ID token) the priority chain is:
//
// 1. IdentityFromToken (operator opt-in): extract identity claims from the
// token-endpoint response body using gjson paths. The userinfo HTTP call
// is skipped entirely.
// 2. UserInfo endpoint: fetch identity from the configured userinfo URL.
// 3. Synthesis: when neither is configured, synthesizeIdentity derives a
// non-PII Subject from the access token (rejects empty tokens to prevent
// the well-known sha256("") collision). Name and Email are empty;
// Synthetic=true tells the callback handler to bypass UserResolver
// because the subject rotates per access token.
//
// The nonce parameter is ignored (no ID token to validate).
func (p *BaseOAuth2Provider) ExchangeCodeForIdentity(ctx context.Context, code, codeVerifier, _ string) (*Identity, error) {
tokens, err := p.exchangeCodeForTokens(ctx, code, codeVerifier)
exchanged, err := p.exchangeCodeForTokens(ctx, code, codeVerifier)
if err != nil {
return nil, err
}

// No userinfo: synthesize a non-PII subject from the access token.
// Synthetic=true tells the callback handler to bypass UserResolver — the
// synthesized subject rotates per access token, so persisting it would
// create a new `users` row on every re-authentication.
if p.config.UserInfo == nil {
return synthesizeIdentity(tokens)
}

userInfo, err := p.fetchUserInfo(ctx, tokens.AccessToken)
if err != nil {
return nil, fmt.Errorf("%w: %w", ErrIdentityResolutionFailed, err)
}
if userInfo == nil || userInfo.Subject == "" {
return nil, ErrIdentityResolutionFailed
}

return &Identity{
Tokens: tokens,
Subject: userInfo.Subject,
Name: userInfo.Name,
Email: userInfo.Email,
}, nil
// Priority 1: identityFromToken (configured by operator).
if p.config.IdentityFromToken != nil {
if exchanged.identity == nil {
// The rewriter logged the extraction failure at WARN with the
// operator-supplied path. Surface the same error here so callers
// can report it without requiring WARN-level log access.
if exchanged.extractionErr != nil {
return nil, exchanged.extractionErr
}
// Unreachable in practice: when identity is nil the rewriter always
// sets extractionErr. Kept as a safe fallback.
return nil, fmt.Errorf(
"%w: identityFromToken configured but extraction failed",
ErrIdentityResolutionFailed,
)
}
return &Identity{
Tokens: exchanged.tokens,
Subject: exchanged.identity.Subject,
Name: exchanged.identity.Name,
Email: exchanged.identity.Email,
}, nil
}

// Priority 2: userInfo (existing behavior).
if p.config.UserInfo != nil {
userInfo, err := p.fetchUserInfo(ctx, exchanged.tokens.AccessToken)
if err != nil {
return nil, fmt.Errorf("%w: %w", ErrIdentityResolutionFailed, err)
}
if userInfo == nil || userInfo.Subject == "" {
return nil, ErrIdentityResolutionFailed
}
return &Identity{
Tokens: exchanged.tokens,
Subject: userInfo.Subject,
Name: userInfo.Name,
Email: userInfo.Email,
}, nil
}

// Priority 3: synthesis (PR 5094). Subject derived from access token; rotates
// per token. The callback handler treats Synthetic=true as opt-out from the
// user-resolver to avoid creating a fresh users row on every re-auth.
return synthesizeIdentity(exchanged.tokens)
}

// synthesizedSubjectPrefix tags subjects produced by
Expand Down Expand Up @@ -496,21 +542,45 @@ func synthesizeIdentity(tokens *Tokens) (*Identity, error) {
}, nil
}

// tokenExchangeResult bundles the outputs of a successful token exchange:
// the obtained tokens, any identity extracted from the token response body,
// and any error from the identity extraction step. The exchange-level error
// is returned as the function's own error return value.
//
// extractionErr is populated when IdentityFromToken is configured but the
// extractor could not resolve the subject path. It carries operator-actionable
// diagnostics (path name, type description) and already wraps
// ErrIdentityResolutionFailed. The caller must check extractionErr when
// identity is nil and IdentityFromToken is set.
type tokenExchangeResult struct {
tokens *Tokens
identity *partialIdentity
extractionErr error
}

// exchangeCodeForTokens exchanges an authorization code for tokens with the upstream IDP.
func (p *BaseOAuth2Provider) exchangeCodeForTokens(ctx context.Context, code, codeVerifier string) (*Tokens, error) {
// It returns a tokenExchangeResult containing the tokens, any identity extracted from
// the token response body, and any extraction error. The function error is the
// exchange-level error (network, HTTP, token parsing).
func (p *BaseOAuth2Provider) exchangeCodeForTokens(
ctx context.Context, code, codeVerifier string,
) (*tokenExchangeResult, error) {
if code == "" {
return nil, errors.New("authorization code is required")
}

slog.Info("exchanging authorization code for tokens",
slog.Debug("exchanging authorization code for tokens",
"token_endpoint", p.config.TokenEndpoint,
"has_pkce_verifier", codeVerifier != "",
)

// Wrap HTTP client with token response rewriter if mapping is configured.
// This normalizes non-standard responses (e.g., GovSlack's nested fields)
// before the oauth2 library parses them, keeping the standard exchange flow.
httpClient := wrapHTTPClientWithMapping(p.httpClient, p.config.TokenResponseMapping, p.config.TokenEndpoint)
// Wrap HTTP client with token response rewriter if mapping or identity extraction
// is configured. At auth-code time, both mapping (field normalization) and
// identityCfg (identity extraction) may be active together. Keep a reference to
// the rewriter so we can read extractedIdentity and extractionErr after Exchange returns.
httpClient, rewriter := wrapHTTPClientForTokenExchange(
p.httpClient, p.config.TokenResponseMapping, p.config.IdentityFromToken, p.config.TokenEndpoint,
)
ctx = context.WithValue(ctx, oauth2.HTTPClient, httpClient)

// Build exchange options
Expand All @@ -534,7 +604,16 @@ func (p *BaseOAuth2Provider) exchangeCodeForTokens(ctx context.Context, code, co
"expires_at", expiresAtLogValue(tokens.ExpiresAt),
)

return tokens, nil
// Read any identity and extraction error captured by the rewriter during the
// token round-trip. rewriter is nil when neither mapping nor identityCfg is
// configured; nil-safe.
result := &tokenExchangeResult{tokens: tokens}
if rewriter != nil {
result.identity = rewriter.extractedIdentity
result.extractionErr = rewriter.extractionErr
}

return result, nil
}

// RefreshTokens refreshes the upstream IDP tokens.
Expand All @@ -559,7 +638,12 @@ func (p *BaseOAuth2Provider) RefreshTokens(ctx context.Context, refreshToken, _
)

// Wrap HTTP client with token response rewriter if mapping is configured.
httpClient := wrapHTTPClientWithMapping(p.httpClient, p.config.TokenResponseMapping, p.config.TokenEndpoint)
// Identity extraction (identityCfg) is intentionally nil here: per Snowflake's
// contract and the general design, the username/identity field is only present
// in the initial auth-code response and is omitted on refresh. Identity is
// cached at auth-code time and read from session storage on subsequent requests.
// The rewriter is discarded because refresh does not produce identity.
httpClient, _ := wrapHTTPClientForTokenExchange(p.httpClient, p.config.TokenResponseMapping, nil, p.config.TokenEndpoint)
ctx = context.WithValue(ctx, oauth2.HTTPClient, httpClient)

opts := []oauth2.AuthCodeOption{
Expand Down
Loading
Loading