Bug description
When thv registry list (or anything that routes through CachedAPIRegistryProvider.GetRegistry) triggers an OAuth/OIDC auth flow that fails — the user never completes the browser flow, the flow times out, or the registry rejects the token — the CLI silently returns stale cached registry data instead of surfacing the auth failure.
The fallback in refreshCache() (pkg/registry/provider_cached.go:116-143) is indiscriminate: any error from the upstream API causes previously-cached data to be returned with nil error. It does not distinguish transient failures (network blip, upstream 5xx) from authorization failures. The persistent disk cache lives at ~/.toolhive/cache/ with maxCacheAge = 7 * 24 * time.Hour (pkg/registry/provider_cached.go:26), so stale-serving can persist for up to a week after a user loses access.
Steps to reproduce
- Configure
thv against a registry behind OAuth:
thv config set-registry <url> --issuer <issuer> --client-id <id>
- Run
thv registry list once and complete the OAuth flow so the cache gets populated on disk.
- Wait past the in-memory cache TTL (1 hour) — or just run in a fresh process so the refresh path runs.
- Run
thv registry list again and do nothing when the OAuth URL is displayed. Wait for the flow to time out. (Reporters have seen a second OAuth flow kick off and also time out.)
- Observe: after the OAuth flow(s) fail, the registry listing is printed anyway, and the process exits 0.
Expected behavior
The CLI surfaces the auth error and exits non-zero, e.g.:
Error: failed to list servers: registry authentication required:
Run `thv registry login` to authenticate.
Authorization state may have changed since the cache was populated (group membership revoked, token rotated, session terminated), so silently returning stale data is a correctness and security hazard.
Actual behavior
thv registry list prints the previous (cached) server list, no error, exit 0 — giving the user the false impression they are still authenticated. Follow-up commands like thv run <server> that rely on live auth fail later with confusing errors.
Evidence
Bug site: pkg/registry/provider_cached.go:116-143
func (p *CachedAPIRegistryProvider) refreshCache() (*types.Registry, error) {
p.cacheMu.Lock()
defer p.cacheMu.Unlock()
registry, err := p.APIRegistryProvider.GetRegistry()
if err != nil {
// If fetch fails and we have stale cache, return it
if p.cachedData != nil {
return p.cachedData, nil // <- auth errors swallowed here
}
return nil, err
}
// ...
}
The upstream APIRegistryProvider.GetRegistry() (pkg/registry/provider_api.go:82-99) correctly classifies 401/403 into auth.ErrRegistryAuthRequired. Those errors reach refreshCache() and get silently replaced with nil.
For the OAuth-flow-failure case specifically (browser flow times out / is cancelled), the error path is different but the outcome is the same:
oauthTokenSource.Token "oauth flow failed: %w"
(pkg/registry/auth/oauth_token_source.go:61-67)
auth.Transport.RoundTrip "failed to get auth token: %w"
(pkg/registry/auth/transport.go:24-27)
api.Client.fetchServersPage "failed to fetch servers: %w"
(pkg/registry/api/client.go:190-193)
APIRegistryProvider.GetRegistry -> *UnavailableError
(pkg/registry/provider_api.go:93-99 — note: no sentinel match, falls through)
refreshCache swallowed; stale cache returned with nil err
(pkg/registry/provider_cached.go:121-127)
Reproducers
Two failing tests exercise the two distinct error-propagation paths that both hit the same buggy fallback. A draft PR with the failing tests is linked below.
// SPDX-FileCopyrightText: Copyright 2025 Stacklok, Inc.
// SPDX-License-Identifier: Apache-2.0
package registry
import (
"context"
"errors"
"fmt"
"net/http"
"net/http/httptest"
"sync/atomic"
"testing"
"github.com/stretchr/testify/require"
"github.com/stacklok/toolhive/pkg/registry/api"
"github.com/stacklok/toolhive/pkg/registry/auth"
)
// Test 1: server-side 401/403 path (token was sent, registry rejected it).
func TestCachedProvider_AuthErrorNotMaskedByStaleCache(t *testing.T) {
t.Parallel()
var returnUnauthorized atomic.Bool
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
if returnUnauthorized.Load() {
w.WriteHeader(http.StatusUnauthorized)
return
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte(`{"servers":[],"metadata":{"next_cursor":""}}`))
}))
defer srv.Close()
provider, err := NewCachedAPIRegistryProvider(srv.URL, true, false, nil)
require.NoError(t, err)
_, err = provider.ListServers()
require.NoError(t, err, "first fetch should populate cache")
returnUnauthorized.Store(true)
err = provider.ForceRefresh()
require.Error(t, err,
"expected auth error to propagate on refresh; got nil — "+
"stale cache is masking the auth failure (bug)")
require.True(t,
errors.Is(err, api.ErrRegistryUnauthorized) ||
errors.Is(err, auth.ErrRegistryAuthRequired),
"expected ErrRegistryUnauthorized or ErrRegistryAuthRequired; got: %v", err)
}
// failingTokenSource simulates the exact error wrapping that
// oauthTokenSource.Token() applies when its browser flow fails/times out.
type failingTokenSource struct{ fail atomic.Bool }
func (s *failingTokenSource) Token(_ context.Context) (string, error) {
if s.fail.Load() {
inner := fmt.Errorf("oauth flow start failed: %w",
errors.New("authorization timed out waiting for browser callback"))
return "", fmt.Errorf("oauth flow failed: %w", inner)
}
return "", nil // empty token = request passes through with no auth header
}
// Test 2: OAuth-browser-flow failure path — the scenario the user actually
// reported. The error carries NEITHER auth.ErrRegistryAuthRequired nor
// api.ErrRegistryUnauthorized — it gets classified as *UnavailableError.
// A sentinel-only fix will miss this path.
func TestCachedProvider_OAuthFlowFailureNotMaskedByStaleCache(t *testing.T) {
t.Parallel()
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte(`{"servers":[],"metadata":{"next_cursor":""}}`))
}))
defer srv.Close()
ts := &failingTokenSource{}
// Non-nil tokenSource causes NewAPIRegistryProvider to skip its
// construction-time validation probe (provider_api.go:57) — matching
// the real OAuth-configured code path.
provider, err := NewCachedAPIRegistryProvider(srv.URL, true, false, ts)
require.NoError(t, err)
_, err = provider.ListServers()
require.NoError(t, err, "first fetch should populate cache")
ts.fail.Store(true)
err = provider.ForceRefresh()
require.Error(t, err,
"expected OAuth flow failure to propagate on refresh; got nil — "+
"stale cache is masking the OAuth failure (bug)")
}
Running on current main
=== RUN TestCachedProvider_AuthErrorNotMaskedByStaleCache
provider_cached_authbug_test.go:69:
An error is expected but got nil.
expected auth error to propagate on refresh; got nil —
stale cache is masking the auth failure (bug)
--- FAIL: TestCachedProvider_AuthErrorNotMaskedByStaleCache
=== RUN TestCachedProvider_OAuthFlowFailureNotMaskedByStaleCache
provider_cached_authbug_test.go:167:
An error is expected but got nil.
expected OAuth flow failure to propagate on refresh; got nil —
stale cache is masking the OAuth failure (bug)
--- FAIL: TestCachedProvider_OAuthFlowFailureNotMaskedByStaleCache
Suggested fix
A naive sentinel check is not sufficient on its own — it only catches the 401/403 branch:
// NOT enough on its own — misses browser-flow errors.
if errors.Is(err, auth.ErrRegistryAuthRequired) ||
errors.Is(err, api.ErrRegistryUnauthorized) {
return nil, err
}
Because the OAuth-browser-flow failure arrives as a generic wrapped error that gets repackaged into *UnavailableError (pkg/registry/provider_api.go:93-99), neither sentinel matches and the stale-cache fallback still fires.
Two alternatives worth discussing:
- Classify at the source: wrap token-acquisition failures in
oauthTokenSource.Token() (or a layer close to it) so they carry an auth sentinel — e.g., a new ErrRegistryAuthFailed — all the way up to refreshCache. Then extend the sentinel check in refreshCache to include it.
- Distinguish retryable vs. non-retryable at the provider boundary: in
APIRegistryProvider.GetRegistry(), classify token-acquisition failures as an auth-class error (not UnavailableError) and have refreshCache only fall back on transient classes.
A secondary option: always return the error alongside any stale data (e.g., refreshCache returns (registry, ErrStaleCache)) and let the CLI surface a user-visible warning. That preserves current availability UX for transient failures while still informing the user when their data is stale.
Impact
- Correctness: users see registry data that no longer reflects their access.
- UX: no indication the OAuth flow failed — CLI exits 0 and prints results.
- Security (moderate): a user whose IdP group membership was revoked can keep browsing registry contents for up to 7 days (
maxCacheAge). They cannot fetch new data, but retain visibility into the last snapshot.
- Scope is broader than just the CLI:
CachedAPIRegistryProvider is also used in headless contexts — HTTP API server registry endpoints (pkg/api/v1/registry.go, pkg/api/v1/registry_v01_servers.go), MCP registry search (pkg/mcp/server/search_registry.go). All inherit the same stale-on-auth-failure behavior.
- Cloud-UI and enterprise-UI codepaths go through a different auth layer and are not affected.
Environment
- OS/version: reproduced on darwin 25.x; bug is platform-independent.
- ToolHive version:
main (post-pull from origin as of filing).
- Confirmed by two independent reviewers (Claude
claude-opus-4-7 and OpenAI gpt-5.4, high reasoning).
Additional context
Draft PR with both failing reproducers: #4951. The PR is intentionally marked draft — it demonstrates the bug but does not include a fix; the fix direction is under discussion in the "Suggested fix" section above.
Bug description
When
thv registry list(or anything that routes throughCachedAPIRegistryProvider.GetRegistry) triggers an OAuth/OIDC auth flow that fails — the user never completes the browser flow, the flow times out, or the registry rejects the token — the CLI silently returns stale cached registry data instead of surfacing the auth failure.The fallback in
refreshCache()(pkg/registry/provider_cached.go:116-143) is indiscriminate: any error from the upstream API causes previously-cached data to be returned withnilerror. It does not distinguish transient failures (network blip, upstream 5xx) from authorization failures. The persistent disk cache lives at~/.toolhive/cache/withmaxCacheAge = 7 * 24 * time.Hour(pkg/registry/provider_cached.go:26), so stale-serving can persist for up to a week after a user loses access.Steps to reproduce
thvagainst a registry behind OAuth:thv config set-registry <url> --issuer <issuer> --client-id <id>thv registry listonce and complete the OAuth flow so the cache gets populated on disk.thv registry listagain and do nothing when the OAuth URL is displayed. Wait for the flow to time out. (Reporters have seen a second OAuth flow kick off and also time out.)Expected behavior
The CLI surfaces the auth error and exits non-zero, e.g.:
Authorization state may have changed since the cache was populated (group membership revoked, token rotated, session terminated), so silently returning stale data is a correctness and security hazard.
Actual behavior
thv registry listprints the previous (cached) server list, no error, exit 0 — giving the user the false impression they are still authenticated. Follow-up commands likethv run <server>that rely on live auth fail later with confusing errors.Evidence
Bug site:
pkg/registry/provider_cached.go:116-143The upstream
APIRegistryProvider.GetRegistry()(pkg/registry/provider_api.go:82-99) correctly classifies 401/403 intoauth.ErrRegistryAuthRequired. Those errors reachrefreshCache()and get silently replaced withnil.For the OAuth-flow-failure case specifically (browser flow times out / is cancelled), the error path is different but the outcome is the same:
Reproducers
Two failing tests exercise the two distinct error-propagation paths that both hit the same buggy fallback. A draft PR with the failing tests is linked below.
Running on current
mainSuggested fix
A naive sentinel check is not sufficient on its own — it only catches the 401/403 branch:
Because the OAuth-browser-flow failure arrives as a generic wrapped error that gets repackaged into
*UnavailableError(pkg/registry/provider_api.go:93-99), neither sentinel matches and the stale-cache fallback still fires.Two alternatives worth discussing:
oauthTokenSource.Token()(or a layer close to it) so they carry an auth sentinel — e.g., a newErrRegistryAuthFailed— all the way up torefreshCache. Then extend the sentinel check inrefreshCacheto include it.APIRegistryProvider.GetRegistry(), classify token-acquisition failures as an auth-class error (notUnavailableError) and haverefreshCacheonly fall back on transient classes.A secondary option: always return the error alongside any stale data (e.g.,
refreshCachereturns(registry, ErrStaleCache)) and let the CLI surface a user-visible warning. That preserves current availability UX for transient failures while still informing the user when their data is stale.Impact
maxCacheAge). They cannot fetch new data, but retain visibility into the last snapshot.CachedAPIRegistryProvideris also used in headless contexts — HTTP API server registry endpoints (pkg/api/v1/registry.go,pkg/api/v1/registry_v01_servers.go), MCP registry search (pkg/mcp/server/search_registry.go). All inherit the same stale-on-auth-failure behavior.Environment
main(post-pull from origin as of filing).claude-opus-4-7and OpenAIgpt-5.4, high reasoning).Additional context
Draft PR with both failing reproducers: #4951. The PR is intentionally marked draft — it demonstrates the bug but does not include a fix; the fix direction is under discussion in the "Suggested fix" section above.