fix(controller): add timeout to RemoteMCPServer registration#1805
fix(controller): add timeout to RemoteMCPServer registration#1805AnantKumar17 wants to merge 4 commits intokagent-dev:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a controller reliability bug where a hung RemoteMCPServer registration could block all subsequent RemoteMCPServer reconciliations by ensuring the registration workflow is time-bounded and observable via structured logs.
Changes:
- Adds a 30s registration timeout and applies it to MCP transport creation + tool listing.
- Ensures
newHTTPClient()always returns an*http.Clientwith a timeout (no longer returnshttp.DefaultClient). - Adds structured logr entries for registration start, success, and failure with duration.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // mcpRegistrationTimeout is the deadline applied to each RemoteMCPServer | ||
| // registration attempt (header resolution + MCP connect + tool listing). | ||
| // A hung or unreachable endpoint is bounded to this duration, ensuring the | ||
| // reconciler goroutine is always released and does not block subsequent | ||
| // RemoteMCPServer reconciliations. | ||
| mcpRegistrationTimeout = 30 * time.Second | ||
| ) | ||
|
|
There was a problem hiding this comment.
added a remoteMCPRegistrationTimeout() helper that returns spec.timeout when set and falls back to the 30s default when the field is nil. The constant is now just the fallback value.
| // Bound the entire registration sequence (header resolution + MCP connect + | ||
| // tool listing) to mcpRegistrationTimeout so that a hung or unreachable | ||
| // endpoint cannot block this goroutine — and therefore all subsequent | ||
| // RemoteMCPServer reconciliations — indefinitely. | ||
| tCtx, cancel := context.WithTimeout(ctx, mcpRegistrationTimeout) | ||
| defer cancel() | ||
|
|
||
| tsp, err := a.createMcpTransport(tCtx, remoteMcpServer) |
There was a problem hiding this comment.
Fixed - upsertToolServerForRemoteMCPServer now calls context.WithTimeout(ctx, remoteMCPRegistrationTimeout(remoteMcpServer)) so operators can tune the deadline per resource via .spec.timeout without any code changes.
| // go-sdk does not have a WithHeaders option when initializing transport | ||
| // so we need to create a custom HTTP client that adds headers to all requests. | ||
| func newHTTPClient(headers map[string]string) *http.Client { | ||
| if len(headers) == 0 { | ||
| return http.DefaultClient | ||
| return &http.Client{ | ||
| Timeout: mcpRegistrationTimeout, | ||
| } | ||
| } | ||
| return &http.Client{ | ||
| Timeout: mcpRegistrationTimeout, | ||
| Transport: &headerTransport{ |
There was a problem hiding this comment.
Updated newHTTPClient to accept an explicit timeout time.Duration parameter. createMcpTransport derives it from remoteMCPRegistrationTimeout(s) and passes it in, so the HTTP client timeout matches the CRD configuration in both branches.
| // Bound the entire registration sequence (header resolution + MCP connect + | ||
| // tool listing) to mcpRegistrationTimeout so that a hung or unreachable | ||
| // endpoint cannot block this goroutine — and therefore all subsequent | ||
| // RemoteMCPServer reconciliations — indefinitely. | ||
| tCtx, cancel := context.WithTimeout(ctx, mcpRegistrationTimeout) | ||
| defer cancel() | ||
|
|
||
| tsp, err := a.createMcpTransport(tCtx, remoteMcpServer) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to create client for toolServer %s: %w", toolServer.Name, err) | ||
| } | ||
|
|
||
| tools, err := a.listTools(ctx, tsp, toolServer) | ||
| tools, err := a.listTools(tCtx, tsp, toolServer) |
There was a problem hiding this comment.
Added TestRemoteMCPRegistrationTimeout (covers nil server, nil spec.timeout, and a custom value) and TestNewHTTPClient (covers nil/empty/with-headers - all assert the correct timeout and transport type). These are in reconciler_test.go in the same package.
jmhbh
left a comment
There was a problem hiding this comment.
Changes LGTM overall, just have on enit
| // .spec.timeout is not set. A hung or unreachable endpoint is bounded to this | ||
| // duration, ensuring the reconciler goroutine is always released and does not | ||
| // block subsequent RemoteMCPServer reconciliations. | ||
| mcpRegistrationTimeout = 30 * time.Second |
There was a problem hiding this comment.
nit: I almost prefer setting this as a default value in remotemcpserver_types.go instead in the spec.Timeout field so its more explicit.
There was a problem hiding this comment.
Done - added // +kubebuilder:default="30s" on Timeout in remotemcpserver_types.go and regenerated both the base CRD and the Helm CRD template.
The Go fallback in remoteMCPRegistrationTimeout() is kept for existing resources that were created before this default takes effect (nil Timeout at admission time).
Fixes the issue 1785 Signed-off-by: AnantKumar17 <anant3011k@gmail.com>
Signed-off-by: AnantKumar17 <anant3011k@gmail.com>
…ient Signed-off-by: AnantKumar17 <anant3011k@gmail.com>
Signed-off-by: AnantKumar17 <anant3011k@gmail.com>
b868f98 to
f3f1282
Compare
What
Fixes a bug where a single hung
RemoteMCPServerregistration blocks all subsequentRemoteMCPServerreconciliations for the lifetime of the controller process.Closes #1785
Why
newHTTPClient()was returninghttp.DefaultClient(no timeout) when there were no custom headers, andlistTools()was passing the context directly toclient.Connect()andsession.ListTools()with no deadline. A hung or unreachable endpoint — for example, an IPv6 address resolved in an IPv4-only cluster — would block the goroutine indefinitely, preventing any subsequentRemoteMCPServerfrom registering until the controller pod was restarted.Changes
Single file changed:
go/core/internal/controller/reconciler/reconciler.gonewHTTPClient(): always returns a*http.ClientwithTimeout: mcpRegistrationTimeoutset, in both the no-headers and with-headers branches. Removes the use ofhttp.DefaultClient.upsertToolServerForRemoteMCPServer(): wraps the transport creation and tool listing incontext.WithTimeout(ctx, mcpRegistrationTimeout)so the entire registration sequence is bounded. The database call (RefreshToolsForServer) intentionally uses the originalctx.ReconcileKagentRemoteMCPServer(): logs registration start (url, protocol), success (url, tool count, duration), and failure (error, duration) using structuredlogrkey-value pairs, matching existing codebase conventions.mcpRegistrationTimeout: new unexported constant set to30 * time.Second, matchingDefaultTimeoutused indiscoverer.go.Testing
All 26 test packages pass.
go vetclean.go buildclean.