Add health check to watch.stream for silent connection drops#2527
Add health check to watch.stream for silent connection drops#2527Urvashi0109 wants to merge 2 commits intokubernetes-client:masterfrom
watch.stream for silent connection drops#2527Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Urvashi0109 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Adds an optional client-side “health check” mechanism to Watch.stream() to detect silent/half-open watch connections (e.g., during control plane upgrades) by configuring a read timeout and reconnecting automatically.
Changes:
- Add
_health_check_intervalparameter toWatch.stream()that sets_request_timeout(read timeout) when enabled. - Catch
ReadTimeoutError/ProtocolErrorduring streaming to trigger a reconnect using the last knownresource_version. - Add unit tests covering reconnect behavior, default behavior, and
_request_timeouthandling.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| kubernetes/base/watch/watch.py | Implements _health_check_interval handling, timeout configuration, and reconnect-on-timeout logic. |
| kubernetes/base/watch/watch_test.py | Adds unit tests for the new health-check/reconnect behavior and timeout parameter interactions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I thought we already have multiple timeouts (supporting both client side and server side). Could you check if the existing timeouts address this problem? Or is it (HTTP connection timeout) supported in client-go only? |
|
Here is an example of using cc @yliaog |
|
Thanks for feedback @roycaihw , I will take a look 😊 |
What type of PR is this?
/kind bug
/kind feature
What this PR does / why we need it:
When running a watch on Kubernetes objects (e.g., Jobs, Pods, Namespaces) and the Kubernetes control plane gets upgraded, the watch connection is silently dropped. The watcher hangs indefinitely - No exception is raised and no new events are received. This is because the TCP connection enters a half-open state where the client believes the connection is still alive, but the server side has been torn down during the upgrade.
This PR adds a
_health_check_intervalparameter towatch.stream()that detects silent connection drops and automatically reconnects:_health_check_intervalis set to a value > 0, a socket-level read timeout (_request_timeout) is configured on the HTTP connectionurllib3raises aReadTimeoutErrorresource_version, ensuring no events are missed_health_check_interval=0), preserving full backward compatibilityReadTimeoutErrorpropagates to the caller as beforeWhich issue(s) this PR fixes:
Fixes #2462
Special notes for your reviewer:
_request_timeout) to break out of the blocking read, then catching the resultingReadTimeoutError/ProtocolErrorexceptions_health_check_intervalfollows the existing convention in this codebase (e.g., _preload_content, _request_timeout) for parameters that are consumed by the client library rather than passed to the API serverDoes this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: