Add health check to `watch.stream` for silent connection drops by Urvashi0109 · Pull Request #2527 · kubernetes-client/python

Urvashi0109 · 2026-03-20T08:31:49Z

What type of PR is this?

/kind bug
/kind feature

What this PR does / why we need it:

When running a watch on Kubernetes objects (e.g., Jobs, Pods, Namespaces) and the Kubernetes control plane gets upgraded, the watch connection is silently dropped. The watcher hangs indefinitely - No exception is raised and no new events are received. This is because the TCP connection enters a half-open state where the client believes the connection is still alive, but the server side has been torn down during the upgrade.

This PR adds a _health_check_interval parameter to watch.stream() that detects silent connection drops and automatically reconnects:

When _health_check_interval is set to a value > 0, a socket-level read timeout (_request_timeout) is configured on the HTTP connection
If no data arrives within the specified interval, urllib3 raises a ReadTimeoutError
The watch catches this exception and automatically reconnects using the last known resource_version, ensuring no events are missed
The feature is disabled by default (_health_check_interval=0), preserving full backward compatibility
When disabled, ReadTimeoutError propagates to the caller as before

Which issue(s) this PR fixes:

Fixes #2462

Special notes for your reviewer:

This PR takes approach: leveraging urllib3's existing read timeout mechanism (_request_timeout) to break out of the blocking read, then catching the resulting ReadTimeoutError/ProtocolError exceptions
The _ prefix on _health_check_interval follows the existing convention in this codebase (e.g., _preload_content, _request_timeout) for parameters that are consumed by the client library rather than passed to the API server
5 new unit tests added, all 24 tests (19 existing + 5 new) pass with zero regressions

Does this PR introduce a user-facing change?

Added `_health_check_interval` parameter to `watch.stream()` to detect and recover from silent connection drops during Kubernetes control plane upgrades. When set to a value > 0 (seconds), the watch will automatically reconnect if no data is received within the specified interval. Disabled by default for backward compatibility.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

k8s-ci-robot · 2026-03-20T08:31:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Urvashi0109
Once this PR has been reviewed and has the lgtm label, please assign yliaog for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

kubernetes/base/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

Adds an optional client-side “health check” mechanism to Watch.stream() to detect silent/half-open watch connections (e.g., during control plane upgrades) by configuring a read timeout and reconnecting automatically.

Changes:

Add _health_check_interval parameter to Watch.stream() that sets _request_timeout (read timeout) when enabled.
Catch ReadTimeoutError / ProtocolError during streaming to trigger a reconnect using the last known resource_version.
Add unit tests covering reconnect behavior, default behavior, and _request_timeout handling.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
kubernetes/base/watch/watch.py	Implements `_health_check_interval` handling, timeout configuration, and reconnect-on-timeout logic.
kubernetes/base/watch/watch_test.py	Adds unit tests for the new health-check/reconnect behavior and timeout parameter interactions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

roycaihw · 2026-03-26T21:38:31Z

I thought we already have multiple timeouts (supporting both client side and server side). Could you check if the existing timeouts address this problem? Or is it (HTTP connection timeout) supported in client-go only?

roycaihw · 2026-03-26T21:39:31Z

Here is an example of using timeout_seconds with watch:

python/examples/watch/pod_namespace_watch.py

Line 18 in bdfbaea

the `timeout_seconds` threshold and then move on to wait for another 10 events

cc @yliaog

Urvashi0109 · 2026-03-28T19:16:57Z

Thanks for feedback @roycaihw , I will take a look 😊

Urvashi0109 · 2026-03-30T17:11:24Z

@roycaihw Great question! I've investigated the existing timeout mechanisms, and they don't fully address the silent connection drop problem during control plane upgrades. Here's why:

timeout_seconds (Server-side timeout): Setting timeout_seconds disables automatic retries (line 203), so when a connection drops, the watch stops completely. During control plane upgrades, we need the watch to recover automatically, not just stop.
_request_timeout (Client-side socket timeout): While _request_timeout can detect stalled connections via ReadTimeoutError, without this PR the exception propagates and the watch stops. Users would need to manually catch the exception, track resource_version, and implement reconnection logic.

With this PR: _health_check_interval combines _request_timeout detection ((line 218): sets read timeout) with automatic reconnection logic (lines 256-269)

this PR makes the Python client behave like client-go does during control plane upgrades - automatic recovery.

Urvashi0109 · 2026-04-29T13:03:20Z

Hi @roycaihw , just following up to see if there are any updates here. Please let me know if you have any questions or need any additional information from my side 😊

k8s-ci-robot · 2026-05-21T05:30:38Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Added health check to watch.stream for silent connection drops

1174077

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Mar 20, 2026

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. labels Mar 20, 2026

k8s-ci-robot requested review from fabianvf and yliaog March 20, 2026 08:31

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 20, 2026

Urvashi0109 marked this pull request as ready for review March 20, 2026 08:33

Copilot AI review requested due to automatic review settings March 20, 2026 08:33

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 20, 2026

Copilot started reviewing on behalf of Urvashi0109 March 20, 2026 08:33 View session

Urvashi0109 mentioned this pull request Mar 20, 2026

Fail watch gracefully on control plane upgrade #2462

Open

Copilot AI reviewed Mar 20, 2026

View reviewed changes

Comment thread kubernetes/base/watch/watch.py

Comment thread kubernetes/base/watch/watch.py Outdated

Comment thread kubernetes/base/watch/watch.py

Comment thread kubernetes/base/watch/watch.py

Minor Style updated

a29f477

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 21, 2026

coderabbitai Bot mentioned this pull request May 25, 2026

net: NAD live update VM info test RedHatQE/openshift-virtualization-tests#4962

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add health check to `watch.stream` for silent connection drops#2527

Add health check to `watch.stream` for silent connection drops#2527
Urvashi0109 wants to merge 2 commits into
kubernetes-client:masterfrom
Urvashi0109:Fix-Added-Watch-Health-Check

Urvashi0109 commented Mar 20, 2026

Uh oh!

k8s-ci-robot commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

roycaihw commented Mar 26, 2026

Uh oh!

roycaihw commented Mar 26, 2026

Uh oh!

Urvashi0109 commented Mar 28, 2026

Uh oh!

Urvashi0109 commented Mar 30, 2026 •

edited

Loading

Uh oh!

Urvashi0109 commented Apr 29, 2026

Uh oh!

k8s-ci-robot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Urvashi0109 commented Mar 20, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

roycaihw commented Mar 26, 2026

Uh oh!

roycaihw commented Mar 26, 2026

Uh oh!

Urvashi0109 commented Mar 28, 2026

Uh oh!

Urvashi0109 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Urvashi0109 commented Apr 29, 2026

Uh oh!

k8s-ci-robot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Urvashi0109 commented Mar 30, 2026 •

edited

Loading