[CORE-12452] feat: auto-recover host-networked pods when node IP changes#4784
Open
coutinhop wants to merge 2 commits intotigera:masterfrom
Open
[CORE-12452] feat: auto-recover host-networked pods when node IP changes#4784coutinhop wants to merge 2 commits intotigera:masterfrom
coutinhop wants to merge 2 commits intotigera:masterfrom
Conversation
Detect Calico's host-networked pods (calico-typha, calico-node,
calico-node-windows) whose status.podIPs no longer matches the node's
current InternalIP, and delete them so the Deployment / DaemonSet
controller recreates them with the correct IP.
This works around an upstream Kubernetes limitation [1] where
status.podIPs is immutable for hostNetwork pods once set: when a node's
IP changes (e.g. KubeVirt VM reboot pulls a new DHCP lease), existing
hostNetwork pods keep their old IP. The kube EndpointSlice controller
reads from status.podIPs, so the calico-typha EndpointSlice ends up
advertising stale IPs and Felix times out connecting to Typha.
Restarting the container does not help — only deleting and recreating
the pod itself causes the kubelet to repopulate status.podIPs from the
current node IP.
Implementation lives in the existing Typha autoscaler tick (every 10s,
already has a Node informer cache):
- Compare each pod's status.podIPs to its node's status.InternalIP
(which the kubelet does update promptly via heartbeat).
- Delete stale pods, paced one per workload-batch per tick. Batch
size is read from each workload's existing rolling-update setting:
the Typha PDB's maxUnavailable, or the DaemonSet's
updateStrategy.rollingUpdate.maxUnavailable. Falls back to 1 if not
set or if the resolved value is < 1 (minimum-progress guarantee).
- Order: Typha first; if any Typha was deleted this cycle, skip the
calico-node deletions until the next tick to give the new Typha pod
a clean window to come up. Linux and Windows DaemonSets are paced
independently of each other.
- Skipped entirely on the non-cluster-host autoscaler instance.
Tested by ODCN on KubeVirt: 3-node cluster with all node IPs changed,
all calico-node and Typha pods recovered automatically without manual
intervention.
[1] kubernetes/kubernetes#93897
Jira: CI-1951, CORE-12452
Add a new Installation.Spec.StalePodIPRecovery field (Enabled /
Disabled, default Enabled) that gates the host-networked stale pod
IP detection and deletion logic in the typha autoscaler. When set
to Disabled, the entire detection path is skipped each tick.
The default-on choice is consistent with other operator-managed
automation (e.g. the typha autoscaler is itself always-on with no
toggle), avoids opt-in friction for users who don't know the bug
exists, and provides an escape hatch for environments where the
detection might interact badly with custom node-IP management.
Implementation notes:
- api/v1: new StalePodIPRecoveryType enum and IsStalePodIPRecoveryEnabled
helper, modeled on the existing FIPSMode pattern. nil is treated as
Enabled so the default-on behavior is encoded in one place.
- typha_autoscaler.go: new optional func() bool field on the autoscaler
consulted at the top of each tick. Wired via the existing option
pattern (typhaAutoscalerOptionStalePodIPRecoveryEnabled) so tests can
inject true / false / nil. A nil getter is treated as enabled, which
keeps existing tests and the non-cluster-host autoscaler path
unchanged.
- core_controller.go: the closure reads the Installation named "default"
from the manager's cached client at call time so toggles take effect
on the next tick (~10s). Failures fall through to enabled — recovery
is the safer default for the kubelet bug we're working around.
Tests:
- 3 new gate tests covering nil getter, true, and false.
- Defensive Maybe() expectations on SetDegraded in the existing stale
pod IP detection and maxUnavailable resolution contexts to fix a
pre-existing race-condition flakiness exposed by this work.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Detect Calico's host-networked pods (calico-typha, calico-node, calico-node-windows) whose status.podIPs no longer matches the node's current InternalIP, and delete them so the Deployment / DaemonSet controller recreates them with the correct IP.
This works around an upstream Kubernetes limitation [1] where status.podIPs is immutable for hostNetwork pods once set: when a node's IP changes (e.g. KubeVirt VM reboot pulls a new DHCP lease), existing hostNetwork pods keep their old IP. The kube EndpointSlice controller reads from status.podIPs, so the calico-typha EndpointSlice ends up advertising stale IPs and Felix times out connecting to Typha. Restarting the container does not help — only deleting and recreating the pod itself causes the kubelet to repopulate status.podIPs from the current node IP.
Implementation lives in the existing Typha autoscaler tick (every 10s, already has a Node informer cache):
Tested by ODCN on KubeVirt: 3-node cluster with all node IPs changed, all calico-node and Typha pods recovered automatically without manual intervention.
[1] kubernetes/kubernetes#93897
Jira: CI-1951, CORE-12452
Release Note
For PR author
make gen-filesmake gen-versionsFor PR reviewers
A note for code reviewers - all pull requests must have the following:
kind/bugif this is a bugfix.kind/enhancementif this is a a new feature.enterpriseif this PR applies to Calico Enterprise only.