Skip to content

[CORE-12452] feat: auto-recover host-networked pods when node IP changes#4784

Open
coutinhop wants to merge 2 commits intotigera:masterfrom
coutinhop:pedro-CI-1951-1
Open

[CORE-12452] feat: auto-recover host-networked pods when node IP changes#4784
coutinhop wants to merge 2 commits intotigera:masterfrom
coutinhop:pedro-CI-1951-1

Conversation

@coutinhop
Copy link
Copy Markdown
Member

Description

Detect Calico's host-networked pods (calico-typha, calico-node, calico-node-windows) whose status.podIPs no longer matches the node's current InternalIP, and delete them so the Deployment / DaemonSet controller recreates them with the correct IP.

This works around an upstream Kubernetes limitation [1] where status.podIPs is immutable for hostNetwork pods once set: when a node's IP changes (e.g. KubeVirt VM reboot pulls a new DHCP lease), existing hostNetwork pods keep their old IP. The kube EndpointSlice controller reads from status.podIPs, so the calico-typha EndpointSlice ends up advertising stale IPs and Felix times out connecting to Typha. Restarting the container does not help — only deleting and recreating the pod itself causes the kubelet to repopulate status.podIPs from the current node IP.

Implementation lives in the existing Typha autoscaler tick (every 10s, already has a Node informer cache):

  • Compare each pod's status.podIPs to its node's status.InternalIP (which the kubelet does update promptly via heartbeat).
  • Delete stale pods, paced one per workload-batch per tick. Batch size is read from each workload's existing rolling-update setting: the Typha PDB's maxUnavailable, or the DaemonSet's updateStrategy.rollingUpdate.maxUnavailable. Falls back to 1 if not set or if the resolved value is < 1 (minimum-progress guarantee).
  • Order: Typha first; if any Typha was deleted this cycle, skip the calico-node deletions until the next tick to give the new Typha pod a clean window to come up. Linux and Windows DaemonSets are paced independently of each other.
  • Skipped entirely on the non-cluster-host autoscaler instance.

Tested by ODCN on KubeVirt: 3-node cluster with all node IPs changed, all calico-node and Typha pods recovered automatically without manual intervention.

[1] kubernetes/kubernetes#93897

Jira: CI-1951, CORE-12452

Release Note

Automatically recover Calico pods stranded with stale pod IPs after a node IP change (e.g. KubeVirt node reboot).

For PR author

  • Tests for change.
  • If changing pkg/apis/, run make gen-files
  • If changing versions, run make gen-versions

For PR reviewers

A note for code reviewers - all pull requests must have the following:

  • Milestone set according to targeted release.
  • Appropriate labels:
    • kind/bug if this is a bugfix.
    • kind/enhancement if this is a a new feature.
    • enterprise if this PR applies to Calico Enterprise only.

Detect Calico's host-networked pods (calico-typha, calico-node,
calico-node-windows) whose status.podIPs no longer matches the node's
current InternalIP, and delete them so the Deployment / DaemonSet
controller recreates them with the correct IP.

This works around an upstream Kubernetes limitation [1] where
status.podIPs is immutable for hostNetwork pods once set: when a node's
IP changes (e.g. KubeVirt VM reboot pulls a new DHCP lease), existing
hostNetwork pods keep their old IP. The kube EndpointSlice controller
reads from status.podIPs, so the calico-typha EndpointSlice ends up
advertising stale IPs and Felix times out connecting to Typha.
Restarting the container does not help — only deleting and recreating
the pod itself causes the kubelet to repopulate status.podIPs from the
current node IP.

Implementation lives in the existing Typha autoscaler tick (every 10s,
already has a Node informer cache):

  - Compare each pod's status.podIPs to its node's status.InternalIP
    (which the kubelet does update promptly via heartbeat).
  - Delete stale pods, paced one per workload-batch per tick. Batch
    size is read from each workload's existing rolling-update setting:
    the Typha PDB's maxUnavailable, or the DaemonSet's
    updateStrategy.rollingUpdate.maxUnavailable. Falls back to 1 if not
    set or if the resolved value is < 1 (minimum-progress guarantee).
  - Order: Typha first; if any Typha was deleted this cycle, skip the
    calico-node deletions until the next tick to give the new Typha pod
    a clean window to come up. Linux and Windows DaemonSets are paced
    independently of each other.
  - Skipped entirely on the non-cluster-host autoscaler instance.

Tested by ODCN on KubeVirt: 3-node cluster with all node IPs changed,
all calico-node and Typha pods recovered automatically without manual
intervention.

[1] kubernetes/kubernetes#93897

Jira: CI-1951, CORE-12452
Add a new Installation.Spec.StalePodIPRecovery field (Enabled /
Disabled, default Enabled) that gates the host-networked stale pod
IP detection and deletion logic in the typha autoscaler. When set
to Disabled, the entire detection path is skipped each tick.

The default-on choice is consistent with other operator-managed
automation (e.g. the typha autoscaler is itself always-on with no
toggle), avoids opt-in friction for users who don't know the bug
exists, and provides an escape hatch for environments where the
detection might interact badly with custom node-IP management.

Implementation notes:
  - api/v1: new StalePodIPRecoveryType enum and IsStalePodIPRecoveryEnabled
    helper, modeled on the existing FIPSMode pattern. nil is treated as
    Enabled so the default-on behavior is encoded in one place.
  - typha_autoscaler.go: new optional func() bool field on the autoscaler
    consulted at the top of each tick. Wired via the existing option
    pattern (typhaAutoscalerOptionStalePodIPRecoveryEnabled) so tests can
    inject true / false / nil. A nil getter is treated as enabled, which
    keeps existing tests and the non-cluster-host autoscaler path
    unchanged.
  - core_controller.go: the closure reads the Installation named "default"
    from the manager's cached client at call time so toggles take effect
    on the next tick (~10s). Failures fall through to enabled — recovery
    is the safer default for the kubelet bug we're working around.

Tests:
  - 3 new gate tests covering nil getter, true, and false.
  - Defensive Maybe() expectations on SetDegraded in the existing stale
    pod IP detection and maxUnavailable resolution contexts to fix a
    pre-existing race-condition flakiness exposed by this work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants