Skip to content

fix: uncordon nodes left cordoned after lock loss#1282

Open
xavierleune wants to merge 1 commit intokubereboot:mainfrom
xavierleune:fix/stuck-cordoned
Open

fix: uncordon nodes left cordoned after lock loss#1282
xavierleune wants to merge 1 commit intokubereboot:mainfrom
xavierleune:fix/stuck-cordoned

Conversation

@xavierleune
Copy link
Copy Markdown

Summary

  • Fix nodes remaining cordoned indefinitely when kured loses the lock during reboot (e.g., lock TTL expiry, lock taken by another node)
  • Detect orphaned cordoned nodes using --pre-reboot-node-labels or --annotate-nodes markers

Problem

When a node reboots and kured restarts, if lock.Holding() returns false (lock expired or released), the uncordon logic is skipped entirely. The node remains cordoned with pre-reboot labels but kured doesn't recognize it as "its own" node to uncordon.
This is particularly problematic with MicroOS or similar auto-updating OS where a second reboot may be required immediately after the first.

Fixes #63

Solution

After checking lock.Holding(), if we don't hold the lock, check if the node:

  • Has the KuredRebootInProgressAnnotation (if --annotate-nodes is enabled), OR
  • Has the --pre-reboot-node-labels labels

If either condition is true AND the node is unschedulable, uncordon it.

Signed-off-by: Xavier Leune <xavier.leune@gmail.com>
@evrardjp evrardjp added the FEATURE-v2 This is a feature improvement that needs to be taken into consideration for v2 label Feb 22, 2026
@evrardjp evrardjp added the keep This won't be closed by the stale bot. label Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

FEATURE-v2 This is a feature improvement that needs to be taken into consideration for v2 keep This won't be closed by the stale bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Node stays on Ready,SchedulingDisabled

2 participants