Fix rolling update deadlock when pods are stuck in non-running state#3051
Fix rolling update deadlock when pods are stuck in non-running state#3051annielzy wants to merge 5 commits intozalando:masterfrom
Conversation
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
|
Thanks @annielzy for this nice contribution. We were also aware of this limitation for a cluster to repair itself. Could you test the following scenario? Pick a Spilo docker image that does not exist. The first pod to be replaced should fail to come back because of the "broken" image. What would happen on the next SYNC cycle? I think the operator would then continue with the next pod until the cascading failure is complete. However, with your code a revert of the image would then work because the pods are getting recreated, which is a big plus. |
|
Hi @FxKu, Test Setup:
Results:
Conclusion: The operator does continue replacing healthy pods even after earlier pods fail, which could lead to cascading failures. The only reason the cluster didn't completely fail in my test was that the switchover logic requires a healthy replica, which prevented the leader from being replaced. Recovery Test: I reverted the image back to a valid one, and the cluster successfully recovered - all pods came back healthy and the rolling update completed successfully. Should this PR address this safety issue, or should it be handled separately? If we want to fix it here, I could add a check to verify that previously replaced pods are healthy (e.g., Running and Ready) before continuing with the next pod in the rolling update. |
Problem
The postgres-operator can enter a permanent deadlock state where pods that need
recreation are never recreated because the Patroni API safety checks fail on
non-running pods.
This occurs when:
secret reference from
postgres-operator-<VERSION A>-secrettopostgres-operator-<VERSION B>-secret)zalando-postgres-operator-rolling-update-requiredannotation
recreatePodstocomplete the rolling update
CreateContainerConfigErrorbecause the old secretno longer exists
syncPatroniConfigandrestartInstancesfail becausethey cannot reach the Patroni API on the non-running pod
isSafeToRecreatePods = false, which blocksrecreatePodsThis affects both single-node and multi-node clusters. In a 3-node cluster, one
broken pod blocks the rolling update of all pods, including the healthy ones.
Root Cause
syncStatefulSetunconditionally setsisSafeToRecreatePods = falsewhensyncPatroniConfigorrestartInstancesreturns any error. It does notdistinguish between:
Fix
pod.gopodIsNotRunning()helper that checks if a pod is stuck in a non-runningstate (e.g.
CreateContainerConfigError,CrashLoopBackOff,ImagePullBackOff,terminated containers, non-Running phase)
allPodsRunning()helper used bysyncStatefulSetto determine if PatroniAPI errors are expected
recreatePods, skip the switchover attempt when the master pod is not running,since the Patroni API is unreachable and Patroni has likely already triggered
automatic failover
sync.gosyncPatroniConfigorrestartInstancesfails, only setisSafeToRecreatePods = falseif all pods are actually running. If some podsare not running, the Patroni API errors are expected and should not block the
pod recreation that would fix them.
Behavior change summary
Testing
TestPodIsNotRunning— verifies detection of various non-running states(CreateContainerConfigError, CrashLoopBackOff, ImagePullBackOff, terminated
containers, mixed container states, pending/failed phase)
TestAllPodsRunning— verifies correct behavior with all-running,mixed, all-broken, and empty pod lists