Fix rolling update deadlock when pods are stuck in non-running state by annielzy · Pull Request #3051 · zalando/postgres-operator

annielzy · 2026-03-06T19:33:16Z

Problem

The postgres-operator can enter a permanent deadlock state where pods that need
recreation are never recreated because the Patroni API safety checks fail on
non-running pods.

This occurs when:

The operator upgrades and updates the StatefulSet template (e.g. changing a
secret reference from postgres-operator-<VERSION A>-secret to
postgres-operator-<VERSION B>-secret)
The operator marks pods with the zalando-postgres-operator-rolling-update-required
annotation
The operator crashes or restarts before it can execute recreatePods to
complete the rolling update
The old secret is removed as part of the operator upgrade
The pod restarts and enters CreateContainerConfigError because the old secret
no longer exists
On subsequent syncs, syncPatroniConfig and restartInstances fail because
they cannot reach the Patroni API on the non-running pod
These failures set isSafeToRecreatePods = false, which blocks recreatePods
The pod can only be fixed by recreation, but recreation is blocked — deadlock

This affects both single-node and multi-node clusters. In a 3-node cluster, one
broken pod blocks the rolling update of all pods, including the healthy ones.

Root Cause

syncStatefulSet unconditionally sets isSafeToRecreatePods = false when
syncPatroniConfig or restartInstances returns any error. It does not
distinguish between:

A genuinely unhealthy Patroni cluster where recreation would be risky
Expected errors from non-running pods that can never be reached via the API

Fix

`pod.go`

Add podIsNotRunning() helper that checks if a pod is stuck in a non-running
state (e.g. CreateContainerConfigError, CrashLoopBackOff, ImagePullBackOff,
terminated containers, non-Running phase)
Add allPodsRunning() helper used by syncStatefulSet to determine if Patroni
API errors are expected
In recreatePods, skip the switchover attempt when the master pod is not running,
since the Patroni API is unreachable and Patroni has likely already triggered
automatic failover

`sync.go`

When syncPatroniConfig or restartInstances fails, only set
isSafeToRecreatePods = false if all pods are actually running. If some pods
are not running, the Patroni API errors are expected and should not block the
pod recreation that would fix them.

Behavior change summary

Scenario	Before	After
All pods running, Patroni sync fails	Postpone ✅	Postpone ✅ (unchanged)
1-node: master in CreateContainerConfigError	Deadlock ❌	Recreate master ✅
3-node: 1 replica broken, others healthy	All 3 postponed ❌	All 3 recreated via normal rolling update ✅
3-node: master broken, replicas healthy	All 3 postponed ❌	Skip switchover, recreate all ✅
3-node: all pods broken	All postponed ❌	Recreate all ✅

Testing

Added TestPodIsNotRunning — verifies detection of various non-running states
(CreateContainerConfigError, CrashLoopBackOff, ImagePullBackOff, terminated
containers, mixed container states, pending/failed phase)
Added TestAllPodsRunning — verifies correct behavior with all-running,
mixed, all-broken, and empty pod lists

zalando-robot · 2026-03-06T19:33:22Z

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

zalando-robot · 2026-03-06T21:18:23Z

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

zalando-robot · 2026-03-06T21:20:01Z

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

zalando-robot · 2026-03-06T21:43:30Z

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

…unning-pods

FxKu · 2026-03-24T13:32:22Z

Thanks @annielzy for this nice contribution. We were also aware of this limitation for a cluster to repair itself. Could you test the following scenario? Pick a Spilo docker image that does not exist. The first pod to be replaced should fail to come back because of the "broken" image. What would happen on the next SYNC cycle? I think the operator would then continue with the next pod until the cascading failure is complete.

However, with your code a revert of the image would then work because the pods are getting recreated, which is a big plus.

annielzy · 2026-04-08T21:33:18Z

Hi @FxKu,
I tested the scenario and confirmed your concern:

Test Setup:

Started with a healthy 3-node cluster
Changed the dockerImage to a non-existent image to trigger a rolling update

Results:

First SYNC cycle: Pod-0 was replaced and failed with ErrImagePull
Subsequent SYNC cycles: The operator continued and replaced pod-1 (also failed)
The operator only stopped before replacing pod-2 (the leader) because it couldn't find a healthy sync standby for switchover

Conclusion: The operator does continue replacing healthy pods even after earlier pods fail, which could lead to cascading failures. The only reason the cluster didn't completely fail in my test was that the switchover logic requires a healthy replica, which prevented the leader from being replaced.

Recovery Test: I reverted the image back to a valid one, and the cluster successfully recovered - all pods came back healthy and the rolling update completed successfully.

Should this PR address this safety issue, or should it be handled separately? If we want to fix it here, I could add a check to verify that previously replaced pods are healthy (e.g., Running and Ready) before continuing with the next pod in the rolling update.

add fix to recreate non running pods in syncStatefulsets

c5142ee

annielzy requested review from FxKu, Jan-M, hughcapet, idanovinda, jopadi, mikkeloscar and sdudoladov as code owners March 6, 2026 19:33

annielzy marked this pull request as draft March 6, 2026 19:37

remove TestSyncStatefulSetNonRunningPodsDoNotBlockRecreatio

d47b666

revert pod_test

27196f4

pod without status

c1c8760

annielzy marked this pull request as ready for review March 6, 2026 21:43

FxKu added the bugfix label Mar 20, 2026

FxKu added this to the 2.0.0 milestone Mar 24, 2026

FxKu added this to Postgres Operator Mar 24, 2026

FxKu moved this to WIP / currently reviewed in Postgres Operator Mar 24, 2026

Merge branch 'master' into fix/break-rolling-update-deadlock-on-non-r…

959f705

…unning-pods

FxKu moved this from WIP / currently reviewed to Open Questions in Postgres Operator Mar 26, 2026

FxKu added minor and removed bugfix labels Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix rolling update deadlock when pods are stuck in non-running state#3051

Fix rolling update deadlock when pods are stuck in non-running state#3051
annielzy wants to merge 5 commits intozalando:masterfrom
annielzy:fix/break-rolling-update-deadlock-on-non-running-pods

annielzy commented Mar 6, 2026 •

edited

Loading

Uh oh!

zalando-robot commented Mar 6, 2026

Uh oh!

zalando-robot commented Mar 6, 2026

Uh oh!

zalando-robot commented Mar 6, 2026

Uh oh!

zalando-robot commented Mar 6, 2026

Uh oh!

FxKu commented Mar 24, 2026

Uh oh!

annielzy commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

annielzy commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Fix

pod.go

sync.go

Behavior change summary

Testing

Uh oh!

zalando-robot commented Mar 6, 2026

Uh oh!

zalando-robot commented Mar 6, 2026

Uh oh!

zalando-robot commented Mar 6, 2026

Uh oh!

zalando-robot commented Mar 6, 2026

Uh oh!

FxKu commented Mar 24, 2026

Uh oh!

annielzy commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

annielzy commented Mar 6, 2026 •

edited

Loading

`pod.go`

`sync.go`