fix(controller): Deterministically select master pod by xuekat · Pull Request #437 · dragonflydb/dragonfly-operator

xuekat · 2025-12-21T22:08:57Z

Deterministically select the next master pod. Context: we were hitting an issue where the operator was failing to reconcile because there was a race condition where multiple pods were not finding a healthy master, then getting promoted to master simultaneously and trying to mark each other as replicas.

see my recent comment

Abhra303

Looks good, Thanks!

Copilot

Pull request overview

This pull request addresses a race condition in the Dragonfly operator where multiple pods could simultaneously be promoted to master. The fix introduces deterministic master pod selection to ensure only one pod becomes master at a time.

Adds selectMasterCandidate function to deterministically select the lowest-ordinal ready pod as master
Implements role verification to detect and fix labeled masters running as replicas
Introduces PhaseConfiguring state with recovery logic to prevent stuck configurations

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File	Description
internal/controller/util.go	Adds `getOrdinal` and `selectMasterCandidate` functions for deterministic master selection
internal/controller/dragonfly_pod_lifecycle_controller.go	Updates master selection logic to use `selectMasterCandidate`, adds role verification for labeled masters, and enables recovery from `PhaseConfiguring` state
internal/controller/dragonfly_instance.go	Modifies `checkAndConfigureReplicas` to return configuration status, adds `getRedisRole` function, and updates phase transition logic
e2e/dragonfly_pod_lifecycle_controller_test.go	Adds test to verify recovery from stuck `PhaseConfiguring` phase

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-08T06:20:05Z

+func getOrdinal(podName string) int {
+	ordinal, err := strconv.Atoi(strings.Split(podName, "-")[1])
+	if err != nil {
+		return -1


Returning -1 on error could lead to unexpected behavior in selectMasterCandidate. A pod with a parsing error would be treated as having ordinal -1, which would always be less than valid ordinals (0, 1, 2, etc.), potentially selecting an invalid pod as the master candidate. Consider returning a larger sentinel value like math.MaxInt or handling the error more explicitly in the caller.

either check that getOrdinal is not -1 before using in selectMasterCandidate or return MaxInt as suggested here.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: kathleen xue <kathleen@shaped.ai>

Abhra303 · 2026-01-12T05:52:49Z

+	role, err := dfi.getRedisRole(ctx, master)
+	if err != nil {
+		log.Error(err, "failed to get redis role for labeled master", "pod", master.Name)
+	} else if role == resources.Replica {
+		log.Info("Pod labeled as master is running as replica. Promoting it.", "pod", master.Name)
+		if err := dfi.replicaOfNoOne(ctx, master); err != nil {
+			return ctrl.Result{}, fmt.Errorf("failed to promote master: %w", err)
+		}
+	}
+


I am concerned with this logic and I think this is redundant - we already handle this in checkAndConfigureReplicas function. I don't see any reason to put it here. Why do you think the checkAndConfigureReplicas function is not enough?

I think the checkAndConfigureReplicas function only handles if the replica pods are configured correctly (not rogue, connected to the right master). But this logic checks if the master is unintentionally running as a replica, so it accounts for issues with the master pod that checkAndConfigureReplicas doesn't check.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: kathleen xue <kathleen@shaped.ai>

Signed-off-by: kathleen xue <kathleen@shaped.ai>

xuekat · 2026-01-19T20:05:36Z

Made a few more changes to ensure no race conditions @Abhra303 if you could review again. Summary of changes:

Deterministically select master node by ordinal value
defer PhaseReady until all replicas are configured. this is because the original code set PhaseReady prematurely before replicas were configured. this caused race conditions where rolling updates could start before replication was fully established.
handle isReplicationError as these changes increased reconciliation frequency (due to Requeue: true when not all configured). this caused concurrent SLAVE OF commands to the same pod. DragonflyDB returns "ERR replication cancelled" for concurrent replication attempts. without handling this, it created an infinite retry loop.
allow reconciliation in PhaseConfiguring phase. this is because we deferred PhaseReady, so the controller needs to continue reconciling during PhaseConfiguring to finish setting up replicas.
use Patch instead of Update, as Patch is safer than Update in concurrent environments.

xuekat · 2026-01-26T21:10:29Z

hey @Abhra303 would appreciate another review here 🙏

xuekat · 2026-02-19T20:29:31Z

gentle bump @Abhra303

ashotland

Thanks for this PR, sorry to take it a step back, please see some comments

ashotland · 2026-02-22T11:18:04Z

 		}
 	}

+	masterPod, err := dfi.getMaster(ctx)


what do you getMaster again ?

also seem to call dfi.replicaOfNoOne(ctx, master) below

can we just consistently use the master variable from line 77?

changed masterPod, err := dfi.getMaster(ctx) to use the existing master variable consistently

I still see masterPod, err := dfi.getMaster(ctx)

why can't you use master ?

ashotland · 2026-02-22T11:57:51Z

should we use patch here too ?

ashotland · 2026-02-22T13:25:28Z

+			continue
+		}
+
 		if !isPodOnLatestVersion(&pod, updateRevision) && !isTerminating(&pod) && !isRunningAndReady(&pod) {


!isTerminating is now redundant ?

thanks, removed

ashotland · 2026-02-22T14:34:12Z

+
+	for i := range pods {
+		p := &pods[i]
+		// We can't use isReady() because the readiness probe might not pass until replication is configured.


but seems we check isReady right after calling this function

also, IIUC, in case pod 0 will be slow to become ready we'll keep requeuing while there might be other ready pods, which is a considerable deviation from current behaviour.

can we do something like the following instead :

func selectMasterCandidate(pods []corev1.Pod, isReady func(*corev1.Pod) bool) *corev1.Pod { var best *corev1.Pod for i := range pods { p := &pods[i] if !isReady(p) { continue } if best == nil || getOrdinal(p.Name) < getOrdinal(best.Name) { best = p } } return best }

makes sense, updated as per this

ashotland · 2026-02-25T13:52:19Z

+				return ctrl.Result{}, nil
+			}
+
+			masterReady, err := dfi.isPodReady(ctx, masterCandidate)


I believe this check is redundant as the master Candidate must be ready

thanks, removed

ashotland · 2026-02-25T13:54:09Z

 		}
 	}

+	masterPod, err := dfi.getMaster(ctx)


I still see masterPod, err := dfi.getMaster(ctx)

why can't you use master ?

ashotland · 2026-02-25T14:06:16Z

 		return err
 	}

-	dfiStatus.Phase = PhaseReady


this seems like a behaviour change that doesn't seems related to the goal of this pr

currently we move to ready once master is configured and serving and I am not sure we should change this

generally it'd be great to limit the scope to changes that are required to achieve the purpose of the PR.

this was to deal with the test DF Pod Lifecycle Reconciler Fail Over is working which needs something that handles a pod-event while in PhaseConfiguring before it transitions to PhaseReady. but i've reverted this change in favor of adding PhaseConfiguring back to the gate condition, and after checkAndConfigureReplicas succeeds, transition from PhaseConfiguring to PhaseReady.

Thanks but you still moved dfiStatus.Phase = PhaseReady to later

which is why I think you added changes in DfPodLifeCycleReconciler to handle 'stuck' PhaseConfiguring

I still don't understand why this is needed for the purpose of this pr and why we can't we keep dfiStatus.Phase = PhaseReady in it's original state

ashotland · 2026-02-25T14:15:32Z

 		log.Info("non-deletion event for a pod with an existing role. checking if something is wrong", "pod", pod.Name, "role", pod.Labels[resources.RoleLabelKey])

-		if err = dfi.checkAndConfigureReplicas(ctx, master.Status.PodIP); err != nil {
+		if allConfigured, err := dfi.checkAndConfigureReplicas(ctx, master.Status.PodIP); err != nil {


not sure I understand why we need this change

also once we revert back to PahseReady on master is configured dfi.getStatus().Phase == PhaseConfiguring
below become dead code

check my comment here, but reverted this logic

ashotland · 2026-03-05T16:09:21Z

 		return err
 	}

-	dfiStatus.Phase = PhaseReady


Thanks but you still moved dfiStatus.Phase = PhaseReady to later

which is why I think you added changes in DfPodLifeCycleReconciler to handle 'stuck' PhaseConfiguring

I still don't understand why this is needed for the purpose of this pr and why we can't we keep dfiStatus.Phase = PhaseReady in it's original state

ashotland · 2026-03-05T16:12:21Z

 			Expect(podRoles[resources.Replica]).To(HaveLen(replicas - 1))
 		})

+		It("Should recover from stuck Configuring phase", func() {


given the other comments this test may no longer be required

yup, removed

ashotland · 2026-03-05T16:12:52Z

+}
+
+// selectMasterCandidate deterministically selects a master candidate from the given list of pods.
+func selectMasterCandidate(pods []corev1.Pod, isReady func(*corev1.Pod) bool) *corev1.Pod {


how about a unit test for this function ?

added a unit test

xuekat · 2026-03-10T16:30:30Z

Thanks but you still moved dfiStatus.Phase = PhaseReady to later

which is why I think you added changes in DfPodLifeCycleReconciler to handle 'stuck' PhaseConfiguring

I still don't understand why this is needed for the purpose of this pr and why we can't we keep dfiStatus.Phase = PhaseReady in it's original state

right, it's not needed anymore, so I've moved it back to its original place, thanks.

ashotland

thanks a lot looks good now, please just update the branch and resolve conflicts

Signed-off-by: kathleen xue <kathleen@shaped.ai>

Abhra303

LGTM, please check my comment. Should be ready to merge after this.

Abhra303 · 2026-03-17T12:55:46Z

+	role, err := dfi.getRedisRole(ctx, master)
+	if err != nil {
+		log.Info("failed to verify master status in redis (ignoring)", "error", err)
+	} else if role == resources.Replica {


Can you please use else if role != resources.Master. Some dragonfly versions returns slave instead of replica as the role.

xuekat added 3 commits December 21, 2025 22:01

deterministically select master pod

0d1e18c

fix pointer

b55898c

fix pointer

684e315

nrdyava approved these changes Dec 21, 2025

View reviewed changes

xuekat added 5 commits December 22, 2025 00:23

fix phaseready

75160f2

fix tests

302aded

add all configured

14535e8

check master

e936d5c

check master

940c1a0

Abhra303 previously approved these changes Jan 6, 2026

View reviewed changes

Comment thread config/manager/kustomization.yaml Outdated

Comment thread internal/controller/dragonfly_pod_lifecycle_controller.go Outdated

Abhra303 reviewed Jan 6, 2026

View reviewed changes

Comment thread internal/controller/dragonfly_pod_lifecycle_controller.go Outdated

moredure mentioned this pull request Jan 6, 2026

BullMQ / Cached IP from previous Dragonfly leader dragonflydb/dragonfly#4972

Open

xuekat added 3 commits January 6, 2026 19:20

requeue reconciliation

c906e2f

fix

6e75b69

get role

45eaa1b

Abhra303 approved these changes Jan 8, 2026

View reviewed changes

Abhra303 requested a review from Copilot January 8, 2026 06:14

Copilot started reviewing on behalf of Abhra303 January 8, 2026 06:15 View session

Copilot AI reviewed Jan 8, 2026

View reviewed changes

xuekat and others added 3 commits January 8, 2026 15:16

Update internal/controller/util.go

0f707db

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: kathleen xue <kathleen@shaped.ai>

Update internal/controller/util.go

7116bcb

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: kathleen xue <kathleen@shaped.ai>

Update internal/controller/dragonfly_instance.go

f65b3b8

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: kathleen xue <kathleen@shaped.ai>

Abhra303 requested changes Jan 12, 2026

View reviewed changes

xuekat and others added 4 commits January 15, 2026 16:52

Update internal/controller/dragonfly_instance.go

11e125b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: kathleen xue <kathleen@shaped.ai>

fixes

de5a1b0

fix tests

b617e66

Merge branch 'main' into xuekat/operator-reconcile

885bbf6

Signed-off-by: kathleen xue <kathleen@shaped.ai>

xuekat requested a review from Abhra303 January 19, 2026 19:20

fix pvc test

944fb5f

ashotland requested changes Feb 22, 2026

View reviewed changes

xuekat added 2 commits February 23, 2026 19:12

updater as per comments

5e3349a

patch

80d95ac

xuekat requested a review from ashotland February 23, 2026 19:33

fix test

82f5031

ashotland reviewed Feb 25, 2026

View reviewed changes

xuekat added 2 commits February 26, 2026 19:01

update as per comments

913f99b

fix test

e4710dc

xuekat requested a review from ashotland February 26, 2026 21:13

ashotland requested changes Mar 5, 2026

View reviewed changes

xuekat added 2 commits March 10, 2026 16:11

update as per comments

1c67a92

update as per comments

7e85ad6

xuekat requested a review from ashotland March 10, 2026 16:30

ashotland requested changes Mar 15, 2026

View reviewed changes

Merge branch 'main' into xuekat/operator-reconcile

4bf0fc3

Signed-off-by: kathleen xue <kathleen@shaped.ai>

xuekat requested a review from ashotland March 16, 2026 17:16

xuekat added 3 commits March 16, 2026 20:05

fix tests

a93e010

fix tests

458de73

import

7c725e6

ashotland approved these changes Mar 17, 2026

View reviewed changes

Abhra303 requested changes Mar 17, 2026

View reviewed changes

update as per comments

1d0470a

xuekat requested a review from Abhra303 March 17, 2026 17:26

Abhra303 approved these changes Mar 19, 2026

View reviewed changes

ashotland merged commit 5b2a6df into dragonflydb:main Mar 19, 2026
2 checks passed

Conversation

xuekat commented Dec 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Abhra303 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xuekat commented Jan 19, 2026

Uh oh!

xuekat commented Jan 26, 2026

Uh oh!

xuekat commented Feb 19, 2026

Uh oh!

ashotland left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment