Skip to content

fix(controller): Deterministically select master pod#437

Merged
ashotland merged 31 commits intodragonflydb:mainfrom
xuekat:xuekat/operator-reconcile
Mar 19, 2026
Merged

fix(controller): Deterministically select master pod#437
ashotland merged 31 commits intodragonflydb:mainfrom
xuekat:xuekat/operator-reconcile

Conversation

@xuekat
Copy link
Copy Markdown
Contributor

@xuekat xuekat commented Dec 21, 2025

Deterministically select the next master pod. Context: we were hitting an issue where the operator was failing to reconcile because there was a race condition where multiple pods were not finding a healthy master, then getting promoted to master simultaneously and trying to mark each other as replicas.

Abhra303
Abhra303 previously approved these changes Jan 6, 2026
Comment thread config/manager/kustomization.yaml Outdated
Comment thread internal/controller/dragonfly_pod_lifecycle_controller.go Outdated
Comment thread internal/controller/dragonfly_pod_lifecycle_controller.go Outdated
Copy link
Copy Markdown
Contributor

@Abhra303 Abhra303 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, Thanks!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request addresses a race condition in the Dragonfly operator where multiple pods could simultaneously be promoted to master. The fix introduces deterministic master pod selection to ensure only one pod becomes master at a time.

  • Adds selectMasterCandidate function to deterministically select the lowest-ordinal ready pod as master
  • Implements role verification to detect and fix labeled masters running as replicas
  • Introduces PhaseConfiguring state with recovery logic to prevent stuck configurations

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
internal/controller/util.go Adds getOrdinal and selectMasterCandidate functions for deterministic master selection
internal/controller/dragonfly_pod_lifecycle_controller.go Updates master selection logic to use selectMasterCandidate, adds role verification for labeled masters, and enables recovery from PhaseConfiguring state
internal/controller/dragonfly_instance.go Modifies checkAndConfigureReplicas to return configuration status, adds getRedisRole function, and updates phase transition logic
e2e/dragonfly_pod_lifecycle_controller_test.go Adds test to verify recovery from stuck PhaseConfiguring phase

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/controller/util.go Outdated
Comment thread internal/controller/util.go Outdated
Comment thread internal/controller/util.go
Comment thread internal/controller/dragonfly_instance.go
Comment thread internal/controller/dragonfly_instance.go Outdated
Comment thread internal/controller/util.go Outdated
func getOrdinal(podName string) int {
ordinal, err := strconv.Atoi(strings.Split(podName, "-")[1])
if err != nil {
return -1
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning -1 on error could lead to unexpected behavior in selectMasterCandidate. A pod with a parsing error would be treated as having ordinal -1, which would always be less than valid ordinals (0, 1, 2, etc.), potentially selecting an invalid pod as the master candidate. Consider returning a larger sentinel value like math.MaxInt or handling the error more explicitly in the caller.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either check that getOrdinal is not -1 before using in selectMasterCandidate or return MaxInt as suggested here.

xuekat and others added 3 commits January 8, 2026 15:16
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: kathleen xue <kathleen@shaped.ai>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: kathleen xue <kathleen@shaped.ai>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: kathleen xue <kathleen@shaped.ai>
Comment on lines +111 to +120
role, err := dfi.getRedisRole(ctx, master)
if err != nil {
log.Error(err, "failed to get redis role for labeled master", "pod", master.Name)
} else if role == resources.Replica {
log.Info("Pod labeled as master is running as replica. Promoting it.", "pod", master.Name)
if err := dfi.replicaOfNoOne(ctx, master); err != nil {
return ctrl.Result{}, fmt.Errorf("failed to promote master: %w", err)
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned with this logic and I think this is redundant - we already handle this in checkAndConfigureReplicas function. I don't see any reason to put it here. Why do you think the checkAndConfigureReplicas function is not enough?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the checkAndConfigureReplicas function only handles if the replica pods are configured correctly (not rogue, connected to the right master). But this logic checks if the master is unintentionally running as a replica, so it accounts for issues with the master pod that checkAndConfigureReplicas doesn't check.

Comment thread internal/controller/util.go
xuekat and others added 4 commits January 15, 2026 16:52
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: kathleen xue <kathleen@shaped.ai>
Signed-off-by: kathleen xue <kathleen@shaped.ai>
@xuekat xuekat requested a review from Abhra303 January 19, 2026 19:20
@xuekat
Copy link
Copy Markdown
Contributor Author

xuekat commented Jan 19, 2026

Made a few more changes to ensure no race conditions @Abhra303 if you could review again. Summary of changes:

  • Deterministically select master node by ordinal value
  • defer PhaseReady until all replicas are configured. this is because the original code set PhaseReady prematurely before replicas were configured. this caused race conditions where rolling updates could start before replication was fully established.
  • handle isReplicationError as these changes increased reconciliation frequency (due to Requeue: true when not all configured). this caused concurrent SLAVE OF commands to the same pod. DragonflyDB returns "ERR replication cancelled" for concurrent replication attempts. without handling this, it created an infinite retry loop.
  • allow reconciliation in PhaseConfiguring phase. this is because we deferred PhaseReady, so the controller needs to continue reconciling during PhaseConfiguring to finish setting up replicas.
  • use Patch instead of Update, as Patch is safer than Update in concurrent environments.

@xuekat
Copy link
Copy Markdown
Contributor Author

xuekat commented Jan 26, 2026

hey @Abhra303 would appreciate another review here 🙏

@xuekat
Copy link
Copy Markdown
Contributor Author

xuekat commented Feb 19, 2026

gentle bump @Abhra303

Copy link
Copy Markdown
Contributor

@ashotland ashotland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR, sorry to take it a step back, please see some comments

}
}

masterPod, err := dfi.getMaster(ctx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you getMaster again ?

also seem to call dfi.replicaOfNoOne(ctx, master) below

can we just consistently use the master variable from line 77?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed masterPod, err := dfi.getMaster(ctx) to use the existing master variable consistently

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see masterPod, err := dfi.getMaster(ctx)

why can't you use master ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use patch here too ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

continue
}

if !isPodOnLatestVersion(&pod, updateRevision) && !isTerminating(&pod) && !isRunningAndReady(&pod) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!isTerminating is now redundant ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, removed

Comment thread internal/controller/util.go Outdated

for i := range pods {
p := &pods[i]
// We can't use isReady() because the readiness probe might not pass until replication is configured.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but seems we check isReady right after calling this function

also, IIUC, in case pod 0 will be slow to become ready we'll keep requeuing while there might be other ready pods, which is a considerable deviation from current behaviour.

can we do something like the following instead :

func selectMasterCandidate(pods []corev1.Pod, isReady func(*corev1.Pod) bool) *corev1.Pod {
    var best *corev1.Pod

    for i := range pods {
        p := &pods[i]
        if !isReady(p) {
            continue
        }
        if best == nil || getOrdinal(p.Name) < getOrdinal(best.Name) {
            best = p
        }
    }
    return best
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, updated as per this

@xuekat xuekat requested a review from ashotland February 23, 2026 19:33
return ctrl.Result{}, nil
}

masterReady, err := dfi.isPodReady(ctx, masterCandidate)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this check is redundant as the master Candidate must be ready

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, removed

}
}

masterPod, err := dfi.getMaster(ctx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see masterPod, err := dfi.getMaster(ctx)

why can't you use master ?

return err
}

dfiStatus.Phase = PhaseReady
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a behaviour change that doesn't seems related to the goal of this pr

currently we move to ready once master is configured and serving and I am not sure we should change this

generally it'd be great to limit the scope to changes that are required to achieve the purpose of the PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was to deal with the test DF Pod Lifecycle Reconciler Fail Over is working which needs something that handles a pod-event while in PhaseConfiguring before it transitions to PhaseReady. but i've reverted this change in favor of adding PhaseConfiguring back to the gate condition, and after checkAndConfigureReplicas succeeds, transition from PhaseConfiguring to PhaseReady.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks but you still moved dfiStatus.Phase = PhaseReady to later

which is why I think you added changes in DfPodLifeCycleReconciler to handle 'stuck' PhaseConfiguring

I still don't understand why this is needed for the purpose of this pr and why we can't we keep dfiStatus.Phase = PhaseReady in it's original state

log.Info("non-deletion event for a pod with an existing role. checking if something is wrong", "pod", pod.Name, "role", pod.Labels[resources.RoleLabelKey])

if err = dfi.checkAndConfigureReplicas(ctx, master.Status.PodIP); err != nil {
if allConfigured, err := dfi.checkAndConfigureReplicas(ctx, master.Status.PodIP); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand why we need this change

also once we revert back to PahseReady on master is configured dfi.getStatus().Phase == PhaseConfiguring
below become dead code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check my comment here, but reverted this logic

@xuekat xuekat requested a review from ashotland February 26, 2026 21:13
Comment thread internal/controller/dragonfly_instance.go Outdated
return err
}

dfiStatus.Phase = PhaseReady
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks but you still moved dfiStatus.Phase = PhaseReady to later

which is why I think you added changes in DfPodLifeCycleReconciler to handle 'stuck' PhaseConfiguring

I still don't understand why this is needed for the purpose of this pr and why we can't we keep dfiStatus.Phase = PhaseReady in it's original state

Expect(podRoles[resources.Replica]).To(HaveLen(replicas - 1))
})

It("Should recover from stuck Configuring phase", func() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the other comments this test may no longer be required

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, removed

}

// selectMasterCandidate deterministically selects a master candidate from the given list of pods.
func selectMasterCandidate(pods []corev1.Pod, isReady func(*corev1.Pod) bool) *corev1.Pod {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about a unit test for this function ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a unit test

@xuekat
Copy link
Copy Markdown
Contributor Author

xuekat commented Mar 10, 2026

Thanks but you still moved dfiStatus.Phase = PhaseReady to later

which is why I think you added changes in DfPodLifeCycleReconciler to handle 'stuck' PhaseConfiguring

I still don't understand why this is needed for the purpose of this pr and why we can't we keep dfiStatus.Phase = PhaseReady in it's original state

right, it's not needed anymore, so I've moved it back to its original place, thanks.

@xuekat xuekat requested a review from ashotland March 10, 2026 16:30
Copy link
Copy Markdown
Contributor

@ashotland ashotland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot looks good now, please just update the branch and resolve conflicts

Signed-off-by: kathleen xue <kathleen@shaped.ai>
@xuekat xuekat requested a review from ashotland March 16, 2026 17:16
Copy link
Copy Markdown
Contributor

@Abhra303 Abhra303 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please check my comment. Should be ready to merge after this.

role, err := dfi.getRedisRole(ctx, master)
if err != nil {
log.Info("failed to verify master status in redis (ignoring)", "error", err)
} else if role == resources.Replica {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please use else if role != resources.Master. Some dragonfly versions returns slave instead of replica as the role.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@xuekat xuekat requested a review from Abhra303 March 17, 2026 17:26
@ashotland ashotland merged commit 5b2a6df into dragonflydb:main Mar 19, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants