feat(controller): implement traffic gating for zero-downtime failovers by jgautheron · Pull Request #455 · dragonflydb/dragonfly-operator

jgautheron · 2026-01-28T17:31:37Z

This change implements traffic gating to prevent READONLY errors during failovers and rolling updates by ensuring the service endpoints are updated before traffic is routed to the new master.

Key changes:

Add traffic label (traffic=enabled/disabled) to control service routing
Service selector now requires both role=master AND traffic=enabled
Implement endpoint synchronization before/after role changes
Disconnect clients from old master before promoting new master

This eliminates the race condition where clients could connect to a demoted master before the service endpoints were updated.

This change implements traffic gating to prevent READONLY errors during failovers and rolling updates by ensuring the service endpoints are updated before traffic is routed to the new master. Key changes: - Add traffic label (traffic=enabled/disabled) to control service routing - Service selector now requires both role=master AND traffic=enabled - Implement endpoint synchronization before/after role changes - Disconnect clients from old master before promoting new master - Add RBAC permissions for endpoints resource New e2e tests: - DF Failover Under Load: verifies write continuity during master failover - DF Rolling Update Under Load: verifies write continuity during image updates - DF Traffic Label Edge Cases: verifies correct labels on master/replicas - DF Service Name Override: verifies traffic gating with custom service names The implementation ensures that: 1. Traffic is disabled on old master before demotion 2. Service endpoints are verified to exclude old master 3. Clients are disconnected from old master 4. New master is promoted with traffic enabled 5. Service endpoints are verified to include new master 6. Only then is the old master pod deleted This eliminates the race condition where clients could connect to a demoted master before the service endpoints were updated.

Copilot

Pull request overview

This PR implements traffic gating to prevent READONLY errors during Dragonfly failovers and rolling updates. The mechanism uses a new traffic label to control service routing, ensuring service endpoints are updated before traffic is routed to new masters.

Changes:

Adds traffic label control (traffic=enabled/disabled) to manage service endpoint selection during role transitions
Implements synchronous endpoint propagation checks before/after master promotions and demotions
Disconnects clients from old masters before demotion to prevent READONLY errors during the failover window

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
internal/resources/const.go	Defines new traffic label constants (TrafficLabelKey, TrafficEnabled, TrafficDisabled)
internal/resources/resources.go	Adds traffic=enabled to service selector alongside role=master requirement
internal/controller/timeouts.go	Adds timeout constants for endpoint propagation (30s), connection drain (1s), and polling interval (500ms)
internal/controller/dragonfly_instance.go	Core traffic gating logic: disables traffic before demotion, waits for endpoint removal, enables traffic after promotion, waits for endpoint addition
internal/controller/dragonfly_controller.go	Adds RBAC permissions for reading endpoints
config/rbac/role.yaml	Grants get/list/watch permissions on endpoints resource
e2e/util.go	Adds test utilities for continuous write testing to measure READONLY errors during failovers
e2e/dragonfly_pod_lifecycle_controller_test.go	Comprehensive e2e tests validating traffic gating during failovers, rolling updates, and edge cases
e2e/dragonfly_controller_test.go	Updates test timeouts to account for endpoint propagation delays

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

miledxz · 2026-02-05T13:15:13Z

@jgautheron Thank you for contributing :) Great work !

please take a look at copilot comment,

I'm looking at your PR, CI passed, and I will be testing it a little bit more,
also if you come up with idea how could I test new feature feel free to share,

kind regards

Switch replica metadata updates to Patch after traffic gating to prevent resourceVersion conflicts, and add a focused unit test that simulates an Update conflict while verifying the patch-based flow succeeds.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add endpoint transition tracking to verify old master endpoints are removed before (or concurrently with) the new master during failover and rolling updates. Harden rolling update checks by re-establishing port-forwards after rollout and adding per-op write timeouts to avoid hangs on stale connections.

Use MergeFrom Patch instead of Update when promoting a pod to master in replicaOfNoOne and replTakeover to avoid resourceVersion conflicts under concurrent updates.

jgautheron · 2026-02-05T17:59:10Z

Hey @miledxz, thanks I'll iterate over these!
About testing scenarios, here are 2:

Failover path
Create a Dragonfly with 2–3 replicas.
Watch endpoints: kubectl get endpoints -w
Delete master pod: kubectl delete pod
Expected: endpoints drop old master IP before it’s demoted; new master IP appears before old master pod is deleted.
Rolling update path
Change image tag on the CR.
Expected: endpoints always show only the active master IP; no traffic to replicas.

Both are covered with the tests in this PR.

Also wait for tiered entries to appear in the tiering e2e test to avoid racing asynchronous offloading.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

miledxz

please take a look at nit comment about test, and also CI that atm does not pass

sorry for later reply @jgautheron

Use context-aware drain and endpoint polling waits to avoid delayed cancellation, and harden test helpers by removing silent skips and deduplicating role-label waits.

ashotland · 2026-02-11T20:04:28Z

Hi @jgautheron - apologies for jumping late here

I see we already have RoleLabelKey: Master, for the service selector

And during failover we first call deleteMasterRoleLabel before configureReplication

For the repltakeover case, if successful master shuts down so no client can hit it and get READONLY response

Can you please explain the scenario in which you encounter READONLY errors

Can we reproduce it ?

jgautheron · 2026-02-12T17:04:16Z

@ashotland Hi! Thanks for checking.
The READONLY we saw is a transient routing window during failover/rollout, not just post-REPLTAKEOVER steady state.
Role changes and Service/Endpoints updates are asynchronous, so clients can briefly still hit a demoted/transitioning pod and get READONLY.
role=master alone didn’t fully gate traffic during that transition window under load.
This PR adds explicit traffic gating (traffic=enabled) and ordering so old master is drained from endpoints before demotion/deletion, and new master is enabled before cutover.
Yes, it is reproducible with continuous writes + master deletion/image rollout, and covered by the new e2e failover/rollout tests with endpoint-order assertions.

ashotland · 2026-02-12T19:14:29Z

@ashotland Hi! Thanks for checking. The READONLY we saw is a transient routing window during failover/rollout, not just post-REPLTAKEOVER steady state. Role changes and Service/Endpoints updates are asynchronous, so clients can briefly still hit a demoted/transitioning pod and get READONLY. role=master alone didn’t fully gate traffic during that transition window under load. This PR adds explicit traffic gating (traffic=enabled) and ordering so old master is drained from endpoints before demotion/deletion, and new master is enabled before cutover. Yes, it is reproducible with continuous writes + master deletion/image rollout, and covered by the new e2e failover/rollout tests with endpoint-order assertions.

Thanks @jgautheron, but I am still failing to understand:

For failover case we call deleteMasterRoleLabel which removes the role label before calling configureReplication which calls replicaOfNoOne (where you added removal of the traffic=enabled label)

so now we remove 2 labels instead of 1 ? how does that help ?

For the takeover case (rolling update), the old master shuts down after successful REPLTAKEOVER, so the pod can't respond with READONLY. And if REPLTAKEOVER fails, the old master should keep serving — removing it from endpoints preemptively would cause unnecessary downtime.

so I'm failing to see a scenario where the traffic label provides protection that the role label doesn't already provide

I also ran the test in this PR with current operator release version and got total=488 success=427 failed=61 readOnlyErrors=0

so no READONLY errors for scenario in the test even for current operator release

Am I missing anything ?

jgautheron · 2026-02-13T20:47:25Z

I am closing this for now, heading to vacations and will be 1st dogfooding this internally to make sure it solves the issue we're facing.

jgautheron force-pushed the fix/traffic-gating-zero-downtime branch from 01f6073 to 32c1a43 Compare January 28, 2026 17:36

jgautheron force-pushed the fix/traffic-gating-zero-downtime branch from 32c1a43 to 247272b Compare January 28, 2026 17:38

miledxz requested a review from Copilot February 4, 2026 14:36

Copilot started reviewing on behalf of miledxz February 4, 2026 14:36 View session

Copilot AI reviewed Feb 4, 2026

View reviewed changes

Comment thread internal/controller/dragonfly_instance.go

fix(controller): avoid pod update conflicts in replicaOf

838d530

Switch replica metadata updates to Patch after traffic gating to prevent resourceVersion conflicts, and add a focused unit test that simulates an Update conflict while verifying the patch-based flow succeeds.

miledxz requested a review from Copilot February 5, 2026 14:15

Copilot started reviewing on behalf of miledxz February 5, 2026 14:16 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

Comment thread internal/controller/dragonfly_instance.go

Comment thread internal/controller/dragonfly_instance.go Outdated

Comment thread internal/controller/dragonfly_instance.go Outdated

Comment thread internal/controller/dragonfly_instance.go Outdated

jgautheron added 2 commits February 5, 2026 18:51

fix(controller): patch pod label updates in promotions

b8e34e0

Use MergeFrom Patch instead of Update when promoting a pod to master in replicaOfNoOne and replTakeover to avoid resourceVersion conflicts under concurrent updates.

fix(controller): avoid disabling traffic on replicas

774da96

Also wait for tiered entries to appear in the tiering e2e test to avoid racing asynchronous offloading.

miledxz requested a review from Copilot February 8, 2026 17:44

Copilot started reviewing on behalf of miledxz February 8, 2026 17:45 View session

Copilot AI reviewed Feb 8, 2026

View reviewed changes

miledxz reviewed Feb 10, 2026

View reviewed changes

Comment thread e2e/dragonfly_pod_lifecycle_controller_test.go Outdated

fix(controller): make failover waits cancellation-safe

5dfcd58

Use context-aware drain and endpoint polling waits to avoid delayed cancellation, and harden test helpers by removing silent skips and deduplicating role-label waits.

jgautheron requested a review from miledxz February 11, 2026 18:01

jgautheron closed this Feb 13, 2026

Conversation

jgautheron commented Jan 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

miledxz commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jgautheron commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

miledxz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ashotland commented Feb 11, 2026

Uh oh!

jgautheron commented Feb 12, 2026

Uh oh!

ashotland commented Feb 12, 2026

Uh oh!

jgautheron commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

miledxz left a comment •

edited

Loading