Skip to content

feat(controller): implement traffic gating for zero-downtime failovers#455

Closed
jgautheron wants to merge 6 commits intodragonflydb:mainfrom
jgautheron:fix/traffic-gating-zero-downtime
Closed

feat(controller): implement traffic gating for zero-downtime failovers#455
jgautheron wants to merge 6 commits intodragonflydb:mainfrom
jgautheron:fix/traffic-gating-zero-downtime

Conversation

@jgautheron
Copy link
Copy Markdown

This change implements traffic gating to prevent READONLY errors during failovers and rolling updates by ensuring the service endpoints are updated before traffic is routed to the new master.

Key changes:

  • Add traffic label (traffic=enabled/disabled) to control service routing
  • Service selector now requires both role=master AND traffic=enabled
  • Implement endpoint synchronization before/after role changes
  • Disconnect clients from old master before promoting new master

This eliminates the race condition where clients could connect to a demoted master before the service endpoints were updated.

@jgautheron jgautheron force-pushed the fix/traffic-gating-zero-downtime branch from 01f6073 to 32c1a43 Compare January 28, 2026 17:36
This change implements traffic gating to prevent READONLY errors during
failovers and rolling updates by ensuring the service endpoints are
updated before traffic is routed to the new master.

Key changes:
- Add traffic label (traffic=enabled/disabled) to control service routing
- Service selector now requires both role=master AND traffic=enabled
- Implement endpoint synchronization before/after role changes
- Disconnect clients from old master before promoting new master
- Add RBAC permissions for endpoints resource

New e2e tests:
- DF Failover Under Load: verifies write continuity during master failover
- DF Rolling Update Under Load: verifies write continuity during image updates
- DF Traffic Label Edge Cases: verifies correct labels on master/replicas
- DF Service Name Override: verifies traffic gating with custom service names

The implementation ensures that:
1. Traffic is disabled on old master before demotion
2. Service endpoints are verified to exclude old master
3. Clients are disconnected from old master
4. New master is promoted with traffic enabled
5. Service endpoints are verified to include new master
6. Only then is the old master pod deleted

This eliminates the race condition where clients could connect to a
demoted master before the service endpoints were updated.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements traffic gating to prevent READONLY errors during Dragonfly failovers and rolling updates. The mechanism uses a new traffic label to control service routing, ensuring service endpoints are updated before traffic is routed to new masters.

Changes:

  • Adds traffic label control (traffic=enabled/disabled) to manage service endpoint selection during role transitions
  • Implements synchronous endpoint propagation checks before/after master promotions and demotions
  • Disconnects clients from old masters before demotion to prevent READONLY errors during the failover window

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
internal/resources/const.go Defines new traffic label constants (TrafficLabelKey, TrafficEnabled, TrafficDisabled)
internal/resources/resources.go Adds traffic=enabled to service selector alongside role=master requirement
internal/controller/timeouts.go Adds timeout constants for endpoint propagation (30s), connection drain (1s), and polling interval (500ms)
internal/controller/dragonfly_instance.go Core traffic gating logic: disables traffic before demotion, waits for endpoint removal, enables traffic after promotion, waits for endpoint addition
internal/controller/dragonfly_controller.go Adds RBAC permissions for reading endpoints
config/rbac/role.yaml Grants get/list/watch permissions on endpoints resource
e2e/util.go Adds test utilities for continuous write testing to measure READONLY errors during failovers
e2e/dragonfly_pod_lifecycle_controller_test.go Comprehensive e2e tests validating traffic gating during failovers, rolling updates, and edge cases
e2e/dragonfly_controller_test.go Updates test timeouts to account for endpoint propagation delays

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/controller/dragonfly_instance.go
@miledxz
Copy link
Copy Markdown
Contributor

miledxz commented Feb 5, 2026

@jgautheron Thank you for contributing :) Great work !

please take a look at copilot comment,

I'm looking at your PR, CI passed, and I will be testing it a little bit more,
also if you come up with idea how could I test new feature feel free to share,

kind regards

Switch replica metadata updates to Patch after traffic gating to prevent
resourceVersion conflicts, and add a focused unit test that simulates
an Update conflict while verifying the patch-based flow succeeds.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/controller/dragonfly_instance.go
Comment thread internal/controller/dragonfly_instance.go Outdated
Comment thread internal/controller/dragonfly_instance.go Outdated
Comment thread internal/controller/dragonfly_instance.go Outdated
Add endpoint transition tracking to verify old master endpoints are
removed before (or concurrently with) the new master during failover
and rolling updates. Harden rolling update checks by re-establishing
port-forwards after rollout and adding per-op write timeouts to avoid
hangs on stale connections.
Use MergeFrom Patch instead of Update when promoting a pod to master
in replicaOfNoOne and replTakeover to avoid resourceVersion conflicts
under concurrent updates.
@jgautheron
Copy link
Copy Markdown
Author

Hey @miledxz, thanks I'll iterate over these!
About testing scenarios, here are 2:

  1. Failover path
    Create a Dragonfly with 2–3 replicas.
    Watch endpoints: kubectl get endpoints -w
    Delete master pod: kubectl delete pod
    Expected: endpoints drop old master IP before it’s demoted; new master IP appears before old master pod is deleted.
  2. Rolling update path
    Change image tag on the CR.
    Expected: endpoints always show only the active master IP; no traffic to replicas.

Both are covered with the tests in this PR.

Also wait for tiered entries to appear in the tiering e2e test to avoid racing
asynchronous offloading.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/controller/dragonfly_instance_test.go Outdated
Comment thread internal/controller/dragonfly_instance_test.go
Comment thread internal/controller/dragonfly_instance_test.go
Comment thread internal/controller/dragonfly_instance.go
Comment thread internal/controller/dragonfly_instance.go
Comment thread internal/controller/dragonfly_instance.go
Comment thread internal/controller/dragonfly_instance.go Outdated
Copy link
Copy Markdown
Contributor

@miledxz miledxz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please take a look at nit comment about test, and also CI that atm does not pass

sorry for later reply @jgautheron

Comment thread e2e/dragonfly_pod_lifecycle_controller_test.go Outdated
Use context-aware drain and endpoint polling waits to avoid delayed cancellation,
and harden test helpers by removing silent skips and deduplicating role-label waits.
@jgautheron jgautheron requested a review from miledxz February 11, 2026 18:01
@ashotland
Copy link
Copy Markdown
Contributor

Hi @jgautheron - apologies for jumping late here

I see we already have RoleLabelKey: Master, for the service selector

And during failover we first call deleteMasterRoleLabel before configureReplication

For the repltakeover case, if successful master shuts down so no client can hit it and get READONLY response

Can you please explain the scenario in which you encounter READONLY errors

Can we reproduce it ?

@jgautheron
Copy link
Copy Markdown
Author

@ashotland Hi! Thanks for checking.
The READONLY we saw is a transient routing window during failover/rollout, not just post-REPLTAKEOVER steady state.
Role changes and Service/Endpoints updates are asynchronous, so clients can briefly still hit a demoted/transitioning pod and get READONLY.
role=master alone didn’t fully gate traffic during that transition window under load.
This PR adds explicit traffic gating (traffic=enabled) and ordering so old master is drained from endpoints before demotion/deletion, and new master is enabled before cutover.
Yes, it is reproducible with continuous writes + master deletion/image rollout, and covered by the new e2e failover/rollout tests with endpoint-order assertions.

@ashotland
Copy link
Copy Markdown
Contributor

@ashotland Hi! Thanks for checking. The READONLY we saw is a transient routing window during failover/rollout, not just post-REPLTAKEOVER steady state. Role changes and Service/Endpoints updates are asynchronous, so clients can briefly still hit a demoted/transitioning pod and get READONLY. role=master alone didn’t fully gate traffic during that transition window under load. This PR adds explicit traffic gating (traffic=enabled) and ordering so old master is drained from endpoints before demotion/deletion, and new master is enabled before cutover. Yes, it is reproducible with continuous writes + master deletion/image rollout, and covered by the new e2e failover/rollout tests with endpoint-order assertions.

Thanks @jgautheron, but I am still failing to understand:

For failover case we call deleteMasterRoleLabel which removes the role label before calling configureReplication which calls replicaOfNoOne (where you added removal of the traffic=enabled label)

so now we remove 2 labels instead of 1 ? how does that help ?

For the takeover case (rolling update), the old master shuts down after successful REPLTAKEOVER, so the pod can't respond with READONLY. And if REPLTAKEOVER fails, the old master should keep serving — removing it from endpoints preemptively would cause unnecessary downtime.

so I'm failing to see a scenario where the traffic label provides protection that the role label doesn't already provide

I also ran the test in this PR with current operator release version and got total=488 success=427 failed=61 readOnlyErrors=0

so no READONLY errors for scenario in the test even for current operator release

Am I missing anything ?

@jgautheron
Copy link
Copy Markdown
Author

I am closing this for now, heading to vacations and will be 1st dogfooding this internally to make sure it solves the issue we're facing.

@jgautheron jgautheron closed this Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants