feat(controller): implement traffic gating for zero-downtime failovers#455
feat(controller): implement traffic gating for zero-downtime failovers#455jgautheron wants to merge 6 commits intodragonflydb:mainfrom
Conversation
01f6073 to
32c1a43
Compare
This change implements traffic gating to prevent READONLY errors during failovers and rolling updates by ensuring the service endpoints are updated before traffic is routed to the new master. Key changes: - Add traffic label (traffic=enabled/disabled) to control service routing - Service selector now requires both role=master AND traffic=enabled - Implement endpoint synchronization before/after role changes - Disconnect clients from old master before promoting new master - Add RBAC permissions for endpoints resource New e2e tests: - DF Failover Under Load: verifies write continuity during master failover - DF Rolling Update Under Load: verifies write continuity during image updates - DF Traffic Label Edge Cases: verifies correct labels on master/replicas - DF Service Name Override: verifies traffic gating with custom service names The implementation ensures that: 1. Traffic is disabled on old master before demotion 2. Service endpoints are verified to exclude old master 3. Clients are disconnected from old master 4. New master is promoted with traffic enabled 5. Service endpoints are verified to include new master 6. Only then is the old master pod deleted This eliminates the race condition where clients could connect to a demoted master before the service endpoints were updated.
32c1a43 to
247272b
Compare
There was a problem hiding this comment.
Pull request overview
This PR implements traffic gating to prevent READONLY errors during Dragonfly failovers and rolling updates. The mechanism uses a new traffic label to control service routing, ensuring service endpoints are updated before traffic is routed to new masters.
Changes:
- Adds traffic label control (traffic=enabled/disabled) to manage service endpoint selection during role transitions
- Implements synchronous endpoint propagation checks before/after master promotions and demotions
- Disconnects clients from old masters before demotion to prevent READONLY errors during the failover window
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| internal/resources/const.go | Defines new traffic label constants (TrafficLabelKey, TrafficEnabled, TrafficDisabled) |
| internal/resources/resources.go | Adds traffic=enabled to service selector alongside role=master requirement |
| internal/controller/timeouts.go | Adds timeout constants for endpoint propagation (30s), connection drain (1s), and polling interval (500ms) |
| internal/controller/dragonfly_instance.go | Core traffic gating logic: disables traffic before demotion, waits for endpoint removal, enables traffic after promotion, waits for endpoint addition |
| internal/controller/dragonfly_controller.go | Adds RBAC permissions for reading endpoints |
| config/rbac/role.yaml | Grants get/list/watch permissions on endpoints resource |
| e2e/util.go | Adds test utilities for continuous write testing to measure READONLY errors during failovers |
| e2e/dragonfly_pod_lifecycle_controller_test.go | Comprehensive e2e tests validating traffic gating during failovers, rolling updates, and edge cases |
| e2e/dragonfly_controller_test.go | Updates test timeouts to account for endpoint propagation delays |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@jgautheron Thank you for contributing :) Great work ! please take a look at copilot comment, I'm looking at your PR, CI passed, and I will be testing it a little bit more, kind regards |
Switch replica metadata updates to Patch after traffic gating to prevent resourceVersion conflicts, and add a focused unit test that simulates an Update conflict while verifying the patch-based flow succeeds.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add endpoint transition tracking to verify old master endpoints are removed before (or concurrently with) the new master during failover and rolling updates. Harden rolling update checks by re-establishing port-forwards after rollout and adding per-op write timeouts to avoid hangs on stale connections.
Use MergeFrom Patch instead of Update when promoting a pod to master in replicaOfNoOne and replTakeover to avoid resourceVersion conflicts under concurrent updates.
|
Hey @miledxz, thanks I'll iterate over these!
Both are covered with the tests in this PR. |
Also wait for tiered entries to appear in the tiering e2e test to avoid racing asynchronous offloading.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
please take a look at nit comment about test, and also CI that atm does not pass
sorry for later reply @jgautheron
Use context-aware drain and endpoint polling waits to avoid delayed cancellation, and harden test helpers by removing silent skips and deduplicating role-label waits.
|
Hi @jgautheron - apologies for jumping late here I see we already have And during failover we first call deleteMasterRoleLabel before configureReplication For the repltakeover case, if successful master shuts down so no client can hit it and get READONLY response Can you please explain the scenario in which you encounter READONLY errors Can we reproduce it ? |
|
@ashotland Hi! Thanks for checking. |
Thanks @jgautheron, but I am still failing to understand: For failover case we call deleteMasterRoleLabel which removes the role label before calling configureReplication which calls replicaOfNoOne (where you added removal of the traffic=enabled label) so now we remove 2 labels instead of 1 ? how does that help ? For the takeover case (rolling update), the old master shuts down after successful REPLTAKEOVER, so the pod can't respond with READONLY. And if REPLTAKEOVER fails, the old master should keep serving — removing it from endpoints preemptively would cause unnecessary downtime. so I'm failing to see a scenario where the traffic label provides protection that the role label doesn't already provide I also ran the test in this PR with current operator release version and got so no READONLY errors for scenario in the test even for current operator release Am I missing anything ? |
|
I am closing this for now, heading to vacations and will be 1st dogfooding this internally to make sure it solves the issue we're facing. |
This change implements traffic gating to prevent READONLY errors during failovers and rolling updates by ensuring the service endpoints are updated before traffic is routed to the new master.
Key changes:
This eliminates the race condition where clients could connect to a demoted master before the service endpoints were updated.