maintainer: quiesce control plane during remove handoff#4828
Conversation
Freeze ordinary scheduling once RemoveMaintainer starts so the old maintainer can only finish the DDL trigger close path. This avoids late heartbeat, barrier, node-change, and operator activity from recreating dispatchers after shutdown begins.
|
Skipping CI for Draft Pull Request. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdd an allowlist-based quiescing mode to the operator controller and orchestrate it from the maintainer controller; while removing, the maintainer suppresses node-change handling, heartbeat-driven self-healing, block-state requests, and legacy resend traffic, preserving only a DDL close path. ChangesMaintainer removal handoff quiescing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a quiescing mechanism for the maintainer and its operator controller to ensure a stable shutdown and handoff process. By entering a removing mode, the maintainer suppresses ordinary scheduling, self-healing, and legacy control-plane traffic while allowing critical DDL trigger operations to complete. The review feedback correctly identifies several opportunities to optimize locking efficiency in the OperatorController, specifically by consolidating quiescing checks and operator lookups into single lock blocks to reduce contention and redundant acquisitions in performance-critical paths.
| if !oc.isOperatorAllowed(id) { | ||
| return | ||
| } | ||
| oc.mu.RLock() | ||
| op, ok := oc.operators[id] | ||
| oc.mu.RUnlock() |
There was a problem hiding this comment.
The current implementation of UpdateOperatorStatus performs redundant locking by calling isOperatorAllowed (which acquires a read lock) followed by another manual read lock acquisition. This can be optimized by using isOperatorAllowedLocked within a single lock block.
| if !oc.isOperatorAllowed(id) { | |
| return | |
| } | |
| oc.mu.RLock() | |
| op, ok := oc.operators[id] | |
| oc.mu.RUnlock() | |
| oc.mu.RLock() | |
| if !oc.isOperatorAllowedLocked(id) { | |
| oc.mu.RUnlock() | |
| return | |
| } | |
| op, ok := oc.operators[id] | |
| oc.mu.RUnlock() |
| ops := oc.GetAllOperators() | ||
|
|
||
| for _, op := range ops { | ||
| if !oc.isOperatorAllowed(op.ID()) { | ||
| continue | ||
| } |
There was a problem hiding this comment.
The GetMinCheckpointTs function is in the hot path and currently suffers from significant lock contention. It first copies all operators into a slice (acquiring a lock) and then repeatedly acquires and releases a read lock for every single operator in the loop via isOperatorAllowed.
Iterating over the map directly while holding the lock once is much more efficient and avoids unnecessary allocations and lock bouncing.
func (oc *Controller) GetMinCheckpointTs(minCheckpointTs uint64) uint64 {
oc.mu.RLock()
defer oc.mu.RUnlock()
for id, op := range oc.operators {
if !oc.isOperatorAllowedLocked(id) {
continue
}| if !oc.isOperatorAllowed(op.ID()) { | ||
| log.Info("skip remove operator while controller is quiescing", | ||
| zap.String("role", oc.role), | ||
| zap.Stringer("changefeedID", oc.changefeedID), | ||
| zap.String("dispatcherID", op.ID().String()), | ||
| zap.String("operator", op.String())) | ||
| return | ||
| } | ||
| oc.mu.RLock() | ||
| old, ok := oc.operators[op.ID()] | ||
| oc.mu.RUnlock() |
There was a problem hiding this comment.
Similar to other methods in this controller, removeReplicaSet performs redundant locking. Consolidating the quiescing check and the operator lookup into a single lock block improves performance.
oc.mu.RLock()
if !oc.isOperatorAllowedLocked(op.ID()) {
oc.mu.RUnlock()
log.Info("skip remove operator while controller is quiescing",
zap.String("role", oc.role),
zap.Stringer("changefeedID", oc.changefeedID),
zap.String("dispatcherID", op.ID().String()),
zap.String("operator", op.String()))
return
}
old, ok := oc.operators[op.ID()]
oc.mu.RUnlock()…e-conflicts # Conflicts: # maintainer/maintainer_test.go # maintainer/operator/operator_controller_test.go
|
/test all |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
maintainer/operator/operator_controller_test.go (1)
1-32:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix formatting issues flagged by CI pipeline.
The pipeline reports
gofumportsformatting check failed for this file. As per coding guidelines, runmake fmtbefore pushing.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@maintainer/operator/operator_controller_test.go` around lines 1 - 32, The file in package "operator" has gofumports formatting issues; run the formatter (make fmt or gofumports -w) on this test file to fix import grouping/formatting and then stage the updated file; ensure the import block and any changed test files (e.g., operator_controller_test.go in package operator) are formatted and committed so CI's gofumports check passes.Sources: Coding guidelines, Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@maintainer/operator/operator_controller_test.go`:
- Around line 1-32: The file in package "operator" has gofumports formatting
issues; run the formatter (make fmt or gofumports -w) on this test file to fix
import grouping/formatting and then stage the updated file; ensure the import
block and any changed test files (e.g., operator_controller_test.go in package
operator) are formatted and committed so CI's gofumports check passes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b3f69584-0b5d-4437-b27d-11bfe222269e
📒 Files selected for processing (5)
maintainer/maintainer.gomaintainer/maintainer_controller.gomaintainer/maintainer_test.gomaintainer/operator/operator_controller.gomaintainer/operator/operator_controller_test.go
|
/test all |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
maintainer/operator/operator_controller.go (1)
443-470:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMake quiescing admission atomic with operator registration.
Line 444 checks the allowlist outside the critical section that mutates
operators/runningQueue. A goroutine can pass that check, thenQuiesceExceptflipsquiescingat Lines 101-108, and this method still inserts and starts a now-blocked operator. BecausecheckAffectedNodesimmediately callsOnNodeRemove, the old maintainer can still mutate scheduling state after handoff has started.Suggested fix
func (oc *Controller) pushOperator(op operator.Operator[common.DispatcherID, *heartbeatpb.TableSpanStatus]) bool { - if !oc.isOperatorAllowed(op.ID()) { - log.Info("skip operator while controller is quiescing", - zap.String("role", oc.role), - zap.Stringer("changefeedID", oc.changefeedID), - zap.String("dispatcherID", op.ID().String()), - zap.String("operator", op.String())) - return false - } log.Info("add operator to running queue", zap.String("role", oc.role), zap.Stringer("changefeedID", oc.changefeedID), zap.String("operator", op.String())) withTime := operator.NewOperatorWithTime(op, time.Now()) oc.mu.Lock() + if !oc.isOperatorAllowedLocked(op.ID()) { + oc.mu.Unlock() + log.Info("skip operator while controller is quiescing", + zap.String("role", oc.role), + zap.Stringer("changefeedID", oc.changefeedID), + zap.String("dispatcherID", op.ID().String()), + zap.String("operator", op.String())) + return false + } oc.operators[op.ID()] = withTime oc.mu.Unlock()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@maintainer/operator/operator_controller.go` around lines 443 - 470, The admission check for quiescing must be done while holding the controller mutex so registration and queue insertion are atomic with the quiescing state; change pushOperator to acquire oc.mu, call isOperatorAllowed (or re-check the quiescing condition/QuiesceExcept) while holding oc.mu, and if allowed insert withTime into oc.operators and heap.Push(&oc.runningQueue) before unlocking; only after the operator is registered and the lock released call op.Start() and oc.checkAffectedNodes(op). This prevents a goroutine from passing the allow check, having QuiesceExcept flip the state, and then starting/ mutating scheduling state (via checkAffectedNodes/OnNodeRemove) after handoff.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@maintainer/operator/operator_controller.go`:
- Around line 443-470: The admission check for quiescing must be done while
holding the controller mutex so registration and queue insertion are atomic with
the quiescing state; change pushOperator to acquire oc.mu, call
isOperatorAllowed (or re-check the quiescing condition/QuiesceExcept) while
holding oc.mu, and if allowed insert withTime into oc.operators and
heap.Push(&oc.runningQueue) before unlocking; only after the operator is
registered and the lock released call op.Start() and oc.checkAffectedNodes(op).
This prevents a goroutine from passing the allow check, having QuiesceExcept
flip the state, and then starting/ mutating scheduling state (via
checkAffectedNodes/OnNodeRemove) after handoff.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2e8d7db8-64fa-485d-9e7a-9c89c6c167bd
📒 Files selected for processing (3)
maintainer/maintainer_test.gomaintainer/operator/operator_controller.gomaintainer/operator/operator_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- maintainer/operator/operator_controller_test.go
…failover-issue-20260415
|
/test all |
|
/test pull-cdc-storage-integration-light |
|
/test all |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hongyunyan, lidezhu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
|
In response to a cherrypick label: new pull request created to branch |
What problem does this PR solve?
Issue Number: close #4827
What is changed and how it works?
This PR stops the old maintainer from continuing ordinary control-plane work after
RemoveMaintainerstarts.It does that by:
Validation:
go test ./maintainer/...Check List
Tests
Questions
Will it cause performance regression or break compatibility?
No. The change only tightens old-maintainer behavior during the remove handoff window and prevents stale control-plane work from continuing after shutdown starts.
Do you need to update user documentation, design documentation or monitoring documentation?
No.
Release note
Summary by CodeRabbit
Bug Fixes
Improvements
Tests