maintainer: quiesce control plane during remove handoff (#4828)#5403
maintainer: quiesce control plane during remove handoff (#4828)#5403ti-chi-bot wants to merge 1 commit into
Conversation
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
|
This cherry pick PR is for a release branch and has not yet been approved by triage owners. To merge this cherry pick:
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@wlwilliamx This PR has conflicts, I have hold it. |
|
@ti-chi-bot: ## If you want to know how to resolve it, please read the guide in TiDB Dev Guide. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request implements control plane quiescence during maintainer removal to prevent stale control-plane messages from racing with the new maintainer. It freezes ordinary scheduling on the old maintainer while keeping the DDL trigger dispatcher close path alive. A critical issue was identified in maintainer/maintainer_controller.go where unresolved git merge conflict markers were left in the code, which must be resolved to ensure successful compilation.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| <<<<<<< HEAD | ||
| ======= | ||
|
|
||
| if !allowSelfHealing { | ||
| continue | ||
| } | ||
|
|
||
| // Fallback: dispatcher becomes non-working without an operator. | ||
| // | ||
| // In normal scheduling flow, a dispatcher should transition to Stopped/Removed as part of a maintainer | ||
| // operator (Remove/Move/Split...). However, after maintainer failover we can lose operatorController state | ||
| // while dispatcher managers keep executing the already-issued requests. | ||
| // | ||
| // A real example is a "remove request in transit" during bootstrap: | ||
| // - Old maintainer sends a Remove (e.g. the remove-origin phase of Move), but the request hasn't reached | ||
| // dispatcher manager yet. | ||
| // - New maintainer bootstraps from dispatcher manager snapshots and sees the dispatcher as Working, with | ||
| // no in-flight operator reported in bootstrap response. | ||
| // - After bootstrap, the in-transit Remove arrives, the dispatcher is removed, and the new maintainer | ||
| // observes a terminal status without a corresponding operator. | ||
| // | ||
| // In these cases we'd observe a non-working status but have no operator to drive the follow-up | ||
| // rescheduling, so we mark the span absent to let the scheduler recreate it. | ||
| // | ||
| // Safety against message reordering/resend: | ||
| // - We only reach here when stm != nil and stm.GetNodeID() == from (checked above). If the span was already | ||
| // rebound to a different node, we skip it, so late statuses from the old node won't trigger rescheduling. | ||
| // - MarkSpanAbsent is idempotent and only affects the scheduler state, so even if we get duplicate terminal | ||
| // statuses, the worst case is an extra no-op absent mark. | ||
| if status.ComponentStatus == heartbeatpb.ComponentState_Stopped || | ||
| status.ComponentStatus == heartbeatpb.ComponentState_Removed { | ||
| if op := operatorController.GetOperator(dispatcherID); op == nil { | ||
| log.Warn("dispatcher becomes non-working without operator, mark span absent for rescheduling", | ||
| zap.String("changefeed", c.changefeedID.Name()), | ||
| zap.String("from", from.String()), | ||
| zap.String("dispatcherID", dispatcherID.String()), | ||
| zap.Any("status", status)) | ||
| spanController.MarkSpanAbsent(stm) | ||
| } | ||
| } | ||
| >>>>>>> 776315e72 (maintainer: quiesce control plane during remove handoff (#4828)) |
There was a problem hiding this comment.
There are unresolved git merge conflict markers (<<<<<<< HEAD, =======, >>>>>>> 776315e72) in this file. Please resolve the conflict and remove the markers to ensure the code compiles successfully.
if !allowSelfHealing {
continue
}
// Fallback: dispatcher becomes non-working without an operator.
//
// In normal scheduling flow, a dispatcher should transition to Stopped/Removed as part of a maintainer
// operator (Remove/Move/Split...). However, after maintainer failover we can lose operatorController state
// while dispatcher managers keep executing the already-issued requests.
//
// A real example is a "remove request in transit" during bootstrap:
// - Old maintainer sends a Remove (e.g. the remove-origin phase of Move), but the request hasn't reached
// dispatcher manager yet.
// - New maintainer bootstraps from dispatcher manager snapshots and sees the dispatcher as Working, with
// no in-flight operator reported in bootstrap response.
// - After bootstrap, the in-transit Remove arrives, the dispatcher is removed, and the new maintainer
// observes a terminal status without a corresponding operator.
//
// In these cases we'd observe a non-working status but have no operator to drive the follow-up
// rescheduling, so we mark the span absent to let the scheduler recreate it.
//
// Safety against message reordering/resend:
// - We only reach here when stm != nil and stm.GetNodeID() == from (checked above). If the span was already
// rebound to a different node, we skip it, so late statuses from the old node won't trigger rescheduling.
// - MarkSpanAbsent is idempotent and only affects the scheduler state, so even if we get duplicate terminal
// statuses, the worst case is an extra no-op absent mark.
if status.ComponentStatus == heartbeatpb.ComponentState_Stopped ||
status.ComponentStatus == heartbeatpb.ComponentState_Removed {
if op := operatorController.GetOperator(dispatcherID); op == nil {
log.Warn("dispatcher becomes non-working without operator, mark span absent for rescheduling",
zap.String("changefeed", c.changefeedID.Name()),
zap.String("from", from.String()),
zap.String("dispatcherID", dispatcherID.String()),
zap.Any("status", status))
spanController.MarkSpanAbsent(stm)
}
}
This is an automated cherry-pick of #4828
What problem does this PR solve?
Issue Number: close #4827
What is changed and how it works?
This PR stops the old maintainer from continuing ordinary control-plane work after
RemoveMaintainerstarts.It does that by:
Validation:
go test ./maintainer/...Check List
Tests
Questions
Will it cause performance regression or break compatibility?
No. The change only tightens old-maintainer behavior during the remove handoff window and prevents stale control-plane work from continuing after shutdown starts.
Do you need to update user documentation, design documentation or monitoring documentation?
No.
Release note
Summary by CodeRabbit
Bug Fixes
Improvements
Tests