Skip to content

maintainer: quiesce control plane during remove handoff (#4828)#5403

Open
ti-chi-bot wants to merge 1 commit into
pingcap:release-8.5from
ti-chi-bot:cherry-pick-4828-to-release-8.5
Open

maintainer: quiesce control plane during remove handoff (#4828)#5403
ti-chi-bot wants to merge 1 commit into
pingcap:release-8.5from
ti-chi-bot:cherry-pick-4828-to-release-8.5

Conversation

@ti-chi-bot

Copy link
Copy Markdown
Member

This is an automated cherry-pick of #4828

What problem does this PR solve?

Issue Number: close #4827

What is changed and how it works?

This PR stops the old maintainer from continuing ordinary control-plane work after RemoveMaintainer starts.

It does that by:

  • disabling heartbeat self-healing while the maintainer is removing;
  • suppressing node-change, block-status, and non-close resend handling during removing;
  • quiescing the operator controller so only the DDL trigger close path can keep running;
  • adding regression tests for removing-state heartbeat handling, removing-state resend suppression, and operator quiescing.

Validation:

  • go test ./maintainer/...

Check List

Tests

  • Unit test
  • Manual test

Questions

Will it cause performance regression or break compatibility?

No. The change only tightens old-maintainer behavior during the remove handoff window and prevents stale control-plane work from continuing after shutdown starts.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

Fix a bug where a removing TiCDC maintainer could still reschedule or recreate dispatchers after shutdown handoff started.

Summary by CodeRabbit

  • Bug Fixes

    • Suppressed failover/self-healing and legacy control-plane traffic during maintainer removal; block-status requests and redundant bootstrap/barrier/control resends are ignored while removing. Resend now only sends maintainer-close when cascade removal is enabled.
  • Improvements

    • Added a quiesce/allowlist mode that freezes normal operator scheduling while permitting only explicitly allowed dispatcher actions during handoff.
    • Checkpoint calculation now respects in-flight operators when removal begins.
  • Tests

    • Added tests covering removing-mode behavior, quiesce handoff, and checkpoint constraints.

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot ti-chi-bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/cherry-pick-for-release-8.5 This PR is cherry-picked to release-8.5 from a source PR. labels Jun 15, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

This cherry pick PR is for a release branch and has not yet been approved by triage owners.
Adding the do-not-merge/cherry-pick-not-approved label.

To merge this cherry pick:

  1. It must be LGTMed and approved by the reviewers firstly.
  2. For pull requests to TiDB-x branches, it must have no failed tests.
  3. AFTER it has lgtm and approved labels, please wait for the cherry-pick merging approval from triage owners.
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot

ti-chi-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kennytm for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot

Copy link
Copy Markdown
Member Author

@wlwilliamx This PR has conflicts, I have hold it.
Please resolve them or ask others to resolve them, then comment /unhold to remove the hold label.

@ti-chi-bot

ti-chi-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

@ti-chi-bot: ## If you want to know how to resolve it, please read the guide in TiDB Dev Guide.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 30ac96f3-fff6-4206-80d3-b19f5d8074a0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements control plane quiescence during maintainer removal to prevent stale control-plane messages from racing with the new maintainer. It freezes ordinary scheduling on the old maintainer while keeping the DDL trigger dispatcher close path alive. A critical issue was identified in maintainer/maintainer_controller.go where unresolved git merge conflict markers were left in the code, which must be resolved to ensure successful compilation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +221 to +261
<<<<<<< HEAD
=======

if !allowSelfHealing {
continue
}

// Fallback: dispatcher becomes non-working without an operator.
//
// In normal scheduling flow, a dispatcher should transition to Stopped/Removed as part of a maintainer
// operator (Remove/Move/Split...). However, after maintainer failover we can lose operatorController state
// while dispatcher managers keep executing the already-issued requests.
//
// A real example is a "remove request in transit" during bootstrap:
// - Old maintainer sends a Remove (e.g. the remove-origin phase of Move), but the request hasn't reached
// dispatcher manager yet.
// - New maintainer bootstraps from dispatcher manager snapshots and sees the dispatcher as Working, with
// no in-flight operator reported in bootstrap response.
// - After bootstrap, the in-transit Remove arrives, the dispatcher is removed, and the new maintainer
// observes a terminal status without a corresponding operator.
//
// In these cases we'd observe a non-working status but have no operator to drive the follow-up
// rescheduling, so we mark the span absent to let the scheduler recreate it.
//
// Safety against message reordering/resend:
// - We only reach here when stm != nil and stm.GetNodeID() == from (checked above). If the span was already
// rebound to a different node, we skip it, so late statuses from the old node won't trigger rescheduling.
// - MarkSpanAbsent is idempotent and only affects the scheduler state, so even if we get duplicate terminal
// statuses, the worst case is an extra no-op absent mark.
if status.ComponentStatus == heartbeatpb.ComponentState_Stopped ||
status.ComponentStatus == heartbeatpb.ComponentState_Removed {
if op := operatorController.GetOperator(dispatcherID); op == nil {
log.Warn("dispatcher becomes non-working without operator, mark span absent for rescheduling",
zap.String("changefeed", c.changefeedID.Name()),
zap.String("from", from.String()),
zap.String("dispatcherID", dispatcherID.String()),
zap.Any("status", status))
spanController.MarkSpanAbsent(stm)
}
}
>>>>>>> 776315e72 (maintainer: quiesce control plane during remove handoff (#4828))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There are unresolved git merge conflict markers (<<<<<<< HEAD, =======, >>>>>>> 776315e72) in this file. Please resolve the conflict and remove the markers to ensure the code compiles successfully.

		if !allowSelfHealing {
			continue
		}

		// Fallback: dispatcher becomes non-working without an operator.
		//
		// In normal scheduling flow, a dispatcher should transition to Stopped/Removed as part of a maintainer
		// operator (Remove/Move/Split...). However, after maintainer failover we can lose operatorController state
		// while dispatcher managers keep executing the already-issued requests.
		//
		// A real example is a "remove request in transit" during bootstrap:
		// - Old maintainer sends a Remove (e.g. the remove-origin phase of Move), but the request hasn't reached
		//   dispatcher manager yet.
		// - New maintainer bootstraps from dispatcher manager snapshots and sees the dispatcher as Working, with
		//   no in-flight operator reported in bootstrap response.
		// - After bootstrap, the in-transit Remove arrives, the dispatcher is removed, and the new maintainer
		//   observes a terminal status without a corresponding operator.
		//
		// In these cases we'd observe a non-working status but have no operator to drive the follow-up
		// rescheduling, so we mark the span absent to let the scheduler recreate it.
		//
		// Safety against message reordering/resend:
		// - We only reach here when stm != nil and stm.GetNodeID() == from (checked above). If the span was already
		//   rebound to a different node, we skip it, so late statuses from the old node won't trigger rescheduling.
		// - MarkSpanAbsent is idempotent and only affects the scheduler state, so even if we get duplicate terminal
		//   statuses, the worst case is an extra no-op absent mark.
		if status.ComponentStatus == heartbeatpb.ComponentState_Stopped ||
			status.ComponentStatus == heartbeatpb.ComponentState_Removed {
			if op := operatorController.GetOperator(dispatcherID); op == nil {
				log.Warn("dispatcher becomes non-working without operator, mark span absent for rescheduling",
					zap.String("changefeed", c.changefeedID.Name()),
					zap.String("from", from.String()),
					zap.String("dispatcherID", dispatcherID.String()),
					zap.Any("status", status))
				spanController.MarkSpanAbsent(stm)
			}
		}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/cherry-pick-not-approved do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/cherry-pick-for-release-8.5 This PR is cherry-picked to release-8.5 from a source PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants