Skip to content

maintainer: quiesce control plane during remove handoff#4828

Merged
ti-chi-bot[bot] merged 5 commits into
pingcap:masterfrom
wlwilliamx:codex/maintainer-failover-issue-20260415
Jun 15, 2026
Merged

maintainer: quiesce control plane during remove handoff#4828
ti-chi-bot[bot] merged 5 commits into
pingcap:masterfrom
wlwilliamx:codex/maintainer-failover-issue-20260415

Conversation

@wlwilliamx

@wlwilliamx wlwilliamx commented Apr 15, 2026

Copy link
Copy Markdown
Collaborator

What problem does this PR solve?

Issue Number: close #4827

What is changed and how it works?

This PR stops the old maintainer from continuing ordinary control-plane work after RemoveMaintainer starts.

It does that by:

  • disabling heartbeat self-healing while the maintainer is removing;
  • suppressing node-change, block-status, and non-close resend handling during removing;
  • quiescing the operator controller so only the DDL trigger close path can keep running;
  • adding regression tests for removing-state heartbeat handling, removing-state resend suppression, and operator quiescing.

Validation:

  • go test ./maintainer/...

Check List

Tests

  • Unit test
  • Manual test

Questions

Will it cause performance regression or break compatibility?

No. The change only tightens old-maintainer behavior during the remove handoff window and prevents stale control-plane work from continuing after shutdown starts.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

Fix a bug where a removing TiCDC maintainer could still reschedule or recreate dispatchers after shutdown handoff started.

Summary by CodeRabbit

  • Bug Fixes

    • Suppressed failover/self-healing and legacy control-plane traffic during maintainer removal; block-status requests and redundant bootstrap/barrier/control resends are ignored while removing. Resend now only sends maintainer-close when cascade removal is enabled.
  • Improvements

    • Added a quiesce/allowlist mode that freezes normal operator scheduling while permitting only explicitly allowed dispatcher actions during handoff.
    • Checkpoint calculation now respects in-flight operators when removal begins.
  • Tests

    • Added tests covering removing-mode behavior, quiesce handoff, and checkpoint constraints.

Freeze ordinary scheduling once RemoveMaintainer starts so the old
maintainer can only finish the DDL trigger close path. This avoids
late heartbeat, barrier, node-change, and operator activity from
recreating dispatchers after shutdown begins.
@ti-chi-bot

ti-chi-bot Bot commented Apr 15, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Apr 15, 2026
@coderabbitai

coderabbitai Bot commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ffdb9d42-8480-48d5-bdd1-709815d43bf7

📥 Commits

Reviewing files that changed from the base of the PR and between 5768c66 and af01e72.

📒 Files selected for processing (2)
  • maintainer/operator/operator_controller.go
  • maintainer/operator/operator_controller_test.go

📝 Walkthrough

Walkthrough

Add an allowlist-based quiescing mode to the operator controller and orchestrate it from the maintainer controller; while removing, the maintainer suppresses node-change handling, heartbeat-driven self-healing, block-state requests, and legacy resend traffic, preserving only a DDL close path.

Changes

Maintainer removal handoff quiescing

Layer / File(s) Summary
Operator controller quiescing state and initialization
maintainer/operator/operator_controller.go
Operator controller adds quiescing flag, admissionMu, and allowedOperatorIDs map; introduces QuiesceExcept and helpers to check allowlist membership.
Operator execution and admission gating under quiescing
maintainer/operator/operator_controller.go
Execute -> scheduleOperator, AddOperator, UpdateOperatorStatus, OnNodeRemoved, pollQueueingOperator, removeReplicaSet, and pushOperator enforce the quiescing allowlist and admission semantics; pushOperator now returns bool and delegates to pushOperatorWithAdmission.
Maintainer controller removal mode orchestration
maintainer/maintainer_controller.go
Adds private handleStatus(..., allowSelfHealing) and public EnterRemovingMode; EnterRemovingMode calls QuiesceExcept on main (and redo) operator controllers to quiesce all dispatchers except an allowlist.
Maintainer control-plane path suppression during removal
maintainer/maintainer.go
Short-circuit node-change handling when removing, route heartbeats to handleStatus(..., false) during removal, return early on block-state requests while removing, and limit handleResendMessage to maintainer-close when not cascadeRemoving.
Maintainer removal behavior tests
maintainer/maintainer_test.go
Adds tests verifying heartbeats do not trigger failover/self-healing during removal, legacy resend/block-state traffic is suppressed, and checkpoint calculation respects in-flight Add operator constraints when removing.
Operator controller quiescing tests
maintainer/operator/operator_controller_test.go
Adds test operators and concurrency tests that validate quiescing waits for in-flight Schedule/Start, allows only allowed dispatcher operators to run, retains blocked operators for checkpoint safety, and rejects new blocked operators during quiesce.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

lgtm, approved

Suggested reviewers

  • lidezhu
  • wk989898
  • hongyunyan

Poem

🐰 I still keep the DDL gate ajar,
hush heartbeats as the old maintainer spars.
No new dispatchers leap or roam,
I fold the handoff, send the close request home.
Quiet paws—handoff completes, spring onward, new comb.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'maintainer: quiesce control plane during remove handoff' clearly and concisely summarizes the main change: quiescing the control plane during the maintainer removal handoff.
Description check ✅ Passed The PR description follows the template structure, includes the linked issue (#4827), explains the changes and how they work, covers test types, addresses questions, and provides a release note.
Linked Issues check ✅ Passed The PR addresses all requirements from issue #4827: disables heartbeat self-healing during removal, suppresses node-change/block-status/resend handling, quiesces the operator controller to allow only DDL close path, and includes regression tests.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the removal handoff problem: adding quiesce modes, self-healing suppression, resend suppression, and comprehensive tests validating the removing-state behavior.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 15, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a quiescing mechanism for the maintainer and its operator controller to ensure a stable shutdown and handoff process. By entering a removing mode, the maintainer suppresses ordinary scheduling, self-healing, and legacy control-plane traffic while allowing critical DDL trigger operations to complete. The review feedback correctly identifies several opportunities to optimize locking efficiency in the OperatorController, specifically by consolidating quiescing checks and operator lookups into single lock blocks to reduce contention and redundant acquisitions in performance-critical paths.

Comment on lines +223 to 228
if !oc.isOperatorAllowed(id) {
return
}
oc.mu.RLock()
op, ok := oc.operators[id]
oc.mu.RUnlock()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of UpdateOperatorStatus performs redundant locking by calling isOperatorAllowed (which acquires a read lock) followed by another manual read lock acquisition. This can be optimized by using isOperatorAllowedLocked within a single lock block.

Suggested change
if !oc.isOperatorAllowed(id) {
return
}
oc.mu.RLock()
op, ok := oc.operators[id]
oc.mu.RUnlock()
oc.mu.RLock()
if !oc.isOperatorAllowedLocked(id) {
oc.mu.RUnlock()
return
}
op, ok := oc.operators[id]
oc.mu.RUnlock()

Comment on lines +276 to +281
ops := oc.GetAllOperators()

for _, op := range ops {
if !oc.isOperatorAllowed(op.ID()) {
continue
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The GetMinCheckpointTs function is in the hot path and currently suffers from significant lock contention. It first copies all operators into a slice (acquiring a lock) and then repeatedly acquires and releases a read lock for every single operator in the loop via isOperatorAllowed.

Iterating over the map directly while holding the lock once is much more efficient and avoids unnecessary allocations and lock bouncing.

func (oc *Controller) GetMinCheckpointTs(minCheckpointTs uint64) uint64 {
	oc.mu.RLock()
	defer oc.mu.RUnlock()

	for id, op := range oc.operators {
		if !oc.isOperatorAllowedLocked(id) {
			continue
		}

Comment on lines +400 to 410
if !oc.isOperatorAllowed(op.ID()) {
log.Info("skip remove operator while controller is quiescing",
zap.String("role", oc.role),
zap.Stringer("changefeedID", oc.changefeedID),
zap.String("dispatcherID", op.ID().String()),
zap.String("operator", op.String()))
return
}
oc.mu.RLock()
old, ok := oc.operators[op.ID()]
oc.mu.RUnlock()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to other methods in this controller, removeReplicaSet performs redundant locking. Consolidating the quiescing check and the operator lookup into a single lock block improves performance.

oc.mu.RLock()
	if !oc.isOperatorAllowedLocked(op.ID()) {
		oc.mu.RUnlock()
		log.Info("skip remove operator while controller is quiescing",
			zap.String("role", oc.role),
			zap.Stringer("changefeedID", oc.changefeedID),
			zap.String("dispatcherID", op.ID().String()),
			zap.String("operator", op.String()))
		return
	}
	old, ok := oc.operators[op.ID()]
	oc.mu.RUnlock()

…e-conflicts

# Conflicts:
#	maintainer/maintainer_test.go
#	maintainer/operator/operator_controller_test.go
@wlwilliamx wlwilliamx marked this pull request as ready for review June 9, 2026 02:36
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2026
@wlwilliamx

Copy link
Copy Markdown
Collaborator Author

/test all

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
maintainer/operator/operator_controller_test.go (1)

1-32: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix formatting issues flagged by CI pipeline.

The pipeline reports gofumports formatting check failed for this file. As per coding guidelines, run make fmt before pushing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@maintainer/operator/operator_controller_test.go` around lines 1 - 32, The
file in package "operator" has gofumports formatting issues; run the formatter
(make fmt or gofumports -w) on this test file to fix import grouping/formatting
and then stage the updated file; ensure the import block and any changed test
files (e.g., operator_controller_test.go in package operator) are formatted and
committed so CI's gofumports check passes.

Sources: Coding guidelines, Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@maintainer/operator/operator_controller_test.go`:
- Around line 1-32: The file in package "operator" has gofumports formatting
issues; run the formatter (make fmt or gofumports -w) on this test file to fix
import grouping/formatting and then stage the updated file; ensure the import
block and any changed test files (e.g., operator_controller_test.go in package
operator) are formatted and committed so CI's gofumports check passes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b3f69584-0b5d-4437-b27d-11bfe222269e

📥 Commits

Reviewing files that changed from the base of the PR and between 6cd6024 and bcc40a6.

📒 Files selected for processing (5)
  • maintainer/maintainer.go
  • maintainer/maintainer_controller.go
  • maintainer/maintainer_test.go
  • maintainer/operator/operator_controller.go
  • maintainer/operator/operator_controller_test.go

@wlwilliamx

Copy link
Copy Markdown
Collaborator Author

/test all

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
maintainer/operator/operator_controller.go (1)

443-470: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make quiescing admission atomic with operator registration.

Line 444 checks the allowlist outside the critical section that mutates operators/runningQueue. A goroutine can pass that check, then QuiesceExcept flips quiescing at Lines 101-108, and this method still inserts and starts a now-blocked operator. Because checkAffectedNodes immediately calls OnNodeRemove, the old maintainer can still mutate scheduling state after handoff has started.

Suggested fix
 func (oc *Controller) pushOperator(op operator.Operator[common.DispatcherID, *heartbeatpb.TableSpanStatus]) bool {
-	if !oc.isOperatorAllowed(op.ID()) {
-		log.Info("skip operator while controller is quiescing",
-			zap.String("role", oc.role),
-			zap.Stringer("changefeedID", oc.changefeedID),
-			zap.String("dispatcherID", op.ID().String()),
-			zap.String("operator", op.String()))
-		return false
-	}
 	log.Info("add operator to running queue",
 		zap.String("role", oc.role),
 		zap.Stringer("changefeedID", oc.changefeedID),
 		zap.String("operator", op.String()))
 	withTime := operator.NewOperatorWithTime(op, time.Now())

 	oc.mu.Lock()
+	if !oc.isOperatorAllowedLocked(op.ID()) {
+		oc.mu.Unlock()
+		log.Info("skip operator while controller is quiescing",
+			zap.String("role", oc.role),
+			zap.Stringer("changefeedID", oc.changefeedID),
+			zap.String("dispatcherID", op.ID().String()),
+			zap.String("operator", op.String()))
+		return false
+	}
 	oc.operators[op.ID()] = withTime
 	oc.mu.Unlock()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@maintainer/operator/operator_controller.go` around lines 443 - 470, The
admission check for quiescing must be done while holding the controller mutex so
registration and queue insertion are atomic with the quiescing state; change
pushOperator to acquire oc.mu, call isOperatorAllowed (or re-check the quiescing
condition/QuiesceExcept) while holding oc.mu, and if allowed insert withTime
into oc.operators and heap.Push(&oc.runningQueue) before unlocking; only after
the operator is registered and the lock released call op.Start() and
oc.checkAffectedNodes(op). This prevents a goroutine from passing the allow
check, having QuiesceExcept flip the state, and then starting/ mutating
scheduling state (via checkAffectedNodes/OnNodeRemove) after handoff.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@maintainer/operator/operator_controller.go`:
- Around line 443-470: The admission check for quiescing must be done while
holding the controller mutex so registration and queue insertion are atomic with
the quiescing state; change pushOperator to acquire oc.mu, call
isOperatorAllowed (or re-check the quiescing condition/QuiesceExcept) while
holding oc.mu, and if allowed insert withTime into oc.operators and
heap.Push(&oc.runningQueue) before unlocking; only after the operator is
registered and the lock released call op.Start() and oc.checkAffectedNodes(op).
This prevents a goroutine from passing the allow check, having QuiesceExcept
flip the state, and then starting/ mutating scheduling state (via
checkAffectedNodes/OnNodeRemove) after handoff.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2e8d7db8-64fa-485d-9e7a-9c89c6c167bd

📥 Commits

Reviewing files that changed from the base of the PR and between bcc40a6 and 5768c66.

📒 Files selected for processing (3)
  • maintainer/maintainer_test.go
  • maintainer/operator/operator_controller.go
  • maintainer/operator/operator_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • maintainer/operator/operator_controller_test.go

@wlwilliamx

Copy link
Copy Markdown
Collaborator Author

/test all

@wlwilliamx

Copy link
Copy Markdown
Collaborator Author

/test pull-cdc-storage-integration-light

@ti-chi-bot ti-chi-bot Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 11, 2026
@wlwilliamx

Copy link
Copy Markdown
Collaborator Author

/test all

@ti-chi-bot ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Jun 15, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongyunyan, lidezhu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the lgtm label Jun 15, 2026
@ti-chi-bot ti-chi-bot Bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Jun 15, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

[LGTM Timeline notifier]

Timeline:

  • 2026-06-15 03:24:41.126164203 +0000 UTC m=+1362382.196481593: ☑️ agreed by hongyunyan.
  • 2026-06-15 03:39:44.981074334 +0000 UTC m=+1363286.051391794: ☑️ agreed by lidezhu.

@ti-chi-bot ti-chi-bot Bot merged commit 776315e into pingcap:master Jun 15, 2026
18 of 19 checks passed
@wlwilliamx wlwilliamx added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Jun 15, 2026
@ti-chi-bot

Copy link
Copy Markdown
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #5403.
But this PR has conflicts, please resolve them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

maintainer remove handoff can still recreate dispatchers after shutdown starts

4 participants