maintainer,dispatcher: fence stale maintainer epochs by hongyunyan · Pull Request #5435 · pingcap/ticdc

hongyunyan · 2026-06-18T01:10:43Z

What problem does this PR solve?

Issue Number: ref #5083

This is PR 2 of 3 split from #5182 and is stacked on PR 1.

Background:

PR 1 persists a monotonic maintainer epoch before new ownership is scheduled.
The receiver side still needs to stamp outbound requests and reject control
messages from stale maintainer owners.

Motivation:

Without a receiver-local owner and epoch fence, a delayed old maintainer can
continue to send schedule, heartbeat, post-bootstrap, merge, or close messages
after a newer maintainer has taken over the same changefeed. Those stale
messages can mutate dispatcher state or complete operators that no longer
belong to the active owner.

What is changed and how it works?

This PR adds the receiver-side stale maintainer fence:

Stamps maintainer status, bootstrap, post-bootstrap, close, scheduler, merge,
heartbeat, and remove messages with the current maintainer epoch.
Tracks the current maintainer epoch in the maintainer controller and its table
operator controllers.
Rejects stale maintainer lifecycle responses before they can complete the
current maintainer flow.
Guards duplicate add-maintainer replacement so a newer epoch cannot overlap a
still-running local maintainer.
Adds dispatcher-manager owner and epoch admission with epoch 0 rolling-upgrade
compatibility.
Serializes dispatcher-manager maintainer fence checks with schedule,
heartbeat, merge, bootstrap, and close side effects.
Keeps queued maintainer lifecycle messages ordered by maintainer epoch so a
newer owner can replace an older pending request.
Records closed maintainer epochs so delayed bootstrap requests cannot recreate
a manager after a close has already completed.

Stack:

#5434: coordinator epoch persistence and owner scheduling foundation.
#5435: maintainer and dispatcher-manager receiver-side epoch fence.
#5436: table trigger takeover hardening and integration coverage.

Check List

Tests

Unit test

Questions

Will it cause performance regression or break compatibility?

No expected performance regression. The new mutexes serialize only
per-changefeed control operations such as bootstrap, close, schedule, merge, and
heartbeat response handling. They are not in the downstream event write path.

The change is wire-compatible. Epoch 0 remains accepted only for the current
compatibility-mode owner until the receiver observes a non-zero maintainer epoch.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

None

Validation

make fmt
go test ./maintainer ./maintainer/operator ./maintainer/replica ./downstreamadapter/dispatchermanager ./downstreamadapter/dispatcherorchestrator

Summary by CodeRabbit

Refactor
- Enhanced coordination mechanisms and request validation to improve system reliability and consistency during distributed operations.
- Strengthened state management for component synchronization with improved handling of concurrent requests and responses.
- Increased resilience through refined locking and version-based state tracking.

ti-chi-bot · 2026-06-18T01:10:46Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-06-18T01:10:51Z

📝 Walkthrough

Walkthrough

This PR introduces maintainer epoch fencing across the TiCDC control plane. A maintainerEpoch uint64 field is threaded into all dispatcher scheduling, heartbeat, merge, and close messages, and all operator constructors. Epoch-aware admission, monotonic update, and stale-message dropping logic is added to the maintainer, maintainer manager, dispatcher manager, and dispatcher orchestrator.

Changes

Maintainer Epoch Fencing

Layer / File(s)	Summary
Dispatcher message epoch fields and SpanReplication builders `downstreamadapter/dispatchermanager/dispatcher_manager_info.go`, `downstreamadapter/dispatchermanager/helper.go`, `maintainer/replica/replication_span.go`, `maintainer/barrier.go`, `maintainer/barrier_event.go`, `downstreamadapter/dispatchermanager/dispatcher_manager*.go`	`SchedulerDispatcherRequest`, `HeartBeatResponse`, and `MergeDispatcherRequest` gain a `From node.ID` field; `SpanReplication.NewAddDispatcherMessage` and `NewRemoveDispatcherMessage` gain `maintainerEpoch`; `dispatcherCreateInfo.Id` → `.ID`; heartbeat responses in barrier include `MaintainerEpoch`.
Operator types carry and emit maintainerEpoch `maintainer/operator/operator_add.go`, `operator_remove.go`, `operator_move.go`, `operator_merge.go`, `operator_split.go`, `maintainer/operator/operator_*_test.go`	All five operator types add a `maintainerEpoch` field; constructors are extended; `Schedule()` methods forward the epoch into `NewAddDispatcherMessage`/`NewRemoveDispatcherMessage`; all operator tests updated with epoch argument `7` and epoch assertions.
OperatorController atomic epoch and scheduler factory plumbing `maintainer/operator/operator_controller.go`, `maintainer/scheduler/basic.go`, `maintainer/scheduler/balance.go`, `maintainer/scheduler/balance_splits.go`, `maintainer/scheduler/drain.go`, `maintainer/operator/operator_controller_test.go`, `maintainer/scheduler/drain_test.go`	`operator.Controller` gains `maintainerEpoch atomic.Uint64` with `SetMaintainerEpoch`/`MaintainerEpoch()` accessors; all remove, move, merge, and split creation paths pass `oc.MaintainerEpoch()` into operator constructors; schedulers use controller-scoped factory methods instead of direct operator constructors.
MaintainerController epoch init, propagation, and bootstrap operator restoration `maintainer/maintainer_controller.go`, `maintainer/maintainer_controller_bootstrap.go`, `maintainer/maintainer_controller_helper.go`, `maintainer/maintainer_controller_test.go`, `maintainer/barrier_test.go`, `maintainer/span/span_controller_test.go`	`NewController` accepts `maintainerEpoch`; `SetMaintainerEpoch` propagates to both operator controllers; orphan-removal in `handleStatus` includes epoch; bootstrap operator restoration calls `operatorController.MaintainerEpoch()` for add/remove operators; `FinishBootstrap` response sets `MaintainerEpoch`.
Maintainer epoch helpers, request/response fencing, and outgoing epoch stamping `maintainer/maintainer.go`, `maintainer/maintainer_test.go`	Adds `currentMaintainerEpoch()`, `isMaintainerEpochResponseAllowed`, `isMaintainerEpochRequestAllowed`, `logDroppedMaintainerResponse`, and `markRemoved()` helpers; gates `RemoveMaintainerRequest`, bootstrap/post-bootstrap/close responses on epoch checks; outgoing `MaintainerBootstrapRequest` and `MaintainerCloseRequest` stamped with current epoch; `NewMaintainerForRemove` accepts `maintainerEpoch`.
Maintainer manager epoch-based admission and registry serialization `maintainer/maintainer_manager_maintainers.go`, `maintainer/maintainer_manager_test.go`	Adds `registryMu` to `managerMaintainerSet`; `handleAddMaintainer` enforces epoch monotonicity via `mayRegisterMaintainerForAdd`/`registerMaintainerForAdd`; cascade-remove path guards registry under `registryMu`; `cleanupRemovedMaintainers` uses `CompareAndDelete` for correct multi-epoch cleanup.
DispatcherManager MaintainerFenceMu, handler fencing, and sender-identity tracking `downstreamadapter/dispatchermanager/dispatcher_manager.go`, `dispatcher_manager_info.go`, `dispatcher_manager_helper.go`, `helper.go`, `heartbeat_collector.go`, `helper_test.go`, `dispatcher_manager_test.go`	`DispatcherManager` gains `MaintainerFenceMu`; `SetMaintainerID` replaced by `TryUpdateMaintainer`/`IsMaintainerRequestAllowed`; `SchedulerDispatcherRequestHandler`, `HeartBeatResponseHandler`, and `MergeDispatcherRequestHandler` lock `MaintainerFenceMu` and drop stale senders; `preCheckForSchedulerHandler` performs epoch+dispatcherID filtering; `HeartBeatCollector.RecvMessages` passes `msg.From` into all constructors.
DispatcherOrchestrator closedMaintainerEpochs and per-handler epoch gates `downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go`, `downstreamadapter/dispatcherorchestrator/helper.go`	Adds `closedMaintainerEpochs` map with `recordClosedMaintainerEpochLocked` tombstone logic; bootstrap handler drops stale requests against closed epochs and fences with `MaintainerFenceMu`/`TryUpdateMaintainer`; post-bootstrap and close handlers gate on `IsMaintainerRequestAllowed`; all responses include `MaintainerEpoch`; pending message queue upgrade path respects epoch ordering.

Sequence Diagram(s)

sequenceDiagram
  rect rgba(100, 149, 237, 0.5)
    Note over MaintainerManager,Maintainer: Maintainer side
    MaintainerManager->>Maintainer: AddMaintainer(epoch=N)
    Maintainer->>Maintainer: mayRegisterMaintainerForAdd(epoch)?
    Maintainer->>DispatcherNode: MaintainerBootstrapRequest(MaintainerEpoch=N)
  end
  rect rgba(144, 238, 144, 0.5)
    Note over DispatcherNode,DispatcherManager: Dispatcher side
    DispatcherNode->>DispatcherOrchestrator: handleBootstrapRequest(epoch=N)
    DispatcherOrchestrator->>DispatcherOrchestrator: check closedMaintainerEpochs >= N?
    DispatcherOrchestrator->>DispatcherManager: NewDispatcherManager(maintainerEpoch=N)
    DispatcherOrchestrator->>DispatcherManager: TryUpdateMaintainer(from, N)
    DispatcherNode->>DispatcherManager: ScheduleDispatcherRequest(From=maintainer, MaintainerEpoch=N)
    DispatcherManager->>DispatcherManager: Lock(MaintainerFenceMu) → IsMaintainerRequestAllowed?
    DispatcherManager->>DispatcherManager: preCheckForSchedulerHandler → handleScheduleCreate
  end
  rect rgba(255, 160, 122, 0.5)
    Note over Maintainer,DispatcherNode: Stale epoch rejection
    MaintainerManager->>Maintainer: AddMaintainer(epoch=N-1)
    Maintainer->>Maintainer: isNewerMaintainerEpoch? → reject
    DispatcherNode->>DispatcherManager: ScheduleDispatcherRequest(MaintainerEpoch=N-1)
    DispatcherManager->>DispatcherManager: IsMaintainerRequestAllowed? → drop + log
  end

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

pingcap/ticdc#3691: Shares the currentOperatorMap/preCheckForSchedulerHandler and dispatcher create/remove in-flight operator tracking code that this PR extends with maintainer-epoch staleness fencing.
pingcap/ticdc#5434: Introduces MaintainerEpoch fields on AddMaintainerRequest/RemoveMaintainerRequest and the heartbeat protocol types that this PR consumes for its fencing checks across the full scheduling pipeline.

Suggested labels

release-note

Suggested reviewers

asddongmen
wk989898

🐇 Epochs march forward, never back,
Old maintainers fade to black.
With fences locked and epochs checked,
No stale request goes unchecked!
The rabbit stamps each message true —
MaintainerEpoch sees it through. 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.59% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding receiver-side fencing for stale maintainer epochs.
Description check	✅ Passed	The PR description comprehensively covers all required sections: problem statement, motivation, detailed implementation, stack context, test coverage, compatibility/performance implications, and validation steps.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces a robust maintainer epoch fencing mechanism to prevent stale maintainers from mutating dispatcher states or sending outdated control requests. Key changes include tracking active maintainer epochs, carrying sender node IDs and epochs in heartbeat responses and schedule requests, and implementing tombstones for closed managers to block delayed bootstrap requests. The review feedback highlights critical concurrency and performance improvements, specifically recommending the use of read locks to prevent data races on maintainer metadata and advising to release the registry lock before performing potentially slow close operations on maintainers.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-18T01:12:31Z

+func (e *DispatcherManager) IsMaintainerRequestAllowed(from node.ID, maintainerEpoch uint64) bool {
+	e.meta.Lock()
+	defer e.meta.Unlock()
+	if maintainerEpoch == 0 {
+		return e.meta.maintainerEpoch == 0 && (e.meta.maintainerID == "" || e.meta.maintainerID == from)
+	}
+	return e.meta.maintainerEpoch == maintainerEpoch && e.meta.maintainerID == from
 }


The IsMaintainerRequestAllowed method currently acquires a write lock (e.meta.Lock()) even though it only performs read operations. Since e.meta embeds sync.RWMutex, it is highly recommended to use RLock() and RUnlock() here to allow concurrent read access and avoid unnecessary serialization of heartbeat and scheduler checks.

Additionally, GetMaintainerID() and GetMaintainerEpoch() (which are called concurrently in other parts of the codebase, such as logging inside preCheckForSchedulerHandler) currently access e.meta fields without any locking. This introduces a data race with TryUpdateMaintainer which writes to these fields. They should also be updated to use RLock() and RUnlock() to ensure thread safety.

Suggested change

func (e *DispatcherManager) IsMaintainerRequestAllowed(from node.ID, maintainerEpoch uint64) bool {

e.meta.Lock()

defer e.meta.Unlock()

if maintainerEpoch == 0 {

return e.meta.maintainerEpoch == 0 && (e.meta.maintainerID == "" || e.meta.maintainerID == from)

}

return e.meta.maintainerEpoch == maintainerEpoch && e.meta.maintainerID == from

}

func (e *DispatcherManager) IsMaintainerRequestAllowed(from node.ID, maintainerEpoch uint64) bool {

e.meta.RLock()

defer e.meta.RUnlock()

if maintainerEpoch == 0 {

return e.meta.maintainerEpoch == 0 && (e.meta.maintainerID == "" || e.meta.maintainerID == from)

}

return e.meta.maintainerEpoch == maintainerEpoch && e.meta.maintainerID == from

}

gemini-code-assist · 2026-06-18T01:12:31Z

+func (p *managerMaintainerSet) registerMaintainerForAdd(
+	changefeedID common.ChangeFeedID,
+	requestEpoch uint64,
+	newMaintainer func() *Maintainer,
+) *Maintainer {
+	p.registryMu.Lock()
+	defer p.registryMu.Unlock()
+
+	registered, loaded := p.registry.Load(changefeedID)
+	if !loaded {
+		maintainer := newMaintainer()
+		p.registry.Store(changefeedID, maintainer)
+		return maintainer
+	}
+	existing := registered.(*Maintainer)
+	if !canRegisterAfterExistingMaintainer(existing, requestEpoch) {
+		logRejectedAddMaintainer(changefeedID, existing, requestEpoch)
+		return nil
+	}
+	// The old maintainer has fully stopped, so it is safe to release the
+	// shared metric labels before the new maintainer creates its own metric
+	// children for the same changefeed.
+	existing.Close()
+	maintainer := newMaintainer()
+	p.registry.Store(changefeedID, maintainer)
+	return maintainer
+}


Calling existing.Close() inside registerMaintainerForAdd while holding the node-wide p.registryMu lock can block all other maintainer registrations or cleanups on this node if the close operation takes time (e.g., waiting for background goroutines to exit or cleaning up resources). Since existing is already fully stopped (as verified by canRegisterAfterExistingMaintainer), it is safe to release the lock before calling existing.Close().

func (p *managerMaintainerSet) registerMaintainerForAdd( changefeedID common.ChangeFeedID, requestEpoch uint64, newMaintainer func() *Maintainer, ) *Maintainer { p.registryMu.Lock() registered, loaded := p.registry.Load(changefeedID) if !loaded { maintainer := newMaintainer() p.registry.Store(changefeedID, maintainer) p.registryMu.Unlock() return maintainer } existing := registered.(*Maintainer) if !canRegisterAfterExistingMaintainer(existing, requestEpoch) { logRejectedAddMaintainer(changefeedID, existing, requestEpoch) p.registryMu.Unlock() return nil } maintainer := newMaintainer() p.registry.Store(changefeedID, maintainer) p.registryMu.Unlock() existing.Close() return maintainer }

gemini-code-assist · 2026-06-18T01:12:31Z

+func (p *managerMaintainerSet) cleanupRemovedMaintainer(key, value interface{}) {
+	p.registryMu.Lock()
+	defer p.registryMu.Unlock()
+
+	cf := value.(*Maintainer)
+	if !cf.removed.Load() {
+		return
+	}
+	// Range can observe a removed maintainer just before a newer epoch replaces it.
+	// Only the value still stored in the registry owns the shared metric labels.
+	if !p.registry.CompareAndDelete(key, cf) {
+		return
+	}
+	cf.Close()
+	log.Info("maintainer removed, remove it from dynamic stream",
+		zap.Stringer("changefeedID", cf.changefeedID),
+		zap.Uint64("checkpointTs", cf.getWatermark().CheckpointTs),
+	)
+}


Similarly to registerMaintainerForAdd, calling cf.Close() while holding the node-wide p.registryMu lock can block other maintainers' operations on this node. It is safer and more performant to release the lock before calling cf.Close().

Suggested change

func (p *managerMaintainerSet) cleanupRemovedMaintainer(key, value interface{}) {

p.registryMu.Lock()

defer p.registryMu.Unlock()

cf := value.(*Maintainer)

if !cf.removed.Load() {

return

}

// Range can observe a removed maintainer just before a newer epoch replaces it.

// Only the value still stored in the registry owns the shared metric labels.

if !p.registry.CompareAndDelete(key, cf) {

return

}

cf.Close()

log.Info("maintainer removed, remove it from dynamic stream",

zap.Stringer("changefeedID", cf.changefeedID),

zap.Uint64("checkpointTs", cf.getWatermark().CheckpointTs),

)

}

func (p *managerMaintainerSet) cleanupRemovedMaintainer(key, value interface{}) {

p.registryMu.Lock()

cf := value.(*Maintainer)

if !cf.removed.Load() {

p.registryMu.Unlock()

return

}

if !p.registry.CompareAndDelete(key, cf) {

p.registryMu.Unlock()

return

}

p.registryMu.Unlock()

cf.Close()

log.Info("maintainer removed, remove it from dynamic stream",

zap.Stringer("changefeedID", cf.changefeedID),

zap.Uint64("checkpointTs", cf.getWatermark().CheckpointTs),

)

}

…le-maintainer-fence

ti-chi-bot · 2026-06-24T01:51:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign nongfushanquan for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS
pkg/config/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go (2)

725-754: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Filter stale operators before returning bootstrap state.

After TryUpdateMaintainer moves the manager to a newer owner/epoch, currentOperatorMap can still contain requests from the previous owner. This loop appends every stored request, so a new maintainer can restore stale operators during bootstrap.

Proposed fix

-	manager.GetCurrentOperatorMap().Range(func(_, value any) bool {
+	manager.GetCurrentOperatorMap().Range(func(key, value any) bool {
 		req := value.(dispatchermanager.SchedulerDispatcherRequest)
+		if !manager.IsMaintainerRequestAllowed(req.From, req.MaintainerEpoch) {
+			manager.GetCurrentOperatorMap().Delete(key)
+			return true
+		}
 		dispatcherID := common.NewDispatcherIDFromPB(req.Config.DispatcherID)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go` around
lines 725 - 754, The loop in the Range function iterates through all operators
in manager.GetCurrentOperatorMap() and appends them to response.Operators
without filtering for stale operators. After TryUpdateMaintainer updates the
manager to a newer owner/epoch, the currentOperatorMap can still contain
requests from the previous owner, which causes a new maintainer to restore stale
operators during bootstrap. Add a filter condition before appending the operator
to response.Operators that checks if the operator belongs to the current
owner/epoch and only includes it if it does. This ensures stale operators from
previous owners are not included in the bootstrap response.

274-303: 🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win

Unlock MaintainerFenceMu on write-path-closed returns.

The new fence lock is held when these write-path-closed branches return. Lines 302, 323, 423, and 436 leave the mutex locked, so later control messages for this dispatcher manager can block forever.

Proposed fix

 				if err != nil {
 					if dispatchermanager.IsWritePathClosedError(err) {
 						log.Info("dispatcher manager write path closed while creating table trigger event dispatcher",
 							zap.Stringer("changefeedID", cfId), zap.Error(err))
+						manager.MaintainerFenceMu.Unlock()
 						return nil
 					}
@@
 				if err != nil {
 					if dispatchermanager.IsWritePathClosedError(err) {
 						log.Info("dispatcher manager write path closed while creating table trigger redo dispatcher",
 							zap.Stringer("changefeedID", cfId), zap.Error(err))
+						manager.MaintainerFenceMu.Unlock()
 						return nil
 					}
@@
 	if err != nil {
 		if dispatchermanager.IsWritePathClosedError(err) {
 			log.Info("dispatcher manager write path closed while initializing table trigger event dispatcher",
 				zap.Any("changefeedID", cfId.Name()), zap.Error(err))
+			manager.MaintainerFenceMu.Unlock()
 			return nil
 		}
@@
 		if err != nil {
 			if dispatchermanager.IsWritePathClosedError(err) {
 				log.Info("dispatcher manager write path closed while initializing table trigger redo dispatcher",
 					zap.Any("changefeedID", cfId.Name()), zap.Error(err))
+				manager.MaintainerFenceMu.Unlock()
 				return nil
 			}

Also applies to: 320-324, 379-424, 433-436

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go` around
lines 274 - 303, The MaintainerFenceMu mutex is locked at the beginning of the
function but is not being unlocked before returning in the write-path-closed
error handling branches. Locate all the places where
dispatchermanager.IsWritePathClosedError(err) is checked and returns nil without
unlocking (including the one in NewTableTriggerEventDispatcher error handling
and any other similar patterns). Before each of these return statements, add
manager.MaintainerFenceMu.Unlock() to ensure the lock is released. This prevents
deadlocks that would occur when other control messages attempt to acquire this
lock.

🧹 Nitpick comments (2)

maintainer/maintainer_test.go (1)

266-307: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Add strict-mode epoch-0 response coverage.

The new tests cover non-zero mismatch cases, but epoch 0 is the compatibility sentinel and takes a separate branch. Add a strict-maintainer response test so unfenced stale responses cannot regress back to being accepted.

Proposed test coverage

 func TestMaintainerPostBootstrapResponseRequiresCurrentEpoch(t *testing.T) {
 	cfID := common.NewChangeFeedIDWithName("test", common.DefaultKeyspaceName)
 	m := &Maintainer{
 		changefeedID: cfID,
 		info:         &config.ChangeFeedInfo{Epoch: 2},
@@
 	))
 	require.NotNil(t, m.postBootstrapMsg)
 
+	m.onMaintainerPostBootstrapResponse(messaging.NewSingleTargetMessage(
+		node.ID("compat"),
+		messaging.MaintainerManagerTopic,
+		&heartbeatpb.MaintainerPostBootstrapResponse{
+			ChangefeedID:    cfID.ToPB(),
+			MaintainerEpoch: 0,
+		},
+	))
+	require.NotNil(t, m.postBootstrapMsg)
+
 	m.onMaintainerPostBootstrapResponse(messaging.NewSingleTargetMessage(
 		node.ID("current"),
 		messaging.MaintainerManagerTopic,
 		&heartbeatpb.MaintainerPostBootstrapResponse{

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@maintainer/maintainer_test.go` around lines 266 - 307, Add a test case to
verify that a strict maintainer rejects post-bootstrap responses with epoch 0,
which is the compatibility sentinel. Create a new test function similar to
TestMaintainerPostBootstrapResponseRequiresCurrentEpoch that constructs a strict
Maintainer with Epoch: 2 in the ChangeFeedInfo, calls
onMaintainerPostBootstrapResponse with a response containing epoch 0, and
asserts that postBootstrapMsg remains non-nil (indicating the response was
rejected). This ensures stale epoch-0 responses cannot regress back to being
accepted by strict maintainers.

maintainer/span/span_controller_test.go (1)

490-493: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert the propagated maintainer epoch.

Line 490 now passes 7 into NewAddDispatcherMessage, but the test still only checks StartTs. Add an assertion so this changed test covers the new message contract.
Proposed test assertion
 msg := task.NewAddDispatcherMessage("node1", heartbeatpb.OperatorType_O_Add, 7)
 req := msg.Message[0].(*heartbeatpb.ScheduleDispatcherRequest)
 require.Equal(t, uint64(20), req.Config.StartTs)
+require.Equal(t, uint64(7), req.MaintainerEpoch)
As per coding guidelines, **/*_test.go: Prefer focused deterministic tests; see docs/agents/testing.md before adding or changing tests.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@maintainer/span/span_controller_test.go` around lines 490 - 493, The test in
the span_controller_test.go file passes a maintainer epoch value of 7 to the
NewAddDispatcherMessage function on line 490, but the subsequent assertion only
verifies the StartTs field on line 492. Add an additional assertion after the
existing require.Equal call for StartTs to verify that the epoch value (7) is
properly propagated in the request's Config field. This ensures the test covers
the complete contract of the NewAddDispatcherMessage call with all its
parameters.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@downstreamadapter/dispatchermanager/helper.go`:
- Around line 294-306: When a Remove request targets a missing dispatcher in the
code around lines 384-413, the existing operator entry is not being deleted from
dispatcherManager.currentOperatorMap. This causes subsequent Create retries to
be dropped by the check at lines 301-303 (which returns false when
ScheduleAction_Create is detected), leaving the dispatcher stuck in an
inconsistent state. In the Remove request handling path when the dispatcher is
absent, add a call to dispatcherManager.currentOperatorMap.Delete(dispatcherID)
to clear the existing operator entry before proceeding, similar to the deletion
already done in the IsMaintainerRequestAllowed check at line 298.

In `@maintainer/maintainer.go`:
- Around line 1028-1030: The isMaintainerEpochResponseAllowed method in the
Maintainer struct needs to be updated to reject epoch-0 responses once the
maintainer enters strict mode. Currently, common.MaintainerEpochMatches accepts
responseEpoch == 0 regardless of the currentMaintainerEpoch value, which leaves
the stale-response path open after strict epochs are in use. Modify
isMaintainerEpochResponseAllowed to reject responses with epoch 0 when the
currentMaintainerEpoch is non-zero, allowing epoch-0 responses only while the
maintainer is still in compatibility mode (when currentMaintainerEpoch is 0).

In `@maintainer/scheduler/balance_splits.go`:
- Around line 135-149: The MaintainerEpoch() method is being called at two
different times: once when creating the NewSplitDispatcherOperator and again
when creating the NewAddDispatcherOperator within the callback. If the epoch
advances between these calls, the split and child add operators will be stamped
with different epochs. Capture the epoch value once before creating the
NewSplitDispatcherOperator by storing s.operatorController.MaintainerEpoch() in
a local variable, then reuse that same variable in both the
NewSplitDispatcherOperator constructor and the NewAddDispatcherOperator
constructor inside the callback function.

---

Outside diff comments:
In `@downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go`:
- Around line 725-754: The loop in the Range function iterates through all
operators in manager.GetCurrentOperatorMap() and appends them to
response.Operators without filtering for stale operators. After
TryUpdateMaintainer updates the manager to a newer owner/epoch, the
currentOperatorMap can still contain requests from the previous owner, which
causes a new maintainer to restore stale operators during bootstrap. Add a
filter condition before appending the operator to response.Operators that checks
if the operator belongs to the current owner/epoch and only includes it if it
does. This ensures stale operators from previous owners are not included in the
bootstrap response.
- Around line 274-303: The MaintainerFenceMu mutex is locked at the beginning of
the function but is not being unlocked before returning in the write-path-closed
error handling branches. Locate all the places where
dispatchermanager.IsWritePathClosedError(err) is checked and returns nil without
unlocking (including the one in NewTableTriggerEventDispatcher error handling
and any other similar patterns). Before each of these return statements, add
manager.MaintainerFenceMu.Unlock() to ensure the lock is released. This prevents
deadlocks that would occur when other control messages attempt to acquire this
lock.

---

Nitpick comments:
In `@maintainer/maintainer_test.go`:
- Around line 266-307: Add a test case to verify that a strict maintainer
rejects post-bootstrap responses with epoch 0, which is the compatibility
sentinel. Create a new test function similar to
TestMaintainerPostBootstrapResponseRequiresCurrentEpoch that constructs a strict
Maintainer with Epoch: 2 in the ChangeFeedInfo, calls
onMaintainerPostBootstrapResponse with a response containing epoch 0, and
asserts that postBootstrapMsg remains non-nil (indicating the response was
rejected). This ensures stale epoch-0 responses cannot regress back to being
accepted by strict maintainers.

In `@maintainer/span/span_controller_test.go`:
- Around line 490-493: The test in the span_controller_test.go file passes a
maintainer epoch value of 7 to the NewAddDispatcherMessage function on line 490,
but the subsequent assertion only verifies the StartTs field on line 492. Add an
additional assertion after the existing require.Equal call for StartTs to verify
that the epoch value (7) is properly propagated in the request's Config field.
This ensures the test covers the complete contract of the
NewAddDispatcherMessage call with all its parameters.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b2e0fbc4-b454-4bec-b9fb-cc2059074454

📥 Commits

Reviewing files that changed from the base of the PR and between 0eec971 and 45f36e3.

📒 Files selected for processing (41)

downstreamadapter/dispatchermanager/dispatcher_manager.go
downstreamadapter/dispatchermanager/dispatcher_manager_helper.go
downstreamadapter/dispatchermanager/dispatcher_manager_info.go
downstreamadapter/dispatchermanager/dispatcher_manager_redo.go
downstreamadapter/dispatchermanager/dispatcher_manager_test.go
downstreamadapter/dispatchermanager/heartbeat_collector.go
downstreamadapter/dispatchermanager/helper.go
downstreamadapter/dispatchermanager/helper_test.go
downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
downstreamadapter/dispatcherorchestrator/helper.go
maintainer/barrier.go
maintainer/barrier_event.go
maintainer/barrier_test.go
maintainer/maintainer.go
maintainer/maintainer_controller.go
maintainer/maintainer_controller_bootstrap.go
maintainer/maintainer_controller_helper.go
maintainer/maintainer_controller_test.go
maintainer/maintainer_manager_maintainers.go
maintainer/maintainer_manager_test.go
maintainer/maintainer_test.go
maintainer/operator/operator_add.go
maintainer/operator/operator_add_test.go
maintainer/operator/operator_controller.go
maintainer/operator/operator_controller_test.go
maintainer/operator/operator_merge.go
maintainer/operator/operator_merge_test.go
maintainer/operator/operator_move.go
maintainer/operator/operator_move_test.go
maintainer/operator/operator_remove.go
maintainer/operator/operator_remove_test.go
maintainer/operator/operator_split.go
maintainer/operator/operator_split_test.go
maintainer/replica/replication_span.go
maintainer/replica/replication_span_test.go
maintainer/scheduler/balance.go
maintainer/scheduler/balance_splits.go
maintainer/scheduler/basic.go
maintainer/scheduler/drain.go
maintainer/scheduler/drain_test.go
maintainer/span/span_controller_test.go

coderabbitai · 2026-06-24T02:06:29Z

+	if existing, operatorExists := dispatcherManager.currentOperatorMap.Load(dispatcherID); operatorExists {
+		existingReq := existing.(SchedulerDispatcherRequest)
+		if !dispatcherManager.IsMaintainerRequestAllowed(existingReq.From, existingReq.MaintainerEpoch) {
+			dispatcherManager.currentOperatorMap.Delete(dispatcherID)
+		} else {
+			// Create requests must be serialized per dispatcherID; otherwise we can end up creating multiple
+			// dispatchers for the same span/dispatcherID.
+			if req.ScheduleAction == heartbeatpb.ScheduleAction_Create {
+				return common.DispatcherID{}, false
+			}
+			// Remove requests are allowed to proceed: removeDispatcher is idempotent and the incoming request
+			// may carry a newer OperatorType for maintainer bootstrap/failover reconstruction.
 		}


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Clear existing operators when a Remove targets a missing dispatcher.

Line 294 allows a current-epoch existing operator to remain while the Remove proceeds, but Lines 384-413 emit/skip the terminal remove path without deleting that existing entry when the dispatcher is absent. If that entry is a prior Create, later Create retries are dropped by Lines 301-303 and the dispatcher can stay stuck.

Proposed fix

if _, exists := dispatcherManager.redoDispatcherMap.Get(dispatcherID); exists { dispatcherManager.currentOperatorMap.Store(dispatcherID, req) log.Debug("store current working remove operator for redo dispatcher", zap.String("changefeedID", req.ChangefeedID.String()), zap.String("dispatcherID", dispatcherID.String()), zap.Any("operator", req), ) } else { + dispatcherManager.currentOperatorMap.Delete(dispatcherID) log.Debug("redo dispatcher not found, skip remove operator store", zap.String("changefeedID", req.ChangefeedID.String()), zap.String("dispatcherID", dispatcherID.String()), zap.Any("operator", req), ) @@ if _, exists := dispatcherManager.dispatcherMap.Get(dispatcherID); exists { dispatcherManager.currentOperatorMap.Store(dispatcherID, req) log.Debug("store current working remove operator", zap.String("changefeedID", req.ChangefeedID.String()), zap.String("dispatcherID", dispatcherID.String()), zap.Any("operator", req), ) } else { + dispatcherManager.currentOperatorMap.Delete(dispatcherID) log.Debug("dispatcher not found, skip remove operator store", zap.String("changefeedID", req.ChangefeedID.String()), zap.String("dispatcherID", dispatcherID.String()), zap.Any("operator", req), )

Also applies to: 371-413

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@downstreamadapter/dispatchermanager/helper.go` around lines 294 - 306, When a Remove request targets a missing dispatcher in the code around lines 384-413, the existing operator entry is not being deleted from dispatcherManager.currentOperatorMap. This causes subsequent Create retries to be dropped by the check at lines 301-303 (which returns false when ScheduleAction_Create is detected), leaving the dispatcher stuck in an inconsistent state. In the Remove request handling path when the dispatcher is absent, add a call to dispatcherManager.currentOperatorMap.Delete(dispatcherID) to clear the existing operator entry before proceeding, similar to the deletion already done in the IsMaintainerRequestAllowed check at line 298.

coderabbitai · 2026-06-24T02:06:29Z

+func (m *Maintainer) isMaintainerEpochResponseAllowed(responseEpoch uint64) bool {
+	return common.MaintainerEpochMatches(responseEpoch, m.currentMaintainerEpoch())
+}


🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Reject epoch-0 responses once the maintainer is strict.

common.MaintainerEpochMatches accepts responseEpoch == 0, so a maintainer with a non-zero current epoch can still accept unfenced bootstrap/post-bootstrap/close responses. That leaves the stale-response path open after strict epochs are in use; epoch 0 should only be accepted while this maintainer is still in compatibility mode.

Proposed fix

func (m *Maintainer) isMaintainerEpochResponseAllowed(responseEpoch uint64) bool { - return common.MaintainerEpochMatches(responseEpoch, m.currentMaintainerEpoch()) + currentEpoch := m.currentMaintainerEpoch() + if responseEpoch == 0 { + return currentEpoch == 0 + } + return currentEpoch == 0 || responseEpoch == currentEpoch }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func (m *Maintainer) isMaintainerEpochResponseAllowed(responseEpoch uint64) bool {

return common.MaintainerEpochMatches(responseEpoch, m.currentMaintainerEpoch())

}

func (m *Maintainer) isMaintainerEpochResponseAllowed(responseEpoch uint64) bool {

currentEpoch := m.currentMaintainerEpoch()

if responseEpoch == 0 {

return currentEpoch == 0

}

return currentEpoch == 0 || responseEpoch == currentEpoch

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@maintainer/maintainer.go` around lines 1028 - 1030, The isMaintainerEpochResponseAllowed method in the Maintainer struct needs to be updated to reject epoch-0 responses once the maintainer enters strict mode. Currently, common.MaintainerEpochMatches accepts responseEpoch == 0 regardless of the currentMaintainerEpoch value, which leaves the stale-response path open after strict epochs are in use. Modify isMaintainerEpochResponseAllowed to reject responses with epoch 0 when the currentMaintainerEpoch is non-zero, allowing epoch-0 responses only while the maintainer is still in compatibility mode (when currentMaintainerEpoch is 0).

coderabbitai · 2026-06-24T02:06:30Z

+					op := operator.NewSplitDispatcherOperator(
+						s.spanController,
+						checkResult.SplitSpan,
+						splitSpans,
+						checkResult.SplitTargetNodes,
+						s.operatorController.MaintainerEpoch(),
+						func(span *replica.SpanReplication, node node.ID) bool {
+							return s.operatorController.AddOperator(operator.NewAddDispatcherOperator(
+								s.spanController,
+								span,
+								node,
+								heartbeatpb.OperatorType_O_Split,
+								s.operatorController.MaintainerEpoch(),
+							))
+						},


🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Capture one epoch snapshot for split and child add operators.

Line 140 and Line 147 read MaintainerEpoch() at different times. If epoch advances between split creation and postFinish, child add operators can be stamped with a different epoch than the split operator that produced them.

Suggested fix

if len(splitSpans) > 1 { + maintainerEpoch := s.operatorController.MaintainerEpoch() op := operator.NewSplitDispatcherOperator( s.spanController, checkResult.SplitSpan, splitSpans, checkResult.SplitTargetNodes, - s.operatorController.MaintainerEpoch(), + maintainerEpoch, func(span *replica.SpanReplication, node node.ID) bool { return s.operatorController.AddOperator(operator.NewAddDispatcherOperator( s.spanController, span, node, heartbeatpb.OperatorType_O_Split, - s.operatorController.MaintainerEpoch(), + maintainerEpoch, )) }, )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@maintainer/scheduler/balance_splits.go` around lines 135 - 149, The MaintainerEpoch() method is being called at two different times: once when creating the NewSplitDispatcherOperator and again when creating the NewAddDispatcherOperator within the callback. If the epoch advances between these calls, the split and child add operators will be stamped with different epochs. Capture the epoch value once before creating the NewSplitDispatcherOperator by storing s.operatorController.MaintainerEpoch() in a local variable, then reuse that same variable in both the NewSplitDispatcherOperator constructor and the NewAddDispatcherOperator constructor inside the callback function.

hongyunyan added 2 commits June 17, 2026 22:18

coordinator: persist maintainer epochs before ownership changes

57df930

maintainer,dispatcher: fence stale maintainer epochs

38669c8

ti-chi-bot Bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jun 18, 2026

ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 18, 2026

This was referenced Jun 18, 2026

downstreamadapter: harden table trigger takeover #5436

Open

coordinator: persist maintainer epochs before ownership changes #5434

Merged

gemini-code-assist Bot reviewed Jun 18, 2026

View reviewed changes

hongyunyan added 7 commits June 22, 2026 14:57

coordinator: address resume review feedback

0d8b952

Merge branch 'codex/pr-5182-epoch-persistence' into codex/pr-5182-sta…

a4f7d30

…le-maintainer-fence

coordinator: refresh changefeed backend mock

06a6788

Merge branch 'codex/pr-5182-epoch-persistence' into codex/pr-5182-sta…

7366373

…le-maintainer-fence

pkg: use canonical errors import

cd9e0b5

Merge branch 'codex/pr-5182-epoch-persistence' into codex/pr-5182-sta…

f10352d

…le-maintainer-fence

Merge remote-tracking branch 'upstream/master' into codex/pr-5182-sta…

45f36e3

…le-maintainer-fence

hongyunyan changed the base branch from codex/pr-5182-epoch-persistence to master June 24, 2026 01:51

hongyunyan marked this pull request as ready for review June 24, 2026 01:52

ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 24, 2026

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

-func (m *Maintainer) isMaintainerEpochResponseAllowed(responseEpoch uint64) bool {
-	return common.MaintainerEpochMatches(responseEpoch, m.currentMaintainerEpoch())
-}
+func (m *Maintainer) isMaintainerEpochResponseAllowed(responseEpoch uint64) bool {
+	currentEpoch := m.currentMaintainerEpoch()
+	if responseEpoch == 0 {
+		return currentEpoch == 0
+	}
+	return currentEpoch == 0 || responseEpoch == currentEpoch
+}

Conversation

hongyunyan commented Jun 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Validation

Summary by CodeRabbit

Uh oh!

ti-chi-bot Bot commented Jun 18, 2026

Uh oh!

coderabbitai Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot Bot commented Jun 24, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hongyunyan commented Jun 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading