Skip to content

downstreamadapter: harden table trigger takeover#5436

Open
hongyunyan wants to merge 5 commits into
pingcap:codex/pr-5182-stale-maintainer-fencefrom
hongyunyan:codex/pr-5182-trigger-takeover
Open

downstreamadapter: harden table trigger takeover#5436
hongyunyan wants to merge 5 commits into
pingcap:codex/pr-5182-stale-maintainer-fencefrom
hongyunyan:codex/pr-5182-trigger-takeover

Conversation

@hongyunyan

@hongyunyan hongyunyan commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

What problem does this PR solve?

Issue Number: close #5083

This is PR 3 of 3 split from #5182 and is stacked on PR 2.

Background:

PR 1 persists maintainer epochs before ownership changes. PR 2 fences stale
maintainer control messages at the maintainer and dispatcher-manager boundary.
This final PR hardens the table trigger takeover cases that still depend on
dispatcher-manager bootstrap recovery.

Motivation:

During a same-capture higher-epoch maintainer replacement, a reused dispatcher
manager can still have table trigger event or redo dispatchers from the previous
owner. Replacing a mismatched trigger in place can duplicate DDL ownership, while
dropping stale remove operators during bootstrap can lose the cleanup intent for
an already reported dispatcher.

What is changed and how it works?

This PR adds takeover-specific hardening:

  • Verifies that a reused dispatcher manager either has no table trigger
    dispatcher yet or already has the trigger dispatcher requested by the current
    bootstrap owner.
  • Creates missing event and redo table trigger dispatchers during bootstrap, but
    rejects mismatched trigger IDs instead of replacing them in place.
  • Restores stale remove operators in bootstrap responses only when the same
    bootstrap snapshot reports the dispatcher they remove.
  • Keeps current-epoch bootstrap operators while filtering stale create operators.
  • Adds unit coverage for stale remove restoration, bootstrap snapshot filtering,
    close acknowledgements, closed-epoch tombstones, and table trigger ID mismatch
    handling.
  • Updates the in-flight syncpoint scheduling integration case to search all CDC
    node logs because the merge task may run on any dispatcher-manager owner after
    scheduling.

Stack:

  • #5434: coordinator epoch persistence and owner scheduling foundation.
  • #5435: maintainer and dispatcher-manager receiver-side epoch fence.
  • #5436: table trigger takeover hardening and integration coverage.

Check List

Tests

  • Unit test
  • Integration test

Questions

Will it cause performance regression or break compatibility?

No expected performance regression. The added checks run during maintainer
bootstrap and bootstrap response reconstruction, not on the event write path.

The behavior is compatible with rolling upgrades because it builds on the epoch
0 compatibility rules introduced in the previous PRs.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

Fix a bug where delayed stale maintainer requests could create duplicate dispatchers during maintainer failover.

Validation

  • make fmt
  • go test ./downstreamadapter/dispatcherorchestrator ./downstreamadapter/dispatchermanager

@ti-chi-bot

ti-chi-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jun 18, 2026
@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 149c5849-378c-4250-a5f9-7fe342bb04b7

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 18, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the bootstrap request handling in DispatcherOrchestrator by extracting helper functions to verify and initialize table trigger event and redo dispatchers, and updates operator retrieval to properly handle stale remove operators. It also adds extensive unit tests and updates an integration test to search logs across all CDC nodes. The review feedback highlights a critical mutex leak in handlePostBootstrapRequest due to a missing unlock when fenced, as well as a regression in the new helper functions which fail to gracefully handle write path closed errors during shutdown.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines 481 to 484
if m.fenced.Load() {
manager.MaintainerFenceMu.Unlock()
manager.LocalFence()
return nil
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In handlePostBootstrapRequest, if m.fenced.Load() is true, the function returns early without unlocking manager.MaintainerFenceMu. This will cause a mutex leak and potential deadlocks during shutdown or subsequent requests. We should unlock manager.MaintainerFenceMu before calling manager.LocalFence() and returning.

Suggested change
if m.fenced.Load() {
manager.MaintainerFenceMu.Unlock()
manager.LocalFence()
return nil
}
if m.fenced.Load() {
manager.MaintainerFenceMu.Unlock()
manager.LocalFence()
return nil
}

Comment on lines +343 to +347
if err := manager.NewTableTriggerEventDispatcher(id, startTs, false); err != nil {
log.Error("failed to create table trigger event dispatcher",
zap.Stringer("changefeedID", cfId), zap.Error(err))
return err
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If manager.NewTableTriggerEventDispatcher fails because the write path is closed, it is an expected state during shutdown or fencing. Logging it as an Error and returning the error to the maintainer is a regression from the original behavior (where it was logged as Info and returned nil). We should handle IsWritePathClosedError specially and return nil to allow clean shutdown.

		if err := manager.NewTableTriggerEventDispatcher(id, startTs, false); err != nil {
			if dispatchermanager.IsWritePathClosedError(err) {
				log.Info("dispatcher manager write path closed while creating table trigger event dispatcher",
					zap.Stringer("changefeedID", cfId), zap.Error(err))
				return nil
			}
			log.Error("failed to create table trigger event dispatcher",
				zap.Stringer("changefeedID", cfId), zap.Error(err))
			return err
		}

Comment on lines +375 to +379
if err := manager.NewTableTriggerRedoDispatcher(id, startTs, false); err != nil {
log.Error("failed to create table trigger redo dispatcher",
zap.Stringer("changefeedID", cfId), zap.Error(err))
return err
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similarly, if manager.NewTableTriggerRedoDispatcher fails because the write path is closed, we should handle IsWritePathClosedError specially and return nil instead of logging an Error and returning the error to the maintainer.

		if err := manager.NewTableTriggerRedoDispatcher(id, startTs, false); err != nil {
			if dispatchermanager.IsWritePathClosedError(err) {
				log.Info("dispatcher manager write path closed while creating table trigger redo dispatcher",
					zap.Stringer("changefeedID", cfId), zap.Error(err))
				return nil
			}
			log.Error("failed to create table trigger redo dispatcher",
				zap.Stringer("changefeedID", cfId), zap.Error(err))
			return err
		}

@ti-chi-bot

ti-chi-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign nongfushanquan for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hongyunyan hongyunyan marked this pull request as ready for review June 24, 2026 01:52
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant