eventcollector: decouple dispatcher session from dispatcher stat by lidezhu · Pull Request #5007 · pingcap/ticdc

lidezhu · 2026-05-07T09:26:59Z

What problem does this PR solve?

Issue Number: close #xxx

What is changed and how it works?

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

Refactor
- Improved internal architecture by decoupling session management from collector dependencies, enabling better modularity and testability.
- Streamlined dispatcher request handling through dependency injection of key operations.
Tests
- Enhanced test utilities to better support isolated component testing scenarios.

ti-chi-bot · 2026-05-07T09:27:06Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenfyzhong for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-05-07T09:27:08Z

Warning

Rate limit exceeded

@lidezhu has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 12 minutes and 17 seconds before requesting another review.

To continue reviewing without waiting, purchase usage credits in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 243441ff-99f9-4596-a2a4-674be56838fe

📥 Commits

Reviewing files that changed from the base of the PR and between 298ba37 and b54d22c.

📒 Files selected for processing (1)

downstreamadapter/eventcollector/dispatcher_session.go

📝 Walkthrough

Walkthrough

Dispatcher session is refactored to eliminate its dependency on dispatcherStat by receiving injected dispatcher metadata, message callbacks, and epoch management instead. Dispatcher stat now provides these dependencies at session creation time and delegates request construction to the session. Tests gain a specialized helper for state-only validation.

Changes

Dispatcher Session Dependency Injection

Layer / File(s)	Summary
Session Constructor & Fields `downstreamadapter/eventcollector/dispatcher_session.go`	Constructor accepts dispatcher `target`, `localServerID`, `sendMessage`, `nextResetEpoch`, and `readyCallback` directly instead of holding an `owner` reference. Session struct now manages connection state internally.
Session Methods Using Injected Dependencies `downstreamadapter/eventcollector/dispatcher_session.go`	`clear()`, `registerTo()`, `commitReady()`, `reset()`, `remove()`, and event handling (`handleSignalEvent`, `handleReadyEvent`, `handleLocalReadyEvent`, `handleRemoteReadyEvent`, `handleNotReusableEvent`) now use injected callbacks and local fields instead of owner references.
Dispatcher Stat Dependency Provision `downstreamadapter/eventcollector/dispatcher_stat.go`	`newDispatcherStat` wrapper validates event-collector and delegates to `newDispatcherStatInternal`, which provides explicit dependencies to session creation. New `nextResetEpoch()` method atomically advances epoch state. Request helpers delegate to session methods.
Test Updates `downstreamadapter/eventcollector/dispatcher_stat_test.go`	New `newDispatcherStatForStateTest()` helper constructs stat instances for state validation without event-collector wiring. Multiple tests switch from `newDispatcherStat(..., nil, nil)` to use the new helper.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

pingcap/ticdc#4991: Concurrent refactor moving dispatcher connection and lifecycle state from dispatcherStat into dispatcherSession with signature and call-site updates.

Suggested labels

lgtm, approved

Suggested reviewers

hongyunyan
flowbehappy
wk989898

Poem

🐰 A session breaks free from its master's grip,
With injected deps and a cleaner ship,
Stat provides the fuel, the epoch, the way,
Tests simplified for a brighter day! 🌟

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete and lacks critical information. The 'Issue Number', 'What is changed and how it works?', test type selection, and release note are either placeholders or unanswered.	Complete the description by: (1) providing a linked issue number, (2) explaining the changes and rationale, (3) selecting applicable test types and confirming tests exist, (4) answering compatibility/documentation questions, and (5) providing an actual release note or 'None'.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main refactoring change: decoupling dispatcher session from dispatcher stat in the eventcollector component.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request refactors the dispatcher management and heartbeat mechanism by introducing a dispatcherSession to handle event service interactions and implementing epoch-based progress tracking. Key changes include the addition of a background heartbeat sender in the EventCollector and updates to the wire format to include epoch information, ensuring stale progress updates are ignored. Review feedback identifies a critical missing length validation in the Unmarshal method for DispatcherProgressLegacy that could lead to panics, and notes a design concern regarding stale heartbeat responses in the session management logic.

gemini-code-assist · 2026-05-07T09:31:24Z

+func (dp *DispatcherProgressLegacy) Unmarshal(data []byte) error {
+	buf := bytes.NewBuffer(data)
+	dp.DispatcherID.Unmarshal(buf.Next(dp.DispatcherID.GetSize()))
+	dp.CheckpointTs = binary.BigEndian.Uint64(buf.Next(8))


The Unmarshal method does not validate the length of the input data before calling buf.Next(). If data is shorter than expected, buf.Next() will return a shorter slice, and binary.BigEndian.Uint64 will panic. This is a high-severity issue as it can lead to service crashes with malformed input.

func (dp *DispatcherProgressLegacy) Unmarshal(data []byte) error { if len(data) < dp.DispatcherID.GetSize()+8 { return fmt.Errorf("data too short") } dp.DispatcherID.Unmarshal(data[:dp.DispatcherID.GetSize()]) dp.CheckpointTs = binary.BigEndian.Uint64(data[dp.DispatcherID.GetSize():dp.DispatcherID.GetSize()+8]) return nil }

gemini-code-assist · 2026-05-07T09:31:25Z

+	// TODO: this design is bad because we may receive stale heartbeat response,
+	// which make us call clear and register again. But the register may be ignore,
+	// so we will not receive any ready event.


The clear() method contains a TODO indicating a known design flaw regarding stale heartbeat responses. This should be addressed to ensure the dispatcher state is managed correctly and to avoid potential issues with registration.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@downstreamadapter/eventcollector/dispatcher_session.go`:
- Around line 292-299: When handleNotReusableEvent in dispatcherSession observes
no replacement candidate (connState.getNextRemoteCandidate() returns empty),
clear the failed remote binding so the session no longer appears attached;
specifically, if candidate == "" then reset the remote binding (e.g., clear
connState.eventServiceID or call the connState clear/unbind helper) before
returning so trySetRemoteCandidates can accept new candidates later; keep the
existing registerTo(path) behavior when a candidate exists and optionally log
the clear action for debugging.
- Around line 160-179: commitReady and reset currently call doReset with
s.target.GetCheckpointTs(), which can advance reset epochs past
collector-observed progress and violate dispatcherEpochState.maxEventTs; change
both commitReady and reset to compute resetTs via the safe, capped progress (use
the dispatcher state helper, e.g. s.state.getSafeResetTs() or the equivalent
getSafeResetTs method wired from dispatcherStat used for EventService
heartbeats) and pass that value into doReset(serverID, resetTs), leaving doReset
unchanged; update imports/struct wiring if needed to expose getSafeResetTs on
s.state so commitReady/reset use the capped progress instead of
s.target.GetCheckpointTs().

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f5703def-593e-412f-ab4e-7da1b9f8898c

📥 Commits

Reviewing files that changed from the base of the PR and between 5bf13f7 and 298ba37.

📒 Files selected for processing (3)

downstreamadapter/eventcollector/dispatcher_session.go
downstreamadapter/eventcollector/dispatcher_stat.go
downstreamadapter/eventcollector/dispatcher_stat_test.go

coderabbitai · 2026-05-07T13:57:17Z

 // commitReady is used to notify the event service to start sending events.
 func (s *dispatcherSession) commitReady(serverID node.ID) {
-	s.doReset(serverID, s.owner.getResetTs())
+	s.doReset(serverID, s.target.GetCheckpointTs())
 }

 // reset is used to reset the dispatcher to the specified commitTs,
 // it will remove the dispatcher from the dynamic stream and add it back.
 func (s *dispatcherSession) reset(serverID node.ID) {
-	s.doReset(serverID, s.owner.getResetTs())
+	s.doReset(serverID, s.target.GetCheckpointTs())
 }

 func (s *dispatcherSession) doReset(serverID node.ID, resetTs uint64) {
-	var epoch uint64
-	for {
-		currentState := s.owner.loadCurrentEpochState()
-		nextState := newDispatcherEpochState(currentState.epoch+1, 0, resetTs)
-		if s.owner.currentEpoch.CompareAndSwap(currentState, nextState) {
-			epoch = nextState.epoch
-			break
-		}
-	}
-	resetRequest := s.owner.newDispatcherResetRequest(
-		s.owner.eventCollector.getLocalServerID().String(),
+	epoch := s.nextResetEpoch(resetTs)
+	resetRequest := s.newDispatcherResetRequest(
+		s.localServerID.String(),
 		resetTs,
 		epoch,
 	)
 	msg := messaging.NewSingleTargetMessage(serverID, messaging.EventServiceTopic, resetRequest)
-	s.owner.eventCollector.enqueueMessageForSend(msg)
+	s.sendMessage(msg)


⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Clamp reset ts to collector-observed progress, not raw sink checkpoint.

Lines 162 and 168 use s.target.GetCheckpointTs() directly. That breaks the safety invariant documented in dispatcherEpochState.maxEventTs: old in-flight events can advance the sink checkpoint after a reset, even when the collector has not accepted that progress in the new epoch yet. If another commitReady/reset happens in that window, this code will start the next epoch too far ahead and can permanently skip data.

Possible fix shape

type dispatcherSession struct { target dispatcher.DispatcherService localServerID node.ID sendMessage func(*messaging.TargetMessage) + getSafeResetTs func() uint64 nextResetEpoch func(resetTs uint64) uint64 readyCallback func() } -func newDispatcherSession( +func newDispatcherSession( target dispatcher.DispatcherService, localServerID node.ID, sendMessage func(*messaging.TargetMessage), + getSafeResetTs func() uint64, nextResetEpoch func(resetTs uint64) uint64, readyCallback func(), ) *dispatcherSession { return &dispatcherSession{ target: target, localServerID: localServerID, sendMessage: sendMessage, + getSafeResetTs: getSafeResetTs, nextResetEpoch: nextResetEpoch, readyCallback: readyCallback, } } func (s *dispatcherSession) commitReady(serverID node.ID) { - s.doReset(serverID, s.target.GetCheckpointTs()) + s.doReset(serverID, s.getSafeResetTs()) } func (s *dispatcherSession) reset(serverID node.ID) { - s.doReset(serverID, s.target.GetCheckpointTs()) + s.doReset(serverID, s.getSafeResetTs()) }

Wire getSafeResetTs from dispatcherStat using the same capped progress used for EventService heartbeats.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@downstreamadapter/eventcollector/dispatcher_session.go` around lines 160 - 179, commitReady and reset currently call doReset with s.target.GetCheckpointTs(), which can advance reset epochs past collector-observed progress and violate dispatcherEpochState.maxEventTs; change both commitReady and reset to compute resetTs via the safe, capped progress (use the dispatcher state helper, e.g. s.state.getSafeResetTs() or the equivalent getSafeResetTs method wired from dispatcherStat used for EventService heartbeats) and pass that value into doReset(serverID, resetTs), leaving doReset unchanged; update imports/struct wiring if needed to expose getSafeResetTs on s.state so commitReady/reset use the capped progress instead of s.target.GetCheckpointTs().

coderabbitai · 2026-05-07T13:57:17Z

+func (s *dispatcherSession) handleNotReusableEvent(event dispatcher.DispatcherEvent) {
+	if *event.From == s.localServerID {
+		log.Panic("should not happen: local event service should not send not reusable event")
+	}
+	candidate := s.connState.getNextRemoteCandidate()
+	if candidate != "" {
+		s.registerTo(candidate)
+	}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clear the failed remote binding when no replacement candidate exists.

If getNextRemoteCandidate() returns empty here, connState.eventServiceID still points at the rejected remote. After that, trySetRemoteCandidates() will refuse any later candidate list because the session still looks attached, so this dispatcher can get stuck until some unrelated clear path runs.

Minimal fix

func (s *dispatcherSession) handleNotReusableEvent(event dispatcher.DispatcherEvent) { if *event.From == s.localServerID { log.Panic("should not happen: local event service should not send not reusable event") } candidate := s.connState.getNextRemoteCandidate() - if candidate != "" { - s.registerTo(candidate) - } + if candidate == "" { + s.connState.setEventServiceID("") + return + } + s.registerTo(candidate) }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@downstreamadapter/eventcollector/dispatcher_session.go` around lines 292 - 299, When handleNotReusableEvent in dispatcherSession observes no replacement candidate (connState.getNextRemoteCandidate() returns empty), clear the failed remote binding so the session no longer appears attached; specifically, if candidate == "" then reset the remote binding (e.g., clear connState.eventServiceID or call the connState clear/unbind helper) before returning so trySetRemoteCandidates can accept new candidates later; keep the existing registerTo(path) behavior when a candidate exists and optionally log the clear action for debugging.

ti-chi-bot · 2026-05-07T14:29:56Z

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

_{📖 For more info, you can check the "Contribute Code" section in the development guide.}

ti-chi-bot · 2026-05-07T14:31:30Z

@lidezhu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-error-log-review	`b54d22c`	link	true	`/test pull-error-log-review`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ti-chi-bot Bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 7, 2026

ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 7, 2026

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

lidezhu force-pushed the ldz/refactor-event-collector002 branch from 2d1ecd2 to fe27b6e Compare May 7, 2026 09:36

ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 7, 2026

lidezhu added 2 commits May 7, 2026 21:31

eventcollector: decouple dispatcher session from dispatcher stat

9fe5b48

eventcollector: clarify state only test helper

298ba37

lidezhu force-pushed the ldz/refactor-event-collector002 branch from fe27b6e to 298ba37 Compare May 7, 2026 13:37

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

lidezhu added 2 commits May 7, 2026 22:27

eventcollector: clarify ready callback comments

c50921a

f

b54d22c

Conversation

lidezhu commented May 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot Bot commented May 7, 2026

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (2 warnings)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot Bot commented May 7, 2026

Uh oh!

ti-chi-bot Bot commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lidezhu commented May 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading