[DNM] Branch for CSE testing by pingyu · Pull Request #5367 · pingcap/ticdc

pingyu · 2026-06-12T09:06:21Z

What problem does this PR solve?

Issue Number: close #xxx

What is changed and how it works?

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

Release Notes

New Features
- Introduced a DML barrier mechanism for MySQL sinks to enhance synchronization of database operations
- Improved DML event handling in the sink pipeline
Documentation
- Added design documentation for the DML barrier behavior and implementation strategy
Tests
- Added comprehensive test coverage for barrier functionality, error handling, and DML cache behavior

Signed-off-by: Ping Yu <yuping@pingcap.com>

ti-chi-bot · 2026-06-12T09:06:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign flowbehappy for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-06-12T09:06:35Z

📝 Walkthrough

Walkthrough

This PR implements a broadcast barrier mechanism in the MySQL sink causality detector to ensure FlushDMLBeforeBlock waits for all prior enqueued DMLs to be flushed downstream. The implementation adds removal-only fence nodes, in-band writer-queue barrier tokens, and sink-level barrier lifecycle management with comprehensive test coverage.

Changes

MySQL Sink DML Barrier Implementation

Layer / File(s)	Summary
Design and Type Contracts `design.md`, `downstreamadapter/sink/mysql/causality/txn_cache.go`, `downstreamadapter/sink/mysql/causality/helper_test.go`	Design document defines the full barrier protocol. `Barrier` interface and `WriterItem` union type (wrapping `DMLEvent` or `Barrier`) are introduced with constructors `NewDMLItem` and `NewBarrierItem`. Test helper `testChangefeedID()` supports barrier tests.
Channel Synchronization `utils/chann/unlimited_chann.go`	`UnlimitedChannel.PushIfNotClosed` method enables safe enqueue of barrier tokens when the channel is open, returning `false` on close.
Node Removal-Only Fence Resolution `downstreamadapter/sink/mysql/causality/node.go`	`Node` gains `resolveByRemovalOnly` flag to prevent dependency resolution assignment for fence nodes. When set, dependency assignment is skipped, depender notifications are suppressed, and `tryResolve` short-circuits to prevent cache assignment.
Slot Operations for Barrier Support `downstreamadapter/sink/mysql/causality/slot.go`	`AddWithDependencies` method injects extra dependencies (e.g., from active fences) into conflict detection. `SnapshotTailNodes` atomically captures all tail nodes across slots for barrier broadcast dependency collection.
Causality Detector Barrier Orchestration `downstreamadapter/sink/mysql/causality/conflict_detector.go`	`ConflictDetector` adds `admissionMu` and `activeFence` to serialize DML admission with fence state. `Add` locks admission, uses `AddWithDependencies`, and wires node notifications. Refactored `BroadcastBarrier` installs a removal-only fence node, snapshots dependencies, broadcasts `BarrierItem` tokens to all queues via `forceAdd`, and clears the fence on completion. `GetOutChByCacheID` returns `WriterItem` channels.
Transaction Cache Writer Items `downstreamadapter/sink/mysql/causality/txn_cache.go`	`txnCache` interface and implementations refactored to handle `WriterItem` instead of raw DML events. Cache gains `forceAdd` method to enqueue barrier tokens even when blocked, using `PushIfNotClosed` for safe insertion. Channels parameterized on `WriterItem`.
Sink DML Barrier Lifecycle `downstreamadapter/sink/mysql/sink.go`	Implements `dmlBarrier` type for DML coordination with mutex-protected error state, completion channel, and deferred done callbacks. `runDMLWriter` refactored to buffer `WriterItem` entries, flush pending DMLs before acking barriers, and fail barriers on context cancellation. `FlushDMLBeforeBlock` creates, broadcasts, and waits for barrier completion.
Barrier Test Coverage `downstreamadapter/sink/mysql/causality/barrier_test.go`, `downstreamadapter/sink/mysql/sink_test.go`	Tests for barrier broadcast enqueue, detector closure handling, removal-only fence semantics, `forceAdd` bypass/closure behavior, and throughput benchmark. Sink tests verify `FlushDMLBeforeBlock` error propagation, post-flush blocking, barrier idempotency, and close-induced unblocking.
Supporting Infrastructure `pkg/sink/mysql/mysql_writer.go`, `deployments/Dockerfile`	DML writer adds failpoint injection before `PostFlush` invocation. Dockerfile sets `LEGACY_SAFEPOINT=1` when `NEXT_GEN` build is enabled.

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

lgtm, size/XXL, release-note-none

Suggested reviewers

wk989898
asddongmen
hongyunyan

🐰 A barrier in the queue, DMLs flush through,
No PostFlush per item, just when needed—it's true!
Removal-only fences keep order so tight,
While WriterItems carry the flow just right.
hops The sink now waits before the block takes flight! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is largely incomplete, containing only placeholder template text with no actual implementation details filled in. Critical sections like 'What is changed and how it works?' are empty, and the 'Issue Number' field is not properly populated.	Complete the PR description by filling in the issue number, explaining the DML barrier mechanism changes, listing the affected files, confirming tests are included, and providing a release note or marking as 'None'.
Title check	❓ Inconclusive	The title '[DNM] Branch for CSE testing' is vague and does not clearly describe the main changes. It uses a non-descriptive term and appears to be a temporary or work-in-progress label rather than a meaningful summary of the changeset.	Replace with a descriptive title that summarizes the main changes, such as 'Add MySQL sink DML barrier mechanism for causality ordering' or similar that reflects the actual implementation in the changeset.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request implements a DML barrier design for the MySQL sink, allowing block events (such as syncpoints) to wait until all prior enqueued DMLs are flushed downstream. It introduces a WriterItem struct to carry either DML events or barrier tokens, updates the ConflictDetector to broadcast barriers using a removal-only fence, and adapts the writer loop to process these items sequentially. The review feedback highlights critical improvements: fixing a potential deadlock in newDMLBarrier when workerCount is zero, using a sync.RWMutex instead of a standard Mutex to avoid serializing the DML ingestion hot path in ConflictDetector.Add, and utilizing Go's clear() built-in to prevent memory leaks when resetting pointer-holding buffers.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-12T09:08:23Z

+func newDMLBarrier(workerCount int) *dmlBarrier {
+	barrier := &dmlBarrier{
+		done:      make(chan struct{}),
+		remaining: workerCount,
+		acked:     make([]bool, workerCount),
+	}
+	if workerCount == 0 {
+		barrier.Fail(errors.ErrMySQLTxnError.GenWithStackByArgs("mysql DML barrier has no writers"))
+	}
+	return barrier
+}


If workerCount is 0, barrier.Fail is called when remaining is already 0. In Fail, the check if b.err != nil || b.remaining == 0 will trigger an early return, meaning b.err is never set and b.done is never closed. This causes Wait() to block forever (deadlock). Directly initialize the failed state when workerCount == 0 to avoid this.

Suggested change

func newDMLBarrier(workerCount int) *dmlBarrier {

barrier := &dmlBarrier{

done: make(chan struct{}),

remaining: workerCount,

acked: make([]bool, workerCount),

}

if workerCount == 0 {

barrier.Fail(errors.ErrMySQLTxnError.GenWithStackByArgs("mysql DML barrier has no writers"))

}

return barrier

}

func newDMLBarrier(workerCount int) *dmlBarrier {

barrier := &dmlBarrier{

done: make(chan struct{}),

remaining: workerCount,

acked: make([]bool, workerCount),

}

if workerCount == 0 {

barrier.err = errors.ErrMySQLTxnError.GenWithStackByArgs("mysql DML barrier has no writers")

close(barrier.done)

}

return barrier

}

gemini-code-assist · 2026-06-12T09:08:24Z

 	changefeedID                 common.ChangeFeedID
 	metricConflictDetectDuration prometheus.Observer
+
+	admissionMu sync.Mutex


Using a standard sync.Mutex for admissionMu serializes all concurrent Add calls on the DML ingestion hot path. Since barriers are rare, changing this to sync.RWMutex allows concurrent Add calls to proceed in parallel under a read lock, significantly improving throughput.

Suggested change

admissionMu sync.Mutex

admissionMu sync.RWMutex

gemini-code-assist · 2026-06-12T09:08:24Z

+	d.admissionMu.Lock()
+	defer d.admissionMu.Unlock()


Acquire a read lock (RLock) instead of a write lock (Lock) to allow concurrent Add calls to run in parallel. This avoids serializing the DML ingestion hot path.

Suggested change

d.admissionMu.Lock()

defer d.admissionMu.Unlock()

d.admissionMu.RLock()

defer d.admissionMu.RUnlock()

gemini-code-assist · 2026-06-12T09:08:24Z

+			if err := flushDMLs(dmlBuffer, rowCount); err != nil {
+				s.failOutstandingBarriers(err)
+				return err
 			}
+			dmlBuffer = dmlBuffer[:0]
 			workerBatchFlushDuration.Observe(time.Since(start).Seconds())

 			// we record total time to calculate the worker busy ratio.
 			// so we record the total time after flushing, to unified statistics on
 			// flush time and total time
 			workerTotalDuration.Observe(time.Since(totalStart).Seconds())
 			totalStart = time.Now()
-			buffer = buffer[:0]
+			itemBuffer = itemBuffer[:0]


Resetting dmlBuffer and itemBuffer using [:0] does not clear the underlying array elements. Since these slices hold pointers (*commonEvent.DMLEvent and Barrier), the referenced objects will not be garbage collected while the goroutine is blocked waiting for new items. Use Go's built-in clear() to nil out the elements before resetting the slice length.

if err := flushDMLs(dmlBuffer, rowCount); err != nil { s.failOutstandingBarriers(err) return err } clear(dmlBuffer) dmlBuffer = dmlBuffer[:0] workerBatchFlushDuration.Observe(time.Since(start).Seconds()) // we record total time to calculate the worker busy ratio. // so we record the total time after flushing, to unified statistics on // flush time and total time workerTotalDuration.Observe(time.Since(totalStart).Seconds()) totalStart = time.Now() clear(itemBuffer) itemBuffer = itemBuffer[:0]

ti-chi-bot · 2026-06-12T09:09:19Z

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

_{📖 For more info, you can check the "Contribute Code" section in the development guide.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@downstreamadapter/sink/mysql/causality/barrier_test.go`:
- Around line 155-162: The test currently calls cache.forceAdd directly and then
sets barrier.Fail itself, masking the production behavior; instead, locate and
invoke the real production caller that calls forceAdd (search for usages of
forceAdd) — call that higher-level method with NewBarrierItem(barrier) after
cache.out().Close(), then assert that barrier.err was set by that production
path (verify barrier.err is non-nil) rather than calling barrier.Fail manually;
keep the test focused and deterministic (use newTestBarrier, forceAdd caller,
and require.Error on barrier.err).

In `@pkg/sink/mysql/mysql_writer.go`:
- Around line 252-254: The failpoint Inject call is inside the per-event loop so
it runs once per event; move failpoint.Inject("MySQLSinkDelayDMLPostFlush", nil)
to just before the for _, event := range events { loop (the block that calls
event.PostFlush()) so the delay is injected once per flush instead of N times;
ensure you keep the event.PostFlush() calls unchanged and only relocate the
inject call.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 964400f9-9eda-461e-afff-d48fe3694f45

📥 Commits

Reviewing files that changed from the base of the PR and between 15a8703 and 05931f9.

📒 Files selected for processing (12)

deployments/Dockerfile
design.md
downstreamadapter/sink/mysql/causality/barrier_test.go
downstreamadapter/sink/mysql/causality/conflict_detector.go
downstreamadapter/sink/mysql/causality/helper_test.go
downstreamadapter/sink/mysql/causality/node.go
downstreamadapter/sink/mysql/causality/slot.go
downstreamadapter/sink/mysql/causality/txn_cache.go
downstreamadapter/sink/mysql/sink.go
downstreamadapter/sink/mysql/sink_test.go
pkg/sink/mysql/mysql_writer.go
utils/chann/unlimited_chann.go

coderabbitai · 2026-06-12T09:13:43Z

+func TestTxnCacheForceAddFailsWhenClosed(t *testing.T) {
+	cache := newTxnCache(TxnCacheOption{Count: 1, Size: 1, BlockStrategy: BlockStrategyWaitEmpty})
+	cache.out().Close()
+
+	barrier := newTestBarrier(1)
+	require.False(t, cache.forceAdd(NewBarrierItem(barrier)))
+	barrier.Fail(errors.New("closed"))
+	require.Error(t, barrier.err)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

This test masks the barrier-failure path.

Lines 160-162 only prove that forceAdd returns false; the test then sets the barrier error itself. That means it still passes if the production close path forgets to propagate the failure to the barrier, so the regression you care about remains untested. Please drive the failure through the real caller that reacts to forceAdd == false and assert on that outcome instead.

As per coding guidelines, **/*_test.go: Prefer focused deterministic tests; see docs/agents/testing.md before adding or changing tests.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@downstreamadapter/sink/mysql/causality/barrier_test.go` around lines 155 - 162, The test currently calls cache.forceAdd directly and then sets barrier.Fail itself, masking the production behavior; instead, locate and invoke the real production caller that calls forceAdd (search for usages of forceAdd) — call that higher-level method with NewBarrierItem(barrier) after cache.out().Close(), then assert that barrier.err was set by that production path (verify barrier.err is non-nil) rather than calling barrier.Fail manually; keep the test focused and deterministic (use newTestBarrier, forceAdd caller, and require.Error on barrier.err).

Source: Coding guidelines

coderabbitai · 2026-06-12T09:13:43Z

 	for _, event := range events {
+		failpoint.Inject("MySQLSinkDelayDMLPostFlush", nil)
 		event.PostFlush()


⚠️ Potential issue | 🟡 Minor

Inject MySQLSinkDelayDMLPostFlush once per flush, not once per event

failpoint.Inject("MySQLSinkDelayDMLPostFlush", nil) is inside the for _, event := range events loop, so a batch of N DML events triggers the failpoint N times; move the inject before the loop to gate the post-flush phase once per flush.

Suggested change

- for _, event := range events { - failpoint.Inject("MySQLSinkDelayDMLPostFlush", nil) + failpoint.Inject("MySQLSinkDelayDMLPostFlush", nil) + for _, event := range events { event.PostFlush() }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for _, event := range events {

failpoint.Inject("MySQLSinkDelayDMLPostFlush", nil)

event.PostFlush()

failpoint.Inject("MySQLSinkDelayDMLPostFlush", nil)

for _, event := range events {

event.PostFlush()

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/sink/mysql/mysql_writer.go` around lines 252 - 254, The failpoint Inject call is inside the per-event loop so it runs once per event; move failpoint.Inject("MySQLSinkDelayDMLPostFlush", nil) to just before the for _, event := range events { loop (the block that calls event.PostFlush()) so the delay is injected once per flush instead of N times; ensure you keep the event.PostFlush() calls unchanged and only relocate the inject call.

pingyu added 3 commits June 12, 2026 15:30

simple version & design

cb2f859

Signed-off-by: Ping Yu <yuping@pingcap.com>

implement design

852df37

Signed-off-by: Ping Yu <yuping@pingcap.com>

add file

05931f9

Signed-off-by: Ping Yu <yuping@pingcap.com>

ti-chi-bot Bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jun 12, 2026

pingyu marked this pull request as draft June 12, 2026 09:06

ti-chi-bot Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2026

ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 12, 2026

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

Conversation

pingyu commented Jun 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Summary by CodeRabbit

Release Notes

Uh oh!

ti-chi-bot Bot commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot Bot commented Jun 12, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pingyu commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading