Skip to content

[DNM] Branch for CSE testing#5367

Draft
pingyu wants to merge 3 commits into
pingcap:masterfrom
pingyu:mysql-sink-flush-dml
Draft

[DNM] Branch for CSE testing#5367
pingyu wants to merge 3 commits into
pingcap:masterfrom
pingyu:mysql-sink-flush-dml

Conversation

@pingyu

@pingyu pingyu commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

What is changed and how it works?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a DML barrier mechanism for MySQL sinks to enhance synchronization of database operations
    • Improved DML event handling in the sink pipeline
  • Documentation

    • Added design documentation for the DML barrier behavior and implementation strategy
  • Tests

    • Added comprehensive test coverage for barrier functionality, error handling, and DML cache behavior

pingyu added 3 commits June 12, 2026 15:30
Signed-off-by: Ping Yu <yuping@pingcap.com>
Signed-off-by: Ping Yu <yuping@pingcap.com>
Signed-off-by: Ping Yu <yuping@pingcap.com>
@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jun 12, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign flowbehappy for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pingyu pingyu marked this pull request as draft June 12, 2026 09:06
@ti-chi-bot ti-chi-bot Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2026
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR implements a broadcast barrier mechanism in the MySQL sink causality detector to ensure FlushDMLBeforeBlock waits for all prior enqueued DMLs to be flushed downstream. The implementation adds removal-only fence nodes, in-band writer-queue barrier tokens, and sink-level barrier lifecycle management with comprehensive test coverage.

Changes

MySQL Sink DML Barrier Implementation

Layer / File(s) Summary
Design and Type Contracts
design.md, downstreamadapter/sink/mysql/causality/txn_cache.go, downstreamadapter/sink/mysql/causality/helper_test.go
Design document defines the full barrier protocol. Barrier interface and WriterItem union type (wrapping DMLEvent or Barrier) are introduced with constructors NewDMLItem and NewBarrierItem. Test helper testChangefeedID() supports barrier tests.
Channel Synchronization
utils/chann/unlimited_chann.go
UnlimitedChannel.PushIfNotClosed method enables safe enqueue of barrier tokens when the channel is open, returning false on close.
Node Removal-Only Fence Resolution
downstreamadapter/sink/mysql/causality/node.go
Node gains resolveByRemovalOnly flag to prevent dependency resolution assignment for fence nodes. When set, dependency assignment is skipped, depender notifications are suppressed, and tryResolve short-circuits to prevent cache assignment.
Slot Operations for Barrier Support
downstreamadapter/sink/mysql/causality/slot.go
AddWithDependencies method injects extra dependencies (e.g., from active fences) into conflict detection. SnapshotTailNodes atomically captures all tail nodes across slots for barrier broadcast dependency collection.
Causality Detector Barrier Orchestration
downstreamadapter/sink/mysql/causality/conflict_detector.go
ConflictDetector adds admissionMu and activeFence to serialize DML admission with fence state. Add locks admission, uses AddWithDependencies, and wires node notifications. Refactored BroadcastBarrier installs a removal-only fence node, snapshots dependencies, broadcasts BarrierItem tokens to all queues via forceAdd, and clears the fence on completion. GetOutChByCacheID returns WriterItem channels.
Transaction Cache Writer Items
downstreamadapter/sink/mysql/causality/txn_cache.go
txnCache interface and implementations refactored to handle WriterItem instead of raw DML events. Cache gains forceAdd method to enqueue barrier tokens even when blocked, using PushIfNotClosed for safe insertion. Channels parameterized on WriterItem.
Sink DML Barrier Lifecycle
downstreamadapter/sink/mysql/sink.go
Implements dmlBarrier type for DML coordination with mutex-protected error state, completion channel, and deferred done callbacks. runDMLWriter refactored to buffer WriterItem entries, flush pending DMLs before acking barriers, and fail barriers on context cancellation. FlushDMLBeforeBlock creates, broadcasts, and waits for barrier completion.
Barrier Test Coverage
downstreamadapter/sink/mysql/causality/barrier_test.go, downstreamadapter/sink/mysql/sink_test.go
Tests for barrier broadcast enqueue, detector closure handling, removal-only fence semantics, forceAdd bypass/closure behavior, and throughput benchmark. Sink tests verify FlushDMLBeforeBlock error propagation, post-flush blocking, barrier idempotency, and close-induced unblocking.
Supporting Infrastructure
pkg/sink/mysql/mysql_writer.go, deployments/Dockerfile
DML writer adds failpoint injection before PostFlush invocation. Dockerfile sets LEGACY_SAFEPOINT=1 when NEXT_GEN build is enabled.

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

lgtm, size/XXL, release-note-none

Suggested reviewers

  • wk989898
  • asddongmen
  • hongyunyan

🐰 A barrier in the queue, DMLs flush through,
No PostFlush per item, just when needed—it's true!
Removal-only fences keep order so tight,
While WriterItems carry the flow just right.
hops The sink now waits before the block takes flight! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is largely incomplete, containing only placeholder template text with no actual implementation details filled in. Critical sections like 'What is changed and how it works?' are empty, and the 'Issue Number' field is not properly populated. Complete the PR description by filling in the issue number, explaining the DML barrier mechanism changes, listing the affected files, confirming tests are included, and providing a release note or marking as 'None'.
Title check ❓ Inconclusive The title '[DNM] Branch for CSE testing' is vague and does not clearly describe the main changes. It uses a non-descriptive term and appears to be a temporary or work-in-progress label rather than a meaningful summary of the changeset. Replace with a descriptive title that summarizes the main changes, such as 'Add MySQL sink DML barrier mechanism for causality ordering' or similar that reflects the actual implementation in the changeset.
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 12, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a DML barrier design for the MySQL sink, allowing block events (such as syncpoints) to wait until all prior enqueued DMLs are flushed downstream. It introduces a WriterItem struct to carry either DML events or barrier tokens, updates the ConflictDetector to broadcast barriers using a removal-only fence, and adapts the writer loop to process these items sequentially. The review feedback highlights critical improvements: fixing a potential deadlock in newDMLBarrier when workerCount is zero, using a sync.RWMutex instead of a standard Mutex to avoid serializing the DML ingestion hot path in ConflictDetector.Add, and utilizing Go's clear() built-in to prevent memory leaks when resetting pointer-holding buffers.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +85 to +95
func newDMLBarrier(workerCount int) *dmlBarrier {
barrier := &dmlBarrier{
done: make(chan struct{}),
remaining: workerCount,
acked: make([]bool, workerCount),
}
if workerCount == 0 {
barrier.Fail(errors.ErrMySQLTxnError.GenWithStackByArgs("mysql DML barrier has no writers"))
}
return barrier
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

If workerCount is 0, barrier.Fail is called when remaining is already 0. In Fail, the check if b.err != nil || b.remaining == 0 will trigger an early return, meaning b.err is never set and b.done is never closed. This causes Wait() to block forever (deadlock). Directly initialize the failed state when workerCount == 0 to avoid this.

Suggested change
func newDMLBarrier(workerCount int) *dmlBarrier {
barrier := &dmlBarrier{
done: make(chan struct{}),
remaining: workerCount,
acked: make([]bool, workerCount),
}
if workerCount == 0 {
barrier.Fail(errors.ErrMySQLTxnError.GenWithStackByArgs("mysql DML barrier has no writers"))
}
return barrier
}
func newDMLBarrier(workerCount int) *dmlBarrier {
barrier := &dmlBarrier{
done: make(chan struct{}),
remaining: workerCount,
acked: make([]bool, workerCount),
}
if workerCount == 0 {
barrier.err = errors.ErrMySQLTxnError.GenWithStackByArgs("mysql DML barrier has no writers")
close(barrier.done)
}
return barrier
}

changefeedID common.ChangeFeedID
metricConflictDetectDuration prometheus.Observer

admissionMu sync.Mutex

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a standard sync.Mutex for admissionMu serializes all concurrent Add calls on the DML ingestion hot path. Since barriers are rare, changing this to sync.RWMutex allows concurrent Add calls to proceed in parallel under a read lock, significantly improving throughput.

Suggested change
admissionMu sync.Mutex
admissionMu sync.RWMutex

Comment on lines +106 to +107
d.admissionMu.Lock()
defer d.admissionMu.Unlock()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Acquire a read lock (RLock) instead of a write lock (Lock) to allow concurrent Add calls to run in parallel. This avoids serializing the DML ingestion hot path.

Suggested change
d.admissionMu.Lock()
defer d.admissionMu.Unlock()
d.admissionMu.RLock()
defer d.admissionMu.RUnlock()

Comment on lines +361 to +373
if err := flushDMLs(dmlBuffer, rowCount); err != nil {
s.failOutstandingBarriers(err)
return err
}
dmlBuffer = dmlBuffer[:0]
workerBatchFlushDuration.Observe(time.Since(start).Seconds())

// we record total time to calculate the worker busy ratio.
// so we record the total time after flushing, to unified statistics on
// flush time and total time
workerTotalDuration.Observe(time.Since(totalStart).Seconds())
totalStart = time.Now()
buffer = buffer[:0]
itemBuffer = itemBuffer[:0]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Resetting dmlBuffer and itemBuffer using [:0] does not clear the underlying array elements. Since these slices hold pointers (*commonEvent.DMLEvent and Barrier), the referenced objects will not be garbage collected while the goroutine is blocked waiting for new items. Use Go's built-in clear() to nil out the elements before resetting the slice length.

			if err := flushDMLs(dmlBuffer, rowCount); err != nil {
				s.failOutstandingBarriers(err)
				return err
			}
			clear(dmlBuffer)
			dmlBuffer = dmlBuffer[:0]
			workerBatchFlushDuration.Observe(time.Since(start).Seconds())

			// we record total time to calculate the worker busy ratio.
			// so we record the total time after flushing, to unified statistics on
			// flush time and total time
			workerTotalDuration.Observe(time.Since(totalStart).Seconds())
			totalStart = time.Now()
			clear(itemBuffer)
			itemBuffer = itemBuffer[:0]

@ti-chi-bot

ti-chi-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@downstreamadapter/sink/mysql/causality/barrier_test.go`:
- Around line 155-162: The test currently calls cache.forceAdd directly and then
sets barrier.Fail itself, masking the production behavior; instead, locate and
invoke the real production caller that calls forceAdd (search for usages of
forceAdd) — call that higher-level method with NewBarrierItem(barrier) after
cache.out().Close(), then assert that barrier.err was set by that production
path (verify barrier.err is non-nil) rather than calling barrier.Fail manually;
keep the test focused and deterministic (use newTestBarrier, forceAdd caller,
and require.Error on barrier.err).

In `@pkg/sink/mysql/mysql_writer.go`:
- Around line 252-254: The failpoint Inject call is inside the per-event loop so
it runs once per event; move failpoint.Inject("MySQLSinkDelayDMLPostFlush", nil)
to just before the for _, event := range events { loop (the block that calls
event.PostFlush()) so the delay is injected once per flush instead of N times;
ensure you keep the event.PostFlush() calls unchanged and only relocate the
inject call.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 964400f9-9eda-461e-afff-d48fe3694f45

📥 Commits

Reviewing files that changed from the base of the PR and between 15a8703 and 05931f9.

📒 Files selected for processing (12)
  • deployments/Dockerfile
  • design.md
  • downstreamadapter/sink/mysql/causality/barrier_test.go
  • downstreamadapter/sink/mysql/causality/conflict_detector.go
  • downstreamadapter/sink/mysql/causality/helper_test.go
  • downstreamadapter/sink/mysql/causality/node.go
  • downstreamadapter/sink/mysql/causality/slot.go
  • downstreamadapter/sink/mysql/causality/txn_cache.go
  • downstreamadapter/sink/mysql/sink.go
  • downstreamadapter/sink/mysql/sink_test.go
  • pkg/sink/mysql/mysql_writer.go
  • utils/chann/unlimited_chann.go

Comment on lines +155 to +162
func TestTxnCacheForceAddFailsWhenClosed(t *testing.T) {
cache := newTxnCache(TxnCacheOption{Count: 1, Size: 1, BlockStrategy: BlockStrategyWaitEmpty})
cache.out().Close()

barrier := newTestBarrier(1)
require.False(t, cache.forceAdd(NewBarrierItem(barrier)))
barrier.Fail(errors.New("closed"))
require.Error(t, barrier.err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

This test masks the barrier-failure path.

Lines 160-162 only prove that forceAdd returns false; the test then sets the barrier error itself. That means it still passes if the production close path forgets to propagate the failure to the barrier, so the regression you care about remains untested. Please drive the failure through the real caller that reacts to forceAdd == false and assert on that outcome instead.

As per coding guidelines, **/*_test.go: Prefer focused deterministic tests; see docs/agents/testing.md before adding or changing tests.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@downstreamadapter/sink/mysql/causality/barrier_test.go` around lines 155 -
162, The test currently calls cache.forceAdd directly and then sets barrier.Fail
itself, masking the production behavior; instead, locate and invoke the real
production caller that calls forceAdd (search for usages of forceAdd) — call
that higher-level method with NewBarrierItem(barrier) after cache.out().Close(),
then assert that barrier.err was set by that production path (verify barrier.err
is non-nil) rather than calling barrier.Fail manually; keep the test focused and
deterministic (use newTestBarrier, forceAdd caller, and require.Error on
barrier.err).

Source: Coding guidelines

Comment on lines 252 to 254
for _, event := range events {
failpoint.Inject("MySQLSinkDelayDMLPostFlush", nil)
event.PostFlush()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inject MySQLSinkDelayDMLPostFlush once per flush, not once per event

failpoint.Inject("MySQLSinkDelayDMLPostFlush", nil) is inside the for _, event := range events loop, so a batch of N DML events triggers the failpoint N times; move the inject before the loop to gate the post-flush phase once per flush.

Suggested change
-	for _, event := range events {
-		failpoint.Inject("MySQLSinkDelayDMLPostFlush", nil)
+	failpoint.Inject("MySQLSinkDelayDMLPostFlush", nil)
+	for _, event := range events {
 		event.PostFlush()
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for _, event := range events {
failpoint.Inject("MySQLSinkDelayDMLPostFlush", nil)
event.PostFlush()
failpoint.Inject("MySQLSinkDelayDMLPostFlush", nil)
for _, event := range events {
event.PostFlush()
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/sink/mysql/mysql_writer.go` around lines 252 - 254, The failpoint Inject
call is inside the per-event loop so it runs once per event; move
failpoint.Inject("MySQLSinkDelayDMLPostFlush", nil) to just before the for _,
event := range events { loop (the block that calls event.PostFlush()) so the
delay is injected once per flush instead of N times; ensure you keep the
event.PostFlush() calls unchanged and only relocate the inject call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant