dnm: cp pr 4030 4460 4950 and add switch for it by haiboumich · Pull Request #5369 · pingcap/ticdc

haiboumich · 2026-06-12T10:03:14Z

What problem does this PR solve?

Issue Number: close #xxx

What is changed and how it works?

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

close pingcap#4172

close pingcap#4365

close pingcap#5041

ti-chi-bot · 2026-06-12T10:03:18Z

This cherry pick PR is for a release branch and has not yet been approved by triage owners.
Adding the do-not-merge/cherry-pick-not-approved label.

To merge this cherry pick:

It must be LGTMed and approved by the reviewers firstly.
For pull requests to TiDB-x branches, it must have no failed tests.
AFTER it has lgtm and approved labels, please wait for the cherry-pick merging approval from triage owners.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai · 2026-06-12T10:03:23Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b4e9da06-916a-44db-8faa-b4f5cb06b8fc

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces an adaptive scan window controller to dynamically adjust the scan interval based on memory usage ratios and release counts, updates congestion control to version 2 to support these metrics, and adds new Grafana panels for monitoring. It also introduces a DDL workload runner to the workload tool. Key feedback points include fixing serialization in AvailableMemory to respect the Version field, resolving connection retry and recovery issues in the DDL worker loop, ensuring nil-safety by using GetTableSpan() instead of direct field access, and making the TOML file extension check case-insensitive.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-12T10:05:56Z

 func (m AvailableMemory) Marshal() []byte {
+	return m.marshalV1()
+}
+
+func (m *AvailableMemory) Unmarshal(buf *bytes.Buffer) {
+	m.unmarshalV1(buf)
+}
+
+func (m AvailableMemory) GetSize() int {
+	return m.sizeV1()
+}


The AvailableMemory struct's Marshal() and GetSize() methods always default to V1 serialization and size calculation, completely ignoring the Version field. Additionally, unmarshalV1 and unmarshalV2 do not set the Version field on the unmarshaled struct. This leads to silent data loss or incorrect size calculations if AvailableMemory is marshaled or its size is queried directly after being unmarshaled as V2. Please ensure Version is set during unmarshaling, and respected during marshaling and size calculation.

func (m AvailableMemory) Marshal() []byte { if m.Version == CongestionControlVersion2 { return m.marshalV2() } return m.marshalV1() } func (m *AvailableMemory) Unmarshal(buf *bytes.Buffer) { _ = m.unmarshalV1(buf) } func (m AvailableMemory) GetSize() int { if m.Version == CongestionControlVersion2 { return m.sizeV2() } return m.sizeV1() }

gemini-code-assist · 2026-06-12T10:05:56Z

+		go func(workerID int, db *DBWrapper) {
+			defer func() {
+				plog.Info("ddl worker exited", zap.Int("worker", workerID))
+				wg.Done()
+			}()
+
+			conn, err := getConnWithTimeout(db.DB, 10*time.Second)
+			if err != nil {
+				plog.Info("get connection failed for ddl worker", zap.Error(err))
+				time.Sleep(5 * time.Second)
+				return
+			}
+			defer func() {
+				if conn != nil {
+					conn.Close()
+				}
+			}()
+
+			plog.Info("start ddl worker", zap.Int("worker", workerID), zap.String("db", db.Name))
+
+			for {
+				task := <-r.taskCh
+				if err := r.executeTask(conn, task); err != nil {
+					if r.app.isConnectionError(err) {
+						conn.Close()
+						time.Sleep(2 * time.Second)
+						newConn, err := getConnWithTimeout(db.DB, 10*time.Second)
+						if err != nil {
+							plog.Info("reconnect failed for ddl worker", zap.Error(err))
+							time.Sleep(5 * time.Second)
+							continue
+						}
+						conn = newConn
+					}
+				}
+			}
+		}(workerID, db)


There are two critical connection handling issues in the worker loop:

If the initial connection attempt fails, the worker goroutine logs the error and exits permanently. If all workers fail initially, the runner will end up with 0 active workers.

If reconnection fails inside the loop, the worker continues to the next iteration and attempts to execute the next task on the closed connection, which immediately fails and inflates the error metrics incorrectly.

To fix both issues, the worker should retry connecting in a loop before starting task execution, and also retry in a loop upon connection loss before taking the next task.

go func(workerID int, db *DBWrapper) { defer func() { plog.Info("ddl worker exited", zap.Int("worker", workerID)) wg.Done() }() var conn *sql.Conn var err error for { conn, err = getConnWithTimeout(db.DB, 10*time.Second) if err == nil { break } plog.Info("get connection failed for ddl worker, retrying...", zap.Error(err)) time.Sleep(5 * time.Second) } defer func() { if conn != nil { conn.Close() } }() plog.Info("start ddl worker", zap.Int("worker", workerID), zap.String("db", db.Name)) for { task := <-r.taskCh if err := r.executeTask(conn, task); err != nil { if r.app.isConnectionError(err) { conn.Close() for { time.Sleep(2 * time.Second) newConn, err := getConnWithTimeout(db.DB, 10*time.Second) if err == nil { conn = newConn break } plog.Info("reconnect failed for ddl worker, retrying...", zap.Error(err)) } } } } }(workerID, db)

gemini-code-assist · 2026-06-12T10:05:56Z

+				zap.Int64("oldTableID", d.tableSpan.GetTableID()),
+				zap.Int64("currentTableID", ddl.GetTableID()),


Accessing d.tableSpan directly may bypass concurrency synchronization or nil-safety checks that are normally provided by d.GetTableSpan(). To prevent potential data races or nil-pointer panics, it is safer to use d.GetTableSpan().GetTableID().

Suggested change

zap.Int64("oldTableID", d.tableSpan.GetTableID()),

zap.Int64("currentTableID", ddl.GetTableID()),

zap.Int64("oldTableID", d.GetTableSpan().GetTableID()),

zap.Int64("currentTableID", ddl.GetTableID()),

gemini-code-assist · 2026-06-12T10:05:56Z

+	if filepath.Ext(path) != ".toml" {
+		return nil, errors.Errorf("ddl config must be a .toml file: %s", path)
+	}


The file extension check is case-sensitive. If a user provides a config file with an uppercase extension (e.g., .TOML), the validation will fail. Consider making the check case-insensitive.

Suggested change

if filepath.Ext(path) != ".toml" {

return nil, errors.Errorf("ddl config must be a .toml file: %s", path)

}

if strings.ToLower(filepath.Ext(path)) != ".toml" {

return nil, errors.Errorf("ddl config must be a .toml file: %s", path)

}

ti-chi-bot · 2026-06-12T10:06:06Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign nongfushanquan for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS
pkg/config/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Add a per-changefeed `enable-scan-window` replica config (default false) and plumb it through the changefeed config, dispatcher info and event service. When the switch is off the adaptive scan-window feature is fully inert and the changefeed behaves as if it was never introduced: - event service: memory control, adaptive scan interval, base-ts capping, empty-range signal, pending-DDL local advance and scan-window metrics are all gated. - dynstream: the memory release ratio follows the switch (0.4 off / 0.6 on); the deadlock high-water-mark stays 0.6 in both modes.

asddongmen added 3 commits June 12, 2026 17:17

*:improve memory control (pingcap#4030)

20a625f

close pingcap#4172

eventservice: fix changefeed getting stuck (pingcap#4460)

ba48071

close pingcap#4365

eventservice: optimize scanwindow (pingcap#4950)

da7abf2

close pingcap#5041

ti-chi-bot Bot added first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. do-not-merge/cherry-pick-not-approved release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jun 12, 2026

ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 12, 2026

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

haiboumich force-pushed the haiboumich/add-switch-for-scanwindow branch from 7d0fe3a to 81e8500 Compare June 13, 2026 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dnm: cp pr 4030 4460 4950 and add switch for it#5369

dnm: cp pr 4030 4460 4950 and add switch for it#5369
haiboumich wants to merge 4 commits into
pingcap:release-8.5from
haiboumich:haiboumich/add-switch-for-scanwindow

haiboumich commented Jun 12, 2026

Uh oh!

ti-chi-bot Bot commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

ti-chi-bot Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		zap.Int64("oldTableID", d.tableSpan.GetTableID()),
		zap.Int64("currentTableID", ddl.GetTableID()),

Conversation

haiboumich commented Jun 12, 2026

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Uh oh!

ti-chi-bot Bot commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading