server: honor scheduler concurrency for coordinator by wlwilliamx · Pull Request #4832 · pingcap/ticdc

wlwilliamx · 2026-04-15T16:21:45Z

What problem does this PR solve?

Issue Number: close #4831

What is changed and how it works?

The coordinator was constructed with hard-coded scheduling settings:

max task concurrency: 10000
balance interval: time.Minute

This bypassed Debug.Scheduler.MaxTaskConcurrency and Debug.Scheduler.CheckBalanceInterval. During bulk changefeed creation, the basic scheduler could therefore schedule a very large number of absent changefeeds at once, causing many maintainer bootstraps to run concurrently.

This PR reads the coordinator scheduler settings from the validated global server config before constructing the coordinator. With the default config, maintainer scheduling concurrency returns to 10, which reduces creation-time memory and CPU spikes by throttling concurrent bootstrap work.

Check List

Tests

Unit test
Manual test

go test ./server

Questions

Will it cause performance regression or break compatibility?

It should reduce CPU and memory spikes during bulk changefeed creation by honoring the existing scheduler concurrency config. It may make very large bulk creation finish more gradually compared with the previous hard-coded 10000 concurrency, but that behavior matches the intended configurable scheduler limit and can be tuned via server config.

Do you need to update user documentation, design documentation or monitoring documentation?

No. This PR makes existing scheduler config effective for coordinator scheduling.

Release note

Fix TiCDC coordinator to honor scheduler max task concurrency when scheduling changefeed maintainers.

Summary by CodeRabbit

Chores
- Election setup now accepts scheduler settings from server config, enabling configurable task concurrency and balance-check intervals.
Bug Fixes / Behavior Changes
- Resume now uses latest persisted changefeed metadata from storage for validation and updates, avoiding stale in-memory overwrites.
- Changefeed retrieval returns a safe copy to prevent races when callers mutate returned data.
Tests
- Added tests covering scheduler settings capture, resume behavior with persisted metadata, and mutation/race safety of returned changefeed info.

Use the validated server scheduler config when constructing the coordinator so maintainer scheduling concurrency respects max-task-concurrency instead of a hard-coded value.

ti-chi-bot · 2026-04-15T16:21:49Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-04-15T16:21:54Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: be625267-443b-454a-a3df-ca1786785c21

📥 Commits

Reviewing files that changed from the base of the PR and between 9f8de82 and e42cb78.

📒 Files selected for processing (2)

coordinator/changefeed/etcd_backend.go
coordinator/changefeed/etcd_backend_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

coordinator/changefeed/etcd_backend.go

📝 Walkthrough

Walkthrough

Adds persisted-changefeed retrieval and makes ResumeChangefeed return the backend’s resumed ChangeFeedInfo; Controller clones returned in-memory info and applies backend-returned info on resume. API resume handler re-fetches persisted info. Elector captures SchedulerConfig and passes scheduler settings into coordinator creation.

Changes

Changefeed resume & Controller defensive copy

Layer / File(s)	Summary
Backend interface: add persisted getter & resume return type `coordinator/changefeed/changefeed_db_backend.go`	Add `Backend.GetChangefeedInfo(ctx, id)` and change `ResumeChangefeed` to return `(*config.ChangeFeedInfo, error)` instead of only `error`.
Etcd backend implementation and tests `coordinator/changefeed/etcd_backend.go`, `coordinator/changefeed/etcd_backend_test.go`	Implement `EtcdBackend.GetChangefeedInfo`, normalize/ensure ChangefeedID, update `ResumeChangefeed` to load/normalize info, update job/status when checkpoint provided, and return updated info; tests check normalized state, ID, and legacy scheduler defaults.
Mocks updated `coordinator/changefeed/mock/changefeed_db_backend.go`	Generated mock adds `GetChangefeedInfo` and updates `ResumeChangefeed` mock signatures/return handling.
Coordinator API extension `coordinator/coordinator.go`, `pkg/server/coordinator.go`	Add `GetPersistedChangefeedInfo` forwarding method and extend `Coordinator` interface with the new method.
Controller: resume uses backend-returned info & defensive GetChangefeed `coordinator/controller.go`, `coordinator/controller_test.go`	`ResumeChangefeed` uses and clones backend-returned info to update in-memory state; `GetChangefeed` clones `cf.GetInfo()` and returns the clone; tests added/extended to validate resume behavior and concurrent-safety of returned info.
API handler: use persisted info for resume validation `api/v2/changefeed.go`, `api/v2/changefeed_test.go`	`ResumeChangefeed` handler re-fetches persisted changefeed metadata via `co.GetPersistedChangefeedInfo` and uses it for subsequent resume checks; test coordinator stub implements the persisted getter. (If fetch fails, return error.)

Coordinator scheduler configurability

Layer / File(s)	Summary
Elector captures scheduler settings and helper `server/module_election.go`, `server/module_election_test.go`	`NewElector` accepts `*config.SchedulerConfig`, stores `coordinatorMaxTaskConcurrency` and `coordinatorCheckBalanceInterval`, and adds `coordinatorSchedulerSettings(schedulerCfg)` with a unit test verifying behavior.
Campaign wiring and server init `server/module_election.go`, `server/server.go`	`campaignCoordinator` passes captured scheduler fields into `coordinator.New`; server initialization now calls `NewElector(c, conf.Debug.Scheduler)`.

Sequence Diagram(s)

sequenceDiagram
  participant API as ResumeChangefeed handler
  participant Coordinator
  participant Controller
  participant EtcdBackend
  API->>Coordinator: GetChangefeed(ctx, id)
  Coordinator->>Controller: GetChangefeed(ctx, id)
  Controller-->>Coordinator: cloned in-memory ChangeFeedInfo
  API->>Coordinator: Resume request (id, newCheckpointTs)
  Coordinator->>Controller: ResumeChangefeed(ctx, id, newCheckpointTs)
  Controller->>EtcdBackend: ResumeChangefeed(ctx, id, newCheckpointTs)
  EtcdBackend-->>Controller: (*config.ChangeFeedInfo, error)
  Controller->>Controller: clone resumedInfo and update in-memory state
  Controller-->>Coordinator: resume result
  Coordinator-->>API: final resume response / error

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

wk989898
lidezhu
asddongmen

Poem

A rabbit clones a careful trace,
So returned info keeps its space.
Resume asks the persisted spring,
Scheduler knobs slow startup's swing.
Tests hop round to guard the race. 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: making the coordinator honor scheduler concurrency settings instead of using hard-coded values.
Description check	✅ Passed	The description includes all required sections: linked issue (`#4831`), detailed explanation of the problem and solution, test types, performance impact analysis, and release notes.
Linked Issues check	✅ Passed	All code changes directly address the objective in `#4831`: passing scheduler config to coordinator construction, implementing persisted metadata retrieval during resume, and updating backend interfaces to support the changes.
Out of Scope Changes check	✅ Passed	All file changes are directly scoped to the issue: scheduler config threading, coordinator info retrieval, backend interface updates, and related tests. No unrelated refactoring or feature additions detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request replaces hardcoded values for coordinator task concurrency and balance interval with dynamic settings retrieved from the global server configuration. It introduces a helper function coordinatorSchedulerSettings and includes a corresponding unit test to ensure the configuration is correctly applied. I have no feedback to provide.

wlwilliamx · 2026-06-09T02:13:08Z

/test all

ti-chi-bot · 2026-06-09T03:08:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongyunyan, lidezhu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [hongyunyan,lidezhu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-06-09T03:08:37Z

[LGTM Timeline notifier]

Timeline:

2026-06-09 03:03:18.39757272 +0000 UTC m=+842699.467890100: ☑️ agreed by hongyunyan.
2026-06-09 03:08:36.126568685 +0000 UTC m=+843017.196886075: ☑️ agreed by lidezhu.

wlwilliamx · 2026-06-09T03:11:53Z

/retest

coderabbitai

🧹 Nitpick comments (1)

coordinator/controller_test.go (1)

395-417: ⚡ Quick win

Strengthen this concurrency test to avoid scheduler-dependent false negatives.

Line 409 snapshots storedInfo once, and both goroutines start unsynchronized; the read loop can finish before mutations begin, so an aliasing regression may slip through.

Suggested refinement

 	var (
 		wg            sync.WaitGroup
 		storedChanged bool
 	)
+	start := make(chan struct{})
 	wg.Add(2)
 	go func() {
 		defer wg.Done()
+		<-start
 		for i := 0; i < 1000; i++ {
 			ret.SinkURI = "kafka://127.0.0.1:9092"
 			ret.TargetTs = uint64(i + 1)
 		}
 	}()
 	go func() {
 		defer wg.Done()
-		storedInfo := changefeedDB.GetByID(cfID).GetInfo()
+		<-start
 		for i := 0; i < 1000; i++ {
+			storedInfo := changefeedDB.GetByID(cfID).GetInfo()
 			if storedInfo.SinkURI != "mysql://127.0.0.1:3306" || storedInfo.TargetTs != 0 {
 				storedChanged = true
 				return
 			}
 		}
 	}()
+	close(start)
 	wg.Wait()

As per coding guidelines, "Prefer focused deterministic tests; see docs/agents/testing.md before adding or changing tests."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@coordinator/controller_test.go` around lines 395 - 417, The current
concurrency test reads storedInfo once before the mutation loop, which lets the
reader finish before writers run and yields flaky/scheduler-dependent behavior;
change the reader to sample changefeedDB.GetByID(cfID).GetInfo() inside its loop
(or use a simple start barrier such as a sync.WaitGroup/channel so both
goroutines begin simultaneously) and then check for SinkURI/TargetTs changes on
each iteration (update the storedChanged flag when a change is observed);
reference the reader call changefeedDB.GetByID(cfID).GetInfo(), the writer
updates to ret.SinkURI/ret.TargetTs, and the existing wg sync to coordinate
starts.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@coordinator/controller_test.go`:
- Around line 395-417: The current concurrency test reads storedInfo once before
the mutation loop, which lets the reader finish before writers run and yields
flaky/scheduler-dependent behavior; change the reader to sample
changefeedDB.GetByID(cfID).GetInfo() inside its loop (or use a simple start
barrier such as a sync.WaitGroup/channel so both goroutines begin
simultaneously) and then check for SinkURI/TargetTs changes on each iteration
(update the storedChanged flag when a change is observed); reference the reader
call changefeedDB.GetByID(cfID).GetInfo(), the writer updates to
ret.SinkURI/ret.TargetTs, and the existing wg sync to coordinate starts.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 17782c56-d583-4a96-a3f3-02d6c2dcf96e

📥 Commits

Reviewing files that changed from the base of the PR and between 641a7ec and 7811631.

📒 Files selected for processing (5)

coordinator/controller.go
coordinator/controller_test.go
server/module_election.go
server/module_election_test.go
server/server.go

🚧 Files skipped from review as they are similar to previous changes (1)

server/module_election_test.go

wlwilliamx · 2026-06-12T03:51:23Z

/test all

ti-chi-bot · 2026-06-15T03:22:33Z

In response to a cherrypick label: new pull request created to branch release-8.5: #5396.

server: honor scheduler concurrency for coordinator

641a7ec

Use the validated server scheduler config when constructing the coordinator so maintainer scheduling concurrency respects max-task-concurrency instead of a hard-coded value.

ti-chi-bot Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Apr 15, 2026

ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 15, 2026

gemini-code-assist Bot reviewed Apr 15, 2026

View reviewed changes

wlwilliamx marked this pull request as ready for review June 9, 2026 02:12

ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2026

hongyunyan approved these changes Jun 9, 2026

View reviewed changes

ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Jun 9, 2026

lidezhu approved these changes Jun 9, 2026

View reviewed changes

ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jun 9, 2026

coordinator,server: fix changefeed update data race

7811631

ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 9, 2026

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

coordinator,api: validate resume with persisted metadata

a6b7b83

ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 10, 2026

wlwilliamx added 2 commits June 10, 2026 12:04

Merge upstream/master into PR 4832

9f8de82

coordinator: complete changefeed metadata on resume

e42cb78

ti-chi-bot Bot merged commit 072f1bd into pingcap:master Jun 12, 2026
25 of 26 checks passed

wlwilliamx added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Jun 15, 2026

ti-chi-bot mentioned this pull request Jun 15, 2026

server: honor scheduler concurrency for coordinator (#4832) #5396

Open

Conversation

wlwilliamx commented Apr 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot Bot commented Apr 15, 2026

Uh oh!

coderabbitai Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

wlwilliamx commented Jun 9, 2026

Uh oh!

ti-chi-bot Bot commented Jun 9, 2026

Uh oh!

ti-chi-bot Bot commented Jun 9, 2026

[LGTM Timeline notifier]

Uh oh!

wlwilliamx commented Jun 9, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

wlwilliamx commented Jun 12, 2026

Uh oh!

Uh oh!

ti-chi-bot commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wlwilliamx commented Apr 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading