Skip to content

server: honor scheduler concurrency for coordinator#4832

Merged
ti-chi-bot[bot] merged 5 commits into
pingcap:masterfrom
wlwilliamx:codex/create-cf-oom-optimization-20260415
Jun 12, 2026
Merged

server: honor scheduler concurrency for coordinator#4832
ti-chi-bot[bot] merged 5 commits into
pingcap:masterfrom
wlwilliamx:codex/create-cf-oom-optimization-20260415

Conversation

@wlwilliamx

@wlwilliamx wlwilliamx commented Apr 15, 2026

Copy link
Copy Markdown
Collaborator

What problem does this PR solve?

Issue Number: close #4831

What is changed and how it works?

The coordinator was constructed with hard-coded scheduling settings:

  • max task concurrency: 10000
  • balance interval: time.Minute

This bypassed Debug.Scheduler.MaxTaskConcurrency and Debug.Scheduler.CheckBalanceInterval. During bulk changefeed creation, the basic scheduler could therefore schedule a very large number of absent changefeeds at once, causing many maintainer bootstraps to run concurrently.

This PR reads the coordinator scheduler settings from the validated global server config before constructing the coordinator. With the default config, maintainer scheduling concurrency returns to 10, which reduces creation-time memory and CPU spikes by throttling concurrent bootstrap work.

Check List

Tests

  • Unit test
  • Manual test
go test ./server

Questions

Will it cause performance regression or break compatibility?

It should reduce CPU and memory spikes during bulk changefeed creation by honoring the existing scheduler concurrency config. It may make very large bulk creation finish more gradually compared with the previous hard-coded 10000 concurrency, but that behavior matches the intended configurable scheduler limit and can be tuned via server config.

Do you need to update user documentation, design documentation or monitoring documentation?

No. This PR makes existing scheduler config effective for coordinator scheduling.

Release note

Fix TiCDC coordinator to honor scheduler max task concurrency when scheduling changefeed maintainers.

Summary by CodeRabbit

  • Chores

    • Election setup now accepts scheduler settings from server config, enabling configurable task concurrency and balance-check intervals.
  • Bug Fixes / Behavior Changes

    • Resume now uses latest persisted changefeed metadata from storage for validation and updates, avoiding stale in-memory overwrites.
    • Changefeed retrieval returns a safe copy to prevent races when callers mutate returned data.
  • Tests

    • Added tests covering scheduler settings capture, resume behavior with persisted metadata, and mutation/race safety of returned changefeed info.

Use the validated server scheduler config when constructing the coordinator so maintainer scheduling concurrency respects max-task-concurrency instead of a hard-coded value.
@ti-chi-bot

ti-chi-bot Bot commented Apr 15, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Apr 15, 2026
@coderabbitai

coderabbitai Bot commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: be625267-443b-454a-a3df-ca1786785c21

📥 Commits

Reviewing files that changed from the base of the PR and between 9f8de82 and e42cb78.

📒 Files selected for processing (2)
  • coordinator/changefeed/etcd_backend.go
  • coordinator/changefeed/etcd_backend_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • coordinator/changefeed/etcd_backend.go

📝 Walkthrough

Walkthrough

Adds persisted-changefeed retrieval and makes ResumeChangefeed return the backend’s resumed ChangeFeedInfo; Controller clones returned in-memory info and applies backend-returned info on resume. API resume handler re-fetches persisted info. Elector captures SchedulerConfig and passes scheduler settings into coordinator creation.

Changes

Changefeed resume & Controller defensive copy

Layer / File(s) Summary
Backend interface: add persisted getter & resume return type
coordinator/changefeed/changefeed_db_backend.go
Add Backend.GetChangefeedInfo(ctx, id) and change ResumeChangefeed to return (*config.ChangeFeedInfo, error) instead of only error.
Etcd backend implementation and tests
coordinator/changefeed/etcd_backend.go, coordinator/changefeed/etcd_backend_test.go
Implement EtcdBackend.GetChangefeedInfo, normalize/ensure ChangefeedID, update ResumeChangefeed to load/normalize info, update job/status when checkpoint provided, and return updated info; tests check normalized state, ID, and legacy scheduler defaults.
Mocks updated
coordinator/changefeed/mock/changefeed_db_backend.go
Generated mock adds GetChangefeedInfo and updates ResumeChangefeed mock signatures/return handling.
Coordinator API extension
coordinator/coordinator.go, pkg/server/coordinator.go
Add GetPersistedChangefeedInfo forwarding method and extend Coordinator interface with the new method.
Controller: resume uses backend-returned info & defensive GetChangefeed
coordinator/controller.go, coordinator/controller_test.go
ResumeChangefeed uses and clones backend-returned info to update in-memory state; GetChangefeed clones cf.GetInfo() and returns the clone; tests added/extended to validate resume behavior and concurrent-safety of returned info.
API handler: use persisted info for resume validation
api/v2/changefeed.go, api/v2/changefeed_test.go
ResumeChangefeed handler re-fetches persisted changefeed metadata via co.GetPersistedChangefeedInfo and uses it for subsequent resume checks; test coordinator stub implements the persisted getter. (If fetch fails, return error.)

Coordinator scheduler configurability

Layer / File(s) Summary
Elector captures scheduler settings and helper
server/module_election.go, server/module_election_test.go
NewElector accepts *config.SchedulerConfig, stores coordinatorMaxTaskConcurrency and coordinatorCheckBalanceInterval, and adds coordinatorSchedulerSettings(schedulerCfg) with a unit test verifying behavior.
Campaign wiring and server init
server/module_election.go, server/server.go
campaignCoordinator passes captured scheduler fields into coordinator.New; server initialization now calls NewElector(c, conf.Debug.Scheduler).

Sequence Diagram(s)

sequenceDiagram
  participant API as ResumeChangefeed handler
  participant Coordinator
  participant Controller
  participant EtcdBackend
  API->>Coordinator: GetChangefeed(ctx, id)
  Coordinator->>Controller: GetChangefeed(ctx, id)
  Controller-->>Coordinator: cloned in-memory ChangeFeedInfo
  API->>Coordinator: Resume request (id, newCheckpointTs)
  Coordinator->>Controller: ResumeChangefeed(ctx, id, newCheckpointTs)
  Controller->>EtcdBackend: ResumeChangefeed(ctx, id, newCheckpointTs)
  EtcdBackend-->>Controller: (*config.ChangeFeedInfo, error)
  Controller->>Controller: clone resumedInfo and update in-memory state
  Controller-->>Coordinator: resume result
  Coordinator-->>API: final resume response / error
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • wk989898
  • lidezhu
  • asddongmen

Poem

A rabbit clones a careful trace,
So returned info keeps its space.
Resume asks the persisted spring,
Scheduler knobs slow startup's swing.
Tests hop round to guard the race. 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: making the coordinator honor scheduler concurrency settings instead of using hard-coded values.
Description check ✅ Passed The description includes all required sections: linked issue (#4831), detailed explanation of the problem and solution, test types, performance impact analysis, and release notes.
Linked Issues check ✅ Passed All code changes directly address the objective in #4831: passing scheduler config to coordinator construction, implementing persisted metadata retrieval during resume, and updating backend interfaces to support the changes.
Out of Scope Changes check ✅ Passed All file changes are directly scoped to the issue: scheduler config threading, coordinator info retrieval, backend interface updates, and related tests. No unrelated refactoring or feature additions detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 15, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces hardcoded values for coordinator task concurrency and balance interval with dynamic settings retrieved from the global server configuration. It introduces a helper function coordinatorSchedulerSettings and includes a corresponding unit test to ensure the configuration is correctly applied. I have no feedback to provide.

@wlwilliamx wlwilliamx marked this pull request as ready for review June 9, 2026 02:12
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2026
@wlwilliamx

Copy link
Copy Markdown
Collaborator Author

/test all

@ti-chi-bot ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Jun 9, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongyunyan, lidezhu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jun 9, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

[LGTM Timeline notifier]

Timeline:

  • 2026-06-09 03:03:18.39757272 +0000 UTC m=+842699.467890100: ☑️ agreed by hongyunyan.
  • 2026-06-09 03:08:36.126568685 +0000 UTC m=+843017.196886075: ☑️ agreed by lidezhu.

@wlwilliamx

Copy link
Copy Markdown
Collaborator Author

/retest

@ti-chi-bot ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 9, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
coordinator/controller_test.go (1)

395-417: ⚡ Quick win

Strengthen this concurrency test to avoid scheduler-dependent false negatives.

Line 409 snapshots storedInfo once, and both goroutines start unsynchronized; the read loop can finish before mutations begin, so an aliasing regression may slip through.

Suggested refinement
 	var (
 		wg            sync.WaitGroup
 		storedChanged bool
 	)
+	start := make(chan struct{})
 	wg.Add(2)
 	go func() {
 		defer wg.Done()
+		<-start
 		for i := 0; i < 1000; i++ {
 			ret.SinkURI = "kafka://127.0.0.1:9092"
 			ret.TargetTs = uint64(i + 1)
 		}
 	}()
 	go func() {
 		defer wg.Done()
-		storedInfo := changefeedDB.GetByID(cfID).GetInfo()
+		<-start
 		for i := 0; i < 1000; i++ {
+			storedInfo := changefeedDB.GetByID(cfID).GetInfo()
 			if storedInfo.SinkURI != "mysql://127.0.0.1:3306" || storedInfo.TargetTs != 0 {
 				storedChanged = true
 				return
 			}
 		}
 	}()
+	close(start)
 	wg.Wait()

As per coding guidelines, "Prefer focused deterministic tests; see docs/agents/testing.md before adding or changing tests."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@coordinator/controller_test.go` around lines 395 - 417, The current
concurrency test reads storedInfo once before the mutation loop, which lets the
reader finish before writers run and yields flaky/scheduler-dependent behavior;
change the reader to sample changefeedDB.GetByID(cfID).GetInfo() inside its loop
(or use a simple start barrier such as a sync.WaitGroup/channel so both
goroutines begin simultaneously) and then check for SinkURI/TargetTs changes on
each iteration (update the storedChanged flag when a change is observed);
reference the reader call changefeedDB.GetByID(cfID).GetInfo(), the writer
updates to ret.SinkURI/ret.TargetTs, and the existing wg sync to coordinate
starts.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@coordinator/controller_test.go`:
- Around line 395-417: The current concurrency test reads storedInfo once before
the mutation loop, which lets the reader finish before writers run and yields
flaky/scheduler-dependent behavior; change the reader to sample
changefeedDB.GetByID(cfID).GetInfo() inside its loop (or use a simple start
barrier such as a sync.WaitGroup/channel so both goroutines begin
simultaneously) and then check for SinkURI/TargetTs changes on each iteration
(update the storedChanged flag when a change is observed); reference the reader
call changefeedDB.GetByID(cfID).GetInfo(), the writer updates to
ret.SinkURI/ret.TargetTs, and the existing wg sync to coordinate starts.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 17782c56-d583-4a96-a3f3-02d6c2dcf96e

📥 Commits

Reviewing files that changed from the base of the PR and between 641a7ec and 7811631.

📒 Files selected for processing (5)
  • coordinator/controller.go
  • coordinator/controller_test.go
  • server/module_election.go
  • server/module_election_test.go
  • server/server.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • server/module_election_test.go

@ti-chi-bot ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 10, 2026
@wlwilliamx

Copy link
Copy Markdown
Collaborator Author

/test all

@ti-chi-bot ti-chi-bot Bot merged commit 072f1bd into pingcap:master Jun 12, 2026
25 of 26 checks passed
@wlwilliamx wlwilliamx added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Jun 15, 2026
@ti-chi-bot

Copy link
Copy Markdown
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #5396.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TiCDC can spike memory and CPU when bulk-creating idle changefeeds

4 participants