maintainer: optimize maintainer checkpoint tracking by hongyunyan · Pull Request #4947 · pingcap/ticdc

hongyunyan · 2026-04-29T03:44:14Z

Background

Issue Number: ref #4951

During a 1M-table bootstrap test, maintainer checkpoint calculation became increasingly expensive because every 100ms tick scanned all absent and scheduling spans to compute the minimum checkpointTs. The same path also held back dispatcher scheduling progress as the number of non-replicating spans grew.

This PR focuses on replacing the repeated scans with incremental state tracking and small scheduler-side query optimizations. It does not change the checkpoint calculation rules themselves.

Motivation

The normal checkpointTs update frequency should remain unchanged because it controls externally visible checkpoint freshness. During large bootstrap waves, checkpoint calculation should avoid repeatedly scanning all absent and scheduling spans while thousands of spans are still being added.

This PR removes the repeated O(N) scans from non-replicating span checkpoint calculation and reduces scheduler batch-query overhead during operator-heavy bootstrap.

Summary of Changes

Add an incremental non-replicating checkpoint tracker for absent and scheduling spans.
Keep tracker updates on existing span state transition paths and on accepted status checkpoint advances.
Make scheduler replica map length queries O(1).
Limit GetAbsentByGroup scanning to the requested batch size.
Replace slow maintainer event logs that printed full messages with compact message metadata.
Add focused tests for tracker behavior, span state transitions, and scheduler batch scanning.

Validation

make fmt
git fetch --prune upstream master && git diff --check upstream/master...HEAD
git diff --check
go test -count=1 ./maintainer/span
go test -count=1 ./pkg/scheduler/replica
go test --tags=intest -count=1 ./downstreamadapter/sink/redo -run TestRedoSinkInProcessor
Independent subagent review found no meaningful correctness, performance, comment, or test coverage issues before the final tracker naming and comment cleanup.

Release Note

None

Summary by CodeRabbit

Bug Fixes
- Improved synchronization in scheduler operations for more consistent behavior
Refactor
- Optimized checkpoint timestamp tracking for better performance and reduced scanning
- Improved scheduler internal efficiency for faster size/count operations
- Enhanced event logging to include more contextual details for diagnostics
Tests
- Added extensive tests for checkpoint tracking and replica scheduling scenarios

ti-chi-bot · 2026-04-29T03:44:16Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

ti-chi-bot · 2026-04-29T03:44:18Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hicqu for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-04-29T03:44:22Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ee0f2621-cfb8-4824-972e-787f69738f28

📥 Commits

Reviewing files that changed from the base of the PR and between 7cfdb8d and 1d19b72.

📒 Files selected for processing (5)

maintainer/span/checkpoint_ts_tracker.go
maintainer/span/checkpoint_ts_tracker_test.go
maintainer/span/span_controller.go
pkg/scheduler/replica/replication_group.go
pkg/scheduler/replica/replication_group_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/scheduler/replica/replication_group.go

📝 Walkthrough

Walkthrough

This PR adds an internal heap-backed checkpoint timestamp tracker and integrates it into the span Controller to avoid full-span scans, changes SpanReplication.UpdateStatus to return whether checkpointTs changed, adds unit tests for the tracker and controller behaviors, introduces atomic size tracking and batch retrieval in scheduler replication groups, and tweaks one slow-path log to include MessageType.

Changes

Minor Logging Enhancement

Layer / File(s)	Summary
Logging `maintainer/maintainer.go`	Slow-path EventMessage logging now records the `MessageType` (`event.message.Type`) instead of logging the entire Message payload.

Checkpoint Timestamp Tracking Infrastructure & Integration

Layer / File(s)	Summary
API Contract `maintainer/replica/replication_span.go`	`SpanReplication.UpdateStatus(newStatus *heartbeatpb.TableSpanStatus) bool` — now returns `false` for `nil` and indicates whether stored checkpointTs changed; rejects smaller checkpointTs.
Core Tracking `maintainer/span/checkpoint_ts_tracker.go`	New unexported `checkpointTsTracker` with maps and a min-heap (`checkpointTsHeap`) supporting track/update/untrack, refcounting, O(log n) heap ops, `min()`, and reset.
Tracker Tests `maintainer/span/checkpoint_ts_tracker_test.go`	Unit tests covering min behavior, ignored updates for missing entries, and heap stability when removing blocking entries.
Controller Integration `maintainer/span/span_controller.go`	Controller gains `nonReplicatingCheckpointTs *checkpointTsTracker`; initialized in `NewController`; `GetMinCheckpointTsForNonReplicatingSpans` uses tracker under RLock; status updates push/pop tracker entries via new `trackNonReplicatingSpan`/`untrackNonReplicatingSpan`; state transitions updated to maintain tracker.
Controller Tests `maintainer/span/span_controller_test.go`	New tests and helpers validating min checkpointTs for non-replicating spans across scheduling/replicating/bind/replace/remove scenarios.

Scheduler Replica Optimization

Layer / File(s)	Summary
Data Shape / Counters `pkg/scheduler/replica/replication_group.go`	`iMap` gains an `atomic.Int64` `size` counter to enable O(1) `Len()`; `Set`/`Delete` update the counter.
Batch Retrieval `pkg/scheduler/replica/replication_group.go`	New `GetAbsentBatch(batch int) []R` to return up to `batch` runnable absent replicas.
Synchronization `pkg/scheduler/replica/replication.go`	`GetAbsentByGroup` now uses `group.GetAbsentBatch()` under the read lock; `GetSchedulingSize` iterates under `withRLock`.
Batch Operation Tests `pkg/scheduler/replica/replication_group_test.go`	Tests for iMap length behavior, batch-limited absent retrieval, and skipping non-runnable tasks via `ShouldRun` checks.

Sequence Diagram(s)

(skip — conditions for diagram generation not met)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

pingcap/ticdc#4548: Related — also modifies SpanReplication and span controller checkpoint-related behavior.

Suggested labels

lgtm, approved

Suggested reviewers

lidezhu
wk989898
3AceShowHand

Poem

🐰 I hop through heaps of timestamps, small and neat,
I track the mins so spans don't skip a beat.
Controllers store, and schedulers fetch a few,
Replicas dance, the tracker hums anew. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title accurately summarizes the main change: optimizing maintainer checkpoint tracking through incremental tracking and scheduler optimizations.
Description check	✅ Passed	The pull request description is comprehensive and well-structured, covering background, motivation, detailed summary of changes, validation steps, and release notes, though it deviates from the template structure.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.1)

Command failed

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request optimizes the calculation of the minimum checkpoint timestamp for non-replicating spans by introducing a heap-based checkpointTsTracker, replacing a linear scan. It also improves the efficiency of the iMap structure in the scheduler by using an atomic counter for size tracking and enhances logging for slow maintainer events with detailed message fields. Review feedback suggests removing an unnecessary state reset in the tracker's removal logic and refining the atomic counter update in the iMap.Set method to avoid redundant operations.

hongyunyan · 2026-04-29T04:09:09Z

/test all

hongyunyan · 2026-05-06T03:49:51Z

@coderabbitai review

coderabbitai · 2026-05-06T03:49:56Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

🧹 Nitpick comments (3)

maintainer/span/checkpoint_ts_tracker_test.go (1)

16-100: ⚡ Quick win

Use require assertions in this new test file.

These tests are deterministic, but the new cases still use manual if/t.Fatalf checks instead of the repo’s standard assertion style.

♻️ Example cleanup

 import (
 	"testing"
 
 	"github.com/pingcap/ticdc/pkg/common"
+	"github.com/stretchr/testify/require"
 )
@@
-	got, ok := tracker.min()
-	if !ok || got != 80 {
-		t.Fatalf("checkpointTsTracker.min() = %d, %v, want 80, true", got, ok)
-	}
+	got, ok := tracker.min()
+	require.True(t, ok)
+	require.Equal(t, uint64(80), got)

As per coding guidelines, "Unit tests should use `*_test.go` file naming convention and favor deterministic tests using `testify/require`."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@maintainer/span/checkpoint_ts_tracker_test.go` around lines 16 - 100, Replace
manual t.Fatalf checks with testify/require assertions in the three tests
(TestCheckpointTsTrackerMin, TestCheckpointTsTrackerIgnoresMissingUpdate,
TestCheckpointTsTrackerRemovesStaleCheckpointTs): add an import for
"github.com/stretchr/testify/require" and use functions like require.True/False,
require.Equal, and require.NoError (as appropriate) to assert expected values
and booleans instead of the current if/.../t.Fatalf patterns, keeping the same
expectation messages/values but leveraging require for concise, standard
assertions.

pkg/scheduler/replica/replication_group_test.go (1)

61-137: ⚡ Quick win

Use testify/require instead of raw t.Fatalf assertions

All three tests use bare t.Fatalf for assertions. Per coding guidelines, tests should favor testify/require for assertions.

♻️ Example migration for `TestIMapLenTracksOverwriteAndDelete`

 import (
 	"fmt"
 	"sync/atomic"
 	"testing"
 
 	"github.com/pingcap/ticdc/pkg/node"
+	"github.com/stretchr/testify/require"
 )

 func TestIMapLenTracksOverwriteAndDelete(t *testing.T) {
 	t.Parallel()

 	replicaMap := newIMap[testReplicationID, *testReplication]()
 	id := testReplicationID("a")

 	replicaMap.Set(id, &testReplication{id: id})
 	replicaMap.Set(id, &testReplication{id: id})
-	if got := replicaMap.Len(); got != 1 {
-		t.Fatalf("iMap.Len() after overwrite = %d, want 1", got)
-	}
+	require.Equal(t, 1, replicaMap.Len(), "iMap.Len() after overwrite")

 	replicaMap.Delete(testReplicationID("missing"))
-	if got := replicaMap.Len(); got != 1 {
-		t.Fatalf("iMap.Len() after deleting missing key = %d, want 1", got)
-	}
+	require.Equal(t, 1, replicaMap.Len(), "iMap.Len() after deleting missing key")

 	replicaMap.Delete(id)
-	if got := replicaMap.Len(); got != 0 {
-		t.Fatalf("iMap.Len() after delete = %d, want 0", got)
-	}
+	require.Equal(t, 0, replicaMap.Len(), "iMap.Len() after delete")
 }

As per coding guidelines: "Unit tests should use *_test.go file naming convention and favor deterministic tests using testify/require."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/scheduler/replica/replication_group_test.go` around lines 61 - 137,
Replace raw t.Fatalf assertions in TestIMapLenTracksOverwriteAndDelete,
TestGetAbsentByGroupStopsAtBatch, and TestGetAbsentByGroupSkipsNotRunnableTasks
with testify/require assertions: import "github.com/stretchr/testify/require"
and use require.Equal/require.Len (e.g., require.Equal(t, 1, replicaMap.Len())
or require.Len(t, absent, 3)) for value/length checks and require.Equal or
require.EqualValues for atomic counts (e.g., require.Equal(t, int64(3),
shouldRunCalls.Load())). Ensure the test signatures remain the same and remove
the t.Fatalf calls.

pkg/scheduler/replica/replication_group.go (1)

311-317: ⚡ Quick win

iMap.Set can be simplified using sync.Map.Swap

The current LoadOrStore + conditional Store pattern performs two map writes when updating an existing key. sync.Map.Swap atomically stores the value and returns whether the key existed, collapsing this into a single operation:

♻️ Proposed refactoring

 func (m *iMap[T, R]) Set(key T, value R) {
-	if _, loaded := m.inner.LoadOrStore(key, value); loaded {
-		m.inner.Store(key, value)
-		return
-	}
-	m.size.Add(1)
+	if _, loaded := m.inner.Swap(key, value); !loaded {
+		m.size.Add(1)
+	}
 }

The module targets Go 1.25.8, which exceeds the 1.20 minimum required for Swap.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/scheduler/replica/replication_group.go` around lines 311 - 317, The Set
method in iMap currently uses LoadOrStore then Store causing two writes on
updates; replace that logic by calling m.inner.Swap(key, value) to atomically
set the value and get whether the key existed, then only call m.size.Add(1) when
Swap reports the key did not previously exist (loaded == false). Update the
implementation in iMap.Set to use m.inner.Swap(key, value) and remove the extra
Store/LoadOrStore branch so size is incremented exactly when a new key is
inserted.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@maintainer/span/checkpoint_ts_tracker_test.go`:
- Around line 16-100: Replace manual t.Fatalf checks with testify/require
assertions in the three tests (TestCheckpointTsTrackerMin,
TestCheckpointTsTrackerIgnoresMissingUpdate,
TestCheckpointTsTrackerRemovesStaleCheckpointTs): add an import for
"github.com/stretchr/testify/require" and use functions like require.True/False,
require.Equal, and require.NoError (as appropriate) to assert expected values
and booleans instead of the current if/.../t.Fatalf patterns, keeping the same
expectation messages/values but leveraging require for concise, standard
assertions.

In `@pkg/scheduler/replica/replication_group_test.go`:
- Around line 61-137: Replace raw t.Fatalf assertions in
TestIMapLenTracksOverwriteAndDelete, TestGetAbsentByGroupStopsAtBatch, and
TestGetAbsentByGroupSkipsNotRunnableTasks with testify/require assertions:
import "github.com/stretchr/testify/require" and use require.Equal/require.Len
(e.g., require.Equal(t, 1, replicaMap.Len()) or require.Len(t, absent, 3)) for
value/length checks and require.Equal or require.EqualValues for atomic counts
(e.g., require.Equal(t, int64(3), shouldRunCalls.Load())). Ensure the test
signatures remain the same and remove the t.Fatalf calls.

In `@pkg/scheduler/replica/replication_group.go`:
- Around line 311-317: The Set method in iMap currently uses LoadOrStore then
Store causing two writes on updates; replace that logic by calling
m.inner.Swap(key, value) to atomically set the value and get whether the key
existed, then only call m.size.Add(1) when Swap reports the key did not
previously exist (loaded == false). Update the implementation in iMap.Set to use
m.inner.Swap(key, value) and remove the extra Store/LoadOrStore branch so size
is incremented exactly when a new key is inserted.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 27cb394a-23ab-4a24-a989-a9da26601d4f

📥 Commits

Reviewing files that changed from the base of the PR and between 5518eb2 and 7cfdb8d.

📒 Files selected for processing (9)

maintainer/maintainer.go
maintainer/replica/replication_span.go
maintainer/span/checkpoint_ts_tracker.go
maintainer/span/checkpoint_ts_tracker_test.go
maintainer/span/span_controller.go
maintainer/span/span_controller_test.go
pkg/scheduler/replica/replication.go
pkg/scheduler/replica/replication_group.go
pkg/scheduler/replica/replication_group_test.go

hongyunyan · 2026-05-06T04:28:27Z

@coderabbitai review

coderabbitai · 2026-05-06T04:28:33Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

…yan/tigate into codex/checkpoint-tracker

optimize maintainer checkpoint tracking

7f3a7b5

ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 29, 2026

gemini-code-assist Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread maintainer/span/checkpoint_ts_tracker.go Outdated

Comment thread pkg/scheduler/replica/replication_group.go Outdated

add checkpoint tracker coverage

2339054

hongyunyan added 5 commits April 29, 2026 14:46

update

7cfdb8d

Throttle checkpoint calculation under operator backlog

002a79a

update

66babac

update

0f8d278

update

f0fd45f

hongyunyan force-pushed the codex/checkpoint-tracker branch from f0fd45f to 7cfdb8d Compare May 6, 2026 03:44

ti-chi-bot Bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/needs-linked-issue do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 6, 2026

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

maintainer: address checkpoint tracker review comments

8e4aeb4

hongyunyan added 4 commits May 6, 2026 16:14

maintainer: clarify checkpoint tracker naming

e573949

Merge branch 'codex/checkpoint-tracker' of https://github.com/hongyun…

52fe352

…yan/tigate into codex/checkpoint-tracker

update

f0dd19b

update

e15cd17

update

1d19b72

hongyunyan marked this pull request as ready for review May 6, 2026 09:08

ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 6, 2026

hongyunyan changed the title ~~optimize maintainer checkpoint tracking~~ maintainer: optimize maintainer checkpoint tracking May 6, 2026

Conversation

hongyunyan commented Apr 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Motivation

Summary of Changes

Validation

Release Note

Summary by CodeRabbit

Uh oh!

ti-chi-bot Bot commented Apr 29, 2026

Uh oh!

ti-chi-bot Bot commented Apr 29, 2026

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

hongyunyan commented Apr 29, 2026

Uh oh!

hongyunyan commented May 6, 2026

Uh oh!

coderabbitai Bot commented May 6, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

hongyunyan commented May 6, 2026

Uh oh!

coderabbitai Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hongyunyan commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading