Skip to content

maintainer: optimize maintainer checkpoint tracking#4947

Open
hongyunyan wants to merge 13 commits intopingcap:masterfrom
hongyunyan:codex/checkpoint-tracker
Open

maintainer: optimize maintainer checkpoint tracking#4947
hongyunyan wants to merge 13 commits intopingcap:masterfrom
hongyunyan:codex/checkpoint-tracker

Conversation

@hongyunyan
Copy link
Copy Markdown
Collaborator

@hongyunyan hongyunyan commented Apr 29, 2026

Background

Issue Number: ref #4951

During a 1M-table bootstrap test, maintainer checkpoint calculation became increasingly expensive because every 100ms tick scanned all absent and scheduling spans to compute the minimum checkpointTs. The same path also held back dispatcher scheduling progress as the number of non-replicating spans grew.

This PR focuses on replacing the repeated scans with incremental state tracking and small scheduler-side query optimizations. It does not change the checkpoint calculation rules themselves.

Motivation

The normal checkpointTs update frequency should remain unchanged because it controls externally visible checkpoint freshness. During large bootstrap waves, checkpoint calculation should avoid repeatedly scanning all absent and scheduling spans while thousands of spans are still being added.

This PR removes the repeated O(N) scans from non-replicating span checkpoint calculation and reduces scheduler batch-query overhead during operator-heavy bootstrap.

Summary of Changes

  • Add an incremental non-replicating checkpoint tracker for absent and scheduling spans.
  • Keep tracker updates on existing span state transition paths and on accepted status checkpoint advances.
  • Make scheduler replica map length queries O(1).
  • Limit GetAbsentByGroup scanning to the requested batch size.
  • Replace slow maintainer event logs that printed full messages with compact message metadata.
  • Add focused tests for tracker behavior, span state transitions, and scheduler batch scanning.

Validation

  • make fmt
  • git fetch --prune upstream master && git diff --check upstream/master...HEAD
  • git diff --check
  • go test -count=1 ./maintainer/span
  • go test -count=1 ./pkg/scheduler/replica
  • go test --tags=intest -count=1 ./downstreamadapter/sink/redo -run TestRedoSinkInProcessor
  • Independent subagent review found no meaningful correctness, performance, comment, or test coverage issues before the final tracker naming and comment cleanup.

Release Note

None

Summary by CodeRabbit

  • Bug Fixes

    • Improved synchronization in scheduler operations for more consistent behavior
  • Refactor

    • Optimized checkpoint timestamp tracking for better performance and reduced scanning
    • Improved scheduler internal efficiency for faster size/count operations
    • Enhanced event logging to include more contextual details for diagnostics
  • Tests

    • Added extensive tests for checkpoint tracking and replica scheduling scenarios

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 29, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 29, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 29, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hicqu for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 29, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ee0f2621-cfb8-4824-972e-787f69738f28

📥 Commits

Reviewing files that changed from the base of the PR and between 7cfdb8d and 1d19b72.

📒 Files selected for processing (5)
  • maintainer/span/checkpoint_ts_tracker.go
  • maintainer/span/checkpoint_ts_tracker_test.go
  • maintainer/span/span_controller.go
  • pkg/scheduler/replica/replication_group.go
  • pkg/scheduler/replica/replication_group_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/scheduler/replica/replication_group.go

📝 Walkthrough

Walkthrough

This PR adds an internal heap-backed checkpoint timestamp tracker and integrates it into the span Controller to avoid full-span scans, changes SpanReplication.UpdateStatus to return whether checkpointTs changed, adds unit tests for the tracker and controller behaviors, introduces atomic size tracking and batch retrieval in scheduler replication groups, and tweaks one slow-path log to include MessageType.

Changes

Minor Logging Enhancement

Layer / File(s) Summary
Logging
maintainer/maintainer.go
Slow-path EventMessage logging now records the MessageType (event.message.Type) instead of logging the entire Message payload.

Checkpoint Timestamp Tracking Infrastructure & Integration

Layer / File(s) Summary
API Contract
maintainer/replica/replication_span.go
SpanReplication.UpdateStatus(newStatus *heartbeatpb.TableSpanStatus) bool — now returns false for nil and indicates whether stored checkpointTs changed; rejects smaller checkpointTs.
Core Tracking
maintainer/span/checkpoint_ts_tracker.go
New unexported checkpointTsTracker with maps and a min-heap (checkpointTsHeap) supporting track/update/untrack, refcounting, O(log n) heap ops, min(), and reset.
Tracker Tests
maintainer/span/checkpoint_ts_tracker_test.go
Unit tests covering min behavior, ignored updates for missing entries, and heap stability when removing blocking entries.
Controller Integration
maintainer/span/span_controller.go
Controller gains nonReplicatingCheckpointTs *checkpointTsTracker; initialized in NewController; GetMinCheckpointTsForNonReplicatingSpans uses tracker under RLock; status updates push/pop tracker entries via new trackNonReplicatingSpan/untrackNonReplicatingSpan; state transitions updated to maintain tracker.
Controller Tests
maintainer/span/span_controller_test.go
New tests and helpers validating min checkpointTs for non-replicating spans across scheduling/replicating/bind/replace/remove scenarios.

Scheduler Replica Optimization

Layer / File(s) Summary
Data Shape / Counters
pkg/scheduler/replica/replication_group.go
iMap gains an atomic.Int64 size counter to enable O(1) Len(); Set/Delete update the counter.
Batch Retrieval
pkg/scheduler/replica/replication_group.go
New GetAbsentBatch(batch int) []R to return up to batch runnable absent replicas.
Synchronization
pkg/scheduler/replica/replication.go
GetAbsentByGroup now uses group.GetAbsentBatch() under the read lock; GetSchedulingSize iterates under withRLock.
Batch Operation Tests
pkg/scheduler/replica/replication_group_test.go
Tests for iMap length behavior, batch-limited absent retrieval, and skipping non-runnable tasks via ShouldRun checks.

Sequence Diagram(s)

(skip — conditions for diagram generation not met)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • pingcap/ticdc#4548: Related — also modifies SpanReplication and span controller checkpoint-related behavior.

Suggested labels

lgtm, approved

Suggested reviewers

  • lidezhu
  • wk989898
  • 3AceShowHand

Poem

🐰 I hop through heaps of timestamps, small and neat,
I track the mins so spans don't skip a beat.
Controllers store, and schedulers fetch a few,
Replicas dance, the tracker hums anew. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title accurately summarizes the main change: optimizing maintainer checkpoint tracking through incremental tracking and scheduler optimizations.
Description check ✅ Passed The pull request description is comprehensive and well-structured, covering background, motivation, detailed summary of changes, validation steps, and release notes, though it deviates from the template structure.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.1)

Command failed


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 29, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the calculation of the minimum checkpoint timestamp for non-replicating spans by introducing a heap-based checkpointTsTracker, replacing a linear scan. It also improves the efficiency of the iMap structure in the scheduler by using an atomic counter for size tracking and enhances logging for slow maintainer events with detailed message fields. Review feedback suggests removing an unnecessary state reset in the tracker's removal logic and refining the atomic counter update in the iMap.Set method to avoid redundant operations.

Comment thread maintainer/span/checkpoint_ts_tracker.go Outdated
Comment thread pkg/scheduler/replica/replication_group.go Outdated
@hongyunyan
Copy link
Copy Markdown
Collaborator Author

/test all

@hongyunyan hongyunyan force-pushed the codex/checkpoint-tracker branch from f0fd45f to 7cfdb8d Compare May 6, 2026 03:44
@ti-chi-bot ti-chi-bot Bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/needs-linked-issue do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 6, 2026
@hongyunyan
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
maintainer/span/checkpoint_ts_tracker_test.go (1)

16-100: ⚡ Quick win

Use require assertions in this new test file.

These tests are deterministic, but the new cases still use manual if/t.Fatalf checks instead of the repo’s standard assertion style.

♻️ Example cleanup
 import (
 	"testing"
 
 	"github.com/pingcap/ticdc/pkg/common"
+	"github.com/stretchr/testify/require"
 )
@@
-	got, ok := tracker.min()
-	if !ok || got != 80 {
-		t.Fatalf("checkpointTsTracker.min() = %d, %v, want 80, true", got, ok)
-	}
+	got, ok := tracker.min()
+	require.True(t, ok)
+	require.Equal(t, uint64(80), got)
As per coding guidelines, "Unit tests should use `*_test.go` file naming convention and favor deterministic tests using `testify/require`."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@maintainer/span/checkpoint_ts_tracker_test.go` around lines 16 - 100, Replace
manual t.Fatalf checks with testify/require assertions in the three tests
(TestCheckpointTsTrackerMin, TestCheckpointTsTrackerIgnoresMissingUpdate,
TestCheckpointTsTrackerRemovesStaleCheckpointTs): add an import for
"github.com/stretchr/testify/require" and use functions like require.True/False,
require.Equal, and require.NoError (as appropriate) to assert expected values
and booleans instead of the current if/.../t.Fatalf patterns, keeping the same
expectation messages/values but leveraging require for concise, standard
assertions.
pkg/scheduler/replica/replication_group_test.go (1)

61-137: ⚡ Quick win

Use testify/require instead of raw t.Fatalf assertions

All three tests use bare t.Fatalf for assertions. Per coding guidelines, tests should favor testify/require for assertions.

♻️ Example migration for `TestIMapLenTracksOverwriteAndDelete`
 import (
 	"fmt"
 	"sync/atomic"
 	"testing"
 
 	"github.com/pingcap/ticdc/pkg/node"
+	"github.com/stretchr/testify/require"
 )

 func TestIMapLenTracksOverwriteAndDelete(t *testing.T) {
 	t.Parallel()

 	replicaMap := newIMap[testReplicationID, *testReplication]()
 	id := testReplicationID("a")

 	replicaMap.Set(id, &testReplication{id: id})
 	replicaMap.Set(id, &testReplication{id: id})
-	if got := replicaMap.Len(); got != 1 {
-		t.Fatalf("iMap.Len() after overwrite = %d, want 1", got)
-	}
+	require.Equal(t, 1, replicaMap.Len(), "iMap.Len() after overwrite")

 	replicaMap.Delete(testReplicationID("missing"))
-	if got := replicaMap.Len(); got != 1 {
-		t.Fatalf("iMap.Len() after deleting missing key = %d, want 1", got)
-	}
+	require.Equal(t, 1, replicaMap.Len(), "iMap.Len() after deleting missing key")

 	replicaMap.Delete(id)
-	if got := replicaMap.Len(); got != 0 {
-		t.Fatalf("iMap.Len() after delete = %d, want 0", got)
-	}
+	require.Equal(t, 0, replicaMap.Len(), "iMap.Len() after delete")
 }

As per coding guidelines: "Unit tests should use *_test.go file naming convention and favor deterministic tests using testify/require."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/scheduler/replica/replication_group_test.go` around lines 61 - 137,
Replace raw t.Fatalf assertions in TestIMapLenTracksOverwriteAndDelete,
TestGetAbsentByGroupStopsAtBatch, and TestGetAbsentByGroupSkipsNotRunnableTasks
with testify/require assertions: import "github.com/stretchr/testify/require"
and use require.Equal/require.Len (e.g., require.Equal(t, 1, replicaMap.Len())
or require.Len(t, absent, 3)) for value/length checks and require.Equal or
require.EqualValues for atomic counts (e.g., require.Equal(t, int64(3),
shouldRunCalls.Load())). Ensure the test signatures remain the same and remove
the t.Fatalf calls.
pkg/scheduler/replica/replication_group.go (1)

311-317: ⚡ Quick win

iMap.Set can be simplified using sync.Map.Swap

The current LoadOrStore + conditional Store pattern performs two map writes when updating an existing key. sync.Map.Swap atomically stores the value and returns whether the key existed, collapsing this into a single operation:

♻️ Proposed refactoring
 func (m *iMap[T, R]) Set(key T, value R) {
-	if _, loaded := m.inner.LoadOrStore(key, value); loaded {
-		m.inner.Store(key, value)
-		return
-	}
-	m.size.Add(1)
+	if _, loaded := m.inner.Swap(key, value); !loaded {
+		m.size.Add(1)
+	}
 }

The module targets Go 1.25.8, which exceeds the 1.20 minimum required for Swap.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/scheduler/replica/replication_group.go` around lines 311 - 317, The Set
method in iMap currently uses LoadOrStore then Store causing two writes on
updates; replace that logic by calling m.inner.Swap(key, value) to atomically
set the value and get whether the key existed, then only call m.size.Add(1) when
Swap reports the key did not previously exist (loaded == false). Update the
implementation in iMap.Set to use m.inner.Swap(key, value) and remove the extra
Store/LoadOrStore branch so size is incremented exactly when a new key is
inserted.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@maintainer/span/checkpoint_ts_tracker_test.go`:
- Around line 16-100: Replace manual t.Fatalf checks with testify/require
assertions in the three tests (TestCheckpointTsTrackerMin,
TestCheckpointTsTrackerIgnoresMissingUpdate,
TestCheckpointTsTrackerRemovesStaleCheckpointTs): add an import for
"github.com/stretchr/testify/require" and use functions like require.True/False,
require.Equal, and require.NoError (as appropriate) to assert expected values
and booleans instead of the current if/.../t.Fatalf patterns, keeping the same
expectation messages/values but leveraging require for concise, standard
assertions.

In `@pkg/scheduler/replica/replication_group_test.go`:
- Around line 61-137: Replace raw t.Fatalf assertions in
TestIMapLenTracksOverwriteAndDelete, TestGetAbsentByGroupStopsAtBatch, and
TestGetAbsentByGroupSkipsNotRunnableTasks with testify/require assertions:
import "github.com/stretchr/testify/require" and use require.Equal/require.Len
(e.g., require.Equal(t, 1, replicaMap.Len()) or require.Len(t, absent, 3)) for
value/length checks and require.Equal or require.EqualValues for atomic counts
(e.g., require.Equal(t, int64(3), shouldRunCalls.Load())). Ensure the test
signatures remain the same and remove the t.Fatalf calls.

In `@pkg/scheduler/replica/replication_group.go`:
- Around line 311-317: The Set method in iMap currently uses LoadOrStore then
Store causing two writes on updates; replace that logic by calling
m.inner.Swap(key, value) to atomically set the value and get whether the key
existed, then only call m.size.Add(1) when Swap reports the key did not
previously exist (loaded == false). Update the implementation in iMap.Set to use
m.inner.Swap(key, value) and remove the extra Store/LoadOrStore branch so size
is incremented exactly when a new key is inserted.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 27cb394a-23ab-4a24-a989-a9da26601d4f

📥 Commits

Reviewing files that changed from the base of the PR and between 5518eb2 and 7cfdb8d.

📒 Files selected for processing (9)
  • maintainer/maintainer.go
  • maintainer/replica/replication_span.go
  • maintainer/span/checkpoint_ts_tracker.go
  • maintainer/span/checkpoint_ts_tracker_test.go
  • maintainer/span/span_controller.go
  • maintainer/span/span_controller_test.go
  • pkg/scheduler/replica/replication.go
  • pkg/scheduler/replica/replication_group.go
  • pkg/scheduler/replica/replication_group_test.go

@hongyunyan
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@hongyunyan hongyunyan marked this pull request as ready for review May 6, 2026 09:08
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 6, 2026
@hongyunyan hongyunyan changed the title optimize maintainer checkpoint tracking maintainer: optimize maintainer checkpoint tracking May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant