coordinator: persist maintainer epochs before ownership changes#5434
Conversation
|
Skipping CI for Draft Pull Request. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR introduces a monotonically increasing ChangesMaintainer Epoch Fencing
Sequence Diagram(s)sequenceDiagram
participant Coordinator
participant OperatorController
participant Backend as EtcdBackend
participant etcd
rect rgba(70, 130, 180, 0.5)
Note over Coordinator,etcd: AddOperator with epoch bump
Coordinator->>OperatorController: AddOperator(addMaintainerOp)
OperatorController->>OperatorController: precheck slot free, reserve epochBumping[id]
OperatorController->>Backend: BumpChangefeedEpoch(id, candidateEpoch, opts)
loop CAS retry
Backend->>etcd: GET info + status
Backend->>Backend: AdvanceChangefeedEpoch(candidate, current)
Backend->>etcd: CAS PUT info [+ status]
end
Backend-->>OperatorController: updated ChangeFeedInfo (bumped epoch)
OperatorController->>OperatorController: recheck slot, push operator, clear epochBumping[id]
end
rect rgba(60, 179, 113, 0.5)
Note over Coordinator,OperatorController: Heartbeat epoch admission
Coordinator->>Coordinator: handleSingleMaintainerStatus(status)
Coordinator->>Coordinator: MaintainerEpochMatches(status.MaintainerEpoch, cf.Epoch)?
alt mismatch
Coordinator->>Coordinator: drop status with warning
else match
Coordinator->>OperatorController: UpdateOperatorStatus(status)
end
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)Command failed Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a fencing mechanism using maintainer epochs to prevent stale maintainers from making decisions or blocking actions during coordinator failover or rolling upgrades. It replaces the resume changefeed operation with an atomic epoch-bumping mechanism in the backend, updates the coordinator and operators to validate epochs, and adds extensive tests. The review feedback highlights several critical reliability issues, including potential nil pointer dereference panics in NewAddMaintainerMessage and the Check methods of the stop and add operators when handling nil statuses, as well as an out-of-sync state issue in AddOperator if the operator is rejected after an epoch has already been bumped in etcd.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if !oc.recheckAddOperatorLocked(op, cf) { | ||
| return false | ||
| } | ||
| cf.SetInfo(info) | ||
| oc.pushOperator(op) | ||
| return true |
There was a problem hiding this comment.
If recheckAddOperatorLocked fails (e.g., because another operator was added concurrently), AddOperator returns false without updating the in-memory changefeed info. However, the epoch has already been successfully bumped in etcd via bumpChangefeedEpoch. This leaves the in-memory changefeed epoch out of sync with etcd, which can cause subsequent coordinator actions or messages to use a stale epoch and get rejected. We should update the in-memory info if the changefeed still exists and is the same instance.
current := oc.changefeedDB.GetByID(op.ID())
if current == cf {
cf.SetInfo(info)
}
if !oc.recheckAddOperatorLocked(op, cf) {
return false
}
oc.pushOperator(op)
return true| return messaging.NewSingleTargetMessage(server, | ||
| messaging.MaintainerManagerTopic, | ||
| &heartbeatpb.AddMaintainerRequest{ | ||
| Id: c.ID.ToPB(), | ||
| CheckpointTs: c.GetStatus().CheckpointTs, | ||
| Config: c.configBytes, | ||
| Config: []byte(configData), | ||
| IsNewChangefeed: c.isNew, | ||
| KeyspaceId: c.GetKeyspaceID(), | ||
| KeyspaceId: info.KeyspaceID, | ||
| MaintainerEpoch: info.Epoch, | ||
| }) |
There was a problem hiding this comment.
c.GetStatus() can return nil (as explicitly handled in GetStatusForResume). Calling c.GetStatus().CheckpointTs directly without a nil check can cause a nil pointer dereference panic. We should safely fall back to c.GetLastSavedCheckPointTs() if the status is nil.
checkpointTs := c.GetLastSavedCheckPointTs()
if status := c.GetStatus(); status != nil {
checkpointTs = status.CheckpointTs
}
return messaging.NewSingleTargetMessage(server,
messaging.MaintainerManagerTopic,
&heartbeatpb.AddMaintainerRequest{
Id: c.ID.ToPB(),
CheckpointTs: checkpointTs,
Config: []byte(configData),
IsNewChangefeed: c.isNew,
KeyspaceId: info.KeyspaceID,
MaintainerEpoch: info.Epoch,
})| func (m *StopChangefeedOperator) Check(from node.ID, status *heartbeatpb.MaintainerStatus) { | ||
| if !m.finished.Load() && | ||
| from == m.nodeID && | ||
| common.MaintainerEpochMatches(status.MaintainerEpoch, m.maintainerEpoch) && | ||
| status.State != heartbeatpb.ComponentState_Working { | ||
| log.Info("maintainer report non-working status", | ||
| zap.Stringer("maintainer", m.cfID)) | ||
| m.finished.Store(true) | ||
| } | ||
| } |
There was a problem hiding this comment.
The Check method does not check if status is nil before accessing status.MaintainerEpoch and status.State. This can lead to a nil pointer dereference panic. We should add a nil check at the beginning of the method.
func (m *StopChangefeedOperator) Check(from node.ID, status *heartbeatpb.MaintainerStatus) {
if status == nil {
return
}
if !m.finished.Load() &&
from == m.nodeID &&
common.MaintainerEpochMatches(status.MaintainerEpoch, m.maintainerEpoch) &&
status.State != heartbeatpb.ComponentState_Working {
log.Info("maintainer report non-working status",
zap.Stringer("maintainer", m.cfID))
m.finished.Store(true)
}
}| if !m.finished.Load() && from == m.dest && | ||
| common.MaintainerEpochMatches(status.MaintainerEpoch, m.cf.GetInfo().Epoch) && | ||
| status.State == heartbeatpb.ComponentState_Working && |
There was a problem hiding this comment.
The Check method does not check if status is nil before accessing status.MaintainerEpoch and status.State. This can lead to a nil pointer dereference panic. We should add a nil check at the beginning of the method.
if status == nil {
return
}
if !m.finished.Load() && from == m.dest &&
common.MaintainerEpochMatches(status.MaintainerEpoch, m.cf.GetInfo().Epoch) &&
status.State == heartbeatpb.ComponentState_Working &&There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (5)
coordinator/operator/operator_stop_test.go (1)
41-82: ⚡ Quick winAdd a focused test for stop-operator epoch gating.
These updates only adapt constructor callsites. Please add a direct
Checktest that confirms mismatchedMaintainerEpochdoes not finish, and matched epoch does.As per coding guidelines, "Prefer focused deterministic tests; see docs/agents/testing.md before adding or changing tests."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@coordinator/operator/operator_stop_test.go` around lines 41 - 82, Add a new focused test function (e.g., TestStopChangefeedOperator_Check) that directly tests the epoch gating behavior of the StopChangefeedOperator. This test should create instances of StopChangefeedOperator with different MaintainerEpoch values, invoke the Check method, and verify two scenarios: first, that when the operator's MaintainerEpoch does not match the changefeed's epoch, the finished flag remains false, and second, that when the epochs match, the Check method properly completes the operation and sets the finished flag to true. This ensures deterministic testing of the core epoch gating logic as referenced in the testing guidelines.Source: Coding guidelines
coordinator/operator/operator_controller_test.go (1)
91-91: ⚡ Quick winRename the new tests to avoid underscores.
These new function names conflict with the repository’s Go naming rule; e.g. use
TestControllerStopChangefeedWithMaintainerEpoch.Proposed rename diff
-func TestController_StopChangefeedWithMaintainerEpoch(t *testing.T) { +func TestControllerStopChangefeedWithMaintainerEpoch(t *testing.T) { @@ -func TestController_StopRemoteMaintainerWithMaintainerEpoch(t *testing.T) { +func TestControllerStopRemoteMaintainerWithMaintainerEpoch(t *testing.T) { @@ -func TestController_AddOperatorBumpsAndPersistsOwnershipEpoch(t *testing.T) { +func TestControllerAddOperatorBumpsAndPersistsOwnershipEpoch(t *testing.T) { @@ -func TestController_AddOperatorRejectsConcurrentEpochBump(t *testing.T) { +func TestControllerAddOperatorRejectsConcurrentEpochBump(t *testing.T) { @@ -func TestController_StopChangefeedDuringMoveUsesOriginEpoch(t *testing.T) { +func TestControllerStopChangefeedDuringMoveUsesOriginEpoch(t *testing.T) { @@ -func TestController_AddOperatorEpochBumpDoesNotBlockStatusAndStop(t *testing.T) { +func TestControllerAddOperatorEpochBumpDoesNotBlockStatusAndStop(t *testing.T) {As per coding guidelines, "Functions should use camelCase naming and do not include underscores (e.g.,
getPartitionNum, notget_partition_num)."Also applies to: 111-111, 226-226, 309-309, 367-367, 445-445
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@coordinator/operator/operator_controller_test.go` at line 91, The test function names in the file use underscores which violates the repository's Go naming convention requiring camelCase without underscores. Rename all test functions to remove underscores: change TestController_StopChangefeedWithMaintainerEpoch to TestControllerStopChangefeedWithMaintainerEpoch, and apply the same camelCase formatting to all other affected test functions at the specified locations. Replace all underscores between words with direct concatenation while maintaining proper camelCase capitalization.Source: Coding guidelines
coordinator/controller_test.go (2)
744-751: ⚡ Quick winAssert resume clears the persisted runtime error.
ResumeChangefeedsetsUpdateError: truewith a nil error; add that assertion to the shared helper so the resume tests catch regressions that leave stale errors persisted.Proposed test assertion
require.NotNil(t, options.State) require.Equal(t, config.StateNormal, *options.State) require.True(t, options.UpdateStatus) + require.True(t, options.UpdateError) + require.Nil(t, options.Error) require.Equal(t, checkpointTs, options.CheckpointTs)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@coordinator/controller_test.go` around lines 744 - 751, The test for ResumeChangefeed is missing an assertion to verify that resuming a changefeed clears any persisted runtime errors. In the DoAndReturn closure of the BumpChangefeedEpoch mock expectation, add two assertions after the existing require statements: one to verify that options.UpdateError is true, and another to verify that options.Error is nil. This ensures that the resume operation properly clears stale errors that may have been persisted from previous failures.
36-36: ⚡ Quick winUse the default
schedulerimport name.There is no conflicting
schedulerimport in this file, so the alias is unnecessary.As per coding guidelines,
**/*.go: “Do not rename imports unless required to resolve a package name conflict or to follow an existing local convention.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@coordinator/controller_test.go` at line 36, The import statement for github.com/pingcap/ticdc/pkg/scheduler is using an unnecessary alias pkgscheduler when there are no conflicting scheduler imports in this file. Remove the pkgscheduler alias prefix from the import and use the default scheduler import name instead, following the coding guideline that aliases should only be used to resolve package name conflicts or follow existing local conventions.Source: Coding guidelines
coordinator/coordinator_test.go (1)
291-298: ⚡ Quick winPropagate the add-request epoch in the mock maintainer.
The mock reports epoch
0, so coordinator scheduling tests pass via compatibility matching and would miss a regression whereAddMaintainerRequest.MaintainerEpochis wrong. Copyreq.MaintainerEpochinto the created status.Proposed mock update
cf = &heartbeatpb.MaintainerStatus{ ChangefeedID: req.GetId(), FeedState: "normal", State: heartbeatpb.ComponentState_Working, CheckpointTs: req.CheckpointTs, + MaintainerEpoch: req.MaintainerEpoch, // In these coordinator tests, the mock maintainer is considered immediately bootstrapped. BootstrapDone: true, }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@coordinator/coordinator_test.go` around lines 291 - 298, The mock maintainer status in the heartbeatpb.MaintainerStatus struct creation is not propagating the epoch from the request, which allows tests to pass even if MaintainerEpoch is incorrectly set. Add a field assignment to the MaintainerStatus struct that copies the epoch value from req.MaintainerEpoch to ensure the mock properly reflects the request epoch and prevents regressions in epoch handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@coordinator/controller.go`:
- Around line 1190-1194: In the code block where BumpChangefeedEpoch is called,
after checking for the error returned by c.backend.BumpChangefeedEpoch, add a
nil check for the info result before calling cf.SetInfo(info). If info is nil,
return an error using errors.Trace or errors.New to prevent nil values from
being set, as this could cause nil dereferences later when cf.GetInfo() is
called in heartbeat or state operations. This guards against the case where
BumpChangefeedEpoch returns a nil result even though no error was reported,
mirroring the same validation pattern that ResumeChangefeed already implements.
- Around line 616-623: The RemoveMaintainerMessage command sent via
c.messageCenter.SendCommand is currently discarding any error returned (using
underscore assignment), which means stale maintainers will not be cleaned up if
the command fails. Replace the underscore assignment with proper error handling
by capturing the error returned from SendCommand, logging the failure with
appropriate context, and routing the failure into a retry mechanism for cleanup
instead of silently ignoring it. Apply this same fix to all occurrences at the
mentioned line ranges (760-767 and 847-854) where similar
RemoveMaintainerMessage commands are being sent.
In `@coordinator/operator/operator_stop.go`:
- Around line 63-67: The Check method in the StopChangefeedOperator class
dereferences the status parameter at line 66 when accessing
status.MaintainerEpoch without first verifying that status is not nil. Add a nil
check for the status parameter at the beginning of the Check method before any
dereferencing occurs, and early return if status is nil, following the same
pattern used in the Check methods of operator_add.go and operator_move.go to
maintain consistency across similar operator implementations.
---
Nitpick comments:
In `@coordinator/controller_test.go`:
- Around line 744-751: The test for ResumeChangefeed is missing an assertion to
verify that resuming a changefeed clears any persisted runtime errors. In the
DoAndReturn closure of the BumpChangefeedEpoch mock expectation, add two
assertions after the existing require statements: one to verify that
options.UpdateError is true, and another to verify that options.Error is nil.
This ensures that the resume operation properly clears stale errors that may
have been persisted from previous failures.
- Line 36: The import statement for github.com/pingcap/ticdc/pkg/scheduler is
using an unnecessary alias pkgscheduler when there are no conflicting scheduler
imports in this file. Remove the pkgscheduler alias prefix from the import and
use the default scheduler import name instead, following the coding guideline
that aliases should only be used to resolve package name conflicts or follow
existing local conventions.
In `@coordinator/coordinator_test.go`:
- Around line 291-298: The mock maintainer status in the
heartbeatpb.MaintainerStatus struct creation is not propagating the epoch from
the request, which allows tests to pass even if MaintainerEpoch is incorrectly
set. Add a field assignment to the MaintainerStatus struct that copies the epoch
value from req.MaintainerEpoch to ensure the mock properly reflects the request
epoch and prevents regressions in epoch handling.
In `@coordinator/operator/operator_controller_test.go`:
- Line 91: The test function names in the file use underscores which violates
the repository's Go naming convention requiring camelCase without underscores.
Rename all test functions to remove underscores: change
TestController_StopChangefeedWithMaintainerEpoch to
TestControllerStopChangefeedWithMaintainerEpoch, and apply the same camelCase
formatting to all other affected test functions at the specified locations.
Replace all underscores between words with direct concatenation while
maintaining proper camelCase capitalization.
In `@coordinator/operator/operator_stop_test.go`:
- Around line 41-82: Add a new focused test function (e.g.,
TestStopChangefeedOperator_Check) that directly tests the epoch gating behavior
of the StopChangefeedOperator. This test should create instances of
StopChangefeedOperator with different MaintainerEpoch values, invoke the Check
method, and verify two scenarios: first, that when the operator's
MaintainerEpoch does not match the changefeed's epoch, the finished flag remains
false, and second, that when the epochs match, the Check method properly
completes the operation and sets the finished flag to true. This ensures
deterministic testing of the core epoch gating logic as referenced in the
testing guidelines.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: aab53221-9828-4d4f-b61b-96ea9539967f
⛔ Files ignored due to path filters (1)
heartbeatpb/heartbeat.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (28)
coordinator/changefeed/changefeed.gocoordinator/changefeed/changefeed_db_backend.gocoordinator/changefeed/changefeed_test.gocoordinator/changefeed/etcd_backend.gocoordinator/changefeed/etcd_backend_test.gocoordinator/changefeed/mock/changefeed_db_backend.gocoordinator/controller.gocoordinator/controller_drain_test.gocoordinator/controller_test.gocoordinator/coordinator.gocoordinator/coordinator_test.gocoordinator/create_changefeed_gc_test.gocoordinator/operator/operator_add.gocoordinator/operator/operator_add_test.gocoordinator/operator/operator_controller.gocoordinator/operator/operator_controller_test.gocoordinator/operator/operator_move.gocoordinator/operator/operator_move_test.gocoordinator/operator/operator_stop.gocoordinator/operator/operator_stop_test.gocoordinator/scheduler/balance_test.gocoordinator/scheduler/basic_test.gocoordinator/scheduler/drain_test.goheartbeatpb/heartbeat.protopkg/common/format.gopkg/common/maintainer_epoch.gopkg/pdutil/utils.gopkg/pdutil/utils_test.go
| _ = c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage( | ||
| common.DefaultKeyspaceID, | ||
| cfID, | ||
| from, | ||
| true, | ||
| true, | ||
| status.MaintainerEpoch, | ||
| )) |
There was a problem hiding this comment.
Retry or at least surface failed stale-maintainer removals.
These direct remove commands are the only cleanup path in the no-local-changefeed and operator-slot-occupied branches. If SendCommand fails, the stale maintainer can keep heartbeating while future reports are only ignored; log the failure and route it into a retried cleanup path instead of discarding the error.
Also applies to: 760-767, 847-854
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@coordinator/controller.go` around lines 616 - 623, The
RemoveMaintainerMessage command sent via c.messageCenter.SendCommand is
currently discarding any error returned (using underscore assignment), which
means stale maintainers will not be cleaned up if the command fails. Replace the
underscore assignment with proper error handling by capturing the error returned
from SendCommand, logging the failure with appropriate context, and routing the
failure into a retry mechanism for cleanup instead of silently ignoring it.
Apply this same fix to all occurrences at the mentioned line ranges (760-767 and
847-854) where similar RemoveMaintainerMessage commands are being sent.
| func (m *StopChangefeedOperator) Check(from node.ID, status *heartbeatpb.MaintainerStatus) { | ||
| if !m.finished.Load() && | ||
| from == m.nodeID && | ||
| common.MaintainerEpochMatches(status.MaintainerEpoch, m.maintainerEpoch) && | ||
| status.State != heartbeatpb.ComponentState_Working { |
There was a problem hiding this comment.
Guard against nil maintainer status in Check.
Line 66 dereferences status without a nil check. coordinator/operator/operator_add.go (Line 68) and coordinator/operator/operator_move.go (Line 76) both early-return on nil, so this path can panic if the same callback contract passes a nil status.
💡 Suggested fix
func (m *StopChangefeedOperator) Check(from node.ID, status *heartbeatpb.MaintainerStatus) {
+ if status == nil {
+ return
+ }
if !m.finished.Load() &&
from == m.nodeID &&
common.MaintainerEpochMatches(status.MaintainerEpoch, m.maintainerEpoch) &&
status.State != heartbeatpb.ComponentState_Working {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@coordinator/operator/operator_stop.go` around lines 63 - 67, The Check method
in the StopChangefeedOperator class dereferences the status parameter at line 66
when accessing status.MaintainerEpoch without first verifying that status is not
nil. Add a nil check for the status parameter at the beginning of the Check
method before any dereferencing occurs, and early return if status is nil,
following the same pattern used in the Check methods of operator_add.go and
operator_move.go to maintain consistency across similar operator
implementations.
| } | ||
|
|
||
| // AdvanceChangefeedEpoch returns max(candidate, current+1). | ||
| func AdvanceChangefeedEpoch(candidate, current uint64) (uint64, error) { |
There was a problem hiding this comment.
Changefeed epoch should not be a part of the pd util?
There was a problem hiding this comment.
Thanks, I agree that the non-PD part should not live in pdutil. I moved AdvanceChangefeedEpoch to pkg/common because it only compares persisted/candidate epochs and has no PD dependency. GenerateChangefeedEpoch stays in pdutil since it is specifically backed by PD TSO generation.
| if candidate > current { | ||
| return candidate, nil | ||
| } | ||
| if current == ^uint64(0) { |
There was a problem hiding this comment.
Will this really happen ?
There was a problem hiding this comment.
It should not happen in normal operation because the candidate epoch is based on PD TSO/local timestamp. I kept the guard as a defensive invariant check because wrapping MaxUint64 back to 0 would make an old owner look newer or compatible again. The updated comment now calls this out as defensive rather than expected behavior.
| FeedState: status.FeedState, | ||
| State: status.State, | ||
| MaintainerEpoch: status.MaintainerEpoch, | ||
| // Old errors are meaningless for resume and can only block the resumed task. |
There was a problem hiding this comment.
This comment lack of context and misleading.
|
|
||
| func (c *Changefeed) NewAddMaintainerMessage(server node.ID) *messaging.TargetMessage { | ||
| info := c.GetInfo() | ||
| if info == nil { |
There was a problem hiding this comment.
info is nil should be checked when new the changefeed ?
| // SetChangefeedProgress persists the operation progress status to db for a changefeed | ||
| SetChangefeedProgress(ctx context.Context, id common.ChangeFeedID, progress config.Progress) error | ||
| // ResumeChangefeed persists the resumed status to db for a changefeed and returns the resumed info. | ||
| ResumeChangefeed(ctx context.Context, id common.ChangeFeedID, newCheckpointTs uint64) (*config.ChangeFeedInfo, error) |
There was a problem hiding this comment.
Why remove this method ?
| return nil | ||
| } | ||
|
|
||
| // BumpChangefeedEpoch atomically persists a strictly newer ownership epoch. |
There was a problem hiding this comment.
It looks the resume chanegefeed method is replaced by this one? Looks not ideal, bump changefeed epoch should be a internal operation of the maintainer.
There was a problem hiding this comment.
Agreed that the resume path should not be expressed as a raw bump at the controller call site. I restored Backend.ResumeChangefeed and made it own the resume-specific transition: StateNormal, ProgressNone, checkpoint update, error clear, and epoch bump in one backend operation. I kept BumpChangefeedEpoch as a lower-level backend boundary because other coordinator-driven ownership changes, such as add/move/retry, also must persist a fresh epoch before creating or moving a maintainer owner. So resume is semantic again, while the bump helper remains the shared metadata fence for non-resume owner-change paths.
[LGTM Timeline notifier]Timeline:
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: 3AceShowHand, asddongmen, lidezhu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
/test all |
What problem does this PR solve?
Issue Number: ref #5083
This is PR 1 of 3 split from #5182.
Background:
During maintainer failover, the coordinator can schedule a new maintainer while
delayed requests from the previous maintainer are still in flight. Receiver-side
fencing only works if every new ownership attempt has a persisted, monotonic
maintainer epoch before any add or move request is sent.
Motivation:
The old code generated owner epochs from in-memory state and resumed changefeeds
with a normal metadata write. That made it hard to prove that a new maintainer
owner is always ordered after the previously persisted owner, especially across
coordinator failover, resume, retry, and move paths.
What is changed and how it works?
This PR adds the coordinator-side epoch persistence foundation:
heartbeat, and close messages.
BumpChangefeedEpochto atomically persist a strictly newer changefeedepoch, optionally with status updates under the same etcd transaction.
retry scheduling can create a new maintainer owner.
ChangeFeedInfo.reported owners with their original epoch.
fenced by the epoch they actually own.
Stack:
Check List
Tests
Questions
Will it cause performance regression or break compatibility?
No expected performance regression. Epoch bumps happen on ownership-changing
control paths such as add, move, resume, and retry scheduling, not in the event
write path.
The protobuf fields are optional. Epoch 0 remains the compatibility value used
by old components until later PRs enforce strict receiver-side fences.
Do you need to update user documentation, design documentation or monitoring documentation?
No.
Release note
Validation
make generate-protobufmake fmtgo test ./coordinator/changefeed ./coordinator ./coordinator/operator ./pkg/pdutilSummary by CodeRabbit