ISSUE-161: implement reconfiguration for raft#167
ISSUE-161: implement reconfiguration for raft#167DongbinCheng wants to merge 2 commits intoeBay:masterfrom
Conversation
8076a7f to
c15a64c
Compare
There was a problem hiding this comment.
Pull request overview
Implements Raft cluster reconfiguration support by introducing a control-plane reconfigure RPC, persisting cluster configuration in control state, and updating Raft/forwarding components to apply membership/role changes at runtime.
Changes:
- Add
ctrl.protocontrol service (split + reconfigure) and persistClusterConfiguration/version inCtrlState. - Extend
ClusterInfoto carry initial roles, joint-consensus member sets, in-sync learners, and proto sync helpers. - Update Raft v2 to support reconfiguration events (membership double-buffering, new request/event plumbing) and adjust tests/mocks accordingly.
Reviewed changes
Copilot reviewed 41 out of 41 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| test/infra/util/ClusterInfoTest.cpp | Updates tests for new resolveAllClusters return tuple/params. |
| test/infra/raft/v2/RaftCoreTest.cpp | Adds follower id into AE response metrics for updated logging/metrics. |
| test/infra/raft/v2/ClusterTestUtil.cpp | Adapts test utility to new cluster members storage. |
| test/infra/es/store/RaftCommandEventStoreTest.cpp | Extends Raft interface mock with role query methods. |
| src/infra/util/ClusterInfo.h | Adds roles/joint members/learners + proto sync + new resolve API. |
| src/infra/util/ClusterInfo.cpp | Implements learners parsing, config versioning, proto sync, and self-id deduction. |
| src/infra/raft/v2/RaftService.h | Adds reconfigure event type; fixes method typo. |
| src/infra/raft/v2/RaftService.cpp | Renames refresh method; fixes grpc error counter increment. |
| src/infra/raft/v2/RaftCore.h | Adds membership double-buffer model, role queries, reconfigure request, majority helpers. |
| src/infra/raft/v2/RaftCore.cpp | Implements reconfigure signal handling + configuration switching logic. |
| src/infra/raft/RaftSignal.h | Adds ReconfigureSignal for propagating cluster config changes. |
| src/infra/raft/RaftInterface.h | Extends interface with role queries and reconfigure queueing + new role. |
| src/infra/raft/DoubleBuffer.h | Introduces double-buffer helper for swapping membership snapshots. |
| src/infra/forward/ForwardSignal.h | Adds forward-core reconfigure signal. |
| src/infra/forward/ForwardCore.h | Makes forward core updatable on reconfiguration; adds locking + TLS/DNS ownership. |
| src/infra/forward/ForwardClient.h | Minor member ordering + metadata comment. |
| src/infra/CMakeLists.txt | Updates infra util deps/links for new proto libraries. |
| src/app_util/protos/scale.proto | Removes old scale proto in favor of ctrl proto. |
| src/app_util/protos/route.proto | Adds cluster configuration + raft roles to shared route/control state proto. |
| src/app_util/protos/ctrl.proto | Adds new control-plane service and reconfigure RPC/messages. |
| src/app_util/control/split/SplitEvent.h | Switches include to ctrl proto. |
| src/app_util/control/split/SplitCommand.h | Switches include to ctrl proto. |
| src/app_util/control/split/SplitCommand.cpp | Uses renamed hasSplitState() API. |
| src/app_util/control/split/ScaleReceiver.h | Removes obsolete receiver in favor of multi-receiver. |
| src/app_util/control/reconfigure/ReconfigureEvent.h | Introduces reconfigure event persisted in ctrl log. |
| src/app_util/control/reconfigure/ReconfigureEvent.cpp | Applies persisted config + emits raft/forward reconfigure signals. |
| src/app_util/control/reconfigure/ReconfigureCommand.h | Adds reconfigure command type + request parsing. |
| src/app_util/control/reconfigure/ReconfigureCommand.cpp | Validates reconfigure requests and builds events. |
| src/app_util/control/CtrlState.h | Adds persisted cluster configuration/version fields + renamed split state predicate. |
| src/app_util/control/CtrlState.cpp | Serializes/deserializes cluster configuration and enhances pretty print. |
| src/app_util/control/CtrlReceiver.h | Adds multi-method gRPC receiver for split + reconfigure. |
| src/app_util/control/CtrlReceiver.cpp | Implements reconfigure call handler. |
| src/app_util/control/CtrlDecoder.h | Adds decoding support for reconfigure command/event. |
| src/app_util/control/CtrlCommandEvent.h | Adds onApplied() hook for post-persist notifications. |
| src/app_util/RequestCallData.h | Switches to ctrl grpc generated header. |
| src/app_util/NetAdminServer.h | Uses renamed hasSplitState() API. |
| src/app_util/MultiRequestReceiver.h | Adds reusable async multi-method receiver base. |
| src/app_util/CMakeLists.txt | Generates/links ctrl proto library instead of scale proto. |
| src/app_util/AppStateMachine.h | Adds on-applied flow and state machine type marker. |
| src/app_util/AppInfo.h | Adds cluster config versioning, locking, ctrl port accessor, init role mapping. |
| src/app_util/AppInfo.cpp | Initializes AppInfo with cluster version/state-provided config. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::shared_ptr<const T> constCurrent() const { | ||
| return mBuffer[mCurrentLoc]; | ||
| } | ||
|
|
||
| std::shared_ptr<T> getCurrent() { | ||
| return mBuffer[mCurrentLoc]; | ||
| } | ||
|
|
There was a problem hiding this comment.
mCurrentLoc is read/written without synchronization. In RaftCore, constCurrent() is used from gRPC handler threads while swap() (and getCurrent() mutations) can occur in the raft thread/signal handling, which creates a data race. Make mCurrentLoc atomic (at minimum) and ensure readers never observe concurrent mutation of the same buffer (e.g., mutate only bypass buffer + swap under a mutex / RW lock, and avoid exposing mutable access to the current buffer across threads).
| void swap() { | ||
| mCurrentLoc = 1 - mCurrentLoc; | ||
| SPDLOG_INFO("Swap DoubleBuffer, new location: {}", mCurrentLoc); | ||
| } |
There was a problem hiding this comment.
mCurrentLoc is read/written without synchronization. In RaftCore, constCurrent() is used from gRPC handler threads while swap() (and getCurrent() mutations) can occur in the raft thread/signal handling, which creates a data race. Make mCurrentLoc atomic (at minimum) and ensure readers never observe concurrent mutation of the same buffer (e.g., mutate only bypass buffer + swap under a mutex / RW lock, and avoid exposing mutable access to the current buffer across threads).
| if (selfId == nodeId) { | ||
| mSelfInfo.mId = selfId; | ||
| mSelfInfo.mAddress = addr; | ||
| mAddressForRaftSvc = "0.0.0.0:" + port; | ||
| members->mSelfInfo.mId = selfId; | ||
| members->mSelfInfo.mAddress = "0.0.0.0:" + port; | ||
| mStreamingPort = node.mPortForStream; | ||
| } else { | ||
| Peer peer; | ||
| peer.mId = nodeId; | ||
| peer.mAddress = addr; | ||
| mPeers[nodeId] = peer; | ||
| members->mPeers[nodeId] = peer; | ||
| } |
There was a problem hiding this comment.
mSelfInfo.mAddress is being set to the bind address (0.0.0.0:port) rather than the advertised peer address (hostname:port). This leaks into getClusterMembers() and other surfaces and makes the self member address inconsistent/unroutable to other nodes/tools. Keep a separate bind address for RaftServer, but store mSelfInfo.mAddress as the advertised addr (e.g., host:port).
| return members->mInitialRoles.at(id) == RaftRole::Learner; | ||
| } | ||
|
|
||
| bool isPreFollower(uint64_t id) const override { | ||
| auto members = mClusterMembers->constCurrent(); | ||
| return members->mInitialRoles.at(id) == RaftRole::PreFollower; | ||
| } | ||
|
|
||
| bool isVoter(uint64_t id) const override { | ||
| auto members = mClusterMembers->constCurrent(); | ||
| return members->mInitialRoles.at(id) == RaftRole::Follower; |
There was a problem hiding this comment.
These role helpers use map::at(), which will throw if id is not present in mInitialRoles. Several call sites pass remote IDs (e.g., request.leader_id(), candidate_id()) that can be absent during/after reconfiguration, causing a crash. Use find() and treat unknown IDs as non-voters / not-learner / not-prefollower (or handle explicitly as "unknown").
| return members->mInitialRoles.at(id) == RaftRole::Learner; | |
| } | |
| bool isPreFollower(uint64_t id) const override { | |
| auto members = mClusterMembers->constCurrent(); | |
| return members->mInitialRoles.at(id) == RaftRole::PreFollower; | |
| } | |
| bool isVoter(uint64_t id) const override { | |
| auto members = mClusterMembers->constCurrent(); | |
| return members->mInitialRoles.at(id) == RaftRole::Follower; | |
| auto it = members->mInitialRoles.find(id); | |
| if (it == members->mInitialRoles.end()) { | |
| // Unknown member ID: treat as not a learner. | |
| return false; | |
| } | |
| return it->second == RaftRole::Learner; | |
| } | |
| bool isPreFollower(uint64_t id) const override { | |
| auto members = mClusterMembers->constCurrent(); | |
| auto it = members->mInitialRoles.find(id); | |
| if (it == members->mInitialRoles.end()) { | |
| // Unknown member ID: treat as not a pre-follower. | |
| return false; | |
| } | |
| return it->second == RaftRole::PreFollower; | |
| } | |
| bool isVoter(uint64_t id) const override { | |
| auto members = mClusterMembers->constCurrent(); | |
| auto it = members->mInitialRoles.find(id); | |
| if (it == members->mInitialRoles.end()) { | |
| // Unknown member ID: treat as not a voter. | |
| return false; | |
| } | |
| return it->second == RaftRole::Follower; |
No description provided.