Skip to content

ISSUE-161: implement reconfiguration for raft#167

Open
DongbinCheng wants to merge 2 commits intoeBay:masterfrom
DongbinCheng:ISSUE-161
Open

ISSUE-161: implement reconfiguration for raft#167
DongbinCheng wants to merge 2 commits intoeBay:masterfrom
DongbinCheng:ISSUE-161

Conversation

@DongbinCheng
Copy link
Collaborator

No description provided.

@DongbinCheng DongbinCheng force-pushed the ISSUE-161 branch 3 times, most recently from 8076a7f to c15a64c Compare February 12, 2026 08:29
@DongbinCheng DongbinCheng requested a review from Copilot February 12, 2026 09:19
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements Raft cluster reconfiguration support by introducing a control-plane reconfigure RPC, persisting cluster configuration in control state, and updating Raft/forwarding components to apply membership/role changes at runtime.

Changes:

  • Add ctrl.proto control service (split + reconfigure) and persist ClusterConfiguration/version in CtrlState.
  • Extend ClusterInfo to carry initial roles, joint-consensus member sets, in-sync learners, and proto sync helpers.
  • Update Raft v2 to support reconfiguration events (membership double-buffering, new request/event plumbing) and adjust tests/mocks accordingly.

Reviewed changes

Copilot reviewed 41 out of 41 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
test/infra/util/ClusterInfoTest.cpp Updates tests for new resolveAllClusters return tuple/params.
test/infra/raft/v2/RaftCoreTest.cpp Adds follower id into AE response metrics for updated logging/metrics.
test/infra/raft/v2/ClusterTestUtil.cpp Adapts test utility to new cluster members storage.
test/infra/es/store/RaftCommandEventStoreTest.cpp Extends Raft interface mock with role query methods.
src/infra/util/ClusterInfo.h Adds roles/joint members/learners + proto sync + new resolve API.
src/infra/util/ClusterInfo.cpp Implements learners parsing, config versioning, proto sync, and self-id deduction.
src/infra/raft/v2/RaftService.h Adds reconfigure event type; fixes method typo.
src/infra/raft/v2/RaftService.cpp Renames refresh method; fixes grpc error counter increment.
src/infra/raft/v2/RaftCore.h Adds membership double-buffer model, role queries, reconfigure request, majority helpers.
src/infra/raft/v2/RaftCore.cpp Implements reconfigure signal handling + configuration switching logic.
src/infra/raft/RaftSignal.h Adds ReconfigureSignal for propagating cluster config changes.
src/infra/raft/RaftInterface.h Extends interface with role queries and reconfigure queueing + new role.
src/infra/raft/DoubleBuffer.h Introduces double-buffer helper for swapping membership snapshots.
src/infra/forward/ForwardSignal.h Adds forward-core reconfigure signal.
src/infra/forward/ForwardCore.h Makes forward core updatable on reconfiguration; adds locking + TLS/DNS ownership.
src/infra/forward/ForwardClient.h Minor member ordering + metadata comment.
src/infra/CMakeLists.txt Updates infra util deps/links for new proto libraries.
src/app_util/protos/scale.proto Removes old scale proto in favor of ctrl proto.
src/app_util/protos/route.proto Adds cluster configuration + raft roles to shared route/control state proto.
src/app_util/protos/ctrl.proto Adds new control-plane service and reconfigure RPC/messages.
src/app_util/control/split/SplitEvent.h Switches include to ctrl proto.
src/app_util/control/split/SplitCommand.h Switches include to ctrl proto.
src/app_util/control/split/SplitCommand.cpp Uses renamed hasSplitState() API.
src/app_util/control/split/ScaleReceiver.h Removes obsolete receiver in favor of multi-receiver.
src/app_util/control/reconfigure/ReconfigureEvent.h Introduces reconfigure event persisted in ctrl log.
src/app_util/control/reconfigure/ReconfigureEvent.cpp Applies persisted config + emits raft/forward reconfigure signals.
src/app_util/control/reconfigure/ReconfigureCommand.h Adds reconfigure command type + request parsing.
src/app_util/control/reconfigure/ReconfigureCommand.cpp Validates reconfigure requests and builds events.
src/app_util/control/CtrlState.h Adds persisted cluster configuration/version fields + renamed split state predicate.
src/app_util/control/CtrlState.cpp Serializes/deserializes cluster configuration and enhances pretty print.
src/app_util/control/CtrlReceiver.h Adds multi-method gRPC receiver for split + reconfigure.
src/app_util/control/CtrlReceiver.cpp Implements reconfigure call handler.
src/app_util/control/CtrlDecoder.h Adds decoding support for reconfigure command/event.
src/app_util/control/CtrlCommandEvent.h Adds onApplied() hook for post-persist notifications.
src/app_util/RequestCallData.h Switches to ctrl grpc generated header.
src/app_util/NetAdminServer.h Uses renamed hasSplitState() API.
src/app_util/MultiRequestReceiver.h Adds reusable async multi-method receiver base.
src/app_util/CMakeLists.txt Generates/links ctrl proto library instead of scale proto.
src/app_util/AppStateMachine.h Adds on-applied flow and state machine type marker.
src/app_util/AppInfo.h Adds cluster config versioning, locking, ctrl port accessor, init role mapping.
src/app_util/AppInfo.cpp Initializes AppInfo with cluster version/state-provided config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +33 to +40
std::shared_ptr<const T> constCurrent() const {
return mBuffer[mCurrentLoc];
}

std::shared_ptr<T> getCurrent() {
return mBuffer[mCurrentLoc];
}

Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mCurrentLoc is read/written without synchronization. In RaftCore, constCurrent() is used from gRPC handler threads while swap() (and getCurrent() mutations) can occur in the raft thread/signal handling, which creates a data race. Make mCurrentLoc atomic (at minimum) and ensure readers never observe concurrent mutation of the same buffer (e.g., mutate only bypass buffer + swap under a mutex / RW lock, and avoid exposing mutable access to the current buffer across threads).

Copilot uses AI. Check for mistakes.
Comment on lines +53 to +56
void swap() {
mCurrentLoc = 1 - mCurrentLoc;
SPDLOG_INFO("Swap DoubleBuffer, new location: {}", mCurrentLoc);
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mCurrentLoc is read/written without synchronization. In RaftCore, constCurrent() is used from gRPC handler threads while swap() (and getCurrent() mutations) can occur in the raft thread/signal handling, which creates a data race. Make mCurrentLoc atomic (at minimum) and ensure readers never observe concurrent mutation of the same buffer (e.g., mutate only bypass buffer + swap under a mutex / RW lock, and avoid exposing mutable access to the current buffer across threads).

Copilot uses AI. Check for mistakes.
Comment on lines 151 to 160
if (selfId == nodeId) {
mSelfInfo.mId = selfId;
mSelfInfo.mAddress = addr;
mAddressForRaftSvc = "0.0.0.0:" + port;
members->mSelfInfo.mId = selfId;
members->mSelfInfo.mAddress = "0.0.0.0:" + port;
mStreamingPort = node.mPortForStream;
} else {
Peer peer;
peer.mId = nodeId;
peer.mAddress = addr;
mPeers[nodeId] = peer;
members->mPeers[nodeId] = peer;
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mSelfInfo.mAddress is being set to the bind address (0.0.0.0:port) rather than the advertised peer address (hostname:port). This leaks into getClusterMembers() and other surfaces and makes the self member address inconsistent/unroutable to other nodes/tools. Keep a separate bind address for RaftServer, but store mSelfInfo.mAddress as the advertised addr (e.g., host:port).

Copilot uses AI. Check for mistakes.
Comment on lines +189 to +199
return members->mInitialRoles.at(id) == RaftRole::Learner;
}

bool isPreFollower(uint64_t id) const override {
auto members = mClusterMembers->constCurrent();
return members->mInitialRoles.at(id) == RaftRole::PreFollower;
}

bool isVoter(uint64_t id) const override {
auto members = mClusterMembers->constCurrent();
return members->mInitialRoles.at(id) == RaftRole::Follower;
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These role helpers use map::at(), which will throw if id is not present in mInitialRoles. Several call sites pass remote IDs (e.g., request.leader_id(), candidate_id()) that can be absent during/after reconfiguration, causing a crash. Use find() and treat unknown IDs as non-voters / not-learner / not-prefollower (or handle explicitly as "unknown").

Suggested change
return members->mInitialRoles.at(id) == RaftRole::Learner;
}
bool isPreFollower(uint64_t id) const override {
auto members = mClusterMembers->constCurrent();
return members->mInitialRoles.at(id) == RaftRole::PreFollower;
}
bool isVoter(uint64_t id) const override {
auto members = mClusterMembers->constCurrent();
return members->mInitialRoles.at(id) == RaftRole::Follower;
auto it = members->mInitialRoles.find(id);
if (it == members->mInitialRoles.end()) {
// Unknown member ID: treat as not a learner.
return false;
}
return it->second == RaftRole::Learner;
}
bool isPreFollower(uint64_t id) const override {
auto members = mClusterMembers->constCurrent();
auto it = members->mInitialRoles.find(id);
if (it == members->mInitialRoles.end()) {
// Unknown member ID: treat as not a pre-follower.
return false;
}
return it->second == RaftRole::PreFollower;
}
bool isVoter(uint64_t id) const override {
auto members = mClusterMembers->constCurrent();
auto it = members->mInitialRoles.find(id);
if (it == members->mInitialRoles.end()) {
// Unknown member ID: treat as not a voter.
return false;
}
return it->second == RaftRole::Follower;

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant