Skip to content

[Feature] Coordinator Server Supports coordinator epoch protect#2781

Merged
wuchong merged 2 commits intoapache:mainfrom
zcoo:20260302_coordinator_ha_epoch
Apr 9, 2026
Merged

[Feature] Coordinator Server Supports coordinator epoch protect#2781
wuchong merged 2 commits intoapache:mainfrom
zcoo:20260302_coordinator_ha_epoch

Conversation

@zcoo
Copy link
Copy Markdown
Contributor

@zcoo zcoo commented Mar 3, 2026

Purpose

Linked issue: close #2778

This is the part 2 pr for coordinator high availability focusing on "coordinator epoch" logic.
Will ready for review when part 1 finish and merge.

Brief change log

see https://cwiki.apache.org/confluence/display/FLUSS/FIP-9%3A+Support+CoordinatorServer+High+Availability

Tests

API and Format

Documentation

@zcoo zcoo marked this pull request as draft March 3, 2026 06:56
@zcoo zcoo changed the title [server] Coordinator Server Supports High-Available [Feature] Coordinator Server Supports High-Available Mar 3, 2026
@zcoo zcoo changed the title [Feature] Coordinator Server Supports High-Available [Feature] Coordinator Server Supports coordinator epoch protect Mar 4, 2026
Copy link
Copy Markdown
Contributor

@swuferhong swuferhong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zcoo Thanks for the great contributions. I left some comments.

@zcoo zcoo force-pushed the 20260302_coordinator_ha_epoch branch from 8fa3cf7 to bc8713c Compare April 1, 2026 12:17
@zcoo zcoo marked this pull request as ready for review April 1, 2026 12:18
@zcoo zcoo force-pushed the 20260302_coordinator_ha_epoch branch 2 times, most recently from e8dff9d to c41673e Compare April 1, 2026 13:07
@wuchong wuchong requested a review from Copilot April 4, 2026 03:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements “coordinator epoch” fencing to protect ZooKeeper metadata writes and tablet-server RPCs during coordinator leader changes (HA), aligning with the epoch-protection portion of issue #2778 / FIP-9.

Changes:

  • Add a persistent /coordinators/epoch znode and leader-side epoch bumping (“fencing”) on leadership acquisition.
  • Wrap key ZK metadata mutations (table assignment, leader/isr) with an epoch znode version check via Curator transactions.
  • Propagate coordinator epoch through UpdateMetadataRequest, and update unit/integration tests accordingly.

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
fluss-server/src/test/java/org/apache/fluss/server/zk/ZooKeeperClientTest.java Updates tests to use expected ZK versions and coordinatorEpoch in LeaderAndIsr.
fluss-server/src/test/java/org/apache/fluss/server/testutils/FlussClusterExtension.java Resets tablet server replica-manager coordinator epoch between tests.
fluss-server/src/test/java/org/apache/fluss/server/tablet/TabletServiceITCase.java Updates makeUpdateMetadataRequest calls to include the new epoch parameter.
fluss-server/src/test/java/org/apache/fluss/server/metadata/ZkBasedMetadataProviderTest.java Updates LeaderAndIsr registration calls to pass expected ZK version.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/TestCoordinatorContext.java Adds a test-only CoordinatorContext that bypasses epoch-version checks.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/TableManagerTest.java Uses TestCoordinatorContext + passes expected versions to ZK writes.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/TableManagerITCase.java Updates makeUpdateMetadataRequest calls to include the new epoch parameter.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/statemachine/TableBucketStateMachineTest.java Uses TestCoordinatorContext and passes expected versions to ZK writes.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/statemachine/ReplicaStateMachineTest.java Uses TestCoordinatorContext and passes expected versions to ZK writes.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/CoordinatorServerITCase.java Minor formatting cleanup in test config.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/CoordinatorServerElectionTest.java Asserts coordinator epoch increments across leader transitions.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/CoordinatorEventProcessorTest.java Uses TestCoordinatorContext in event-processor test setup.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/CoordinatorChannelManagerTest.java Updates makeUpdateMetadataRequest usage with the new signature.
fluss-server/src/main/java/org/apache/fluss/server/zk/ZooKeeperOp.java Adds helpers to build Curator transaction ops (check/create/update/delete).
fluss-server/src/main/java/org/apache/fluss/server/zk/ZooKeeperClient.java Implements epoch znode handling + expected-version checks for key metadata mutations.
fluss-server/src/main/java/org/apache/fluss/server/zk/ZkEpoch.java Introduces a small value object for epoch + epoch-znode-version.
fluss-server/src/main/java/org/apache/fluss/server/zk/data/ZkVersion.java Adds special version constants (match-any/unknown).
fluss-server/src/main/java/org/apache/fluss/server/zk/data/ZkData.java Adds CoordinatorEpochZNode path + encode/decode helpers.
fluss-server/src/main/java/org/apache/fluss/server/utils/ServerRpcMessageUtils.java Extends UpdateMetadataRequest builder to optionally include coordinatorEpoch.
fluss-server/src/main/java/org/apache/fluss/server/replica/ReplicaManager.java Adds test-only reset hook for tablet-server coordinator epoch.
fluss-server/src/main/java/org/apache/fluss/server/metrics/group/CoordinatorMetricGroup.java Ensures server_id metric variable is stringified.
fluss-server/src/main/java/org/apache/fluss/server/coordinator/statemachine/TableBucketStateMachine.java Passes expected epoch-znode version into ZK LeaderAndIsr mutations.
fluss-server/src/main/java/org/apache/fluss/server/coordinator/statemachine/ReplicaStateMachine.java Passes expected epoch-znode version into batch ZK LeaderAndIsr updates.
fluss-server/src/main/java/org/apache/fluss/server/coordinator/MetadataManager.java Uses match-any ZK version for deletions / initial table-assignment registration.
fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorServer.java Wires CoordinatorContext into leader election construction.
fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorRequestBatch.java Includes coordinatorEpoch in UpdateMetadata requests sent to tablet servers.
fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorLeaderElection.java Adds leader fencing step that attempts to bump coordinator epoch in ZK.
fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorEventProcessor.java Passes expected epoch-znode version into ZK assignment/LeaderAndIsr writes.
fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorContext.java Tracks coordinatorEpoch and coordinatorEpochZkVersion.
fluss-common/src/main/java/org/apache/fluss/exception/CoordinatorEpochFencedException.java Adds a runtime exception type for fencing failures.
fluss-common/src/main/java/org/apache/fluss/config/ConfigOptions.java Adds coordinator.id configuration option.
Comments suppressed due to low confidence (1)

fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorLeaderElection.java:162

  • Leader fencing is effectively ignored: fenceBecomeCoordinatorLeader(serverId) can return Optional.empty() (BadVersion) but initLeaderServices.run() still executes and isLeader is set to true, which can allow a fenced coordinator to continue acting as leader. Consider treating Optional.empty() as a hard failure (e.g., throw a CoordinatorEpochFencedException / skip init and trigger leadership relinquish) and only marking isLeader true after fencing succeeds.
                                    try {
                                        // to avoid split-brain
                                        Optional<ZkEpoch> optionalEpoch =
                                                zkClient.fenceBecomeCoordinatorLeader(serverId);
                                        optionalEpoch.ifPresent(
                                                integer ->
                                                        coordinatorContext
                                                                .setCoordinatorEpochAndZkVersion(
                                                                        optionalEpoch
                                                                                .get()
                                                                                .getCoordinatorEpoch(),
                                                                        optionalEpoch
                                                                                .get()
                                                                                .getCoordinatorEpochZkVersion()));
                                        initLeaderServices.run();
                                    } catch (CoordinatorEpochFencedException e) {
                                        LOG.warn(
                                                "Coordinator server {} has been fenced and not become leader successfully.",
                                                serverId);
                                        throw e;
                                    } catch (Exception e) {
                                        LOG.error(
                                                "Failed to initialize leader services for server {}",
                                                serverId,
                                                e);
                                    }
                                });
                        // Set leader flag before init completes, so when zk found this leader, the
                        // leader can accept requests
                        isLeader.set(true);
                    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

public void createRecursiveWithEpochCheck(
String path, byte[] data, int expectedZkVersion, boolean throwIfPathExists)
throws Exception {
CuratorOp createOp = zkOp.createOp(path, data, CreateMode.PERSISTENT);
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createRecursiveWithEpochCheck recursively calls itself with data = null for parent paths, but ZooKeeperOp.createOp(...).forPath(path, data) typically expects a non-null byte array. This can cause NPE/IAE during parent creation. Consider using Curator's create builder that omits data for parent nodes (or pass an explicit empty byte array) and only attach data to the final target node.

Suggested change
CuratorOp createOp = zkOp.createOp(path, data, CreateMode.PERSISTENT);
byte[] nodeData = data == null ? new byte[0] : data;
CuratorOp createOp = zkOp.createOp(path, nodeData, CreateMode.PERSISTENT);

Copilot uses AI. Check for mistakes.
Comment on lines +1921 to +1922
createRecursiveWithEpochCheck(
parentPath, null, expectedZkVersion, throwIfPathExists);
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createRecursiveWithEpochCheck propagates throwIfPathExists into the recursive parent creation call. If throwIfPathExists is true for the leaf node, this will also throw when parent paths already exist, which breaks the usual contract of recursive-create (only the target path should be subject to the existence check). Consider passing false when creating parent paths, and applying throwIfPathExists only to the final path creation attempt.

Suggested change
createRecursiveWithEpochCheck(
parentPath, null, expectedZkVersion, throwIfPathExists);
createRecursiveWithEpochCheck(parentPath, null, expectedZkVersion, false);

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

@wuchong wuchong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zcoo for the contribution. I left some comments below, besides, I think we should also add a test case to cover the epoch evolution in the cluster.

Missing test coverage for coordinator epoch propagation to TabletServer after leader switch

There is no test that verifies the end-to-end flow:

  1. Coordinator A is leader with epoch=1, sends requests to TabletServer
  2. Coordinator A loses leadership
  3. Coordinator B becomes leader with epoch=2
  4. TabletServer accepts requests from B (epoch=2) and rejects stale requests from A (epoch=1)

Existing tests cover individual pieces (ZK epoch increment in CoordinatorServerElectionTest, epoch fencing in ReplicaManagerTest) but not the full flow.

Suggestion: Add an integration test in CoordinatorHighAvailabilityITCase:

@Test
void testTabletServerRejectsStaleCoordinatorEpochAfterLeaderSwitch() {
    // 1. Start two coordinators, confirm leader
    // 2. Record current coordinator epoch
    // 3. Kill leader's ZK session, trigger leader switch
    // 4. Wait for new leader election (epoch should increment)
    // 5. Verify new leader can send requests to TabletServer
    // 6. Verify requests with old epoch are rejected with InvalidCoordinatorException
}

@zcoo
Copy link
Copy Markdown
Contributor Author

zcoo commented Apr 8, 2026

@wuchong @swuferhong Thanks for all your comments. Now they are all addressed. PLAT~

@zcoo zcoo force-pushed the 20260302_coordinator_ha_epoch branch from 0122c9f to 9108e75 Compare April 8, 2026 15:37
@zcoo
Copy link
Copy Markdown
Contributor Author

zcoo commented Apr 9, 2026

@wuchong To improve test coverage for coordinator epoch propagation to TabletServer/Zookeeper after leader switch, I just add 2 test cases in CoordinatorHighAvailabilityITCase:

testTabletServerRejectsStaleCoordinatorEpochAfterLeaderSwitch
testZooKeeperRejectsStaleCoordinatorRequestAfterLeaderSwitch

@wuchong wuchong force-pushed the 20260302_coordinator_ha_epoch branch from 200ef9a to bbe55c6 Compare April 9, 2026 07:54
@wuchong
Copy link
Copy Markdown
Member

wuchong commented Apr 9, 2026

I pushed a commit fixes a race condition in leader election and refactors coordinator epoch to be immutable.

Key Changes

  • CoordinatorLeaderElection: Moved isLeader.set(true) from before to after initLeaderServices.run(), preventing the coordinator from accepting requests before initialization completes. Removed unnecessary zkClient and coordinatorContext fields.

  • CoordinatorContext: Made coordinatorEpoch and coordinatorEpochZkVersion final fields initialized via ZkEpoch in the constructor. Removed the mutable setter. Renamed getCoordinatorEpochZkVersion()getCoordinatorZkVersion().

  • CoordinatorServer: CoordinatorContext is now created locally in initCoordinatorLeader() per election cycle instead of being a long-lived server field. Removed redundant resetContext() calls from cleanup/close paths.

  • CoordinatorEventProcessor: Removed ZkEpoch constructor parameter (epoch now comes via CoordinatorContext). Added resetContext() in close().

  • ReplicaManager: Added getCoordinatorEpoch() getter and epoch update logging; removed resetCoordinatorEpoch().

  • ZkEpoch: Added INITIAL_EPOCH constant for test convenience.

  • ZooKeeperClient: Fixed Javadoc formatting (inline text → <pre> and <ul> blocks).

  • Tests: Deleted TestCoordinatorContext; updated HA tests to waitUntil leader is ready (since isLeader is now set after init); replaced manual UpdateMetadataRequest workaround with natural epoch propagation wait.

  • ITCase: Updated CoordinatorHighAvailabilityITCase#testTabletServerRejectsStaleCoordinatorEpochAfterLeaderSwitch to update the coordinator epoch of TabletServer by coordinator itself, rather than manually update it in IT case.

@wuchong wuchong force-pushed the 20260302_coordinator_ha_epoch branch from bbe55c6 to 1a4384b Compare April 9, 2026 08:50
@wuchong wuchong merged commit 2c031c4 into apache:main Apr 9, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support Epoch and ZkEpoch protect for Coordinator leader change

4 participants