PHOENIX-7562 HAGroupStore peer cache: fail-closed replay on peer loss#2547
Open
ritegarg wants to merge 1 commit into
Open
PHOENIX-7562 HAGroupStore peer cache: fail-closed replay on peer loss#2547ritegarg wants to merge 1 commit into
ritegarg wants to merge 1 commit into
Conversation
Extract peer-connection handling from HAGroupStoreClient into a dedicated PeerClusterWatcher: peer cache lifecycle, background retry when peer ZK is unreachable, connection-state handling, de-duplicated delivery with one forced redelivery after reconnect, and a visible/blind state machine. While this RegionServer is STANDBY and cannot see the peer, present an effective local DEGRADED_STANDBY so replication replay fails closed. The overlay is in-memory only; the shared HA record is never modified. Replay consumes the effective HA state rather than peer-connectivity details, and peer reconcile runs off Curator event threads. Add phoenix.ha.group.store.peer.cache.retry.interval.seconds (default 60s) with retry jitter and rate-limited, reason-tagged logging. Co-authored-by: Cursor <cursoragent@cursor.com>
d2b1191 to
717b765
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Make a
STANDBYRegionServer's replication replay fail closed while its peer cluster is not visible, and move all peer-connection handling out ofHAGroupStoreClientinto a dedicatedPeerClusterWatcher.PeerClusterWatcher(new) — owns the peerPathChildrenCache+PhoenixHAAdminfor one HA group and reports peer state via aListener(onPeerStateChanged/onPeerVisible/onPeerBlind):BLIND, and a per-client daemon retries the build onphoenix.ha.group.store.peer.cache.retry.interval.secondsuntil the peer returns, then goesVISIBLE— no restart needed.transitionLock → connectionStateLock; the blocking cache build and listener callbacks run outsideconnectionStateLock.HAGroupStoreCacheUtil(new) — shared helpers to parse a znode into(record, stat)and build/start aPathChildrenCache(init latch released infinally).HAGroupStoreClient— delegates peer handling toPeerClusterWatchervia aPeerListener; adds the in-memory effective-state overlaygetEffectiveHAGroupStoreRecord()(reports a localSTANDBYasDEGRADED_STANDBYwhenever the peer is blind — whether the peer drops while alreadySTANDBYor the role reachesSTANDBYafter the peer is blind — never persisted), and suppresses a redundant realSTANDBYwhile degraded.HAGroupStoreManager/HAGroupStoreRecord— addgetEffectiveHAGroupStoreRecord(haGroupName)andwithHAGroupState(...)(immutable copy for the overlay), and documentDEGRADED_STANDBY's dual nature. TheDEGRADED_STANDBYstate and the subscription framework are pre-existing.ReplicationLogGroup.initandReplicationLogDiscoveryReplay.getHAGroupRecordread the effective record; mode mapping is otherwise unchanged.phoenix.ha.group.store.peer.cache.retry.interval.seconds(default 60s;0disables retry), with jittered retry and rate-limitedWARN(1st + every 10th attempt).Why are the changes needed?
When the peer ZK is unreachable — including down at startup — a
STANDBYcan't reliably determine peer state, so replay risked proceeding as if in sync (fail-open). It must instead fail closed (STORE_AND_FORWARD) until the peer is reachable, then recover automatically. Extracting the peer lifecycle intoPeerClusterWatcherdecouples replay from peer connectivity and keepsHAGroupStoreClientfocused on the effective HA view.Does this PR introduce any user-facing change?
Yes, within the unreleased consistent-failover feature branch (no change vs released Phoenix):
phoenix.ha.group.store.peer.cache.retry.interval.seconds(default 60s).STANDBYwhose peer ZK is unreachable presents an effectiveDEGRADED_STANDBY(in-memory only, never written to ZK), so replay fails closed until the peer is visible again. Persisted wire format is unchanged.How was this patch tested?
New and existing unit/integration tests:
HAGroupStoreClientIT,HAGroupStoreManagerIT,HAGroupStateSubscriptionIT,ReplicationLogDiscoveryReplayTestIT,HAGroupStoreRecordTest,PeerClusterWatcherTest.Key cases: peer-loss degrade/recover; the role entering
STANDBYwhile the peer is already blind, and cold start with peer ZK down, both presentDEGRADED_STANDBYwhile the persisted record staysSTANDBY; peer ZK down at startup then retry rebuilds the cache; in-memory-only overlay (persisted staysSTANDBY);STANDBYre-entry suppressed while peer-blind; forced reconnect redelivery; visible/blind serialization;close()idempotency; replayerdegrade → abort → recover; and a listener throwing on the cacheINITIALIZEDevent still initializes healthy.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor