PHOENIX-7872 :- HA observability metrics for poller, CRR refresh/age, failover, mutation-block + RegionServer bypass counter#2502
Open
lokiore wants to merge 2 commits into
Conversation
… failover, mutation-block + RegionServer bypass counter Adds client-side and server-side observability metrics for the Consistent Failover (CCF) high-availability path: Tier-1 client-side counters (4): - HA_POLLER_TICK_COUNT — total poller ticks across all HA groups - HA_POLLER_TICK_FAILURES — per-tick CRR fetch failures - HA_FAILOVER_COUNT — failover transitions executed by the client - HA_MUTATION_BLOCKED_COUNT — MutationBlockedIOException occurrences detected via the wrap-and-propagate path Tier-2 client-side metrics (4): - HA_FAILOVER_DURATION_MS — failover end-to-end latency histogram - HA_STALE_CRR_DETECTED_COUNT — StaleClusterRoleRecordException occurrences - HA_CRR_CACHE_AGE_MS — gauge of staleness of the in-memory CRR cache - (HA_FAILOVER_COUNT moved to applyClusterRoleRecord with role-transition guard so it only fires on actual ACTIVE -> STANDBY or STANDBY -> ACTIVE transitions) Tier-2 server-side counter (1): - BYPASSED_MUTATION_BLOCK_COUNT — emitted from IndexRegionObserver when a mutation bypasses the mutation-block check because no log group is present. Implemented as 3-file Hadoop-metrics2 source: interface + static factory (DefaultMetricsSystem.instance()) + impl. Tests: - HAGroupMetricsIT — IT covering all 8 client-side metrics - BypassedMutationBlockMetricsIT — IT covering server-side bypass counter - HighAvailabilityUtilTest — UT covering RetriesExhaustedWithDetailsException + IOException cause-chain MBIOE detection - MetricsHaBypassSourceFactoryTest — UT covering factory thread-safety Generated-by: Claude Code (Opus 4.7)
…+ IT catch narrowing + review nits) Five must-fixes from the PR apache#2502 review: 1. HA_FAILOVER_COUNT semantic — moved increment inside the transition try and gated on transitionSucceeded so only successful policy transitions count; preserves the existing metric name. Gate decision factored into the package-private static HighAvailabilityGroup#shouldCountFailover for direct unit-test coverage of the negative path (see Nit 1 below). 2. HA_CRR_REFRESH_COUNT semantic — moved increment to after a successful getClusterRoleRecordFromEndpoint() so no-op refreshes inside the cache window do NOT inflate the counter (counter now measures fresh fetches, not refresh-method invocations). 3. HA_CRR_CACHE_AGE_MS sampling — added a poller-tick sample site so the gauge updates on every poller iteration via a new HighAvailabilityGroup #getCacheAgeMs() accessor. The connect()-site sample is retained. 4. BYPASSED_MUTATION_BLOCK_COUNT framing — rewrote the IndexRegionObserver inline comment and MetricsHaBypassSource Javadoc/descriptions as a path-coverage detector (counts the short-circuit code path; does NOT imply any safety property was breached). 5. HAGroupMetricsIT catch narrowing — narrowed the stale-CRR catch from the broad catch(Exception) to catch(SQLException) and asserted the error code is FAILOVER_IN_PROGRESS (the contracted surface from FailoverPhoenixConnection.wrapActionDuringFailover); added a LOG.info so the expected exception is recorded. Tests adjusted to match the new main-code semantics: - HAGroupMetricsIT.testCrrRefreshCount — switched both refresh calls to force-refresh so the assertion exercises the actual-fetch path the counter now measures. Inline comment explains WHY non-force inside the cache window intentionally no longer increments. Review-nit follow-ups: - Nit 1 (negative-path coverage for Fix apache#1): extracted the failover-gate decision into a pure, package-private static helper HighAvailabilityGroup#shouldCountFailover(boolean, ClusterRoleRecord, ClusterRoleRecord) and added HighAvailabilityGroupTest #testShouldCountFailoverGate covering 5 cases: (a) real ACTIVE-URL move counts, (b) same-active-URL no-op does NOT count, (c) transition INTO no-active does NOT count, (d) transitionSucceeded=false (failed policy callback) does NOT count — the regression guard, and (e) recovery from no-active back to ACTIVE counts. - Nit 2 (poller-tick gauge value): HAGroupMetricsIT.testPollerTickCount now also asserts GLOBAL_HA_CRR_CACHE_AGE_MS.getValue() >= 0L after the poller has ticked (guards against the -1L never-refreshed sentinel leaking out through the poller-tick sample site). - Nit 3 (never-refreshed disambiguation): HighAvailabilityGroup #getCacheAgeMs() now returns -1L when lastClusterRoleRecordRefreshTime is 0, instead of 0L. This disambiguates "never sampled" from "refreshed within the same millisecond" on the gauge, and supersedes a latent bug: because the CRR poller is scheduled with initial-delay 0, its first tick can fire before init() seeds the timestamp; under the prior code the raw arithmetic now - 0 would publish a giant value (~currentTimeMillis()) to the gauge and spuriously trip every age > threshold alert. -1L publishes a clean "not yet sampled" marker. Javadoc documents the rationale + warns future readers not to revert to return 0. Connect()-site (state-gated) is unaffected and continues to use raw arithmetic. - Nit 4 (logger arg): HAGroupMetricsIT.testStaleCrrDetectedCount LOG.info now passes testName.getMethodName() (was relying on TestName.toString() to do the right thing). Generated-by: Claude Code (Opus 4.7)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Adds client-side and server-side observability metrics for the Consistent Failover (CCF) high-availability path.
JIRA: https://issues.apache.org/jira/browse/PHOENIX-7872
Tier-1 client-side counters (in
GlobalClientMetrics+MetricType):HA_POLLER_TICK_COUNT— total poller ticks across all HA groups (incremented inGetClusterRoleRecordUtil's polling task)HA_POLLER_TICK_FAILURES— per-tick CRR fetch failuresHA_FAILOVER_COUNT— failover transitions executed by the client. Emitted fromHighAvailabilityGroup.applyClusterRoleRecordonly on actual ACTIVE → STANDBY or STANDBY → ACTIVE role transitionsHA_MUTATION_BLOCKED_COUNT—MutationBlockedIOExceptionoccurrences detected via the wrap-and-propagate path inFailoverPhoenixConnection.wrapActionWhileFailoverTier-2 client-side metrics:
HA_FAILOVER_DURATION_MS— failover end-to-end latency histogram (try/finally wrapper inFailoverPhoenixConnection.failover)HA_STALE_CRR_DETECTED_COUNT—StaleClusterRoleRecordExceptionoccurrences detected in the wrap pathHA_CRR_CACHE_AGE_MS— gauge of staleness of the in-memory CRR cache, set on every successful CRR refresh inHighAvailabilityGroupTier-2 server-side counter (new 3-file Hadoop-metrics2 source under
phoenix-core-server/.../hbase/index/metrics):BYPASSED_MUTATION_BLOCK_COUNT— emitted fromIndexRegionObserverwhen a mutation bypasses the mutation-block check because no log group is present for the data table. Implemented asMetricsHaBypassSource(interface) +MetricsHaBypassSourceFactory(DefaultMetricsSystem-anchored, double-checked lock) +MetricsHaBypassSourceImpl.Why are the changes needed?
The CCF HA path previously had no observability for client-side polling cadence, failover frequency, failover latency, mutation-block fail-fast counts, stale-CRR detection, or CRR cache age. Operators investigating slow failovers or unexpected mutation rejections had to reconstruct event timelines from scattered DEBUG logs.
These metrics close the gap on the dimensions the platform team needs for HA SLO tracking and incident triage:
Does this PR introduce any user-facing change?
No
The new metrics are emitted via the existing
GlobalClientMetrics(client-side) and Hadoop metrics2 (server-side) pipelines. No public-API change, no SQL surface change, no behavior change on the failover/poller paths beyondgetMetric().increment()/.update()/.set()calls.How was this patch tested?
New unit tests:
HighAvailabilityUtilTest— coversRetriesExhaustedWithDetailsExceptioncause-chain MBIOE detection (the wrap-and-propagate path that firesHA_MUTATION_BLOCKED_COUNT)MetricsHaBypassSourceFactoryTest— covers factory thread-safety (single-instance under concurrentgetInstance())New ITs:
HAGroupMetricsIT— covers all 8 client-side metrics across the CCF failover lifecycle (poller ticks, failover transitions, CRR cache age gauge, stale-CRR detection, MBIOE detection on the wrap path)BypassedMutationBlockMetricsIT— covers server-sideBYPASSED_MUTATION_BLOCK_COUNTemission when a mutation hits a region without an HA log groupLocal 13/13 PASS reproduction:
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.7)