PHOENIX-7872 :- HA observability metrics for poller, CRR refresh/age, failover, mutation-block + RegionServer bypass counter by lokiore · Pull Request #2502 · apache/phoenix

lokiore · 2026-06-08T22:15:36Z

What changes were proposed in this pull request?

Adds client-side and server-side observability metrics for the Consistent Failover (CCF) high-availability path.

JIRA: https://issues.apache.org/jira/browse/PHOENIX-7872

Tier-1 client-side counters (in GlobalClientMetrics + MetricType):

HA_POLLER_TICK_COUNT — total poller ticks across all HA groups (incremented in GetClusterRoleRecordUtil's polling task)
HA_POLLER_TICK_FAILURES — per-tick CRR fetch failures
HA_FAILOVER_COUNT — failover transitions executed by the client. Emitted from HighAvailabilityGroup.applyClusterRoleRecord only on actual ACTIVE → STANDBY or STANDBY → ACTIVE role transitions
HA_MUTATION_BLOCKED_COUNT — MutationBlockedIOException occurrences detected via the wrap-and-propagate path in FailoverPhoenixConnection.wrapActionWhileFailover

Tier-2 client-side metrics:

HA_FAILOVER_DURATION_MS — failover end-to-end latency histogram (try/finally wrapper in FailoverPhoenixConnection.failover)
HA_STALE_CRR_DETECTED_COUNT — StaleClusterRoleRecordException occurrences detected in the wrap path
HA_CRR_CACHE_AGE_MS — gauge of staleness of the in-memory CRR cache, set on every successful CRR refresh in HighAvailabilityGroup

Tier-2 server-side counter (new 3-file Hadoop-metrics2 source under phoenix-core-server/.../hbase/index/metrics):

BYPASSED_MUTATION_BLOCK_COUNT — emitted from IndexRegionObserver when a mutation bypasses the mutation-block check because no log group is present for the data table. Implemented as MetricsHaBypassSource (interface) + MetricsHaBypassSourceFactory (DefaultMetricsSystem-anchored, double-checked lock) + MetricsHaBypassSourceImpl.

Why are the changes needed?

The CCF HA path previously had no observability for client-side polling cadence, failover frequency, failover latency, mutation-block fail-fast counts, stale-CRR detection, or CRR cache age. Operators investigating slow failovers or unexpected mutation rejections had to reconstruct event timelines from scattered DEBUG logs.

These metrics close the gap on the dimensions the platform team needs for HA SLO tracking and incident triage:

Poller liveness & health (tick count + failures)
Failover frequency, duration, and trigger (CRR transition vs MBIOE-driven)
Stale-CRR detection rate as a leading indicator of failover-in-progress windows
Server-side rate of mutation-block bypass (regions without an HA log group attached)

Does this PR introduce any user-facing change?

No

The new metrics are emitted via the existing GlobalClientMetrics (client-side) and Hadoop metrics2 (server-side) pipelines. No public-API change, no SQL surface change, no behavior change on the failover/poller paths beyond getMetric().increment() / .update() / .set() calls.

How was this patch tested?

New unit tests:

HighAvailabilityUtilTest — covers RetriesExhaustedWithDetailsException cause-chain MBIOE detection (the wrap-and-propagate path that fires HA_MUTATION_BLOCKED_COUNT)
MetricsHaBypassSourceFactoryTest — covers factory thread-safety (single-instance under concurrent getInstance())

New ITs:

HAGroupMetricsIT — covers all 8 client-side metrics across the CCF failover lifecycle (poller ticks, failover transitions, CRR cache age gauge, stale-CRR detection, MBIOE detection on the wrap path)
BypassedMutationBlockMetricsIT — covers server-side BYPASSED_MUTATION_BLOCK_COUNT emission when a mutation hits a region without an HA log group

Local 13/13 PASS reproduction:

[INFO] Running org.apache.phoenix.jdbc.HighAvailabilityUtilTest
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0
[INFO] Running org.apache.phoenix.hbase.index.metrics.MetricsHaBypassSourceFactoryTest
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0
[INFO] Running org.apache.phoenix.jdbc.HAGroupMetricsIT
[INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0
[INFO] Running org.apache.phoenix.end2end.index.BypassedMutationBlockMetricsIT
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

… failover, mutation-block + RegionServer bypass counter Adds client-side and server-side observability metrics for the Consistent Failover (CCF) high-availability path: Tier-1 client-side counters (4): - HA_POLLER_TICK_COUNT — total poller ticks across all HA groups - HA_POLLER_TICK_FAILURES — per-tick CRR fetch failures - HA_FAILOVER_COUNT — failover transitions executed by the client - HA_MUTATION_BLOCKED_COUNT — MutationBlockedIOException occurrences detected via the wrap-and-propagate path Tier-2 client-side metrics (4): - HA_FAILOVER_DURATION_MS — failover end-to-end latency histogram - HA_STALE_CRR_DETECTED_COUNT — StaleClusterRoleRecordException occurrences - HA_CRR_CACHE_AGE_MS — gauge of staleness of the in-memory CRR cache - (HA_FAILOVER_COUNT moved to applyClusterRoleRecord with role-transition guard so it only fires on actual ACTIVE -> STANDBY or STANDBY -> ACTIVE transitions) Tier-2 server-side counter (1): - BYPASSED_MUTATION_BLOCK_COUNT — emitted from IndexRegionObserver when a mutation bypasses the mutation-block check because no log group is present. Implemented as 3-file Hadoop-metrics2 source: interface + static factory (DefaultMetricsSystem.instance()) + impl. Tests: - HAGroupMetricsIT — IT covering all 8 client-side metrics - BypassedMutationBlockMetricsIT — IT covering server-side bypass counter - HighAvailabilityUtilTest — UT covering RetriesExhaustedWithDetailsException + IOException cause-chain MBIOE detection - MetricsHaBypassSourceFactoryTest — UT covering factory thread-safety Generated-by: Claude Code (Opus 4.7)

…+ IT catch narrowing + review nits) Five must-fixes from the PR apache#2502 review: 1. HA_FAILOVER_COUNT semantic — moved increment inside the transition try and gated on transitionSucceeded so only successful policy transitions count; preserves the existing metric name. Gate decision factored into the package-private static HighAvailabilityGroup#shouldCountFailover for direct unit-test coverage of the negative path (see Nit 1 below). 2. HA_CRR_REFRESH_COUNT semantic — moved increment to after a successful getClusterRoleRecordFromEndpoint() so no-op refreshes inside the cache window do NOT inflate the counter (counter now measures fresh fetches, not refresh-method invocations). 3. HA_CRR_CACHE_AGE_MS sampling — added a poller-tick sample site so the gauge updates on every poller iteration via a new HighAvailabilityGroup #getCacheAgeMs() accessor. The connect()-site sample is retained. 4. BYPASSED_MUTATION_BLOCK_COUNT framing — rewrote the IndexRegionObserver inline comment and MetricsHaBypassSource Javadoc/descriptions as a path-coverage detector (counts the short-circuit code path; does NOT imply any safety property was breached). 5. HAGroupMetricsIT catch narrowing — narrowed the stale-CRR catch from the broad catch(Exception) to catch(SQLException) and asserted the error code is FAILOVER_IN_PROGRESS (the contracted surface from FailoverPhoenixConnection.wrapActionDuringFailover); added a LOG.info so the expected exception is recorded. Tests adjusted to match the new main-code semantics: - HAGroupMetricsIT.testCrrRefreshCount — switched both refresh calls to force-refresh so the assertion exercises the actual-fetch path the counter now measures. Inline comment explains WHY non-force inside the cache window intentionally no longer increments. Review-nit follow-ups: - Nit 1 (negative-path coverage for Fix apache#1): extracted the failover-gate decision into a pure, package-private static helper HighAvailabilityGroup#shouldCountFailover(boolean, ClusterRoleRecord, ClusterRoleRecord) and added HighAvailabilityGroupTest #testShouldCountFailoverGate covering 5 cases: (a) real ACTIVE-URL move counts, (b) same-active-URL no-op does NOT count, (c) transition INTO no-active does NOT count, (d) transitionSucceeded=false (failed policy callback) does NOT count — the regression guard, and (e) recovery from no-active back to ACTIVE counts. - Nit 2 (poller-tick gauge value): HAGroupMetricsIT.testPollerTickCount now also asserts GLOBAL_HA_CRR_CACHE_AGE_MS.getValue() >= 0L after the poller has ticked (guards against the -1L never-refreshed sentinel leaking out through the poller-tick sample site). - Nit 3 (never-refreshed disambiguation): HighAvailabilityGroup #getCacheAgeMs() now returns -1L when lastClusterRoleRecordRefreshTime is 0, instead of 0L. This disambiguates "never sampled" from "refreshed within the same millisecond" on the gauge, and supersedes a latent bug: because the CRR poller is scheduled with initial-delay 0, its first tick can fire before init() seeds the timestamp; under the prior code the raw arithmetic now - 0 would publish a giant value (~currentTimeMillis()) to the gauge and spuriously trip every age > threshold alert. -1L publishes a clean "not yet sampled" marker. Javadoc documents the rationale + warns future readers not to revert to return 0. Connect()-site (state-gated) is unaffected and continues to use raw arithmetic. - Nit 4 (logger arg): HAGroupMetricsIT.testStaleCrrDetectedCount LOG.info now passes testName.getMethodName() (was relying on TestName.toString() to do the right thing). Generated-by: Claude Code (Opus 4.7)

lokiore added 2 commits June 8, 2026 15:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PHOENIX-7872 :- HA observability metrics for poller, CRR refresh/age, failover, mutation-block + RegionServer bypass counter#2502

PHOENIX-7872 :- HA observability metrics for poller, CRR refresh/age, failover, mutation-block + RegionServer bypass counter#2502
lokiore wants to merge 2 commits into
apache:PHOENIX-7562-feature-newfrom
lokiore:PHOENIX-7872-ha-observability-metrics

lokiore commented Jun 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lokiore commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lokiore commented Jun 8, 2026 •

edited

Loading