Skip to content

PHOENIX-7872 :- HA observability metrics for poller, CRR refresh/age, failover, mutation-block + RegionServer bypass counter#2502

Open
lokiore wants to merge 2 commits into
apache:PHOENIX-7562-feature-newfrom
lokiore:PHOENIX-7872-ha-observability-metrics
Open

PHOENIX-7872 :- HA observability metrics for poller, CRR refresh/age, failover, mutation-block + RegionServer bypass counter#2502
lokiore wants to merge 2 commits into
apache:PHOENIX-7562-feature-newfrom
lokiore:PHOENIX-7872-ha-observability-metrics

Conversation

@lokiore

@lokiore lokiore commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Adds client-side and server-side observability metrics for the Consistent Failover (CCF) high-availability path.

JIRA: https://issues.apache.org/jira/browse/PHOENIX-7872

Tier-1 client-side counters (in GlobalClientMetrics + MetricType):

  • HA_POLLER_TICK_COUNT — total poller ticks across all HA groups (incremented in GetClusterRoleRecordUtil's polling task)
  • HA_POLLER_TICK_FAILURES — per-tick CRR fetch failures
  • HA_FAILOVER_COUNT — failover transitions executed by the client. Emitted from HighAvailabilityGroup.applyClusterRoleRecord only on actual ACTIVE → STANDBY or STANDBY → ACTIVE role transitions
  • HA_MUTATION_BLOCKED_COUNTMutationBlockedIOException occurrences detected via the wrap-and-propagate path in FailoverPhoenixConnection.wrapActionWhileFailover

Tier-2 client-side metrics:

  • HA_FAILOVER_DURATION_MS — failover end-to-end latency histogram (try/finally wrapper in FailoverPhoenixConnection.failover)
  • HA_STALE_CRR_DETECTED_COUNTStaleClusterRoleRecordException occurrences detected in the wrap path
  • HA_CRR_CACHE_AGE_MS — gauge of staleness of the in-memory CRR cache, set on every successful CRR refresh in HighAvailabilityGroup

Tier-2 server-side counter (new 3-file Hadoop-metrics2 source under phoenix-core-server/.../hbase/index/metrics):

  • BYPASSED_MUTATION_BLOCK_COUNT — emitted from IndexRegionObserver when a mutation bypasses the mutation-block check because no log group is present for the data table. Implemented as MetricsHaBypassSource (interface) + MetricsHaBypassSourceFactory (DefaultMetricsSystem-anchored, double-checked lock) + MetricsHaBypassSourceImpl.

Why are the changes needed?

The CCF HA path previously had no observability for client-side polling cadence, failover frequency, failover latency, mutation-block fail-fast counts, stale-CRR detection, or CRR cache age. Operators investigating slow failovers or unexpected mutation rejections had to reconstruct event timelines from scattered DEBUG logs.

These metrics close the gap on the dimensions the platform team needs for HA SLO tracking and incident triage:

  • Poller liveness & health (tick count + failures)
  • Failover frequency, duration, and trigger (CRR transition vs MBIOE-driven)
  • Stale-CRR detection rate as a leading indicator of failover-in-progress windows
  • Server-side rate of mutation-block bypass (regions without an HA log group attached)

Does this PR introduce any user-facing change?

No

The new metrics are emitted via the existing GlobalClientMetrics (client-side) and Hadoop metrics2 (server-side) pipelines. No public-API change, no SQL surface change, no behavior change on the failover/poller paths beyond getMetric().increment() / .update() / .set() calls.

How was this patch tested?

New unit tests:

  • HighAvailabilityUtilTest — covers RetriesExhaustedWithDetailsException cause-chain MBIOE detection (the wrap-and-propagate path that fires HA_MUTATION_BLOCKED_COUNT)
  • MetricsHaBypassSourceFactoryTest — covers factory thread-safety (single-instance under concurrent getInstance())

New ITs:

  • HAGroupMetricsIT — covers all 8 client-side metrics across the CCF failover lifecycle (poller ticks, failover transitions, CRR cache age gauge, stale-CRR detection, MBIOE detection on the wrap path)
  • BypassedMutationBlockMetricsIT — covers server-side BYPASSED_MUTATION_BLOCK_COUNT emission when a mutation hits a region without an HA log group

Local 13/13 PASS reproduction:

[INFO] Running org.apache.phoenix.jdbc.HighAvailabilityUtilTest
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0
[INFO] Running org.apache.phoenix.hbase.index.metrics.MetricsHaBypassSourceFactoryTest
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0
[INFO] Running org.apache.phoenix.jdbc.HAGroupMetricsIT
[INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0
[INFO] Running org.apache.phoenix.end2end.index.BypassedMutationBlockMetricsIT
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

lokiore added 2 commits June 8, 2026 15:13
… failover, mutation-block + RegionServer bypass counter

Adds client-side and server-side observability metrics for the
Consistent Failover (CCF) high-availability path:

Tier-1 client-side counters (4):
- HA_POLLER_TICK_COUNT — total poller ticks across all HA groups
- HA_POLLER_TICK_FAILURES — per-tick CRR fetch failures
- HA_FAILOVER_COUNT — failover transitions executed by the client
- HA_MUTATION_BLOCKED_COUNT — MutationBlockedIOException occurrences
  detected via the wrap-and-propagate path

Tier-2 client-side metrics (4):
- HA_FAILOVER_DURATION_MS — failover end-to-end latency histogram
- HA_STALE_CRR_DETECTED_COUNT — StaleClusterRoleRecordException occurrences
- HA_CRR_CACHE_AGE_MS — gauge of staleness of the in-memory CRR cache
- (HA_FAILOVER_COUNT moved to applyClusterRoleRecord with role-transition guard
  so it only fires on actual ACTIVE -> STANDBY or STANDBY -> ACTIVE transitions)

Tier-2 server-side counter (1):
- BYPASSED_MUTATION_BLOCK_COUNT — emitted from IndexRegionObserver when a
  mutation bypasses the mutation-block check because no log group is present.
  Implemented as 3-file Hadoop-metrics2 source: interface +
  static factory (DefaultMetricsSystem.instance()) + impl.

Tests:
- HAGroupMetricsIT — IT covering all 8 client-side metrics
- BypassedMutationBlockMetricsIT — IT covering server-side bypass counter
- HighAvailabilityUtilTest — UT covering RetriesExhaustedWithDetailsException
  + IOException cause-chain MBIOE detection
- MetricsHaBypassSourceFactoryTest — UT covering factory thread-safety

Generated-by: Claude Code (Opus 4.7)
…+ IT catch narrowing + review nits)

Five must-fixes from the PR apache#2502 review:

1. HA_FAILOVER_COUNT semantic — moved increment inside the transition try and
   gated on transitionSucceeded so only successful policy transitions count;
   preserves the existing metric name. Gate decision factored into the
   package-private static HighAvailabilityGroup#shouldCountFailover for
   direct unit-test coverage of the negative path (see Nit 1 below).

2. HA_CRR_REFRESH_COUNT semantic — moved increment to after a successful
   getClusterRoleRecordFromEndpoint() so no-op refreshes inside the cache
   window do NOT inflate the counter (counter now measures fresh fetches,
   not refresh-method invocations).

3. HA_CRR_CACHE_AGE_MS sampling — added a poller-tick sample site so the
   gauge updates on every poller iteration via a new HighAvailabilityGroup
   #getCacheAgeMs() accessor. The connect()-site sample is retained.

4. BYPASSED_MUTATION_BLOCK_COUNT framing — rewrote the IndexRegionObserver
   inline comment and MetricsHaBypassSource Javadoc/descriptions as a
   path-coverage detector (counts the short-circuit code path; does NOT
   imply any safety property was breached).

5. HAGroupMetricsIT catch narrowing — narrowed the stale-CRR catch from
   the broad catch(Exception) to catch(SQLException) and asserted the
   error code is FAILOVER_IN_PROGRESS (the contracted surface from
   FailoverPhoenixConnection.wrapActionDuringFailover); added a LOG.info
   so the expected exception is recorded.

Tests adjusted to match the new main-code semantics:

- HAGroupMetricsIT.testCrrRefreshCount — switched both refresh calls to
  force-refresh so the assertion exercises the actual-fetch path the
  counter now measures. Inline comment explains WHY non-force inside
  the cache window intentionally no longer increments.

Review-nit follow-ups:

- Nit 1 (negative-path coverage for Fix apache#1): extracted the failover-gate
  decision into a pure, package-private static helper
  HighAvailabilityGroup#shouldCountFailover(boolean, ClusterRoleRecord,
  ClusterRoleRecord) and added HighAvailabilityGroupTest
  #testShouldCountFailoverGate covering 5 cases: (a) real ACTIVE-URL
  move counts, (b) same-active-URL no-op does NOT count, (c) transition
  INTO no-active does NOT count, (d) transitionSucceeded=false (failed
  policy callback) does NOT count — the regression guard, and
  (e) recovery from no-active back to ACTIVE counts.

- Nit 2 (poller-tick gauge value): HAGroupMetricsIT.testPollerTickCount
  now also asserts GLOBAL_HA_CRR_CACHE_AGE_MS.getValue() >= 0L after
  the poller has ticked (guards against the -1L never-refreshed
  sentinel leaking out through the poller-tick sample site).

- Nit 3 (never-refreshed disambiguation): HighAvailabilityGroup
  #getCacheAgeMs() now returns -1L when lastClusterRoleRecordRefreshTime
  is 0, instead of 0L. This disambiguates "never sampled" from
  "refreshed within the same millisecond" on the gauge, and supersedes
  a latent bug: because the CRR poller is scheduled with initial-delay
  0, its first tick can fire before init() seeds the timestamp; under
  the prior code the raw arithmetic now - 0 would publish a giant value
  (~currentTimeMillis()) to the gauge and spuriously trip every
  age > threshold alert. -1L publishes a clean "not yet sampled"
  marker. Javadoc documents the rationale + warns future readers not
  to revert to return 0. Connect()-site (state-gated) is unaffected
  and continues to use raw arithmetic.

- Nit 4 (logger arg): HAGroupMetricsIT.testStaleCrrDetectedCount LOG.info
  now passes testName.getMethodName() (was relying on TestName.toString()
  to do the right thing).

Generated-by: Claude Code (Opus 4.7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant