Skip to content

PHOENIX-7883 : More metrics for EC index consumer#2543

Open
palashc wants to merge 3 commits into
apache:masterfrom
palashc:PHOENIX-7883
Open

PHOENIX-7883 : More metrics for EC index consumer#2543
palashc wants to merge 3 commits into
apache:masterfrom
palashc:PHOENIX-7883

Conversation

@palashc

@palashc palashc commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Builds on PHOENIX-7884 (lag tracking refactor, #2506) with four new operational metrics for the eventually-consistent (EC) index IndexCDCConsumer, plus two correctness fixes in the lag watermark.

New metrics (all per-table + global, following the existing dual-emit pattern in MetricsIndexCDCConsumerSourceImpl):

Metric Type Purpose
cdcEventSkippedCount Counter Increments at the processCDCBatchGenerated give-up site when maxDataVisibilityRetries is exhausted and the consumer permanently advances past unprocessable CDC events. Surfaces silent data divergence between the data table and its EC indexes.
cdcParentReplayActiveRegions Gauge "How many regions on this RS are currently in post-split / post-merge parent-region replay for this table?" Lets operators distinguish the by-design lag-spike during catch-up from a broken consumer. Incremented in run() around the top-level replayAndCompleteParentRegions call (outside the recursive descent so ancestor recursion does not double-count).
cdcParentReplayDuration Histogram (ms) One sample per ancestor partition when processPartitionToCompletion reaches a terminal state (marked COMPLETE here, or observed COMPLETE-by-sibling). Stopped/interrupted exits emit nothing.
cdcConsumerActiveRegions Gauge "How many consumers are in steady-state poll loop for this table on this RS?" Incremented immediately before the main while (!stopped) loop in run() and decremented in a finally, so it strictly reflects steady-state processing and is semantically disjoint from cdcParentReplayActiveRegions. Sum of the two gauges = "consumer is doing useful work".

Lag-tracking fixes (built on PHOENIX-7884 watermark plumbing):

  • processCDCBatchGenerated give-up path: pull progress.recordProcessed(newLastTimestamp) out of the !batchStates.isEmpty() gate into the existing newLastTimestamp > lastProcessedTimestamp gate so the in-memory watermark advances in lockstep with the durable tracker. Previously the watermark stayed stale until the next empty poll or successful batch, causing cdcIndexUpdateLag to over-report.
  • processCDCBatch inner loop: when the CDC scan returns rows that are all empty IndexMutations protos (no-op CDC entries), advance progress.recordProcessed(newLastTimestamp) — we have definitively scanned past those timestamps and the watermark would otherwise stay fixed for the burn-through.
  • Bump DEFAULT_LAG_SAMPLE_INTERVAL_MS from 1000 ms to 5000 ms to cut background histogram-update load on RegionServers hosting many EC-indexed regions. Tunable via phoenix.index.cdc.consumer.lag.sample.interval.ms.

Why are the changes needed?

After PHOENIX-7884 the lag histogram became more accurate but several operational blind spots remained:

  1. Silent data divergence: the give-up branch in processCDCBatchGenerated permanently drops index updates with only a WARN log — no metric an SRE can alert on.
  2. Post-split lag spikes are indistinguishable from broken consumers: parent-region replay deliberately does not advance the freshness watermark (parent freshness ≠ child freshness), so cdcIndexUpdateLag inflates by design during catch-up; nothing else fires to disambiguate it.
  3. Liveness ambiguity: a daemon consumer thread that exits cleanly (no EC index) is indistinguishable from one that crashed mid-run, and the lag histogram is silent in both cases.
  4. The give-up-path watermark staleness and no-op-burn-through watermark stagnation (both fixed here) caused the new lag metric to over-report under exactly the conditions where accuracy matters most.

Does this PR introduce any user-facing change?

No user-facing API/behavior change. New JMX metrics are additive (under RegionServer,sub=IndexCDCConsumer). One default config value changed (phoenix.index.cdc.consumer.lag.sample.interval.ms 1000 → 5000), overridable via existing config knob.

How was this patch tested?

  • mvn -pl phoenix-core-client,phoenix-core-server spotless:apply (no further changes required).
  • mvn -pl phoenix-core-client,phoenix-core-server -am install -DskipTests clean.
  • No new test infrastructure added for the new metrics — they are additive per-table+global counters/gauges/histogram following the established dual-emit pattern, wired at unique, single-purpose call sites (one each for the skip counter and the histogram; symmetric inc/dec in finally blocks for the two gauges).

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor (Claude Opus 4.7)

palashc and others added 2 commits June 23, 2026 13:57
Co-authored-by: Cursor <cursoragent@cursor.com>
- Keep in-memory progress in sync with durable tracker on the
  processCDCBatchGenerated give-up path (max data-visibility retries
  exhausted): previously the durable tracker advanced via
  updateTrackerProgress but the in-memory watermark stayed stale, so
  cdcIndexUpdateLag over-reported until the next empty poll or
  successful batch.
- In the serialized processCDCBatch inner loop, advance the in-memory
  watermark when rows exist but are all empty IndexMutations protos
  (no-op CDC entries). We have definitively scanned past
  newLastTimestamp; without this the watermark stayed fixed across
  the no-op burn-through.
- Bump DEFAULT_LAG_SAMPLE_INTERVAL_MS from 1000ms to 5000ms to reduce
  background histogram-update load on RegionServers hosting many
  EC-indexed regions. Operators can dial it back down via
  phoenix.index.cdc.consumer.lag.sample.interval.ms.

Co-authored-by: Cursor <cursoragent@cursor.com>
@palashc palashc requested a review from virajjasani June 23, 2026 21:40
…ion DESC

- cdcEventSkippedCount now increments by the number of CDC events
  dropped in each give-up event (not by 1). getDataRowStatesAndTimestamp
  reports the scanned row count via a new long[] out-param, mirroring
  the existing lastScannedTimestamp idiom. The WARN log line also
  carries the count.
- cdcParentReplayDuration DESC clarifies that the histogram measures
  this consumer's time inside processPartitionToCompletion, which may
  be less than end-to-end partition replay time when another consumer
  marks the partition COMPLETE first.
@virajjasani

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants