PHOENIX-7883 : More metrics for EC index consumer by palashc · Pull Request #2543 · apache/phoenix

palashc · 2026-06-23T21:40:39Z

What changes were proposed in this pull request?

Builds on PHOENIX-7884 (lag tracking refactor, #2506) with four new operational metrics for the eventually-consistent (EC) index IndexCDCConsumer, plus two correctness fixes in the lag watermark.

New metrics (all per-table + global, following the existing dual-emit pattern in MetricsIndexCDCConsumerSourceImpl):

Metric	Type	Purpose
`cdcEventSkippedCount`	Counter	Increments at the `processCDCBatchGenerated` give-up site when `maxDataVisibilityRetries` is exhausted and the consumer permanently advances past unprocessable CDC events. Surfaces silent data divergence between the data table and its EC indexes.
`cdcParentReplayActiveRegions`	Gauge	"How many regions on this RS are currently in post-split / post-merge parent-region replay for this table?" Lets operators distinguish the by-design lag-spike during catch-up from a broken consumer. Incremented in `run()` around the top-level `replayAndCompleteParentRegions` call (outside the recursive descent so ancestor recursion does not double-count).
`cdcParentReplayDuration`	Histogram (ms)	One sample per ancestor partition when `processPartitionToCompletion` reaches a terminal state (marked COMPLETE here, or observed COMPLETE-by-sibling). Stopped/interrupted exits emit nothing.
`cdcConsumerActiveRegions`	Gauge	"How many consumers are in steady-state poll loop for this table on this RS?" Incremented immediately before the main `while (!stopped)` loop in `run()` and decremented in a `finally`, so it strictly reflects steady-state processing and is semantically disjoint from `cdcParentReplayActiveRegions`. Sum of the two gauges = "consumer is doing useful work".

Lag-tracking fixes (built on PHOENIX-7884 watermark plumbing):

processCDCBatchGenerated give-up path: pull progress.recordProcessed(newLastTimestamp) out of the !batchStates.isEmpty() gate into the existing newLastTimestamp > lastProcessedTimestamp gate so the in-memory watermark advances in lockstep with the durable tracker. Previously the watermark stayed stale until the next empty poll or successful batch, causing cdcIndexUpdateLag to over-report.
processCDCBatch inner loop: when the CDC scan returns rows that are all empty IndexMutations protos (no-op CDC entries), advance progress.recordProcessed(newLastTimestamp) — we have definitively scanned past those timestamps and the watermark would otherwise stay fixed for the burn-through.
Bump DEFAULT_LAG_SAMPLE_INTERVAL_MS from 1000 ms to 5000 ms to cut background histogram-update load on RegionServers hosting many EC-indexed regions. Tunable via phoenix.index.cdc.consumer.lag.sample.interval.ms.

Why are the changes needed?

After PHOENIX-7884 the lag histogram became more accurate but several operational blind spots remained:

Silent data divergence: the give-up branch in processCDCBatchGenerated permanently drops index updates with only a WARN log — no metric an SRE can alert on.
Post-split lag spikes are indistinguishable from broken consumers: parent-region replay deliberately does not advance the freshness watermark (parent freshness ≠ child freshness), so cdcIndexUpdateLag inflates by design during catch-up; nothing else fires to disambiguate it.
Liveness ambiguity: a daemon consumer thread that exits cleanly (no EC index) is indistinguishable from one that crashed mid-run, and the lag histogram is silent in both cases.
The give-up-path watermark staleness and no-op-burn-through watermark stagnation (both fixed here) caused the new lag metric to over-report under exactly the conditions where accuracy matters most.

Does this PR introduce any user-facing change?

No user-facing API/behavior change. New JMX metrics are additive (under RegionServer,sub=IndexCDCConsumer). One default config value changed (phoenix.index.cdc.consumer.lag.sample.interval.ms 1000 → 5000), overridable via existing config knob.

How was this patch tested?

mvn -pl phoenix-core-client,phoenix-core-server spotless:apply (no further changes required).
mvn -pl phoenix-core-client,phoenix-core-server -am install -DskipTests clean.
No new test infrastructure added for the new metrics — they are additive per-table+global counters/gauges/histogram following the established dual-emit pattern, wired at unique, single-purpose call sites (one each for the skip counter and the histogram; symmetric inc/dec in finally blocks for the two gauges).

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor (Claude Opus 4.7)

Co-authored-by: Cursor <cursoragent@cursor.com>

- Keep in-memory progress in sync with durable tracker on the processCDCBatchGenerated give-up path (max data-visibility retries exhausted): previously the durable tracker advanced via updateTrackerProgress but the in-memory watermark stayed stale, so cdcIndexUpdateLag over-reported until the next empty poll or successful batch. - In the serialized processCDCBatch inner loop, advance the in-memory watermark when rows exist but are all empty IndexMutations protos (no-op CDC entries). We have definitively scanned past newLastTimestamp; without this the watermark stayed fixed across the no-op burn-through. - Bump DEFAULT_LAG_SAMPLE_INTERVAL_MS from 1000ms to 5000ms to reduce background histogram-update load on RegionServers hosting many EC-indexed regions. Operators can dial it back down via phoenix.index.cdc.consumer.lag.sample.interval.ms. Co-authored-by: Cursor <cursoragent@cursor.com>

…ion DESC - cdcEventSkippedCount now increments by the number of CDC events dropped in each give-up event (not by 1). getDataRowStatesAndTimestamp reports the scanned row count via a new long[] out-param, mirroring the existing lastScannedTimestamp idiom. The WARN log line also carries the count. - cdcParentReplayDuration DESC clarifies that the histogram measures this consumer's time inside processPartitionToCompletion, which may be less than end-to-end partition replay time when another consumer marks the partition COMPLETE first.

virajjasani · 2026-06-24T16:35:39Z

https://ci-hadoop.apache.org/job/Phoenix/job/Phoenix-PreCommit-GitHub-PR/job/PR-2543/

palashc and others added 2 commits June 23, 2026 13:57

PHOENIX-7883 : More metrics for EC index consumer

852d845

Co-authored-by: Cursor <cursoragent@cursor.com>

palashc requested a review from virajjasani June 23, 2026 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PHOENIX-7883 : More metrics for EC index consumer#2543

PHOENIX-7883 : More metrics for EC index consumer#2543
palashc wants to merge 3 commits into
apache:masterfrom
palashc:PHOENIX-7883

palashc commented Jun 23, 2026

Uh oh!

virajjasani commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

palashc commented Jun 23, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

virajjasani commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants