PHOENIX-7883 : More metrics for EC index consumer#2543
Open
palashc wants to merge 3 commits into
Open
Conversation
Co-authored-by: Cursor <cursoragent@cursor.com>
- Keep in-memory progress in sync with durable tracker on the processCDCBatchGenerated give-up path (max data-visibility retries exhausted): previously the durable tracker advanced via updateTrackerProgress but the in-memory watermark stayed stale, so cdcIndexUpdateLag over-reported until the next empty poll or successful batch. - In the serialized processCDCBatch inner loop, advance the in-memory watermark when rows exist but are all empty IndexMutations protos (no-op CDC entries). We have definitively scanned past newLastTimestamp; without this the watermark stayed fixed across the no-op burn-through. - Bump DEFAULT_LAG_SAMPLE_INTERVAL_MS from 1000ms to 5000ms to reduce background histogram-update load on RegionServers hosting many EC-indexed regions. Operators can dial it back down via phoenix.index.cdc.consumer.lag.sample.interval.ms. Co-authored-by: Cursor <cursoragent@cursor.com>
…ion DESC - cdcEventSkippedCount now increments by the number of CDC events dropped in each give-up event (not by 1). getDataRowStatesAndTimestamp reports the scanned row count via a new long[] out-param, mirroring the existing lastScannedTimestamp idiom. The WARN log line also carries the count. - cdcParentReplayDuration DESC clarifies that the histogram measures this consumer's time inside processPartitionToCompletion, which may be less than end-to-end partition replay time when another consumer marks the partition COMPLETE first.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Builds on PHOENIX-7884 (lag tracking refactor, #2506) with four new operational metrics for the eventually-consistent (EC) index
IndexCDCConsumer, plus two correctness fixes in the lag watermark.New metrics (all per-table + global, following the existing dual-emit pattern in
MetricsIndexCDCConsumerSourceImpl):cdcEventSkippedCountprocessCDCBatchGeneratedgive-up site whenmaxDataVisibilityRetriesis exhausted and the consumer permanently advances past unprocessable CDC events. Surfaces silent data divergence between the data table and its EC indexes.cdcParentReplayActiveRegionsrun()around the top-levelreplayAndCompleteParentRegionscall (outside the recursive descent so ancestor recursion does not double-count).cdcParentReplayDurationprocessPartitionToCompletionreaches a terminal state (marked COMPLETE here, or observed COMPLETE-by-sibling). Stopped/interrupted exits emit nothing.cdcConsumerActiveRegionswhile (!stopped)loop inrun()and decremented in afinally, so it strictly reflects steady-state processing and is semantically disjoint fromcdcParentReplayActiveRegions. Sum of the two gauges = "consumer is doing useful work".Lag-tracking fixes (built on PHOENIX-7884 watermark plumbing):
processCDCBatchGeneratedgive-up path: pullprogress.recordProcessed(newLastTimestamp)out of the!batchStates.isEmpty()gate into the existingnewLastTimestamp > lastProcessedTimestampgate so the in-memory watermark advances in lockstep with the durable tracker. Previously the watermark stayed stale until the next empty poll or successful batch, causingcdcIndexUpdateLagto over-report.processCDCBatchinner loop: when the CDC scan returns rows that are all emptyIndexMutationsprotos (no-op CDC entries), advanceprogress.recordProcessed(newLastTimestamp)— we have definitively scanned past those timestamps and the watermark would otherwise stay fixed for the burn-through.DEFAULT_LAG_SAMPLE_INTERVAL_MSfrom 1000 ms to 5000 ms to cut background histogram-update load on RegionServers hosting many EC-indexed regions. Tunable viaphoenix.index.cdc.consumer.lag.sample.interval.ms.Why are the changes needed?
After PHOENIX-7884 the lag histogram became more accurate but several operational blind spots remained:
processCDCBatchGeneratedpermanently drops index updates with only a WARN log — no metric an SRE can alert on.cdcIndexUpdateLaginflates by design during catch-up; nothing else fires to disambiguate it.Does this PR introduce any user-facing change?
No user-facing API/behavior change. New JMX metrics are additive (under
RegionServer,sub=IndexCDCConsumer). One default config value changed (phoenix.index.cdc.consumer.lag.sample.interval.ms1000 → 5000), overridable via existing config knob.How was this patch tested?
mvn -pl phoenix-core-client,phoenix-core-server spotless:apply(no further changes required).mvn -pl phoenix-core-client,phoenix-core-server -am install -DskipTestsclean.finallyblocks for the two gauges).Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor (Claude Opus 4.7)