Skip to content

[changelog-consumer] Opt-in periodic offset commit to enable Kafka consumer-group lag monitoring#2789

Open
minhmo1620 wants to merge 6 commits into
linkedin:mainfrom
minhmo1620:minnguyen/expose-consumer-group-veng-12668
Open

[changelog-consumer] Opt-in periodic offset commit to enable Kafka consumer-group lag monitoring#2789
minhmo1620 wants to merge 6 commits into
linkedin:mainfrom
minhmo1620:minnguyen/expose-consumer-group-veng-12668

Conversation

@minhmo1620
Copy link
Copy Markdown
Contributor

@minhmo1620 minhmo1620 commented May 11, 2026

Problem Statement

We want to leverage Kafka's existing consumer-group lag mechanism to detect laggy VeniceChangelogConsumer (DVRT) jobs. Kafka already computes lag for any consumer that commits to __consumer_offsets — the broker has the latest committed offset, the high-watermark, and exposes the difference through every standard consumer-group tool (kafka-consumer-groups.sh, Burrow, lag dashboards, regex-based alerts). Reusing this means we don't have to stand up a parallel lag-monitoring stack just for DVRT.

Today DVRT can't plug into this. It never writes to __consumer_offsets, so from the broker's point of view the consumer group doesn't exist, and every tool that reads broker-side lag returns empty.

Solution

Let the changelog consumer optionally publish its current position to __consumer_offsets purely so external tools can see it. Source-of-truth for offsets stays with the caller — Kafka-side commits are never read back on restart and a failed commit is swallowed.

  • Add commitOffsets() to VeniceChangelogConsumer (default no-op; Kafka-backed impl delegates to the underlying consumer's commitSync()).
  • Call it periodically from poll() at most once per consumerOffsetCommitIntervalMs. The interval is measured after each commit returns so a slow commit can't trigger an immediate re-commit on the next poll.

Opting in is fully explicit and requires two knobs. Commits actually happen only when both:

  1. ChangelogClientConfig.consumerOffsetCommitIntervalMs > 0 (default 0 = off), and
  2. A Kafka group.id is set on the consumer properties (caller configures this on the existing ChangelogClientConfig.consumerProperties bag).

If either is missing the path short-circuits silently — no Kafka call, no log noise. Existing callers see zero behavior change. No new factory overload or Properties parameter — callers wire both knobs through the existing ChangelogClientConfig API.

Code changes

  • Added new code behind a config. ChangelogClientConfig.consumerOffsetCommitIntervalMs (default 0 = off). To enable, the caller must additionally set group.id on the consumer properties.
  • Introduced new log lines. One LOGGER.warn (with stack trace) for actual broker-side commit failures. No log noise when commits are disabled or when group.id is unset — the path short-circuits silently.

Concurrency-Specific Checks

  • Uses existing locks (consumerLock, subscriptionLock) — no new synchronization primitives.
  • No new blocking calls inside critical sections.
  • No new shared collections.
  • Commit failures are caught and logged in two layers; never propagate to the caller.

How was this PR tested?

  • New unit tests added across the adapter and the changelog impl:
    • commitSync() delegates to the underlying Kafka consumer when group.id is configured.
    • commitSync() short-circuits silently when group.id is missing (no Kafka call, no log).
    • commitSync() swallows real broker exceptions (logged once with stack).
    • commitOffsets() delegates to the underlying pub-sub consumer.
    • Periodic commit cadence is respected when the interval is positive; the default-zero / explicit-zero path doesn't commit at all.
  • Modified or extended existing tests.
  • Verified backward compatibility — interface additions are default no-ops; the commit path is fully off by default.

Does this PR introduce any user-facing or breaking changes?

  • No. All additions are backwards-compatible and the new commit path is off by default.

…Kafka

property overrides

VeniceChangelogConsumer can now commit its current positions back to the
underlying pub-sub broker so external tools (e.g. xinfra consumer-group lag
dashboards) can observe progress. This is monitoring-only — the caller's
checkpoint state remains the source of truth for offsets; on restart the
consumer re-seeks to the caller's checkpoint and the Kafka-side committed
offset is ignored. A failed commit is non-fatal.

Wires VeniceChangelogConsumerImpl.internalPoll(long) to call a new
commitOffsets() at most once per
ChangelogClientConfig.consumerOffsetCommitIntervalMs (default 30s, set to 0
to disable). VeniceChangelogConsumerClientFactory gets a new overload
accepting a Properties bag of per-call consumer overrides so callers can
set e.g. pubsub.kafka.consumer.group.id deterministically per view job.

Context: enables Proteus to monitor Venice CDC Flink jobs via the same
Kafka consumer-group regex pattern they use for every other Flink job.
See VENG-12668.

Tests:
- ApacheKafkaConsumerAdapterTest: commitSync delegates + swallows
- VeniceChangelogConsumerImplTest: config default, commitOffsets delegation,
  periodic commit cadence, zero-interval escape hatch
- VeniceChangelogConsumerClientFactoryTest: per-call overrides land on the
  resolved per-store config and don't leak globally
@minhmo1620 minhmo1620 marked this pull request as ready for review May 12, 2026 20:33
Copilot AI review requested due to automatic review settings May 12, 2026 20:33
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an optional “monitoring-only” offset commit path for Venice changelog consumers so external Kafka consumer-group monitoring tools can observe lag, and adds factory overloads to allow per-call Kafka consumer property overrides (e.g., group.id) without mutating global config.

Changes:

  • Add commitSync() to PubSubConsumerAdapter (default no-op) and implement it in ApacheKafkaConsumerAdapter by delegating to Kafka commitSync() while swallowing failures.
  • Add commitOffsets() to VeniceChangelogConsumer (default no-op) and periodically invoke it from poll() in the changelog consumer implementation, gated by consumerOffsetCommitIntervalMs.
  • Add ChangelogClientConfig.consumerOffsetCommitIntervalMs (default 30s; 0 disables) and add Properties override overloads in VeniceChangelogConsumerClientFactory with tests.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
internal/venice-common/src/main/java/com/linkedin/venice/pubsub/api/PubSubConsumerAdapter.java Adds a monitoring-only commitSync() default method to the pub-sub consumer abstraction.
internal/venice-common/src/main/java/com/linkedin/venice/pubsub/adapter/kafka/consumer/ApacheKafkaConsumerAdapter.java Implements monitoring-only commits by delegating to Kafka commitSync() and swallowing errors.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/consumer/VeniceChangelogConsumer.java Adds a monitoring-only commitOffsets() default method to the changelog consumer API.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/consumer/VeniceChangelogConsumerImpl.java Periodically calls commitOffsets() from poll() based on the new interval config.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/consumer/VeniceChangelogConsumerClientFactory.java Adds overloads to merge per-call consumer Properties into resolved per-store configs without leaking globally.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/consumer/ChangelogClientConfig.java Introduces consumerOffsetCommitIntervalMs config with clone support.
internal/venice-common/src/test/java/com/linkedin/venice/pubsub/adapter/kafka/consumer/ApacheKafkaConsumerAdapterTest.java Adds unit tests for commitSync() delegation and exception swallowing.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/consumer/VeniceChangelogConsumerImplTest.java Adds unit tests for commit interval config, delegation, cadence, and interval=0 behavior.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/consumer/VeniceChangelogConsumerClientFactoryTest.java Adds test verifying per-call consumer overrides merge without mutating global config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Minh Nguyen added 2 commits May 12, 2026 13:45
group.id; preserve cadence after slow commit

Two fixes from the Copilot review on PR linkedin#2789:

1. ApacheKafkaConsumerAdapter.commitSync() now silently no-ops when
   group.id is not configured on the consumer. Without this, existing
   callers that didn't opt into broker-side monitoring saw a WARN every
   commit interval (default 30s) as Kafka threw IllegalStateException.
   For real broker failures, log the exception with stack (not just
   getMessage) so production issues are debuggable.

2. VeniceChangelogConsumerImpl.maybeCommitOffsets() now captures
   lastCommitTimeMs *after* the commit returns instead of before.
   Previously, a slow commit (e.g. broker hiccup) would cause the next
   poll to immediately re-commit because the timestamp didn't account
   for commit duration.

Tests:
- testCommitSyncShortCircuitsWhenNoGroupId — new
- testCommitSyncDelegatesToKafkaConsumerWhenGroupIdConfigured —
  renamed, now explicitly sets group.id in mocked consumer properties
- testCommitSyncSwallowsExceptions — updated to set group.id so the
  swallow path is actually exercised
… — co

mmits require two explicit opt-ins

Previously the interval defaulted to 30s. Even though
ApacheKafkaConsumerAdapter.commitSync() short-circuits silently when no
group.id is set, every poll cycle was still walking the
maybeCommitOffsets path for callers who had no monitoring need.

Now: commits actually happen only when *both* are true — the caller
sets a positive consumerOffsetCommitIntervalMs *and* a Kafka group.id
on the consumer properties. No magic auto-enable, no implicit
behavior tied to property presence.

Default = 0 (off). Test assertion updated to match.
Copilot AI review requested due to automatic review settings May 12, 2026 20:54
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Minh Nguyen added 2 commits May 12, 2026 14:39
…ry — ca

llers configure group.id via the global ChangelogClientConfig

The Properties overload on VeniceChangelogConsumerClientFactory turned out
to be redundant. The Flink-side consumer (flink-li-venice-connectors) sets
its default group.id by injecting it into the global cfg.backendConfig
Properties bag in OffspringFactory, which flows through to
ChangelogClientConfig.consumerProperties unchanged. No per-call factory
parameter needed.

Keeps everything else from the original PR intact: commitSync() on
PubSubConsumerAdapter/ApacheKafkaConsumerAdapter, commitOffsets() on
VeniceChangelogConsumer/Impl, the maybeCommitOffsets cadence in
internalPoll, and the consumerOffsetCommitIntervalMs setter on
ChangelogClientConfig.

Tests:
- testGetChangelogConsumerMergesConsumerOverrides — removed (no longer ap
plicable)
- All commit-path tests still pass.
maybeCommitOffsets

Matches the existing pattern in the same file (lines 274, 398, 1285) —
the protected `time` field is reserved for cases that need a fake-clock
injection point (e.g. version-swap timeout checks). The periodic-commit
cadence doesn't need that abstraction: the tests rely on
lastCommitTimeMs starting at 0 and any real clock easily exceeding any
positive interval.
Copilot AI review requested due to automatic review settings May 12, 2026 22:00
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

clients/da-vinci-client/src/main/java/com/linkedin/davinci/consumer/VeniceChangelogConsumerImpl.java:789

  • maybeCommitOffsets() uses System.currentTimeMillis() even though this class already has an injectable Time time abstraction (used elsewhere via time.getMilliseconds()). Using the Time field here would keep time handling consistent and make commit-cadence behavior easier to test deterministically (e.g., with TestMockTime).
  private void maybeCommitOffsets() {
    if (consumerOffsetCommitIntervalMs <= 0) {
      return;
    }
    long now = System.currentTimeMillis();
    if (now - lastCommitTimeMs >= consumerOffsetCommitIntervalMs) {
      commitOffsets();
      lastCommitTimeMs = System.currentTimeMillis();
    }

…ire +

adapter-level too

Two follow-ups from PR review on linkedin#2789:

1. ApacheKafkaConsumerAdapter.commitSync(): acquireLockWithTimeout() was
   called before the try block, so a lock-acquisition failure
   (PubSubOpTimeoutException, PubSubClientException from interrupt) would
   propagate out of monitoring-only commitSync. Moved the acquire inside
   the try; releaseLock() is already safe to call on a non-held lock.

2. VeniceChangelogConsumerImpl.commitOffsets(): added defensive try/catch
   as a second safety layer. If any PubSubConsumerAdapter implementation
   ever throws unexpectedly, the exception now gets logged and swallowed
   here instead of bubbling out of poll().

Tests:
- testCommitOffsetsSwallowsExceptions — new
@minhmo1620 minhmo1620 changed the title [changelog] Expose monitoring-only offset commit + per-call Kafka property overrides [changelog] Expose monitoring-only offset commit for job monitoring May 14, 2026
@minhmo1620 minhmo1620 changed the title [changelog] Expose monitoring-only offset commit for job monitoring [changelog-consumer] Opt-in periodic offset commit to enable Kafka consumer-group lag monitoring May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants