Skip to content

[vpj] Support running VTConsistencyCheckerJob via VenicePushJob#2805

Open
Mohith22 wants to merge 1 commit into
linkedin:mainfrom
Mohith22:mdamarap/add-vt-consistency-checker-venice-operator
Open

[vpj] Support running VTConsistencyCheckerJob via VenicePushJob#2805
Mohith22 wants to merge 1 commit into
linkedin:mainfrom
Mohith22:mdamarap/add-vt-consistency-checker-venice-operator

Conversation

@Mohith22
Copy link
Copy Markdown
Collaborator

@Mohith22 Mohith22 commented May 20, 2026

Problem Statement

VTConsistencyCheckerJob (recently introduced) is a Spark job that scans Version Topics across two DCs to detect cross-region inconsistencies. Currently, we integrate it via VBnP using HadoopJavaOperator which complicates the wiring process. We want to integrate it to VenicePushOperator so it can be seamlessly added to the airflow DAGs. Also, the spark job today doesn't fail the operator when inconsistencies are found, rather it silently ends by writing the errors to parquet. We need to throw an error, so the DAG step can be failed.

Solution

  1. VPJ as the entry point. Added a vt.consistency.check.only flag. When set, VenicePushJob.run() short-circuits the push and delegates to VTConsistencyCheckerJob.run(props) via runVTConsistencyCheck().
  2. Fail-fast on findings. Added a LongAccumulator inconsistenciesFound in the checker, incremented per detected mismatch. After the Parquet write completes (so forensics are preserved on disk), the job throws VeniceException if the count is non-zero.

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
  • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

Comment on lines +327 to +332
/**
* When set to {@code true}, {@link com.linkedin.venice.hadoop.VenicePushJob#run()} skips the
* push and instead invokes {@link com.linkedin.venice.spark.consistency.VTConsistencyCheckerJob}
* against the store's current version topic.
*/
public static final String VT_CONSISTENCY_CHECK_ONLY = "vt.consistency.check.only";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For heartbeat push jobs that want produce + verify in one run, the natural UX is: flip a single flag and the job does the push and then the consistency check. Today VT_CONSISTENCY_CHECK_ONLY=true is mutually exclusive with the push, so the user has to schedule a second VPJ run with the flag flipped. The second run also targets whatever the current version is at that later time, not necessarily the version just pushed.

A simpler shape: keep this flag for the "skip push, check only" path, and add a sibling vt.consistency.check.after.push that runs the checker as a follow-on phase inside the same run() after a successful push. Heartbeat jobs then flip one config and get produce + post-push consistency check in one invocation.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally kept this separately because of two reasons:

  • It doesn't really make sense to run this right after VPJ because we need nearline writes for this validation to make sense. This is more relevant for AA stores.
  • I want it to be light weighted so we can trigger and quickly do the validation, instead of waiting for the VPJ to finish.

@sushantmane sushantmane added the requested-changes Reviewer requested changes label May 20, 2026
@Mohith22 Mohith22 force-pushed the mdamarap/add-vt-consistency-checker-venice-operator branch from 7b6450b to 8f9e7be Compare May 20, 2026 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

requested-changes Reviewer requested changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants