Skip to content

[controller] Keep prior current version during backup cleanup#2803

Open
misyel wants to merge 4 commits into
linkedin:mainfrom
misyel:mkwong/cleanup-preserve-prior-current-version
Open

[controller] Keep prior current version during backup cleanup#2803
misyel wants to merge 4 commits into
linkedin:mainfrom
misyel:mkwong/cleanup-preserve-prior-current-version

Conversation

@misyel
Copy link
Copy Markdown
Contributor

@misyel misyel commented May 18, 2026

Problem Statement

StoreBackupVersionCleanupService.cleanupBackupVersion could delete the previous current version when a higher non-current version (e.g. one left at PUSHED by a kill that finished bootstrap before propagating KILLED to child metadata) sat between the prior current and the current. The keep-newest fallback sorts sub-current versions in descending order and removes index 0 from the deletion list, which preserves the highest version but lets the actually-served prior current through to deletion. DaVinci nodes still subscribed to the prior current then fail at version swap.

Solution

Use the existing Version.previousCurrentVersion field, auto-stamped by ZKStore.setCurrentVersion at swap time since StoreMetaValue v40, to preserve the prior current in the non-repush keep-newest branch. Falls back to keep-newest when the field is unset (pre-v40 stores or rollback paths that bypass the auto-stamp).

For sequential pushes the prior current is the newest sub-current version, so behavior matches the legacy keep-newest path; only the lingering-higher-version case changes.

Tracks: VENG-12676

Code changes

  • Added new code behind a config.
  • Introduced new log lines.

Concurrency-Specific Checks

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms are used where needed.
  • No blocking calls inside critical sections.
  • Verified thread-safe collections are used.
  • Validated proper exception handling in multi-threaded code.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Unit tests cover four scenarios via a data provider: the bug reproduction, a longer-history variant, sequential pushes (legacy match), and back-compat when priorCurrent is unset. Integration test reproduces the full lix-member flow across two regions with a target-region deferred-swap push, kill, follow-up regular push, and asserts the cleanup picks v2 (lingering PUSHED) for deletion while preserving v1 (prior current).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.

StoreBackupVersionCleanupService.cleanupBackupVersion could delete the
previous current version when a higher non-current version (e.g. one
left at PUSHED by a kill that finished bootstrap before propagating
KILLED to child metadata) sat between the prior current and the
current. The keep-newest fallback sorts sub-current versions in
descending order and removes index 0 from the deletion list, which
preserves the highest version but lets the actually-served prior
current through to deletion. DaVinci nodes still subscribed to the
prior current then fail at version swap.

Use the existing Version.previousCurrentVersion field, auto-stamped by
ZKStore.setCurrentVersion at swap time since StoreMetaValue v40, to
preserve the prior current in the non-repush keep-newest branch. Falls
back to keep-newest when the field is unset (pre-v40 stores or
rollback paths that bypass the auto-stamp). For sequential pushes the
prior current is the newest sub-current version, so behavior matches
the legacy keep-newest path; only the lingering-higher-version case
changes.

Unit tests cover four scenarios via a data provider: the bug
reproduction, a longer-history variant, sequential pushes (legacy
match), and back-compat when priorCurrent is unset. Integration test
reproduces the full lix-member flow across two regions with a
target-region deferred-swap push, kill, follow-up regular push, and
asserts the cleanup picks v2 (lingering PUSHED) for deletion while
preserving v1 (prior current).

Tracks: VENG-12676
Copilot AI review requested due to automatic review settings May 18, 2026 21:50
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes backup-version cleanup in the Venice controller to avoid deleting the prior current version when there is a higher non-current “stale” version (e.g., lingering at PUSHED) between the prior current and the current version, which can break DaVinci version swaps.

Changes:

  • Update StoreBackupVersionCleanupService.cleanupBackupVersion to preferentially preserve currentVersionMeta.getPreviousCurrentVersion() (with a keep-newest fallback).
  • Add unit coverage for the prior-current preservation logic via a new data provider.
  • Add an end-to-end integration test reproducing the multi-region deferred-swap + kill scenario.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
services/venice-controller/src/main/java/com/linkedin/venice/controller/StoreBackupVersionCleanupService.java Preserve the prior current version during retention-based cleanup when multiple sub-current versions are eligible for deletion.
services/venice-controller/src/test/java/com/linkedin/venice/controller/TestStoreBackupVersionCleanupService.java Adds unit tests covering bug reproduction, sequential-push behavior parity, and fallback when prior-current is unset.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestStoreBackupVersionDeletion.java Adds a multi-region integration test reproducing the deferred-swap + kill flow and asserting cleanup preserves the prior current.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

misyel added 2 commits May 18, 2026 15:23
- Correct sentinel description: priorCurrent unset is
  Store.NON_EXISTING_VERSION (0), not -1.
- Integration test: force-stamp v2 to PUSHED via storeMetadataUpdate
  after the parent kill so the test exercises the cleanup-service fix
  deterministically, independent of dc-1's kill-handler behavior.
- Integration test: assert the background v2 VPJ thread surfaced an
  error after join (kill should fail the push), guarding against silent
  regressions.
The integration test's value is reproducing the full VENG-12676 flow
(kill no-op leaves v2 PUSHED, then cleanup mishandles prior current).
Force-stamping v2 to PUSHED bypasses the kill-handler interaction the
test is meant to exercise and makes it a redundant unit test. Unit
tests already cover the cleanup logic deterministically.
Copilot AI review requested due to automatic review settings May 18, 2026 22:48
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

- Guard Optional dereference on parentStore.getVersion(2) inside the
  waitForNonDeterministicAssertion loop; absence would throw
  NoSuchElementException, which the helper does not retry on, leading
  to flaky failures.
- Fail the test if the v2 VPJ background thread is still alive after
  interrupt/join, instead of silently proceeding into the v3 push.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants