[controller] Keep prior current version during backup cleanup#2803
Open
misyel wants to merge 4 commits into
Open
[controller] Keep prior current version during backup cleanup#2803misyel wants to merge 4 commits into
misyel wants to merge 4 commits into
Conversation
StoreBackupVersionCleanupService.cleanupBackupVersion could delete the previous current version when a higher non-current version (e.g. one left at PUSHED by a kill that finished bootstrap before propagating KILLED to child metadata) sat between the prior current and the current. The keep-newest fallback sorts sub-current versions in descending order and removes index 0 from the deletion list, which preserves the highest version but lets the actually-served prior current through to deletion. DaVinci nodes still subscribed to the prior current then fail at version swap. Use the existing Version.previousCurrentVersion field, auto-stamped by ZKStore.setCurrentVersion at swap time since StoreMetaValue v40, to preserve the prior current in the non-repush keep-newest branch. Falls back to keep-newest when the field is unset (pre-v40 stores or rollback paths that bypass the auto-stamp). For sequential pushes the prior current is the newest sub-current version, so behavior matches the legacy keep-newest path; only the lingering-higher-version case changes. Unit tests cover four scenarios via a data provider: the bug reproduction, a longer-history variant, sequential pushes (legacy match), and back-compat when priorCurrent is unset. Integration test reproduces the full lix-member flow across two regions with a target-region deferred-swap push, kill, follow-up regular push, and asserts the cleanup picks v2 (lingering PUSHED) for deletion while preserving v1 (prior current). Tracks: VENG-12676
There was a problem hiding this comment.
Pull request overview
This PR fixes backup-version cleanup in the Venice controller to avoid deleting the prior current version when there is a higher non-current “stale” version (e.g., lingering at PUSHED) between the prior current and the current version, which can break DaVinci version swaps.
Changes:
- Update
StoreBackupVersionCleanupService.cleanupBackupVersionto preferentially preservecurrentVersionMeta.getPreviousCurrentVersion()(with a keep-newest fallback). - Add unit coverage for the prior-current preservation logic via a new data provider.
- Add an end-to-end integration test reproducing the multi-region deferred-swap + kill scenario.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| services/venice-controller/src/main/java/com/linkedin/venice/controller/StoreBackupVersionCleanupService.java | Preserve the prior current version during retention-based cleanup when multiple sub-current versions are eligible for deletion. |
| services/venice-controller/src/test/java/com/linkedin/venice/controller/TestStoreBackupVersionCleanupService.java | Adds unit tests covering bug reproduction, sequential-push behavior parity, and fallback when prior-current is unset. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestStoreBackupVersionDeletion.java | Adds a multi-region integration test reproducing the deferred-swap + kill flow and asserting cleanup preserves the prior current. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Correct sentinel description: priorCurrent unset is Store.NON_EXISTING_VERSION (0), not -1. - Integration test: force-stamp v2 to PUSHED via storeMetadataUpdate after the parent kill so the test exercises the cleanup-service fix deterministically, independent of dc-1's kill-handler behavior. - Integration test: assert the background v2 VPJ thread surfaced an error after join (kill should fail the push), guarding against silent regressions.
The integration test's value is reproducing the full VENG-12676 flow (kill no-op leaves v2 PUSHED, then cleanup mishandles prior current). Force-stamping v2 to PUSHED bypasses the kill-handler interaction the test is meant to exercise and makes it a redundant unit test. Unit tests already cover the cleanup logic deterministically.
- Guard Optional dereference on parentStore.getVersion(2) inside the waitForNonDeterministicAssertion loop; absence would throw NoSuchElementException, which the helper does not retry on, leading to flaky failures. - Fail the test if the v2 VPJ background thread is still alive after interrupt/join, instead of silently proceeding into the v3 push.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem Statement
StoreBackupVersionCleanupService.cleanupBackupVersioncould delete the previous current version when a higher non-current version (e.g. one left atPUSHEDby a kill that finished bootstrap before propagatingKILLEDto child metadata) sat between the prior current and the current. The keep-newest fallback sorts sub-current versions in descending order and removes index 0 from the deletion list, which preserves the highest version but lets the actually-served prior current through to deletion. DaVinci nodes still subscribed to the prior current then fail at version swap.Solution
Use the existing
Version.previousCurrentVersionfield, auto-stamped byZKStore.setCurrentVersionat swap time since StoreMetaValue v40, to preserve the prior current in the non-repush keep-newest branch. Falls back to keep-newest when the field is unset (pre-v40 stores or rollback paths that bypass the auto-stamp).For sequential pushes the prior current is the newest sub-current version, so behavior matches the legacy keep-newest path; only the lingering-higher-version case changes.
Tracks: VENG-12676
Code changes
Concurrency-Specific Checks
How was this PR tested?
Unit tests cover four scenarios via a data provider: the bug reproduction, a longer-history variant, sequential pushes (legacy match), and back-compat when
priorCurrentis unset. Integration test reproduces the full lix-member flow across two regions with a target-region deferred-swap push, kill, follow-up regular push, and asserts the cleanup picks v2 (lingering PUSHED) for deletion while preserving v1 (prior current).Does this PR introduce any user-facing or breaking changes?