[SPARK-56216][SS] Integrate checkpoint V2 with auto-repair snapshot#55015
Open
ericm-db wants to merge 3 commits intoapache:masterfrom
Open
[SPARK-56216][SS] Integrate checkpoint V2 with auto-repair snapshot#55015ericm-db wants to merge 3 commits intoapache:masterfrom
ericm-db wants to merge 3 commits intoapache:masterfrom
Conversation
### What changes were proposed in this pull request? This PR adds auto-repair snapshot support to the checkpoint V2 (state store checkpoint IDs) load path. Previously, auto-repair and V2 were completely disjoint: `loadWithCheckpointId` had no recovery logic for corrupt snapshots, and the auto-repair path (`loadSnapshotWithoutCheckpointId`) only handled V1 files without UUID awareness. Changes: - **RocksDBFileManager**: Added `getSnapshotVersionsAndUniqueIdsFromLineage()` which returns all lineage-matching snapshots sorted descending (the V2 equivalent of `getEligibleSnapshotsForVersion`). Refactored `getLatestSnapshotVersionAndUniqueIdFromLineage()` to delegate to it. - **RocksDB**: Added `loadSnapshotWithCheckpointId()` which uses `AutoSnapshotLoader` with V2-specific callbacks that map version to uniqueId via a side-channel map. Changelog replay is included inside the load callback so corrupt changelogs also trigger fallback to the next older snapshot. - **RocksDB**: Wrapped the snapshot load + changelog replay block in `loadWithCheckpointId()` with a try-catch that delegates to the new auto-repair method when enabled. Uses `getFullLineage()` to build the complete lineage chain (back to version 1) so that version 0 fallback with full changelog replay works correctly. ### Why are the changes needed? Without this change, any corrupt or missing snapshot file in V2 mode causes a hard query failure with no recovery path. V1 mode already had auto-repair (falling back to older snapshots and replaying changelogs), but V2's `loadWithCheckpointId` bypassed that entirely. This is especially important because speculative execution can produce orphaned or incomplete snapshot files that V2 is designed to handle, but corruption of the "winning" snapshot had no fallback. ### Does this PR introduce _any_ user-facing change? No. This is an internal improvement to fault tolerance. Queries using checkpoint V2 that previously would fail on corrupt snapshots will now automatically recover when `autoSnapshotRepair.enabled` is true (the production default). ### How was this patch tested? Added integration test "Auto snapshot repair with checkpoint format V2" in `RocksDBSuite` covering: - Single corrupt V2 snapshot: falls back to older snapshot in lineage - All V2 snapshots corrupt: falls back to version 0 with full changelog replay - Verified state correctness and `numSnapshotsAutoRepaired` metric after repair Also verified existing tests pass: - `AutoSnapshotLoaderSuite` (5/5) - `RocksDBSuite` V1 auto-repair test - `RocksDBStateStoreCheckpointFormatV2Suite` (24/24) ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.6) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Member
|
Thank you, @ericm-db . Please use a valid JIRA ID and convert to the normal PR. |
1adfe92 to
1c33bc8
Compare
…atch Instead of wrapping loadWithCheckpointId's snapshot load in a try-catch that delegates to AutoSnapshotLoader on failure, route V2 through AutoSnapshotLoader from the start (like V1 already does). Added a beforeRepair() callback to AutoSnapshotLoader (default no-op) that is called once when repair begins. V2 overrides it to enrich the lineage via getFullLineage(). getEligibleSnapshots() is re-called after beforeRepair() so the enriched lineage is picked up automatically. This eliminates the try-catch and makes V2 follow the same structural pattern as V1: AutoSnapshotLoader drives the entire load path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1c33bc8 to
3d9d889
Compare
- Fix setMaxSeenVersion to use target version instead of snapshot version - Add logWarning when getFullLineage fails during auto-repair - Remove dead loadedSnapshotUniqueId variable - Propagate enriched lineage back from loadSnapshotWithCheckpointId - Restore version 0 assertion for stateStoreCkptId - Add tests: maxChangeFileReplay limit, load-after-repair roundtrip, no snapshots + full replay, getFullLineage failure handling Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b63d359 to
1580bde
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR adds auto-repair snapshot support to the checkpoint V2 (state store checkpoint IDs) load path. Previously, auto-repair and V2 were completely disjoint:
loadWithCheckpointIdhad no recovery logic for corrupt snapshots, and the auto-repair path (loadSnapshotWithoutCheckpointId) only handled V1 files without UUID awareness.Changes:
getSnapshotVersionsAndUniqueIdsFromLineage()which returns all lineage-matching snapshots sorted descending (the V2 equivalent ofgetEligibleSnapshotsForVersion). RefactoredgetLatestSnapshotVersionAndUniqueIdFromLineage()to delegate to it.loadSnapshotWithCheckpointId()which usesAutoSnapshotLoaderwith V2-specific callbacks that map version to uniqueId via a side-channel map. Changelog replay is included inside the load callback so corrupt changelogs also trigger fallback to the next older snapshot.loadWithCheckpointId()with a try-catch that delegates to the new auto-repair method when enabled. UsesgetFullLineage()to build the complete lineage chain (back to version 1) so that version 0 fallback with full changelog replay works correctly.Why are the changes needed?
Without this change, any corrupt or missing snapshot file in V2 mode causes a hard query failure with no recovery path. V1 mode already had auto-repair (falling back to older snapshots and replaying changelogs), but V2's
loadWithCheckpointIdbypassed that entirely. This is especially important because speculative execution can produce orphaned or incomplete snapshot files that V2 is designed to handle, but corruption of the "winning" snapshot had no fallback.Does this PR introduce any user-facing change?
No. This is an internal improvement to fault tolerance. Queries using checkpoint V2 that previously would fail on corrupt snapshots will now automatically recover when
autoSnapshotRepair.enabledis true (the production default).How was this patch tested?
Added integration test "Auto snapshot repair with checkpoint format V2" in
RocksDBSuitecovering:numSnapshotsAutoRepairedmetric after repairAlso verified existing tests pass:
AutoSnapshotLoaderSuite(5/5)RocksDBSuiteV1 auto-repair testRocksDBStateStoreCheckpointFormatV2Suite(24/24)Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.6)