[SPARK-56216][SS] Integrate checkpoint V2 with auto-repair snapshot by ericm-db · Pull Request #55015 · apache/spark

ericm-db · 2026-03-25T21:00:12Z

What changes were proposed in this pull request?

This PR adds auto-repair snapshot support to the checkpoint V2 (state store checkpoint IDs) load path. Previously, auto-repair and V2 were completely disjoint: loadWithCheckpointId had no recovery logic for corrupt snapshots, and the auto-repair path (loadSnapshotWithoutCheckpointId) only handled V1 files without UUID awareness.

Changes:

RocksDBFileManager: Added getSnapshotVersionsAndUniqueIdsFromLineage() which returns all lineage-matching snapshots sorted descending (the V2 equivalent of getEligibleSnapshotsForVersion). Refactored getLatestSnapshotVersionAndUniqueIdFromLineage() to delegate to it.
RocksDB: Added loadSnapshotWithCheckpointId() which uses AutoSnapshotLoader with V2-specific callbacks that map version to uniqueId via a side-channel map. Changelog replay is included inside the load callback so corrupt changelogs also trigger fallback to the next older snapshot.
RocksDB: Wrapped the snapshot load + changelog replay block in loadWithCheckpointId() with a try-catch that delegates to the new auto-repair method when enabled. Uses getFullLineage() to build the complete lineage chain (back to version 1) so that version 0 fallback with full changelog replay works correctly.

Why are the changes needed?

Without this change, any corrupt or missing snapshot file in V2 mode causes a hard query failure with no recovery path. V1 mode already had auto-repair (falling back to older snapshots and replaying changelogs), but V2's loadWithCheckpointId bypassed that entirely. This is especially important because speculative execution can produce orphaned or incomplete snapshot files that V2 is designed to handle, but corruption of the "winning" snapshot had no fallback.

Does this PR introduce any user-facing change?

No. This is an internal improvement to fault tolerance. Queries using checkpoint V2 that previously would fail on corrupt snapshots will now automatically recover when autoSnapshotRepair.enabled is true (the production default).

How was this patch tested?

Added integration test "Auto snapshot repair with checkpoint format V2" in RocksDBSuite covering:

Single corrupt V2 snapshot: falls back to older snapshot in lineage
All V2 snapshots corrupt: falls back to version 0 with full changelog replay
Verified state correctness and numSnapshotsAutoRepaired metric after repair

Also verified existing tests pass:

AutoSnapshotLoaderSuite (5/5)
RocksDBSuite V1 auto-repair test
RocksDBStateStoreCheckpointFormatV2Suite (24/24)

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.6)

### What changes were proposed in this pull request? This PR adds auto-repair snapshot support to the checkpoint V2 (state store checkpoint IDs) load path. Previously, auto-repair and V2 were completely disjoint: `loadWithCheckpointId` had no recovery logic for corrupt snapshots, and the auto-repair path (`loadSnapshotWithoutCheckpointId`) only handled V1 files without UUID awareness. Changes: - **RocksDBFileManager**: Added `getSnapshotVersionsAndUniqueIdsFromLineage()` which returns all lineage-matching snapshots sorted descending (the V2 equivalent of `getEligibleSnapshotsForVersion`). Refactored `getLatestSnapshotVersionAndUniqueIdFromLineage()` to delegate to it. - **RocksDB**: Added `loadSnapshotWithCheckpointId()` which uses `AutoSnapshotLoader` with V2-specific callbacks that map version to uniqueId via a side-channel map. Changelog replay is included inside the load callback so corrupt changelogs also trigger fallback to the next older snapshot. - **RocksDB**: Wrapped the snapshot load + changelog replay block in `loadWithCheckpointId()` with a try-catch that delegates to the new auto-repair method when enabled. Uses `getFullLineage()` to build the complete lineage chain (back to version 1) so that version 0 fallback with full changelog replay works correctly. ### Why are the changes needed? Without this change, any corrupt or missing snapshot file in V2 mode causes a hard query failure with no recovery path. V1 mode already had auto-repair (falling back to older snapshots and replaying changelogs), but V2's `loadWithCheckpointId` bypassed that entirely. This is especially important because speculative execution can produce orphaned or incomplete snapshot files that V2 is designed to handle, but corruption of the "winning" snapshot had no fallback. ### Does this PR introduce _any_ user-facing change? No. This is an internal improvement to fault tolerance. Queries using checkpoint V2 that previously would fail on corrupt snapshots will now automatically recover when `autoSnapshotRepair.enabled` is true (the production default). ### How was this patch tested? Added integration test "Auto snapshot repair with checkpoint format V2" in `RocksDBSuite` covering: - Single corrupt V2 snapshot: falls back to older snapshot in lineage - All V2 snapshots corrupt: falls back to version 0 with full changelog replay - Verified state correctness and `numSnapshotsAutoRepaired` metric after repair Also verified existing tests pass: - `AutoSnapshotLoaderSuite` (5/5) - `RocksDBSuite` V1 auto-repair test - `RocksDBStateStoreCheckpointFormatV2Suite` (24/24) ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.6) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dongjoon-hyun · 2026-03-25T21:01:14Z

Thank you, @ericm-db . Please use a valid JIRA ID and convert to the normal PR.

…atch Instead of wrapping loadWithCheckpointId's snapshot load in a try-catch that delegates to AutoSnapshotLoader on failure, route V2 through AutoSnapshotLoader from the start (like V1 already does). Added a beforeRepair() callback to AutoSnapshotLoader (default no-op) that is called once when repair begins. V2 overrides it to enrich the lineage via getFullLineage(). getEligibleSnapshots() is re-called after beforeRepair() so the enriched lineage is picked up automatically. This eliminates the try-catch and makes V2 follow the same structural pattern as V1: AutoSnapshotLoader drives the entire load path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix setMaxSeenVersion to use target version instead of snapshot version - Add logWarning when getFullLineage fails during auto-repair - Remove dead loadedSnapshotUniqueId variable - Propagate enriched lineage back from loadSnapshotWithCheckpointId - Restore version 0 assertion for stateStoreCkptId - Add tests: maxChangeFileReplay limit, load-after-repair roundtrip, no snapshots + full replay, getFullLineage failure handling Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dongjoon-hyun marked this pull request as draft March 25, 2026 21:00

ericm-db force-pushed the integrate-v2-auto-repair-snapshot branch 5 times, most recently from 1adfe92 to 1c33bc8 Compare March 25, 2026 21:49

ericm-db changed the title ~~[SS] Integrate checkpoint V2 with auto-repair snapshot~~ [SPARK-56216][SS] Integrate checkpoint V2 with auto-repair snapshot Mar 25, 2026

ericm-db force-pushed the integrate-v2-auto-repair-snapshot branch from 1c33bc8 to 3d9d889 Compare March 25, 2026 22:13

ericm-db marked this pull request as ready for review March 25, 2026 22:20

ericm-db force-pushed the integrate-v2-auto-repair-snapshot branch from b63d359 to 1580bde Compare March 26, 2026 00:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56216][SS] Integrate checkpoint V2 with auto-repair snapshot#55015

[SPARK-56216][SS] Integrate checkpoint V2 with auto-repair snapshot#55015
ericm-db wants to merge 3 commits intoapache:masterfrom
ericm-db:integrate-v2-auto-repair-snapshot

ericm-db commented Mar 25, 2026

Uh oh!

dongjoon-hyun commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ericm-db commented Mar 25, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants