Skip to content

[SPARK-56216][SS] Integrate checkpoint V2 with auto-repair snapshot#55015

Open
ericm-db wants to merge 3 commits intoapache:masterfrom
ericm-db:integrate-v2-auto-repair-snapshot
Open

[SPARK-56216][SS] Integrate checkpoint V2 with auto-repair snapshot#55015
ericm-db wants to merge 3 commits intoapache:masterfrom
ericm-db:integrate-v2-auto-repair-snapshot

Conversation

@ericm-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR adds auto-repair snapshot support to the checkpoint V2 (state store checkpoint IDs) load path. Previously, auto-repair and V2 were completely disjoint: loadWithCheckpointId had no recovery logic for corrupt snapshots, and the auto-repair path (loadSnapshotWithoutCheckpointId) only handled V1 files without UUID awareness.

Changes:

  • RocksDBFileManager: Added getSnapshotVersionsAndUniqueIdsFromLineage() which returns all lineage-matching snapshots sorted descending (the V2 equivalent of getEligibleSnapshotsForVersion). Refactored getLatestSnapshotVersionAndUniqueIdFromLineage() to delegate to it.
  • RocksDB: Added loadSnapshotWithCheckpointId() which uses AutoSnapshotLoader with V2-specific callbacks that map version to uniqueId via a side-channel map. Changelog replay is included inside the load callback so corrupt changelogs also trigger fallback to the next older snapshot.
  • RocksDB: Wrapped the snapshot load + changelog replay block in loadWithCheckpointId() with a try-catch that delegates to the new auto-repair method when enabled. Uses getFullLineage() to build the complete lineage chain (back to version 1) so that version 0 fallback with full changelog replay works correctly.

Why are the changes needed?

Without this change, any corrupt or missing snapshot file in V2 mode causes a hard query failure with no recovery path. V1 mode already had auto-repair (falling back to older snapshots and replaying changelogs), but V2's loadWithCheckpointId bypassed that entirely. This is especially important because speculative execution can produce orphaned or incomplete snapshot files that V2 is designed to handle, but corruption of the "winning" snapshot had no fallback.

Does this PR introduce any user-facing change?

No. This is an internal improvement to fault tolerance. Queries using checkpoint V2 that previously would fail on corrupt snapshots will now automatically recover when autoSnapshotRepair.enabled is true (the production default).

How was this patch tested?

Added integration test "Auto snapshot repair with checkpoint format V2" in RocksDBSuite covering:

  • Single corrupt V2 snapshot: falls back to older snapshot in lineage
  • All V2 snapshots corrupt: falls back to version 0 with full changelog replay
  • Verified state correctness and numSnapshotsAutoRepaired metric after repair

Also verified existing tests pass:

  • AutoSnapshotLoaderSuite (5/5)
  • RocksDBSuite V1 auto-repair test
  • RocksDBStateStoreCheckpointFormatV2Suite (24/24)

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.6)

### What changes were proposed in this pull request?

This PR adds auto-repair snapshot support to the checkpoint V2 (state store
checkpoint IDs) load path. Previously, auto-repair and V2 were completely
disjoint: `loadWithCheckpointId` had no recovery logic for corrupt snapshots,
and the auto-repair path (`loadSnapshotWithoutCheckpointId`) only handled V1
files without UUID awareness.

Changes:
- **RocksDBFileManager**: Added `getSnapshotVersionsAndUniqueIdsFromLineage()`
  which returns all lineage-matching snapshots sorted descending (the V2
  equivalent of `getEligibleSnapshotsForVersion`). Refactored
  `getLatestSnapshotVersionAndUniqueIdFromLineage()` to delegate to it.
- **RocksDB**: Added `loadSnapshotWithCheckpointId()` which uses
  `AutoSnapshotLoader` with V2-specific callbacks that map version to uniqueId
  via a side-channel map. Changelog replay is included inside the load callback
  so corrupt changelogs also trigger fallback to the next older snapshot.
- **RocksDB**: Wrapped the snapshot load + changelog replay block in
  `loadWithCheckpointId()` with a try-catch that delegates to the new
  auto-repair method when enabled. Uses `getFullLineage()` to build the
  complete lineage chain (back to version 1) so that version 0 fallback with
  full changelog replay works correctly.

### Why are the changes needed?

Without this change, any corrupt or missing snapshot file in V2 mode causes a
hard query failure with no recovery path. V1 mode already had auto-repair
(falling back to older snapshots and replaying changelogs), but V2's
`loadWithCheckpointId` bypassed that entirely. This is especially important
because speculative execution can produce orphaned or incomplete snapshot files
that V2 is designed to handle, but corruption of the "winning" snapshot had no
fallback.

### Does this PR introduce _any_ user-facing change?

No. This is an internal improvement to fault tolerance. Queries using checkpoint
V2 that previously would fail on corrupt snapshots will now automatically
recover when `autoSnapshotRepair.enabled` is true (the production default).

### How was this patch tested?

Added integration test "Auto snapshot repair with checkpoint format V2" in
`RocksDBSuite` covering:
- Single corrupt V2 snapshot: falls back to older snapshot in lineage
- All V2 snapshots corrupt: falls back to version 0 with full changelog replay
- Verified state correctness and `numSnapshotsAutoRepaired` metric after repair

Also verified existing tests pass:
- `AutoSnapshotLoaderSuite` (5/5)
- `RocksDBSuite` V1 auto-repair test
- `RocksDBStateStoreCheckpointFormatV2Suite` (24/24)

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.6)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dongjoon-hyun dongjoon-hyun marked this pull request as draft March 25, 2026 21:00
@dongjoon-hyun
Copy link
Member

Thank you, @ericm-db . Please use a valid JIRA ID and convert to the normal PR.

@ericm-db ericm-db force-pushed the integrate-v2-auto-repair-snapshot branch 5 times, most recently from 1adfe92 to 1c33bc8 Compare March 25, 2026 21:49
@ericm-db ericm-db changed the title [SS] Integrate checkpoint V2 with auto-repair snapshot [SPARK-56216][SS] Integrate checkpoint V2 with auto-repair snapshot Mar 25, 2026
…atch

Instead of wrapping loadWithCheckpointId's snapshot load in a try-catch
that delegates to AutoSnapshotLoader on failure, route V2 through
AutoSnapshotLoader from the start (like V1 already does).

Added a beforeRepair() callback to AutoSnapshotLoader (default no-op)
that is called once when repair begins. V2 overrides it to enrich the
lineage via getFullLineage(). getEligibleSnapshots() is re-called after
beforeRepair() so the enriched lineage is picked up automatically.

This eliminates the try-catch and makes V2 follow the same structural
pattern as V1: AutoSnapshotLoader drives the entire load path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ericm-db ericm-db force-pushed the integrate-v2-auto-repair-snapshot branch from 1c33bc8 to 3d9d889 Compare March 25, 2026 22:13
@ericm-db ericm-db marked this pull request as ready for review March 25, 2026 22:20
- Fix setMaxSeenVersion to use target version instead of snapshot version
- Add logWarning when getFullLineage fails during auto-repair
- Remove dead loadedSnapshotUniqueId variable
- Propagate enriched lineage back from loadSnapshotWithCheckpointId
- Restore version 0 assertion for stateStoreCkptId
- Add tests: maxChangeFileReplay limit, load-after-repair roundtrip,
  no snapshots + full replay, getFullLineage failure handling

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ericm-db ericm-db force-pushed the integrate-v2-auto-repair-snapshot branch from b63d359 to 1580bde Compare March 26, 2026 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants