Skip to content

Make crucible-downstairs dump tolerant of MissingContextSlot errors#1905

Open
leftwo wants to merge 5 commits intomainfrom
alan/downstairs-dump-tolerant
Open

Make crucible-downstairs dump tolerant of MissingContextSlot errors#1905
leftwo wants to merge 5 commits intomainfrom
alan/downstairs-dump-tolerant

Conversation

@leftwo
Copy link
Contributor

@leftwo leftwo commented Mar 14, 2026

When a raw extent is dirty and its block data does not match either stored context slot, opening the region fails with MissingContextSlot. Previously this caused the dump subcommand to exit with an error even if a healthy extent was requested.

Refactor Region::open to share setup logic with a new open_for_dump function that uses try_open_extents internally. try_open_extents tolerates MissingContextSlot errors per extent, leaving failed extents as ExtentState::Closed and returning a list of (ExtentId, error) to the caller.

dump_region now uses open_for_dump and checks the error list before proceeding: if the user requested a specific extent or block that maps to a failed extent, it returns an error; otherwise it prints a warning and skips the failed extent. show_extent and show_extent_block are updated similarly.

verify_region is unchanged and still fails on any corrupt extent.

Five tests are added covering: open_for_dump skipping a corrupt extent, dump succeeding when a corrupt extent is not requested, dump failing when a corrupt extent is specifically requested by extent number or block number, and dump of a good extent succeeding when another extent is corrupt.

When a raw extent is dirty and its block data does not match either
stored context slot, opening the region fails with MissingContextSlot.
Previously this caused the dump subcommand to exit with an error even
if a healthy extent was requested.

Refactor Region::open to share setup logic with a new open_for_dump
function that uses try_open_extents internally. try_open_extents
tolerates MissingContextSlot errors per extent, leaving failed extents
as ExtentState::Closed and returning a list of (ExtentId, error) to
the caller.

dump_region now uses open_for_dump and checks the error list before
proceeding: if the user requested a specific extent or block that maps
to a failed extent, it returns an error; otherwise it prints a warning
and skips the failed extent. show_extent and show_extent_block are
updated similarly.

verify_region is unchanged and still fails on any corrupt extent.

Five tests are added covering: open_for_dump skipping a corrupt extent,
dump succeeding when a corrupt extent is not requested, dump failing
when a corrupt extent is specifically requested by extent number or
block number, and dump of a good extent succeeding when another extent
is corrupt.
@leftwo leftwo requested review from jmpesp and mkeeter March 15, 2026 15:28
@mkeeter
Copy link
Contributor

mkeeter commented Mar 16, 2026

When a raw extent is dirty and its block data does not match either stored context slot,

Are you seeing this in the wild? This falls into the "really shouldn't happen" category of events: we must always have a valid context slot, otherwise the data can be lost!

///
/// Any error other than `MissingContextSlot` (e.g. missing files, I/O
/// errors, version mismatches) is still fatal.
pub fn open_for_dump<P: AsRef<Path>>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is only used in downstairs/src/dump.rs, could it be pub(crate)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in ecb4474

}
self.extents.push(ExtentState::Opened(extent));
}
Err(e) => match e.downcast::<CrucibleError>() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes me sad that we're using anyhow in library code – the typical pattern is thiserror or similar for strongly-typed errors in library code, then anyhow at the top level of application code (where multiple errors types must be combined). This isn't a problem introduced in the PR, though!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your tears.

/// [`ExtentState::Closed`] (preserving Vec indices) and the error is
/// collected into the returned Vec. Any other error causes an
/// immediate failure.
fn try_open_extents(&mut self) -> Result<Vec<(ExtentId, CrucibleError)>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid duplicate code, you could write open_extents in terms of try_open_extents, something like

fn open_extents(&mut self) -> Result<()> {
    let errs = self.try_open_extents()?;
    if let Some((_id, err)) = errs.first() {
        Err(err)
    } else {
        Ok(())
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 2554368

// Open Region read only
let mut region = Region::open(dir, false, true, &log)?;
// Open Region read only, tolerating MissingContextSlot on
// extents other than the one we're comparing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems locally incorrect (same with the one below), because we're discarding all invalid extents reported by open_for_dump.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the comment and just handle the error better here and below, see 4d4cc95

*/
pub fn open<P: AsRef<Path>>(
/// Read region config and validate versions, returning a `Region` with
/// an empty extents Vec. Callers must open extents separately.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vibes: should this instead return a Region with an extent Vec that's all closed?

I don't remember what invariants we try to maintain for the Region object, but I could imagine "the extents vec is sized based on the number of extents" would be a reasonable one.

@leftwo
Copy link
Contributor Author

leftwo commented Mar 16, 2026

When a raw extent is dirty and its block data does not match either stored context slot,

Are you seeing this in the wild? This falls into the "really shouldn't happen" category of events: we must always have a valid context slot, otherwise the data can be lost!

Yeah... 😢 #1906

leftwo added 4 commits March 16, 2026 13:13
Remove wrong comment, but also check the result for open_for_dump and
handle it better if we can't open our extent instead of a panic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants