Make crucible-downstairs dump tolerant of MissingContextSlot errors#1905
Make crucible-downstairs dump tolerant of MissingContextSlot errors#1905
Conversation
When a raw extent is dirty and its block data does not match either stored context slot, opening the region fails with MissingContextSlot. Previously this caused the dump subcommand to exit with an error even if a healthy extent was requested. Refactor Region::open to share setup logic with a new open_for_dump function that uses try_open_extents internally. try_open_extents tolerates MissingContextSlot errors per extent, leaving failed extents as ExtentState::Closed and returning a list of (ExtentId, error) to the caller. dump_region now uses open_for_dump and checks the error list before proceeding: if the user requested a specific extent or block that maps to a failed extent, it returns an error; otherwise it prints a warning and skips the failed extent. show_extent and show_extent_block are updated similarly. verify_region is unchanged and still fails on any corrupt extent. Five tests are added covering: open_for_dump skipping a corrupt extent, dump succeeding when a corrupt extent is not requested, dump failing when a corrupt extent is specifically requested by extent number or block number, and dump of a good extent succeeding when another extent is corrupt.
Are you seeing this in the wild? This falls into the "really shouldn't happen" category of events: we must always have a valid context slot, otherwise the data can be lost! |
downstairs/src/region.rs
Outdated
| /// | ||
| /// Any error other than `MissingContextSlot` (e.g. missing files, I/O | ||
| /// errors, version mismatches) is still fatal. | ||
| pub fn open_for_dump<P: AsRef<Path>>( |
There was a problem hiding this comment.
Since this is only used in downstairs/src/dump.rs, could it be pub(crate)?
| } | ||
| self.extents.push(ExtentState::Opened(extent)); | ||
| } | ||
| Err(e) => match e.downcast::<CrucibleError>() { |
There was a problem hiding this comment.
It makes me sad that we're using anyhow in library code – the typical pattern is thiserror or similar for strongly-typed errors in library code, then anyhow at the top level of application code (where multiple errors types must be combined). This isn't a problem introduced in the PR, though!
| /// [`ExtentState::Closed`] (preserving Vec indices) and the error is | ||
| /// collected into the returned Vec. Any other error causes an | ||
| /// immediate failure. | ||
| fn try_open_extents(&mut self) -> Result<Vec<(ExtentId, CrucibleError)>> { |
There was a problem hiding this comment.
To avoid duplicate code, you could write open_extents in terms of try_open_extents, something like
fn open_extents(&mut self) -> Result<()> {
let errs = self.try_open_extents()?;
if let Some((_id, err)) = errs.first() {
Err(err)
} else {
Ok(())
}
}
downstairs/src/dump.rs
Outdated
| // Open Region read only | ||
| let mut region = Region::open(dir, false, true, &log)?; | ||
| // Open Region read only, tolerating MissingContextSlot on | ||
| // extents other than the one we're comparing. |
There was a problem hiding this comment.
This comment seems locally incorrect (same with the one below), because we're discarding all invalid extents reported by open_for_dump.
There was a problem hiding this comment.
I removed the comment and just handle the error better here and below, see 4d4cc95
| */ | ||
| pub fn open<P: AsRef<Path>>( | ||
| /// Read region config and validate versions, returning a `Region` with | ||
| /// an empty extents Vec. Callers must open extents separately. |
There was a problem hiding this comment.
Vibes: should this instead return a Region with an extent Vec that's all closed?
I don't remember what invariants we try to maintain for the Region object, but I could imagine "the extents vec is sized based on the number of extents" would be a reasonable one.
Yeah... 😢 #1906 |
Remove wrong comment, but also check the result for open_for_dump and handle it better if we can't open our extent instead of a panic.
When a raw extent is dirty and its block data does not match either stored context slot, opening the region fails with MissingContextSlot. Previously this caused the dump subcommand to exit with an error even if a healthy extent was requested.
Refactor Region::open to share setup logic with a new open_for_dump function that uses try_open_extents internally. try_open_extents tolerates MissingContextSlot errors per extent, leaving failed extents as ExtentState::Closed and returning a list of (ExtentId, error) to the caller.
dump_region now uses open_for_dump and checks the error list before proceeding: if the user requested a specific extent or block that maps to a failed extent, it returns an error; otherwise it prints a warning and skips the failed extent. show_extent and show_extent_block are updated similarly.
verify_region is unchanged and still fails on any corrupt extent.
Five tests are added covering: open_for_dump skipping a corrupt extent, dump succeeding when a corrupt extent is not requested, dump failing when a corrupt extent is specifically requested by extent number or block number, and dump of a good extent succeeding when another extent is corrupt.