feat(rust): add cleanup explain API#7147
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
@claude review |
Xuanwo
left a comment
There was a problem hiding this comment.
I think this should follow SQL EXPLAIN semantics: the cleanup plan should be a dry-run/audit report, not a materialized deletion plan.
Execution should re-evaluate cleanup from the current dataset/ref state instead of trusting an old file list. For example, a tag or branch can be added after planning without advancing the manifest version, so the current read_version check can still pass while the old plan deletes files that are now protected.
Sounds reasonable. My original idea is also for |
4714005 to
ca62f3a
Compare
Xuanwo
left a comment
There was a problem hiding this comment.
Thanks for refactoring this away from the previous executable CleanupPlan design. I think the current behavior is much closer to the semantics we want.
However, I’m still not sure CleanupMode::{Execute, DryRun} is the right public abstraction. This makes dry-run look like a mode of the destructive cleanup API, while what we want is closer to SQL EXPLAIN: a read-only explanation of what cleanup would do, not an alternate execution mode.
For Rust, I’d prefer a builder-style API that models cleanup as an operation with two terminal actions, for example:
let cleanup = dataset.cleanup(policy);
let explanation = cleanup.explain().await?;
let stats = cleanup.execute().await?;explain() should return a CleanupExplanation with fields like read_version, RemovalStats, candidate files, referenced-branch details, warnings, and truncation info. It should not be accepted later as an executable deletion plan; execute() should still re-evaluate current dataset/ref state.
For Python, I think we should keep the existing Python style instead of exposing Rust builders/modes directly:
explanation = ds.explain_cleanup_old_versions(
older_than=timedelta(days=7),
retain_versions=3,
include_files=True,
max_files=1000,
)
stats = ds.cleanup_old_versions(
older_than=timedelta(days=7),
retain_versions=3,
)This also gives us a cleaner path to extend the same concept to other maintenance operations later, e.g. Rust builder-style dataset.compaction(options).explain() / .execute(), while Python can expose a Pythonic ds.optimize.explain_compaction(...) alongside ds.optimize.compact_files(...). That feels more extensible than adding operation-specific mode enums such as CleanupMode.
Sounds reasonable, good suggestion. Will refactor it. |
Summary
Dataset::cleanup(policy)with two terminal actions:explain()returns a read-onlyCleanupExplanationexecute()re-evaluates current dataset/ref state before deleting filesreferenced branch details, and warnings.
cleanup_old_versions,cleanup_with_policy) compatible.Tests
cargo fmt --allcargo test -p lance dataset::cleanup::tests -- --nocapturecargo check -p lance --testscargo clippy -p lance --all-targets -- -D warningswas attempted, but currently fails on unrelated pre-existinglint errors outside this change.