Skip to content

feat(rust): add cleanup explain API#7147

Draft
yanghua wants to merge 3 commits into
lance-format:mainfrom
yanghua:cleanup-plan
Draft

feat(rust): add cleanup explain API#7147
yanghua wants to merge 3 commits into
lance-format:mainfrom
yanghua:cleanup-plan

Conversation

@yanghua

@yanghua yanghua commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add Dataset::cleanup(policy) with two terminal actions:
    • explain() returns a read-only CleanupExplanation
    • execute() re-evaluates current dataset/ref state before deleting files
  • Add explanation details including read version, aggregate removal stats, candidate files, truncation metadata,
    referenced branch details, and warnings.
  • Keep existing Rust execution APIs (cleanup_old_versions, cleanup_with_policy) compatible.

Tests

  • cargo fmt --all
  • cargo test -p lance dataset::cleanup::tests -- --nocapture
  • cargo check -p lance --tests

cargo clippy -p lance --all-targets -- -D warnings was attempted, but currently fails on unrelated pre-existing
lint errors outside this change.

@github-actions github-actions Bot added the enhancement New feature or request label Jun 8, 2026
@codecov

codecov Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.08357% with 24 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/cleanup.rs 93.25% 13 Missing and 10 partials ⚠️
rust/lance/src/dataset.rs 83.33% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@yanghua

yanghua commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

@claude review

Comment thread rust/lance/src/dataset/cleanup.rs Outdated
Comment thread rust/lance/src/dataset/cleanup.rs Outdated
@yanghua yanghua marked this pull request as ready for review June 8, 2026 11:38

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@yanghua yanghua requested a review from Xuanwo June 8, 2026 13:57

@Xuanwo Xuanwo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should follow SQL EXPLAIN semantics: the cleanup plan should be a dry-run/audit report, not a materialized deletion plan.

Execution should re-evaluate cleanup from the current dataset/ref state instead of trusting an old file list. For example, a tag or branch can be added after planning without advancing the manifest version, so the current read_version check can still pass while the old plan deletes files that are now protected.

@yanghua

yanghua commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

I think this should follow SQL EXPLAIN semantics: the cleanup plan should be a dry-run/audit report, not a materialized deletion plan.

Execution should re-evaluate cleanup from the current dataset/ref state instead of trusting an old file list. For example, a tag or branch can be added after planning without advancing the manifest version, so the current read_version check can still pass while the old plan deletes files that are now protected.

Sounds reasonable. My original idea is also for dry-run purpose. I think I misunderstood your meaning of the plan when we talked offline before. Will refactor later.

@yanghua yanghua marked this pull request as draft June 9, 2026 04:45
@yanghua yanghua force-pushed the cleanup-plan branch 2 times, most recently from 4714005 to ca62f3a Compare June 10, 2026 14:48
@yanghua yanghua changed the title feat: support planning cleanup feat: support dry-run cleanup Jun 11, 2026
@yanghua yanghua marked this pull request as ready for review June 11, 2026 03:23

@Xuanwo Xuanwo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for refactoring this away from the previous executable CleanupPlan design. I think the current behavior is much closer to the semantics we want.

However, I’m still not sure CleanupMode::{Execute, DryRun} is the right public abstraction. This makes dry-run look like a mode of the destructive cleanup API, while what we want is closer to SQL EXPLAIN: a read-only explanation of what cleanup would do, not an alternate execution mode.

For Rust, I’d prefer a builder-style API that models cleanup as an operation with two terminal actions, for example:

let cleanup = dataset.cleanup(policy);

let explanation = cleanup.explain().await?;
let stats = cleanup.execute().await?;

explain() should return a CleanupExplanation with fields like read_version, RemovalStats, candidate files, referenced-branch details, warnings, and truncation info. It should not be accepted later as an executable deletion plan; execute() should still re-evaluate current dataset/ref state.

For Python, I think we should keep the existing Python style instead of exposing Rust builders/modes directly:

explanation = ds.explain_cleanup_old_versions(
    older_than=timedelta(days=7),
    retain_versions=3,
    include_files=True,
    max_files=1000,
)

stats = ds.cleanup_old_versions(
    older_than=timedelta(days=7),
    retain_versions=3,
)

This also gives us a cleaner path to extend the same concept to other maintenance operations later, e.g. Rust builder-style dataset.compaction(options).explain() / .execute(), while Python can expose a Pythonic ds.optimize.explain_compaction(...) alongside ds.optimize.compact_files(...). That feels more extensible than adding operation-specific mode enums such as CleanupMode.

@yanghua

yanghua commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks for refactoring this away from the previous executable CleanupPlan design. I think the current behavior is much closer to the semantics we want.

However, I’m still not sure CleanupMode::{Execute, DryRun} is the right public abstraction. This makes dry-run look like a mode of the destructive cleanup API, while what we want is closer to SQL EXPLAIN: a read-only explanation of what cleanup would do, not an alternate execution mode.

For Rust, I’d prefer a builder-style API that models cleanup as an operation with two terminal actions, for example:

let cleanup = dataset.cleanup(policy);

let explanation = cleanup.explain().await?;
let stats = cleanup.execute().await?;

explain() should return a CleanupExplanation with fields like read_version, RemovalStats, candidate files, referenced-branch details, warnings, and truncation info. It should not be accepted later as an executable deletion plan; execute() should still re-evaluate current dataset/ref state.

For Python, I think we should keep the existing Python style instead of exposing Rust builders/modes directly:

explanation = ds.explain_cleanup_old_versions(
    older_than=timedelta(days=7),
    retain_versions=3,
    include_files=True,
    max_files=1000,
)

stats = ds.cleanup_old_versions(
    older_than=timedelta(days=7),
    retain_versions=3,
)

This also gives us a cleaner path to extend the same concept to other maintenance operations later, e.g. Rust builder-style dataset.compaction(options).explain() / .execute(), while Python can expose a Pythonic ds.optimize.explain_compaction(...) alongside ds.optimize.compact_files(...). That feels more extensible than adding operation-specific mode enums such as CleanupMode.

Sounds reasonable, good suggestion. Will refactor it.

@yanghua yanghua marked this pull request as draft June 11, 2026 12:17
@yanghua yanghua changed the title feat: support dry-run cleanup feat(rust): add cleanup explain API Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants