Skip to content

feat(transaction): Add SnapshotValidator#2590

Open
CTTY wants to merge 2 commits into
apache:mainfrom
CTTY:ctty/snapshot-validator
Open

feat(transaction): Add SnapshotValidator#2590
CTTY wants to merge 2 commits into
apache:mainfrom
CTTY:ctty/snapshot-validator

Conversation

@CTTY

@CTTY CTTY commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Which issue does this PR close?

Here is a diagram to picture what I have in mind about the class heirachy for the snapshot producing.

flowchart TB
    subgraph entry["Transaction entry point"]
        TA["«trait» TransactionAction"]
        FAA["FastAppendAction"]
        RFA["RewriteFilesAction<br/>(planned)"]
    end

    subgraph behavior["Behavior traits — passed into SnapshotProducer::commit"]
        SPO["«trait» SnapshotProduceOperation<br/>what to include / tombstone"]
        SV["«trait» SnapshotValidator<br/>conflict checks (default no-op)"]
        MP["«trait» ManifestProcess<br/>how to assemble manifests"]
    end

    subgraph impls["Concrete behavior"]
        FAO["FastAppendOperation"]
        RFO["RewriteFilesOperation<br/>(planned)"]
        DMP["DefaultManifestProcess"]
    end

    SP["SnapshotProducer<br/>(engine)"]

    %% who implements the entry trait
    FAA -- impl --> TA
    RFA -. impl planned .-> TA

    %% supertrait
    SPO -- "requires (supertrait)" --> SV

    %% concrete impls of behavior traits
    FAO -- impl --> SPO
    RFO -. impl planned .-> SPO
    DMP -- impl --> MP

    %% actions drive the producer with an operation + manifest process
    FAA -- "commit() builds" --> SP
    SP -- "uses OP" --> SPO
    SP -- "uses MP" --> MP
    SP -- "calls validate() via supertrait" --> SV

    FAA -. "passes" .-> FAO
    FAA -. "passes" .-> DMP
    RFA -. "passes (planned)" .-> RFO


Loading

What changes are included in this PR?

  • Added a new trait SnapshotValidator to help validate the snapshot to avoid conflicts

Are these changes tested?

Not yet, this should be tested when concrete implementations that validate history or no new delete files are added

let metadata = base.metadata_ref();
let snapshots = ancestors_between(&metadata, to_snapshot_id, from_snapshot_id);

for current_snapshot in snapshots {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and other for-loops have opportunity to be parallelized using runtime, we can revisit this once we are aligned on the general design or as a follow-up

/// # Errors
///
/// Returns an error if the history between the snapshots cannot be determined.
#[allow(dead_code)]

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

marking them as dead_code for now and they will be used by delete or rewrite actions later on

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add some unit tests?

use crate::{Error, ErrorKind};

/// Operations whose snapshots may add delete files.
static VALIDATE_ADDED_DELETE_FILES_OPERATIONS: Lazy<HashSet<Operation>> =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The price for pay for defining hash set is the changes in public-api.txt etc. No strong opinion but it seems that we could avoid it by defining

/// Operations whose snapshots may add delete files.
fn adds_delete_files(op: &Operation) -> bool {
    matches!(op, Operation::Overwrite | Operation::Delete)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I think having a reference to that function improves readability anyway.

Maybe not worth it since I don't expect a new operation type anytime soon, but I was thinking along the lines of this.

/// Operations whose snapshots may add delete files.
fn adds_delete_files(op: &Operation) -> bool {
    match op {
        Operation::Append | Operation::Replace => false,
        Operation::Overwrite | Operation::Delete => true,
    }
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this mirrors the Java implementation, however I find it confusing as a replace operation will add delete files, when it rewrites the delete files to retain deletes it could not apply.

What we're really checking for here is that no new deletes apply to the data files we've rewritten.

Is adds_deletes(op: &Operation) better here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully understand, if we add multiple APIs to Operation like this, how do we tell which API to call in SnapshotValidator?

Right now we only need one implementation of validation_history across multiple table actions, and I think this is convenient. But would be happy to switch to the cleaner design if it doesn't really complicate the implementation

/// # Errors
///
/// Returns an error if the history between the snapshots cannot be determined.
#[allow(dead_code)]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add some unit tests?

@blackmwk

blackmwk commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

I'm still quite confused about how the overall things works. I think the key point is the design of MergingSnapshotProduder, which contains a lot of indices to speed up the confliction check.

@dannycjones dannycjones left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Shawn!

Comment on lines +52 to +66
/// Validates a snapshot against a table.
///
/// # Arguments
///
/// * `base` - The base table to validate against.
/// * `parent_snapshot_id` - The ID of the parent snapshot, if any. This is usually
/// the latest snapshot of the base table, unless it's a non-main branch
/// (note: writing to branches is not currently supported).
///
/// # Returns
///
/// A `Result` indicating success or an error if validation fails.
async fn validate(&self, _base: &Table, _parent_snapshot_id: Option<i64>) -> Result<()> {
Ok(())
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not leave this abstract, such that append is documented as requiring no validation in its implementation?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would also work. I don't really have a preference here

use crate::{Error, ErrorKind};

/// Operations whose snapshots may add delete files.
static VALIDATE_ADDED_DELETE_FILES_OPERATIONS: Lazy<HashSet<Operation>> =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I think having a reference to that function improves readability anyway.

Maybe not worth it since I don't expect a new operation type anytime soon, but I was thinking along the lines of this.

/// Operations whose snapshots may add delete files.
fn adds_delete_files(op: &Operation) -> bool {
    match op {
        Operation::Append | Operation::Replace => false,
        Operation::Overwrite | Operation::Delete => true,
    }
}

use crate::{Error, ErrorKind};

/// Operations whose snapshots may add delete files.
static VALIDATE_ADDED_DELETE_FILES_OPERATIONS: Lazy<HashSet<Operation>> =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this mirrors the Java implementation, however I find it confusing as a replace operation will add delete files, when it rewrites the delete files to retain deletes it could not apply.

What we're really checking for here is that no new deletes apply to the data files we've rewritten.

Is adds_deletes(op: &Operation) better here?

Comment on lines +221 to +224
match base.metadata().snapshot_by_id(from_snapshot_id) {
Some(snapshot) => snapshot.sequence_number(),
None => INITIAL_SEQUENCE_NUMBER,
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it not an error if we cannot find the snapshot associated with from_snapshot_id? Is it safe to default to INITIAL_SEQUENCE_NUMBER here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't hurt if we validate against the entire snapshot history. the conflict rate would be higher if we were trying to pass an invalid starting snapshot id

Comment on lines +240 to +246
return Err(Error::new(
ErrorKind::DataInvalid,
format!(
"Cannot commit, found new positional delete for added data file: {}",
data_file.file_path
),
));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data files are not 'added', right?

I was reviewing the Java implementation, the wording is: "Cannot commit, found new position delete for replaced data file: ". Shall we align with that?

https://github.com/apache/iceberg/blob/bead8dfab3249cbc6ba7713e862ff2698fd2122c/core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java#L538-L549

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think "replaced data file" is more precise

@CTTY

CTTY commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

I think the key point is the design of MergingSnapshotProduder, which contains a lot of indices to speed up the confliction check.

I think you meant DeleteFileIndex in java implementation and the conflictDetectionFilter. In rust's DeleteFileIndex, we don't allow filter to be pushed down at this point, so I didn't include that change.

In the future, we could store the filter in snapshot operations like RowDeltaAction and pass the filter to SnapshotValidator::validate_no_new_deletes easily.

The current implementation won't block that change, we will only need to change the API in SnapshotValidator after adding conflict_detecting_filter support to DeleteFileIndex

@blackmwk

blackmwk commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

I think the key point is the design of MergingSnapshotProduder, which contains a lot of indices to speed up the confliction check.

I think you meant DeleteFileIndex in java implementation and the conflictDetectionFilter. In rust's DeleteFileIndex, we don't allow filter to be pushed down at this point, so I didn't include that change.

In the future, we could store the filter in snapshot operations like RowDeltaAction and pass the filter to SnapshotValidator::validate_no_new_deletes easily.

The current implementation won't block that change, we will only need to change the API in SnapshotValidator after adding conflict_detecting_filter support to DeleteFileIndex

No, I mean ManifestMergeManager, ManifestFilterManager, which are critial data structures enable efficient concurrency. Also I don't understand what a crate private SnapshotValidator is used for?
If you look at java api, each snapshot tx action interface contains interface to allow standalone check. For example, RowDelta has methods like validateDeletedFiles, validateNoConflictingDataFiles, validateNoConflictingDeleteFiles, but they are just marks about what to check during committing. This design is quite flexible since they allow different checks in differnt isolation levels.

@CTTY

CTTY commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author
  • ManifestMergeManager: This should be handled by the existing ManifestProcess, we can implement a MergingManifestProcess
  • ManifestFilterManager: It should live in each transaction operation imo. We can add another trait ManifestFilterManagement and some accessors like fn filter_manager() and fn delete_filter_manager to standardize it. But I think this is more of an implementation details. Basically the filter manager can be built in each operation similar to what we have in Java
struct RowDelta {
  fn removeRows(file: DataFile) {
   ...
   self.delete_data_file(file)
}
}

impl ManifestFilterManagement for RowDelta {
  fn delete_data_file(file: DataFile) {
   self.filter_manager().delete_data_file(file);
  }
}
  • If we are adding RowDelta in the future, it can implements its own validation logic by implementing the SnapshotValidator trait. These validation flags can also be set in the action like the following. It will still be additive
row_delta.validate_deleted_files() // set the flag to be true in operation

...
impl SnapshotValidator for RowDelta {
 fn validate() {
    if self.validate_deleted_files {...}
 }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement SnapshotValidator

4 participants