Reject negative positional delete positions#2631
Conversation
9f3abfd to
e8606cb
Compare
|
although it makes sense to reject negative position values, we might want to be more liberal on the read side. For example, we might want to be able to read negative values in order to repair and fix them |
viirya
left a comment
There was a problem hiding this comment.
Confirmed the bug: parse_positional_deletes_record_batch_stream reads positions from an Int64Array and inserts each with pos as u64, so a negative -1 wraps to u64::MAX and silently poisons the DeleteVector's roaring bitmap rather than being rejected. The fix rejects pos < 0 with DataInvalid before the cast, which is the right call — the Iceberg spec defines pos as a non-negative row position, so a negative value means the delete file is corrupt and failing fast is correct.
The check is placed correctly (before pos as u64) and the regression test exercises the exact -1 case. LGTM.
|
I checked the Java implementation for comparison. It looks like Java is liberal at the raw positional-delete row layer: The strict validation happens later when the rows are converted into a So I think Rust should follow a similar split:
The current PR fixes the wraparound bug, but because the validation is inside the parser that directly builds |
|
Thanks @kevinjqliu — I looked into whether the Rust side can mirror Java's split here, and I think this PR is actually validating at the same layer Java does, just with less ceremony. The key difference: Rust has no equivalent of Java's raw So a malformed delete file isn't being made "impossible to inspect" by a path that previously could; no such inspect/repair path exists yet. Building one would mean adding a separate reader that yields raw signed That said, two of your suggestions are worth folding in regardless:
|
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM
@viirya is right, this takes the file stream and directly creates DeleteVector; adding a check here makes sense
Summary
ErrorKind::DataInvalidinstead of casting negativei64values to hugeu64positions.pos = -1.Root cause
The positional-delete parser reads positions from an
Int64Array, but inserted each value withpos as u64. That allowed malformed negative positions to wrap into very large row offsets instead of failing validation.Tests
cargo fmt --checkCARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse cargo test -p iceberg test_parse_positional_deletes_rejects_negative_positions --lockedCARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse cargo test -p iceberg arrow::caching_delete_file_loader::tests --locked