fix(reader): filter_row_groups_by_byte_range duplicates rows for sub-row-group file splits#2615
fix(reader): filter_row_groups_by_byte_range duplicates rows for sub-row-group file splits#2615mbutrovich wants to merge 2 commits into
filter_row_groups_by_byte_range duplicates rows for sub-row-group file splits#2615Conversation
| /// | ||
| /// Iceberg splits large files at row group boundaries, so we only read row groups | ||
| /// whose byte ranges overlap with [start, start+length). | ||
| /// External engines (e.g. Spark via Comet) split a data file into multiple scan tasks, |
There was a problem hiding this comment.
I think the comment could be updated to reflect the fact: at most(normal) cases the iceberg parquet files are split at row group boundaries. It only split parquet files at request size if the splitOffsets metadata is missing when planning.
There was a problem hiding this comment.
That's good context, thank you!
| let length = split_size.min(file_size - start); | ||
| let task = FileScanTask::builder() | ||
| .with_file_size_in_bytes(file_size) | ||
| .with_start(start) |
There was a problem hiding this comment.
it might be helpful to test cases where the start of file scan task being exactly of the mid of one row group.
There was a problem hiding this comment.
Will do, good suggestion!
advancedxy
left a comment
There was a problem hiding this comment.
LGTM, the fix looks reasonable. I provided some more context from the iceberg side.
blackmwk
left a comment
There was a problem hiding this comment.
Hi, @mbutrovich I don't think this is the problem of row filter. Instead, this should be a bug of task planning algorithm.
Hi, @blackmwk I don't think the task planning algorithm can always produce the correct row group offset for parquet files. As I commented in the linked issue apache/datafusion-comet#4590, if the datafile entry's split offset is missing which could be possible for bad writer behavior or it's created manually. |
I rechecked iceberg java's behavior and this change aligns with it.
blackmwk
left a comment
There was a problem hiding this comment.
Thanks @mbutrovich for this pr, and @advancedxy for the review! It would be nice if @advancedxy 's comments could be addressed.
Which issue does this PR close?
filter_row_groups_by_byte_rangeduplicates rows for sub-row-group file splits #2614.What changes are included in this PR?
ArrowReader::filter_row_groups_by_byte_range(added in #1779) selected a row group for everyFileScanTaskbyte range that overlapped it. When a data file is split into byte ranges smaller than a row group, multiple tasks overlap the same row group, so each reads it and the rows are duplicated. This surfaced in Apache DataFusion Comet (apache/datafusion-comet#4590), where Spark tiles a file bysplit-sizeregardless of row-group layout.This PR switches selection to midpoint ownership, matching parquet-java's
ParquetMetadataConverter.filterFileMetaDataByMidpoint: a split owns a row group only if its[start, start+length)range contains the row group's midpoint. Because the splits tile the file contiguously and disjointly, exactly one split contains a given midpoint, so each row group is read once. The comparison is half-open (start <= midpoint < end) so a midpoint landing on a split boundary belongs to the upper split.This is a no-op for whole-file tasks (
start=0, length=file_size, all iceberg-rust's own planner emits), since every midpoint lies in range and all row groups are selected. It only changes the externally-planned sub-row-group split case, completing the byte-range work from #1779.Are these changes tested?
Yes. A new test
test_sub_row_group_splits_do_not_duplicate_rowswrites a 3-row-group file, tiles it into 64-byte splits, reads every split throughArrowReader, and asserts each row appears exactly once. It returns ~2800 rows before the fix and exactly 300 after. The existingtest_file_splits_respect_byte_ranges(boundary-aligned splits) continues to pass.Running Iceberg Java tests via Comet against this change in apache/datafusion-comet#4621.