Skip to content

fix(reader): filter_row_groups_by_byte_range duplicates rows for sub-row-group file splits#2615

Open
mbutrovich wants to merge 2 commits into
apache:mainfrom
mbutrovich:filescantask_midpoint
Open

fix(reader): filter_row_groups_by_byte_range duplicates rows for sub-row-group file splits#2615
mbutrovich wants to merge 2 commits into
apache:mainfrom
mbutrovich:filescantask_midpoint

Conversation

@mbutrovich

@mbutrovich mbutrovich commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Which issue does this PR close?

What changes are included in this PR?

ArrowReader::filter_row_groups_by_byte_range (added in #1779) selected a row group for every FileScanTask byte range that overlapped it. When a data file is split into byte ranges smaller than a row group, multiple tasks overlap the same row group, so each reads it and the rows are duplicated. This surfaced in Apache DataFusion Comet (apache/datafusion-comet#4590), where Spark tiles a file by split-size regardless of row-group layout.

This PR switches selection to midpoint ownership, matching parquet-java's ParquetMetadataConverter.filterFileMetaDataByMidpoint: a split owns a row group only if its [start, start+length) range contains the row group's midpoint. Because the splits tile the file contiguously and disjointly, exactly one split contains a given midpoint, so each row group is read once. The comparison is half-open (start <= midpoint < end) so a midpoint landing on a split boundary belongs to the upper split.

This is a no-op for whole-file tasks (start=0, length=file_size, all iceberg-rust's own planner emits), since every midpoint lies in range and all row groups are selected. It only changes the externally-planned sub-row-group split case, completing the byte-range work from #1779.

Are these changes tested?

Yes. A new test test_sub_row_group_splits_do_not_duplicate_rows writes a 3-row-group file, tiles it into 64-byte splits, reads every split through ArrowReader, and asserts each row appears exactly once. It returns ~2800 rows before the fix and exactly 300 after. The existing test_file_splits_respect_byte_ranges (boundary-aligned splits) continues to pass.

Running Iceberg Java tests via Comet against this change in apache/datafusion-comet#4621.

///
/// Iceberg splits large files at row group boundaries, so we only read row groups
/// whose byte ranges overlap with [start, start+length).
/// External engines (e.g. Spark via Comet) split a data file into multiple scan tasks,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment could be updated to reflect the fact: at most(normal) cases the iceberg parquet files are split at row group boundaries. It only split parquet files at request size if the splitOffsets metadata is missing when planning.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good context, thank you!

let length = split_size.min(file_size - start);
let task = FileScanTask::builder()
.with_file_size_in_bytes(file_size)
.with_start(start)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be helpful to test cases where the start of file scan task being exactly of the mid of one row group.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, good suggestion!

@advancedxy advancedxy left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the fix looks reasonable. I provided some more context from the iceberg side.

@blackmwk blackmwk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @mbutrovich I don't think this is the problem of row filter. Instead, this should be a bug of task planning algorithm.

@advancedxy

Copy link
Copy Markdown

Hi, @mbutrovich I don't think this is the problem of row filter. Instead, this should be a bug of task planning algorithm.

Hi, @blackmwk I don't think the task planning algorithm can always produce the correct row group offset for parquet files. As I commented in the linked issue apache/datafusion-comet#4590, if the datafile entry's split offset is missing which could be possible for bad writer behavior or it's created manually.

@blackmwk blackmwk dismissed their stale review June 11, 2026 08:48

I rechecked iceberg java's behavior and this change aligns with it.

@blackmwk blackmwk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbutrovich for this pr, and @advancedxy for the review! It would be nice if @advancedxy 's comments could be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

filter_row_groups_by_byte_range duplicates rows for sub-row-group file splits

3 participants