fix(reader): `filter_row_groups_by_byte_range` duplicates rows for sub-row-group file splits by mbutrovich · Pull Request #2615 · apache/iceberg-rust

mbutrovich · 2026-06-10T18:04:07Z

Which issue does this PR close?

Closes filter_row_groups_by_byte_range duplicates rows for sub-row-group file splits #2614.

What changes are included in this PR?

ArrowReader::filter_row_groups_by_byte_range (added in #1779) selected a row group for every FileScanTask byte range that overlapped it. When a data file is split into byte ranges smaller than a row group, multiple tasks overlap the same row group, so each reads it and the rows are duplicated. This surfaced in Apache DataFusion Comet (apache/datafusion-comet#4590), where Spark tiles a file by split-size regardless of row-group layout.

This PR switches selection to midpoint ownership, matching parquet-java's ParquetMetadataConverter.filterFileMetaDataByMidpoint: a split owns a row group only if its [start, start+length) range contains the row group's midpoint. Because the splits tile the file contiguously and disjointly, exactly one split contains a given midpoint, so each row group is read once. The comparison is half-open (start <= midpoint < end) so a midpoint landing on a split boundary belongs to the upper split.

This is a no-op for whole-file tasks (start=0, length=file_size, all iceberg-rust's own planner emits), since every midpoint lies in range and all row groups are selected. It only changes the externally-planned sub-row-group split case, completing the byte-range work from #1779.

Are these changes tested?

Yes. A new test test_sub_row_group_splits_do_not_duplicate_rows writes a 3-row-group file, tiles it into 64-byte splits, reads every split through ArrowReader, and asserts each row appears exactly once. It returns ~2800 rows before the fix and exactly 300 after. The existing test_file_splits_respect_byte_ranges (boundary-aligned splits) continues to pass.

Running Iceberg Java tests via Comet against this change in apache/datafusion-comet#4621.

…verlap

advancedxy · 2026-06-11T06:40:34Z

    ///
-    /// Iceberg splits large files at row group boundaries, so we only read row groups
-    /// whose byte ranges overlap with [start, start+length).
+    /// External engines (e.g. Spark via Comet) split a data file into multiple scan tasks,


I think the comment could be updated to reflect the fact: at most(normal) cases the iceberg parquet files are split at row group boundaries. It only split parquet files at request size if the splitOffsets metadata is missing when planning.

That's good context, thank you!

advancedxy · 2026-06-11T06:50:43Z

+            let length = split_size.min(file_size - start);
+            let task = FileScanTask::builder()
+                .with_file_size_in_bytes(file_size)
+                .with_start(start)


it might be helpful to test cases where the start of file scan task being exactly of the mid of one row group.

Will do, good suggestion!

advancedxy

LGTM, the fix looks reasonable. I provided some more context from the iceberg side.

blackmwk

Hi, @mbutrovich I don't think this is the problem of row filter. Instead, this should be a bug of task planning algorithm.

advancedxy · 2026-06-11T08:33:14Z

Hi, @mbutrovich I don't think this is the problem of row filter. Instead, this should be a bug of task planning algorithm.

Hi, @blackmwk I don't think the task planning algorithm can always produce the correct row group offset for parquet files. As I commented in the linked issue apache/datafusion-comet#4590, if the datafile entry's split offset is missing which could be possible for bad writer behavior or it's created manually.

I rechecked iceberg java's behavior and this change aligns with it.

blackmwk

Thanks @mbutrovich for this pr, and @advancedxy for the review! It would be nice if @advancedxy 's comments could be addressed.

mbutrovich added 2 commits June 10, 2026 13:46

fix(arrow): assign row groups to splits by midpoint, not byte-range o…

7832d39

…verlap

refactor test, cleanup

2b27ebe

mbutrovich mentioned this pull request Jun 10, 2026

Comet native Iceberg scan duplicates rows when splitting a single-row-group Parquet file into multiple byte-range tasks apache/datafusion-comet#4590

Open

mbutrovich requested review from CTTY and blackmwk June 10, 2026 18:05

mbutrovich mentioned this pull request Jun 10, 2026

fix: [DO NOT MERGE] Iceberg scan duplicates rows when splitting a single-row-group Parquet file into multiple byte-range tasks apache/datafusion-comet#4621

Draft

advancedxy reviewed Jun 11, 2026

View reviewed changes

advancedxy approved these changes Jun 11, 2026

View reviewed changes

blackmwk previously requested changes Jun 11, 2026

View reviewed changes

blackmwk approved these changes Jun 11, 2026

View reviewed changes

dannycjones mentioned this pull request Jun 11, 2026

Tracking issues of Iceberg Rust 0.10.0 Release #2527

Open

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(reader): `filter_row_groups_by_byte_range` duplicates rows for sub-row-group file splits#2615

fix(reader): `filter_row_groups_by_byte_range` duplicates rows for sub-row-group file splits#2615
mbutrovich wants to merge 2 commits into
apache:mainfrom
mbutrovich:filescantask_midpoint

mbutrovich commented Jun 10, 2026 •

edited

Loading

Uh oh!

advancedxy Jun 11, 2026

Uh oh!

mbutrovich Jun 11, 2026

Uh oh!

advancedxy Jun 11, 2026

Uh oh!

mbutrovich Jun 11, 2026

Uh oh!

advancedxy left a comment

Uh oh!

blackmwk left a comment

Uh oh!

advancedxy commented Jun 11, 2026

Uh oh!

blackmwk left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mbutrovich commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

advancedxy Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

advancedxy Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

advancedxy left a comment

Choose a reason for hiding this comment

Uh oh!

blackmwk left a comment

Choose a reason for hiding this comment

Uh oh!

advancedxy commented Jun 11, 2026

Uh oh!

blackmwk left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mbutrovich commented Jun 10, 2026 •

edited

Loading