Skip to content

fix: [DO NOT MERGE] Iceberg scan duplicates rows when splitting a single-row-group Parquet file into multiple byte-range tasks#4621

Draft
mbutrovich wants to merge 3 commits into
apache:mainfrom
mbutrovich:iceberg_midpoint_fix
Draft

fix: [DO NOT MERGE] Iceberg scan duplicates rows when splitting a single-row-group Parquet file into multiple byte-range tasks#4621
mbutrovich wants to merge 3 commits into
apache:mainfrom
mbutrovich:iceberg_midpoint_fix

Conversation

@mbutrovich

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #4590.

Rationale for this change

The native Iceberg scan returns duplicate rows when an Iceberg FileScanTask splits a data file into byte ranges smaller than a row group. Spark/Iceberg planning tiles a file by split-size regardless of row-group layout, so a single row group can be covered by several FileScanTasks. The root cause is in iceberg-rust's ArrowReader: it selected a row group for every task byte range that overlapped it, so each overlapping split read the row group and its rows were duplicated.

Comet only forwards the start/length Spark planned; serialization and the native round-trip are correct. The fix belongs in iceberg-rust, which now assigns each row group to the single split whose range contains the row group's midpoint (matching parquet-java semantics). See apache/iceberg-rust#2614 and the fix in apache/iceberg-rust#2615.

The Parquet path (native_datafusion) is unaffected: arrow-rs/DataFusion already use midpoint ownership, confirmed by the regression test added here.

What changes are included in this PR?

  • Bump the iceberg / iceberg-storage-opendal dependency to a rev that includes the row-group midpoint fix.
  • Adapt IcebergScanExec to the updated iceberg-rust API: ArrowReaderBuilder::new now takes an iceberg::Runtime. execute() builds the reader on the Spark executor thread (outside any tokio context), so the runtime is constructed from Comet's shared global runtime via IcebergRuntime::new(get_runtime()) rather than Runtime::current().
  • Add regression tests for both scan paths.

Note

Draft: native/Cargo.toml currently points at mbutrovich/iceberg-rust branch filescantask_midpoint so CI can exercise apache/iceberg-rust#2615 before it merges. This will be repointed to apache/iceberg-rust at the merged SHA before this PR leaves draft.

How are these changes tested?

Two new tests, both verified to fail before the fix and pass after:

  • CometIcebergNativeSuite "native Iceberg scan does not duplicate a row group split by byte range" — writes a single-row-group file, registers it as an Iceberg table, reads it with split-size/file-open-cost of 64 to force sub-row-group splits, and asserts the matching row is returned exactly once. Returned 2 rows before the fix.
  • CometNativeReaderSuite "native scan does not duplicate a row group split by byte range" — the same scenario on the native_datafusion Parquet path, guarding against regression. Asserts the file is actually split into multiple partitions and still returns each row once.

@mbutrovich mbutrovich added this to the 0.17.0 milestone Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comet native Iceberg scan duplicates rows when splitting a single-row-group Parquet file into multiple byte-range tasks

1 participant