[Data] Support incremental reads for Iceberg tables by tanmayrauth · Pull Request #64485 · ray-project/ray

tanmayrauth · 2026-07-01T21:41:59Z

Extend ray.data.read_iceberg(...) with start_snapshot_id / end_snapshot_id to read only the rows appended between two Iceberg snapshots, built on PyIceberg's Table.incremental_append_scan(...) (apache/iceberg-python#3512).

IcebergDatasource: incremental-scan branch reusing the existing plan_files()/projection() read path; validates mutual exclusivity with snapshot_id; guards on the PyIceberg capability with a clear error.
read_iceberg: additive params, docstring, and example.
Tests: validation, capability guard (runs on all PyIceberg versions), and functional incremental reads (skipped until the pin ships [rllib] Rollout should preprocess observations; some cleanups #3512).

Related to #64464

Extend ray.data.read_iceberg(...) with start_snapshot_id / end_snapshot_id to read only the rows appended between two Iceberg snapshots, built on PyIceberg's Table.incremental_append_scan(...) (apache/iceberg-python#3512). - IcebergDatasource: incremental-scan branch reusing the existing plan_files()/projection() read path; validates mutual exclusivity with snapshot_id; guards on the PyIceberg capability with a clear error. - read_iceberg: additive params, docstring, and example. - Tests: validation, capability guard (runs on all PyIceberg versions), and functional incremental reads (skipped until the pin ships ray-project#3512). Related to ray-project#64464 Signed-off-by: Tanmay Rauth <t_rauth@apple.com>

gemini-code-assist

Code Review

This pull request adds support for incremental append scans in the Iceberg datasource by introducing start_snapshot_id and end_snapshot_id parameters to read_iceberg and IcebergDatasource. It includes compatibility checks for PyIceberg and comprehensive unit tests. The feedback recommends raising a clear ValueError when end_snapshot_id is provided without start_snapshot_id to prevent undefined behavior, along with adding a corresponding unit test to verify this validation.

gemini-code-assist · 2026-07-01T21:44:18Z

+        # An incremental (append) scan is requested when either snapshot bound is set.
+        self._incremental = start_snapshot_id is not None or end_snapshot_id is not None
+        if snapshot_id is not None and self._incremental:
+            raise ValueError(
+                "`snapshot_id` is mutually exclusive with `start_snapshot_id` / "
+                "`end_snapshot_id`. Pass `snapshot_id` to read a single snapshot, or "
+                "the start/end pair to read data appended between two snapshots."
+            )
+        self._start_snapshot_id = start_snapshot_id
+        self._end_snapshot_id = end_snapshot_id


An incremental read is not well-defined without a starting snapshot ID. If a user provides end_snapshot_id but omits start_snapshot_id, we should raise a clear ValueError to prevent silent misbehavior or confusing errors from PyIceberg's incremental_append_scan.

Suggested change

# An incremental (append) scan is requested when either snapshot bound is set.

self._incremental = start_snapshot_id is not None or end_snapshot_id is not None

if snapshot_id is not None and self._incremental:

raise ValueError(

"`snapshot_id` is mutually exclusive with `start_snapshot_id` / "

"`end_snapshot_id`. Pass `snapshot_id` to read a single snapshot, or "

"the start/end pair to read data appended between two snapshots."

)

self._start_snapshot_id = start_snapshot_id

self._end_snapshot_id = end_snapshot_id

# An incremental (append) scan is requested when either snapshot bound is set.

self._incremental = start_snapshot_id is not None or end_snapshot_id is not None

if snapshot_id is not None and self._incremental:

raise ValueError(

"`snapshot_id` is mutually exclusive with `start_snapshot_id` / "

"`end_snapshot_id`. Pass `snapshot_id` to read a single snapshot, or "

"the start/end pair to read data appended between two snapshots."

)

if self._incremental and start_snapshot_id is None:

raise ValueError(

"`start_snapshot_id` must be provided when performing an incremental read."

)

self._start_snapshot_id = start_snapshot_id

self._end_snapshot_id = end_snapshot_id

gemini-code-assist · 2026-07-01T21:44:19Z

+def test_incremental_snapshot_id_mutually_exclusive():
+    """`snapshot_id` cannot be combined with the incremental snapshot bounds."""
+    with pytest.raises(ValueError, match="mutually exclusive"):
+        IcebergDatasource(
+            table_identifier=f"{_DB_NAME}.{_TABLE_NAME}",
+            snapshot_id=123,
+            start_snapshot_id=456,
+            catalog_kwargs=_CATALOG_KWARGS.copy(),
+        )
+
+    with pytest.raises(ValueError, match="mutually exclusive"):
+        IcebergDatasource(
+            table_identifier=f"{_DB_NAME}.{_TABLE_NAME}",
+            snapshot_id=123,
+            end_snapshot_id=789,
+            catalog_kwargs=_CATALOG_KWARGS.copy(),
+        )


Add a test case to verify that a ValueError is raised when end_snapshot_id is provided without start_snapshot_id.

def test_incremental_snapshot_id_mutually_exclusive(): """`snapshot_id` cannot be combined with the incremental snapshot bounds.""" with pytest.raises(ValueError, match="mutually exclusive"): IcebergDatasource( table_identifier=f"{_DB_NAME}.{_TABLE_NAME}", snapshot_id=123, start_snapshot_id=456, catalog_kwargs=_CATALOG_KWARGS.copy(), ) with pytest.raises(ValueError, match="mutually exclusive"): IcebergDatasource( table_identifier=f"{_DB_NAME}.{_TABLE_NAME}", snapshot_id=123, end_snapshot_id=789, catalog_kwargs=_CATALOG_KWARGS.copy(), ) with pytest.raises(ValueError, match="start_snapshot_id must be provided"): IcebergDatasource( table_identifier=f"{_DB_NAME}.{_TABLE_NAME}", end_snapshot_id=789, catalog_kwargs=_CATALOG_KWARGS.copy(), )

tanmayrauth · 2026-07-02T05:07:42Z

@abrarsheikh @rueian Can you please review it when you get chance?

tanmayrauth requested a review from a team as a code owner July 1, 2026 21:42

gemini-code-assist Bot reviewed Jul 1, 2026

View reviewed changes

ray-gardener Bot added docs An issue or change related to documentation data Ray Data-related issues community-contribution Contributed by the community labels Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Support incremental reads for Iceberg tables#64485

[Data] Support incremental reads for Iceberg tables#64485
tanmayrauth wants to merge 1 commit into
ray-project:masterfrom
tanmayrauth:data/iceberg-incremental-read-64464

tanmayrauth commented Jul 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Uh oh!

tanmayrauth commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tanmayrauth commented Jul 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

tanmayrauth commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant