From 1d8a329f622c4d9f50977a3971e8aa982379d50b Mon Sep 17 00:00:00 2001
From: Evgeni Burovski <evgeny.burovskiy@gmail.com>
Date: Mon, 15 Dec 2025 11:32:23 +0100
Subject: [PATCH 1/2] DOC: combine DataSources docs/tutorials, remove
 duplicates

---
 docs/data_sources.md                  | 108 +-------------------------
 docs/tutorials/data_sources/index.rst |  68 ++++++++++++++++
 2 files changed, 71 insertions(+), 105 deletions(-)

diff --git a/docs/data_sources.md b/docs/data_sources.md
index 91a1b2239..16cac182d 100644
--- a/docs/data_sources.md
+++ b/docs/data_sources.md
@@ -1,109 +1,7 @@
 # Data Sources
 
 A Grain data source is responsible for retrieving individual records. Records
-could be in a file/storage system or generated on the fly. Data sources need to
-implement the following protocol:
+could be in a file/storage system or generated on the fly. 
 
-```python
-class RandomAccessDataSource(Protocol, Generic[T]):
-  """Interface for datasources where storage supports efficient random access."""
-
-  def __len__(self) -> int:
-    """Number of records in the dataset."""
-
-  def __getitem__(self, record_key: SupportsIndex) -> T:
-    """Retrieves record for the given record_key."""
-```
-
-## File Format
-
-Note that the underlying file format/storage system needs to support efficient
-random access. Grain currently supports random-access file format [ArrayRecord](https://github.com/google/array_record).
-
-## Available Data Sources
-
-We provide a variety of data sources for Grain,
-which we discuss in the following sections.
-
-### Range Data Source
-
-This data source mimics the built-in Python
-[range class](https://docs.python.org/3/library/functions.html#func-range). It
-can be used for initial Grain testing or if your use case involves generating
-records on the fly (for example if you only want to generate synthetic records
-online rather than read records from storage.)
-
-```python
-range_data_source = grain.python.RangeDataSource(start=1,stop=10,step=2)
-print(list(range_data_source)) # prints [1, 3, 5, 7, 9]
-```
-
-### ArrayRecord Data Source
-
-This is a data source for [ArrayRecord](https://github.com/google/array_record) files.
-The data source accepts a single/list of PathLike or File Instruction objects.
-
-PathLike are objects implementing
-[os.PathLike](https://docs.python.org/3/library/os.html#os.PathLike). For these
-objects, the data source starts by opening the files to get the number of
-records in each file. It uses this information to build a global index over all
-files.
-
-On the other hand, File Instruction objects are objects implementing the
-following protocol:
-
-```python
-class FileInstruction(Protocol):
-  """Protocol with same interface as FileInstruction objects returned by Tfds."""
-
-  filename: str
-  skip: int # Number of examples in the beginning of the shard to skip.
-  take: int # Number of examples to include.
-  examples_in_shard: int # Total number of records in the shard.
-```
-
-File instruction objects enable a few use cases:
-
-*   Selecting only a subset of records within a file.
-*   Saving startup time when the file sizes are known in advance (since the data
-    source skips opening files in that case.)
-
-### TFDS Data Source
-
-TFDS provides Grain compatible data sources via `tfds.data_source()`.
-Arguments are equivalent to `tfds.load()`. For more information see
-
-```python
-tfds_data_source = tfds.data_source("imagenet2012", split="train[:75%]")
-```
-
-### Parquet Data Source
-
-This data source reads [Parquet](https://parquet.apache.org/docs/) files,
-accepting a file path within any PyArrow-supported
-[file system](https://arrow.apache.org/docs/python/api/filesystems.html).
-
-```python
-parquet_data_source = grain.experimental.ParquetIterDataset(path="<your_path>/<file_name>.parquet")
-```
-
-## Implement your own Data Source
-
-You can implement your own data source and use it with Grain. It needs to
-implement the `RandomAccessDataSource` protocol defined above. In addition, you
-need to pay attention to the following:
-
-*   **Data Sources should be pickleable.** This is because in the multi-worker
-    setting, data sources are pickled and sent to child processes, where each
-    child process reads only the records it needs to process. File reader
-    objects are usually not pickleable. In our data sources, we implement
-    `__getstate__` and `__setstate__` to ensure that file readers aren't part of
-    the state when the data source is pickled, but rather are recreated upon
-    unpickling.
-*   **Open file handles should be closed after use.** Data sources typically
-    open underlying files in order to read records from them. We recommend
-    implementing data sources as context managers that close their open file
-    handles within the `__exit__` method. When opening a data source, the
-    `DataLoader` will first attempt to use the data source as a context manager.
-    If the data source doesn't implement the context manager protocol, it will
-    be used as-is, without a `with` statement.
+We provide a variety of data sources for Grain, which we discuss in the
+[tutorials](tutorials/data_sources/index) section.
diff --git a/docs/tutorials/data_sources/index.rst b/docs/tutorials/data_sources/index.rst
index 775fca9d9..1b74e94df 100644
--- a/docs/tutorials/data_sources/index.rst
+++ b/docs/tutorials/data_sources/index.rst
@@ -3,6 +3,74 @@
 Data Sources
 ============
 
+A `Grain` data source is responsible for retrieving individual records. Records
+could be in a file/storage system or generated on the fly. Data sources need to
+implement the following protocol:
+
+.. code-block:: python
+
+    class RandomAccessDataSource(Protocol, Generic[T]):
+      """Interface for datasources where storage supports efficient random access."""
+
+      def __len__(self) -> int:
+        """Number of records in the dataset."""
+
+      def __getitem__(self, record_key: SupportsIndex) -> T:
+        """Retrieves record for the given record_key."""
+
+
+File formats and available Data Sources
+---------------------------------------
+
+The underlying file format/storage system needs to support efficient random access.
+We provide a variety of data sources for `Grain`, which we discuss in the
+:ref:`tutorials-label` section below.
+
+
+Range Data Source
+-----------------
+
+This data source mimics the built-in Python
+`range class <https://docs.python.org/3/library/functions.html#func-range>`_. It
+can be used for initial `Grain` testing or if your use case involves generating
+records on the fly (for example if you only want to generate synthetic records
+online rather than read records from storage.)
+
+.. code-block:: python
+
+    range_data_source = grain.python.RangeDataSource(start=1,stop=10,step=2)
+    print(list(range_data_source)) # prints [1, 3, 5, 7, 9]
+
+
+Implement your own Data Source
+------------------------------
+
+You can implement your own data source and use it with `Grain`. It needs to
+implement the ``RandomAccessDataSource`` protocol defined above. In addition, you
+need to pay attention to the following:
+
+*   **Data Sources should be pickleable.** This is because in the multi-worker
+    setting, data sources are pickled and sent to child processes, where each
+    child process reads only the records it needs to process. File reader
+    objects are usually not pickleable. In our data sources, we implement
+    ``__getstate__`` and ``__setstate__`` to ensure that file readers aren't part of
+    the state when the data source is pickled, but rather are recreated upon
+    unpickling.
+*   **Open file handles should be closed after use.** Data sources typically
+    open underlying files in order to read records from them. We recommend
+    implementing data sources as context managers that close their open file
+    handles within the ``__exit__`` method. When opening a data source, the
+    ``DataLoader`` will first attempt to use the data source as a context manager.
+    If the data source doesn't implement the context manager protocol, it will
+    be used as-is, without a ``with`` statement.
+
+
+
+.. _tutorials-label:
+
+Tutorials
+=========
+
 This section contains tutorials for using Grain to read data from various sources.
 
 .. toctree::

From f8e852bbf43832d1ec91640219d2428a56684c1a Mon Sep 17 00:00:00 2001
From: Evgeni Burovski <evgeny.burovskiy@gmail.com>
Date: Mon, 22 Dec 2025 13:19:55 +0100
Subject: [PATCH 2/2] address review comments

---
 docs/data_sources.md                  |  7 ------
 docs/index.md                         |  3 +--
 docs/tutorials/data_sources/index.rst | 35 ++++++++++++++++++++++-----
 3 files changed, 30 insertions(+), 15 deletions(-)
 delete mode 100644 docs/data_sources.md

diff --git a/docs/data_sources.md b/docs/data_sources.md
deleted file mode 100644
index 16cac182d..000000000
--- a/docs/data_sources.md
+++ /dev/null
@@ -1,7 +0,0 @@
-# Data Sources
-
-A Grain data source is responsible for retrieving individual records. Records
-could be in a file/storage system or generated on the fly. 
-
-We provide a variety of data sources for Grain, which we discuss in the
-[tutorials](tutorials/data_sources/index) section.
diff --git a/docs/index.md b/docs/index.md
index 447904900..0b126c9d1 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -46,7 +46,6 @@ not depend on TensorFlow.
 :caption: Get started
 installation
 api_choice
-data_sources
 behind_the_scenes
 ```
 
@@ -83,4 +82,4 @@ changelog
 :hidden:
 :caption: Contributor guides
 CONTRIBUTING
-```
\ No newline at end of file
+```
diff --git a/docs/tutorials/data_sources/index.rst b/docs/tutorials/data_sources/index.rst
index 1b74e94df..719283e14 100644
--- a/docs/tutorials/data_sources/index.rst
+++ b/docs/tutorials/data_sources/index.rst
@@ -4,8 +4,11 @@ Data Sources
 ============
 
 A `Grain` data source is responsible for retrieving individual records. Records
-could be in a file/storage system or generated on the fly. Data sources need to
-implement the following protocol:
+could be in a file/storage system or generated on the fly. There are two main kinds of
+data sources: those supporting efficient random access, and those only supporing a
+sequential access and can be iteratated over. 
+
+Data sources with random access need to implement the following protocol:
 
 .. code-block:: python
 
@@ -18,6 +21,17 @@ implement the following protocol:
       def __getitem__(self, record_key: SupportsIndex) -> T:
         """Retrieves record for the given record_key."""
 
+Data sources / datasets with no random access should implement the ``grain.IterDataset``
+(see the *Dataset basics* page for further details)
+
+.. code-block:: python
+
+    class IterDataset(_Dataset, Iterable[T]):
+      """Interface for datasets which can be iterated over."""
+      def __iter__(self) -> DatasetIterator[T]:
+        """Returns an iterator for this dataset."""
+
+
 
 File formats and available Data Sources
 ---------------------------------------
@@ -46,8 +60,8 @@ Implement your own Data Source
 ------------------------------
 
 You can implement your own data source and use it with `Grain`. It needs to
-implement the ``RandomAccessDataSource`` protocol defined above. In addition, you
-need to pay attention to the following:
+implement one of the ``RandomAccessDataSource`` or ``IterDataset`` protocols
+defined above. In addition, you need to pay attention to the following:
 
 *   **Data Sources should be pickleable.** This is because in the multi-worker
     setting, data sources are pickled and sent to child processes, where each
@@ -79,7 +93,16 @@ This section contains tutorials for using Grain to read data from various source
    parquet_dataset_tutorial.md
    arrayrecord_data_source_tutorial.md
    bagz_data_source_tutorial.md
-   load_from_s3_tutorial.md
-   load_from_gcs_tutorial.md
    huggingface_dataset_tutorial.md
    pytorch_dataset_tutorial.md
+
+File systems
+------------
+
+.. toctree::
+   :maxdepth: 1
+
+   load_from_s3_tutorial.md
+   load_from_gcs_tutorial.md
+
+