From 1d8a329f622c4d9f50977a3971e8aa982379d50b Mon Sep 17 00:00:00 2001 From: Evgeni Burovski Date: Mon, 15 Dec 2025 11:32:23 +0100 Subject: [PATCH 1/2] DOC: combine DataSources docs/tutorials, remove duplicates --- docs/data_sources.md | 108 +------------------------- docs/tutorials/data_sources/index.rst | 68 ++++++++++++++++ 2 files changed, 71 insertions(+), 105 deletions(-) diff --git a/docs/data_sources.md b/docs/data_sources.md index 91a1b2239..16cac182d 100644 --- a/docs/data_sources.md +++ b/docs/data_sources.md @@ -1,109 +1,7 @@ # Data Sources A Grain data source is responsible for retrieving individual records. Records -could be in a file/storage system or generated on the fly. Data sources need to -implement the following protocol: +could be in a file/storage system or generated on the fly. -```python -class RandomAccessDataSource(Protocol, Generic[T]): - """Interface for datasources where storage supports efficient random access.""" - - def __len__(self) -> int: - """Number of records in the dataset.""" - - def __getitem__(self, record_key: SupportsIndex) -> T: - """Retrieves record for the given record_key.""" -``` - -## File Format - -Note that the underlying file format/storage system needs to support efficient -random access. Grain currently supports random-access file format [ArrayRecord](https://github.com/google/array_record). - -## Available Data Sources - -We provide a variety of data sources for Grain, -which we discuss in the following sections. - -### Range Data Source - -This data source mimics the built-in Python -[range class](https://docs.python.org/3/library/functions.html#func-range). It -can be used for initial Grain testing or if your use case involves generating -records on the fly (for example if you only want to generate synthetic records -online rather than read records from storage.) - -```python -range_data_source = grain.python.RangeDataSource(start=1,stop=10,step=2) -print(list(range_data_source)) # prints [1, 3, 5, 7, 9] -``` - -### ArrayRecord Data Source - -This is a data source for [ArrayRecord](https://github.com/google/array_record) files. -The data source accepts a single/list of PathLike or File Instruction objects. - -PathLike are objects implementing -[os.PathLike](https://docs.python.org/3/library/os.html#os.PathLike). For these -objects, the data source starts by opening the files to get the number of -records in each file. It uses this information to build a global index over all -files. - -On the other hand, File Instruction objects are objects implementing the -following protocol: - -```python -class FileInstruction(Protocol): - """Protocol with same interface as FileInstruction objects returned by Tfds.""" - - filename: str - skip: int # Number of examples in the beginning of the shard to skip. - take: int # Number of examples to include. - examples_in_shard: int # Total number of records in the shard. -``` - -File instruction objects enable a few use cases: - -* Selecting only a subset of records within a file. -* Saving startup time when the file sizes are known in advance (since the data - source skips opening files in that case.) - -### TFDS Data Source - -TFDS provides Grain compatible data sources via `tfds.data_source()`. -Arguments are equivalent to `tfds.load()`. For more information see - -```python -tfds_data_source = tfds.data_source("imagenet2012", split="train[:75%]") -``` - -### Parquet Data Source - -This data source reads [Parquet](https://parquet.apache.org/docs/) files, -accepting a file path within any PyArrow-supported -[file system](https://arrow.apache.org/docs/python/api/filesystems.html). - -```python -parquet_data_source = grain.experimental.ParquetIterDataset(path="/.parquet") -``` - -## Implement your own Data Source - -You can implement your own data source and use it with Grain. It needs to -implement the `RandomAccessDataSource` protocol defined above. In addition, you -need to pay attention to the following: - -* **Data Sources should be pickleable.** This is because in the multi-worker - setting, data sources are pickled and sent to child processes, where each - child process reads only the records it needs to process. File reader - objects are usually not pickleable. In our data sources, we implement - `__getstate__` and `__setstate__` to ensure that file readers aren't part of - the state when the data source is pickled, but rather are recreated upon - unpickling. -* **Open file handles should be closed after use.** Data sources typically - open underlying files in order to read records from them. We recommend - implementing data sources as context managers that close their open file - handles within the `__exit__` method. When opening a data source, the - `DataLoader` will first attempt to use the data source as a context manager. - If the data source doesn't implement the context manager protocol, it will - be used as-is, without a `with` statement. +We provide a variety of data sources for Grain, which we discuss in the +[tutorials](tutorials/data_sources/index) section. diff --git a/docs/tutorials/data_sources/index.rst b/docs/tutorials/data_sources/index.rst index 775fca9d9..1b74e94df 100644 --- a/docs/tutorials/data_sources/index.rst +++ b/docs/tutorials/data_sources/index.rst @@ -3,6 +3,74 @@ Data Sources ============ +A `Grain` data source is responsible for retrieving individual records. Records +could be in a file/storage system or generated on the fly. Data sources need to +implement the following protocol: + +.. code-block:: python + + class RandomAccessDataSource(Protocol, Generic[T]): + """Interface for datasources where storage supports efficient random access.""" + + def __len__(self) -> int: + """Number of records in the dataset.""" + + def __getitem__(self, record_key: SupportsIndex) -> T: + """Retrieves record for the given record_key.""" + + +File formats and available Data Sources +--------------------------------------- + +The underlying file format/storage system needs to support efficient random access. +We provide a variety of data sources for `Grain`, which we discuss in the +:ref:`tutorials-label` section below. + + +Range Data Source +----------------- + +This data source mimics the built-in Python +`range class `_. It +can be used for initial `Grain` testing or if your use case involves generating +records on the fly (for example if you only want to generate synthetic records +online rather than read records from storage.) + +.. code-block:: python + + range_data_source = grain.python.RangeDataSource(start=1,stop=10,step=2) + print(list(range_data_source)) # prints [1, 3, 5, 7, 9] + + +Implement your own Data Source +------------------------------ + +You can implement your own data source and use it with `Grain`. It needs to +implement the ``RandomAccessDataSource`` protocol defined above. In addition, you +need to pay attention to the following: + +* **Data Sources should be pickleable.** This is because in the multi-worker + setting, data sources are pickled and sent to child processes, where each + child process reads only the records it needs to process. File reader + objects are usually not pickleable. In our data sources, we implement + ``__getstate__`` and ``__setstate__`` to ensure that file readers aren't part of + the state when the data source is pickled, but rather are recreated upon + unpickling. +* **Open file handles should be closed after use.** Data sources typically + open underlying files in order to read records from them. We recommend + implementing data sources as context managers that close their open file + handles within the ``__exit__`` method. When opening a data source, the + ``DataLoader`` will first attempt to use the data source as a context manager. + If the data source doesn't implement the context manager protocol, it will + be used as-is, without a ``with`` statement. + + + +.. _tutorials-label: + +Tutorials +========= + This section contains tutorials for using Grain to read data from various sources. .. toctree:: From f8e852bbf43832d1ec91640219d2428a56684c1a Mon Sep 17 00:00:00 2001 From: Evgeni Burovski Date: Mon, 22 Dec 2025 13:19:55 +0100 Subject: [PATCH 2/2] address review comments --- docs/data_sources.md | 7 ------ docs/index.md | 3 +-- docs/tutorials/data_sources/index.rst | 35 ++++++++++++++++++++++----- 3 files changed, 30 insertions(+), 15 deletions(-) delete mode 100644 docs/data_sources.md diff --git a/docs/data_sources.md b/docs/data_sources.md deleted file mode 100644 index 16cac182d..000000000 --- a/docs/data_sources.md +++ /dev/null @@ -1,7 +0,0 @@ -# Data Sources - -A Grain data source is responsible for retrieving individual records. Records -could be in a file/storage system or generated on the fly. - -We provide a variety of data sources for Grain, which we discuss in the -[tutorials](tutorials/data_sources/index) section. diff --git a/docs/index.md b/docs/index.md index 447904900..0b126c9d1 100644 --- a/docs/index.md +++ b/docs/index.md @@ -46,7 +46,6 @@ not depend on TensorFlow. :caption: Get started installation api_choice -data_sources behind_the_scenes ``` @@ -83,4 +82,4 @@ changelog :hidden: :caption: Contributor guides CONTRIBUTING -``` \ No newline at end of file +``` diff --git a/docs/tutorials/data_sources/index.rst b/docs/tutorials/data_sources/index.rst index 1b74e94df..719283e14 100644 --- a/docs/tutorials/data_sources/index.rst +++ b/docs/tutorials/data_sources/index.rst @@ -4,8 +4,11 @@ Data Sources ============ A `Grain` data source is responsible for retrieving individual records. Records -could be in a file/storage system or generated on the fly. Data sources need to -implement the following protocol: +could be in a file/storage system or generated on the fly. There are two main kinds of +data sources: those supporting efficient random access, and those only supporing a +sequential access and can be iteratated over. + +Data sources with random access need to implement the following protocol: .. code-block:: python @@ -18,6 +21,17 @@ implement the following protocol: def __getitem__(self, record_key: SupportsIndex) -> T: """Retrieves record for the given record_key.""" +Data sources / datasets with no random access should implement the ``grain.IterDataset`` +(see the *Dataset basics* page for further details) + +.. code-block:: python + + class IterDataset(_Dataset, Iterable[T]): + """Interface for datasets which can be iterated over.""" + def __iter__(self) -> DatasetIterator[T]: + """Returns an iterator for this dataset.""" + + File formats and available Data Sources --------------------------------------- @@ -46,8 +60,8 @@ Implement your own Data Source ------------------------------ You can implement your own data source and use it with `Grain`. It needs to -implement the ``RandomAccessDataSource`` protocol defined above. In addition, you -need to pay attention to the following: +implement one of the ``RandomAccessDataSource`` or ``IterDataset`` protocols +defined above. In addition, you need to pay attention to the following: * **Data Sources should be pickleable.** This is because in the multi-worker setting, data sources are pickled and sent to child processes, where each @@ -79,7 +93,16 @@ This section contains tutorials for using Grain to read data from various source parquet_dataset_tutorial.md arrayrecord_data_source_tutorial.md bagz_data_source_tutorial.md - load_from_s3_tutorial.md - load_from_gcs_tutorial.md huggingface_dataset_tutorial.md pytorch_dataset_tutorial.md + +File systems +------------ + +.. toctree:: + :maxdepth: 1 + + load_from_s3_tutorial.md + load_from_gcs_tutorial.md + +