From 6824cd772622915f74e9631ab68ec2b6000fe29a Mon Sep 17 00:00:00 2001 From: "mintlify[bot]" <109931778+mintlify[bot]@users.noreply.github.com> Date: Mon, 1 Jun 2026 09:55:39 +0000 Subject: [PATCH 1/4] docs: document multi-worker DataLoader support for remote tables --- docs/training/torch.mdx | 77 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 77 insertions(+) diff --git a/docs/training/torch.mdx b/docs/training/torch.mdx index 9a9dbe5b..e9c571f1 100644 --- a/docs/training/torch.mdx +++ b/docs/training/torch.mdx @@ -96,3 +96,80 @@ dataloader = torch.utils.data.DataLoader( for batch in dataloader: print(batch.schema) ``` + +## Using multiple DataLoader workers + +PyTorch's `DataLoader` can fan out reads across worker processes by setting `num_workers > 0`. LanceDB tables and `Permutation` objects are picklable, so each worker reopens its own connection after the worker process starts. + +Because LanceDB is multi-threaded internally, use the `spawn` start method (not `fork`) when running with multiple workers. See [the performance guide](/performance) for more on safe multiprocessing patterns. + +```py Python icon=Python +from lancedb.permutation import Permutation + +permutation = Permutation.identity(table) +dataloader = torch.utils.data.DataLoader( + permutation, + batch_size=1024, + shuffle=True, + num_workers=4, + multiprocessing_context="spawn", + persistent_workers=True, +) +``` + +### Remote tables in DataLoader workers + +Tables opened from a remote LanceDB Enterprise connection (`db://...`) also work with multi-worker DataLoaders. The connection details needed to reopen the table — `db_url`, `api_key`, `region`, `host_override`, and the serializable parts of `client_config` — travel with the pickled table and are used to rebuild the connection in each worker. + +```py Python icon=Python +import lancedb +from lancedb.permutation import Permutation + +db = lancedb.connect( + "db://my-database", + api_key="sk-...", + region="us-east-1", +) +table = db.open_table("my_table") + +permutation = Permutation.identity(table).select_columns(["id", "image"]) +dataloader = torch.utils.data.DataLoader( + permutation, + batch_size=512, + num_workers=4, + multiprocessing_context="spawn", +) +``` + + +This embeds the API key in the pickle sent to each worker. If you'd rather load credentials inside the worker — for example, from an environment variable or a secret manager — use the connection factory escape hatch described below. A factory is also required when your `client_config` uses a non-serializable `header_provider`. + + +### Providing a custom connection factory + +`Permutation.with_connection_factory` lets you control how each worker reopens the base table. The factory takes the base table name and returns a LanceDB table. It must be picklable, which in practice means a top-level function, a `functools.partial` of one, or an instance of a picklable class with `__call__` — lambdas and closures over local variables will not work. + +```py Python icon=Python +import os +import lancedb +from lancedb.permutation import Permutation + +def open_table(name: str): + db = lancedb.connect( + "db://my-database", + api_key=os.environ["LANCEDB_API_KEY"], + region="us-east-1", + ) + return db.open_table(name) + +permutation = ( + Permutation.identity(table) + .with_connection_factory(open_table) +) +dataloader = torch.utils.data.DataLoader( + permutation, + batch_size=512, + num_workers=4, + multiprocessing_context="spawn", +) +``` From f449aa6b91f31098a88ddda2ed7dbae301021e56 Mon Sep 17 00:00:00 2001 From: prrao87 <35005448+prrao87@users.noreply.github.com> Date: Tue, 9 Jun 2026 13:32:53 -0400 Subject: [PATCH 2/4] docs: tighten multi-worker torch guidance --- docs/training/torch.mdx | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/docs/training/torch.mdx b/docs/training/torch.mdx index e9c571f1..ba50db34 100644 --- a/docs/training/torch.mdx +++ b/docs/training/torch.mdx @@ -99,11 +99,12 @@ for batch in dataloader: ## Using multiple DataLoader workers -PyTorch's `DataLoader` can fan out reads across worker processes by setting `num_workers > 0`. LanceDB tables and `Permutation` objects are picklable, so each worker reopens its own connection after the worker process starts. +Set `num_workers > 0` to read from LanceDB in multiple PyTorch worker processes. LanceDB tables and `Permutation` objects are picklable, so each worker reopens the table after it starts. -Because LanceDB is multi-threaded internally, use the `spawn` start method (not `fork`) when running with multiple workers. See [the performance guide](/performance) for more on safe multiprocessing patterns. +Prefer the `spawn` start method when using multiple workers; LanceDB uses internal threads. See [the performance guide](/performance) for more multiprocessing guidance. ```py Python icon=Python +import torch from lancedb.permutation import Permutation permutation = Permutation.identity(table) @@ -119,11 +120,11 @@ dataloader = torch.utils.data.DataLoader( ### Remote tables in DataLoader workers -Tables opened from a remote LanceDB Enterprise connection (`db://...`) also work with multi-worker DataLoaders. The connection details needed to reopen the table — `db_url`, `api_key`, `region`, `host_override`, and the serializable parts of `client_config` — travel with the pickled table and are used to rebuild the connection in each worker. +Remote LanceDB Enterprise tables (`db://...`) work the same way: workers reopen the table from the pickled connection state. ```py Python icon=Python import lancedb -from lancedb.permutation import Permutation +import torch db = lancedb.connect( "db://my-database", @@ -132,9 +133,8 @@ db = lancedb.connect( ) table = db.open_table("my_table") -permutation = Permutation.identity(table).select_columns(["id", "image"]) dataloader = torch.utils.data.DataLoader( - permutation, + table, batch_size=512, num_workers=4, multiprocessing_context="spawn", @@ -142,16 +142,17 @@ dataloader = torch.utils.data.DataLoader( ``` -This embeds the API key in the pickle sent to each worker. If you'd rather load credentials inside the worker — for example, from an environment variable or a secret manager — use the connection factory escape hatch described below. A factory is also required when your `client_config` uses a non-serializable `header_provider`. +This sends the connection state, including the API key, to each worker. Use a connection factory if credentials should be loaded inside the worker or your `client_config` contains a non-serializable `header_provider`. ### Providing a custom connection factory -`Permutation.with_connection_factory` lets you control how each worker reopens the base table. The factory takes the base table name and returns a LanceDB table. It must be picklable, which in practice means a top-level function, a `functools.partial` of one, or an instance of a picklable class with `__call__` — lambdas and closures over local variables will not work. +`Permutation.with_connection_factory` lets each worker reopen the base table with custom logic. The factory takes the table name, returns a LanceDB table, and must be picklable. ```py Python icon=Python import os import lancedb +import torch from lancedb.permutation import Permutation def open_table(name: str): @@ -162,6 +163,7 @@ def open_table(name: str): ) return db.open_table(name) +table = open_table("my_table") permutation = ( Permutation.identity(table) .with_connection_factory(open_table) From 70291f29a2ff31fbae6bf31272d2eaabe5656a27 Mon Sep 17 00:00:00 2001 From: prrao87 <35005448+prrao87@users.noreply.github.com> Date: Thu, 2 Jul 2026 14:21:24 -0400 Subject: [PATCH 3/4] docs: fix PyTorch DataLoader table collation examples --- docs/training/torch.mdx | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/docs/training/torch.mdx b/docs/training/torch.mdx index ba50db34..1c5d1626 100644 --- a/docs/training/torch.mdx +++ b/docs/training/torch.mdx @@ -17,13 +17,14 @@ The `Table` class in LanceDB implements a contract for a PyTorch import lancedb import torch import pyarrow as pa +from lancedb.util import tbl_to_tensor mem_db = lancedb.connect("memory://") table = mem_db.create_table("test_table", pa.table({"a": range(1000)})) # Any LanceDB table can be used as a PyTorch Dataset dataloader = torch.utils.data.DataLoader( - table, batch_size=1024, shuffle=True + table, batch_size=1024, shuffle=True, collate_fn=tbl_to_tensor ) for batch in dataloader: @@ -42,12 +43,17 @@ dataloader = torch.utils.data.DataLoader(permutation) ## Output Formats -By default, a `Table` data loader will emit a `pyarrow.RecordBatch`. To convert to a different format (such as a -`pytorch.Tensor`), you will need to provide a custom collate function. +By default, a `Table` data loader will emit Arrow data. PyTorch calls the `collate_fn` argument to turn the fetched +items into one batch. Its default collate function only knows how to combine tensors, NumPy arrays, numbers, dicts, and +lists, so it does not accept Arrow data directly. Direct `Table` data loaders should provide a custom collate function +such as `lancedb.util.tbl_to_tensor`, which converts numeric Arrow columns into a column-major `torch.Tensor` with shape +`(columns, rows)`. -The `Permutation` class is more flexible. By default, the output will be a list of dicts. This is the default output -format of standard data loaders and usually more convenient when you are getting started. However, there is a -significant performance penalty converting from Arrow, Lance's internal representation, to this default format. +`Permutation` works differently: its default output is a list of Python dicts, which PyTorch's default collate function +can batch into a dict of tensors. This is usually more convenient when you are getting started. However, there is a +significant performance penalty converting from Arrow, Lance's internal representation, to this default format. Use a +direct `Table` with `collate_fn` when you want Arrow-to-tensor conversion, or a `Permutation` when you want the default +PyTorch dict-of-tensors behavior. To address this, the `Permutation` class provides a set of builtin transform functions that can be applied to map the Arrow data in different ways. The `arrow` and `polars` formats will always avoid data copies. However, `numpy`, @@ -125,6 +131,7 @@ Remote LanceDB Enterprise tables (`db://...`) work the same way: workers reopen ```py Python icon=Python import lancedb import torch +from lancedb.util import tbl_to_tensor db = lancedb.connect( "db://my-database", @@ -138,6 +145,7 @@ dataloader = torch.utils.data.DataLoader( batch_size=512, num_workers=4, multiprocessing_context="spawn", + collate_fn=tbl_to_tensor, ) ``` From 55e040f6b3e145e30e6f100e59927d08c0c4b616 Mon Sep 17 00:00:00 2001 From: prrao87 <35005448+prrao87@users.noreply.github.com> Date: Thu, 2 Jul 2026 14:23:54 -0400 Subject: [PATCH 4/4] docs: clarify PyTorch collate function usage --- docs/training/torch.mdx | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/training/torch.mdx b/docs/training/torch.mdx index 1c5d1626..c04d2760 100644 --- a/docs/training/torch.mdx +++ b/docs/training/torch.mdx @@ -43,11 +43,11 @@ dataloader = torch.utils.data.DataLoader(permutation) ## Output Formats -By default, a `Table` data loader will emit Arrow data. PyTorch calls the `collate_fn` argument to turn the fetched -items into one batch. Its default collate function only knows how to combine tensors, NumPy arrays, numbers, dicts, and -lists, so it does not accept Arrow data directly. Direct `Table` data loaders should provide a custom collate function -such as `lancedb.util.tbl_to_tensor`, which converts numeric Arrow columns into a column-major `torch.Tensor` with shape -`(columns, rows)`. +By default, a `Table` data loader will emit Arrow data. `collate_fn` is PyTorch's batching hook: PyTorch calls it to +turn the fetched items into one batch. PyTorch's default collate function only knows how to combine tensors, NumPy +arrays, numbers, dicts, and lists, so it does not accept Arrow data directly. When using a `Table` directly, pass +LanceDB's `lancedb.util.tbl_to_tensor` helper as PyTorch's `collate_fn`; it converts numeric Arrow columns into a +column-major `torch.Tensor` with shape `(columns, rows)`. `Permutation` works differently: its default output is a list of Python dicts, which PyTorch's default collate function can batch into a dict of tensors. This is usually more convenient when you are getting started. However, there is a