From 6824cd772622915f74e9631ab68ec2b6000fe29a Mon Sep 17 00:00:00 2001
From: "mintlify[bot]" <109931778+mintlify[bot]@users.noreply.github.com>
Date: Mon, 1 Jun 2026 09:55:39 +0000
Subject: [PATCH 1/4] docs: document multi-worker DataLoader support for remote
 tables

---
 docs/training/torch.mdx | 77 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)
diff --git a/docs/training/torch.mdx b/docs/training/torch.mdx
index 9a9dbe5b..e9c571f1 100644
--- a/docs/training/torch.mdx
+++ b/docs/training/torch.mdx
@@ -96,3 +96,80 @@ dataloader = torch.utils.data.DataLoader(
 for batch in dataloader:
     print(batch.schema)
 ```
+
+## Using multiple DataLoader workers
+
+PyTorch's `DataLoader` can fan out reads across worker processes by setting `num_workers > 0`. LanceDB tables and `Permutation` objects are picklable, so each worker reopens its own connection after the worker process starts.
+
+Because LanceDB is multi-threaded internally, use the `spawn` start method (not `fork`) when running with multiple workers. See [the performance guide](/performance) for more on safe multiprocessing patterns.
+
+```py Python icon=Python 
+from lancedb.permutation import Permutation
+
+permutation = Permutation.identity(table)
+dataloader = torch.utils.data.DataLoader(
+    permutation,
+    batch_size=1024,
+    shuffle=True,
+    num_workers=4,
+    multiprocessing_context="spawn",
+    persistent_workers=True,
+)
+```
+
+### Remote tables in DataLoader workers
+
+Tables opened from a remote LanceDB Enterprise connection (`db://...`) also work with multi-worker DataLoaders. The connection details needed to reopen the table — `db_url`, `api_key`, `region`, `host_override`, and the serializable parts of `client_config` — travel with the pickled table and are used to rebuild the connection in each worker.
+
+```py Python icon=Python 
+import lancedb
+from lancedb.permutation import Permutation
+
+db = lancedb.connect(
+    "db://my-database",
+    api_key="sk-...",
+    region="us-east-1",
+)
+table = db.open_table("my_table")
+
+permutation = Permutation.identity(table).select_columns(["id", "image"])
+dataloader = torch.utils.data.DataLoader(
+    permutation,
+    batch_size=512,
+    num_workers=4,
+    multiprocessing_context="spawn",
+)
+```
+
+<Note>
+This embeds the API key in the pickle sent to each worker. If you'd rather load credentials inside the worker — for example, from an environment variable or a secret manager — use the connection factory escape hatch described below. A factory is also required when your `client_config` uses a non-serializable `header_provider`.
+</Note>
+
+### Providing a custom connection factory
+
+`Permutation.with_connection_factory` lets you control how each worker reopens the base table. The factory takes the base table name and returns a LanceDB table. It must be picklable, which in practice means a top-level function, a `functools.partial` of one, or an instance of a picklable class with `__call__` — lambdas and closures over local variables will not work.
+
+```py Python icon=Python 
+import os
+import lancedb
+from lancedb.permutation import Permutation
+
+def open_table(name: str):
+    db = lancedb.connect(
+        "db://my-database",
+        api_key=os.environ["LANCEDB_API_KEY"],
+        region="us-east-1",
+    )
+    return db.open_table(name)
+
+permutation = (
+    Permutation.identity(table)
+    .with_connection_factory(open_table)
+)
+dataloader = torch.utils.data.DataLoader(
+    permutation,
+    batch_size=512,
+    num_workers=4,
+    multiprocessing_context="spawn",
+)
+```

From f449aa6b91f31098a88ddda2ed7dbae301021e56 Mon Sep 17 00:00:00 2001
From: prrao87 <35005448+prrao87@users.noreply.github.com>
Date: Tue, 9 Jun 2026 13:32:53 -0400
Subject: [PATCH 2/4] docs: tighten multi-worker torch guidance

---
 docs/training/torch.mdx | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/docs/training/torch.mdx b/docs/training/torch.mdx
index e9c571f1..ba50db34 100644
--- a/docs/training/torch.mdx
+++ b/docs/training/torch.mdx
@@ -99,11 +99,12 @@ for batch in dataloader:
 
 ## Using multiple DataLoader workers
 
-PyTorch's `DataLoader` can fan out reads across worker processes by setting `num_workers > 0`. LanceDB tables and `Permutation` objects are picklable, so each worker reopens its own connection after the worker process starts.
+Set `num_workers > 0` to read from LanceDB in multiple PyTorch worker processes. LanceDB tables and `Permutation` objects are picklable, so each worker reopens the table after it starts.
 
-Because LanceDB is multi-threaded internally, use the `spawn` start method (not `fork`) when running with multiple workers. See [the performance guide](/performance) for more on safe multiprocessing patterns.
+Prefer the `spawn` start method when using multiple workers; LanceDB uses internal threads. See [the performance guide](/performance) for more multiprocessing guidance.
 
 ```py Python icon=Python 
+import torch
 from lancedb.permutation import Permutation
 
 permutation = Permutation.identity(table)
@@ -119,11 +120,11 @@ dataloader = torch.utils.data.DataLoader(
 
 ### Remote tables in DataLoader workers
 
-Tables opened from a remote LanceDB Enterprise connection (`db://...`) also work with multi-worker DataLoaders. The connection details needed to reopen the table — `db_url`, `api_key`, `region`, `host_override`, and the serializable parts of `client_config` — travel with the pickled table and are used to rebuild the connection in each worker.
+Remote LanceDB Enterprise tables (`db://...`) work the same way: workers reopen the table from the pickled connection state.
 
 ```py Python icon=Python 
 import lancedb
-from lancedb.permutation import Permutation
+import torch
 
 db = lancedb.connect(
     "db://my-database",
@@ -132,9 +133,8 @@ db = lancedb.connect(
 )
 table = db.open_table("my_table")
 
-permutation = Permutation.identity(table).select_columns(["id", "image"])
 dataloader = torch.utils.data.DataLoader(
-    permutation,
+    table,
     batch_size=512,
     num_workers=4,
     multiprocessing_context="spawn",
@@ -142,16 +142,17 @@ dataloader = torch.utils.data.DataLoader(
 ```
 
 <Note>
-This embeds the API key in the pickle sent to each worker. If you'd rather load credentials inside the worker — for example, from an environment variable or a secret manager — use the connection factory escape hatch described below. A factory is also required when your `client_config` uses a non-serializable `header_provider`.
+This sends the connection state, including the API key, to each worker. Use a connection factory if credentials should be loaded inside the worker or your `client_config` contains a non-serializable `header_provider`.
 </Note>
 
 ### Providing a custom connection factory
 
-`Permutation.with_connection_factory` lets you control how each worker reopens the base table. The factory takes the base table name and returns a LanceDB table. It must be picklable, which in practice means a top-level function, a `functools.partial` of one, or an instance of a picklable class with `__call__` — lambdas and closures over local variables will not work.
+`Permutation.with_connection_factory` lets each worker reopen the base table with custom logic. The factory takes the table name, returns a LanceDB table, and must be picklable.
 
 ```py Python icon=Python 
 import os
 import lancedb
+import torch
 from lancedb.permutation import Permutation
 
 def open_table(name: str):
@@ -162,6 +163,7 @@ def open_table(name: str):
     )
     return db.open_table(name)
 
+table = open_table("my_table")
 permutation = (
     Permutation.identity(table)
     .with_connection_factory(open_table)

From 70291f29a2ff31fbae6bf31272d2eaabe5656a27 Mon Sep 17 00:00:00 2001
From: prrao87 <35005448+prrao87@users.noreply.github.com>
Date: Thu, 2 Jul 2026 14:21:24 -0400
Subject: [PATCH 3/4] docs: fix PyTorch DataLoader table collation examples

---
 docs/training/torch.mdx | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/docs/training/torch.mdx b/docs/training/torch.mdx
index ba50db34..1c5d1626 100644
--- a/docs/training/torch.mdx
+++ b/docs/training/torch.mdx
@@ -17,13 +17,14 @@ The `Table` class in LanceDB implements a contract for a PyTorch
 import lancedb
 import torch
 import pyarrow as pa
+from lancedb.util import tbl_to_tensor
 
 mem_db = lancedb.connect("memory://")
 table = mem_db.create_table("test_table", pa.table({"a": range(1000)}))
 
 # Any LanceDB table can be used as a PyTorch Dataset
 dataloader = torch.utils.data.DataLoader(
-    table, batch_size=1024, shuffle=True
+    table, batch_size=1024, shuffle=True, collate_fn=tbl_to_tensor
 )
 
 for batch in dataloader:
@@ -42,12 +43,17 @@ dataloader = torch.utils.data.DataLoader(permutation)
 
 ## Output Formats
 
-By default, a `Table` data loader will emit a `pyarrow.RecordBatch`.  To convert to a different format (such as a
-`pytorch.Tensor`), you will need to provide a custom collate function.
+By default, a `Table` data loader will emit Arrow data. PyTorch calls the `collate_fn` argument to turn the fetched
+items into one batch. Its default collate function only knows how to combine tensors, NumPy arrays, numbers, dicts, and
+lists, so it does not accept Arrow data directly. Direct `Table` data loaders should provide a custom collate function
+such as `lancedb.util.tbl_to_tensor`, which converts numeric Arrow columns into a column-major `torch.Tensor` with shape
+`(columns, rows)`.
 
-The `Permutation` class is more flexible.  By default, the output will be a list of dicts.  This is the default output
-format of standard data loaders and usually more convenient when you are getting started.  However, there is a
-significant performance penalty converting from Arrow, Lance's internal representation, to this default format.
+`Permutation` works differently: its default output is a list of Python dicts, which PyTorch's default collate function
+can batch into a dict of tensors. This is usually more convenient when you are getting started. However, there is a
+significant performance penalty converting from Arrow, Lance's internal representation, to this default format. Use a
+direct `Table` with `collate_fn` when you want Arrow-to-tensor conversion, or a `Permutation` when you want the default
+PyTorch dict-of-tensors behavior.
 
 To address this, the `Permutation` class provides a set of builtin transform functions that can be applied to map
 the Arrow data in different ways.  The `arrow` and `polars` formats will always avoid data copies.  However, `numpy`,
@@ -125,6 +131,7 @@ Remote LanceDB Enterprise tables (`db://...`) work the same way: workers reopen
 ```py Python icon=Python 
 import lancedb
 import torch
+from lancedb.util import tbl_to_tensor
 
 db = lancedb.connect(
     "db://my-database",
@@ -138,6 +145,7 @@ dataloader = torch.utils.data.DataLoader(
     batch_size=512,
     num_workers=4,
     multiprocessing_context="spawn",
+    collate_fn=tbl_to_tensor,
 )
 ```
 

From 55e040f6b3e145e30e6f100e59927d08c0c4b616 Mon Sep 17 00:00:00 2001
From: prrao87 <35005448+prrao87@users.noreply.github.com>
Date: Thu, 2 Jul 2026 14:23:54 -0400
Subject: [PATCH 4/4] docs: clarify PyTorch collate function usage

---
 docs/training/torch.mdx | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/training/torch.mdx b/docs/training/torch.mdx
index 1c5d1626..c04d2760 100644
--- a/docs/training/torch.mdx
+++ b/docs/training/torch.mdx
@@ -43,11 +43,11 @@ dataloader = torch.utils.data.DataLoader(permutation)
 
 ## Output Formats
 
-By default, a `Table` data loader will emit Arrow data. PyTorch calls the `collate_fn` argument to turn the fetched
-items into one batch. Its default collate function only knows how to combine tensors, NumPy arrays, numbers, dicts, and
-lists, so it does not accept Arrow data directly. Direct `Table` data loaders should provide a custom collate function
-such as `lancedb.util.tbl_to_tensor`, which converts numeric Arrow columns into a column-major `torch.Tensor` with shape
-`(columns, rows)`.
+By default, a `Table` data loader will emit Arrow data. `collate_fn` is PyTorch's batching hook: PyTorch calls it to
+turn the fetched items into one batch. PyTorch's default collate function only knows how to combine tensors, NumPy
+arrays, numbers, dicts, and lists, so it does not accept Arrow data directly. When using a `Table` directly, pass
+LanceDB's `lancedb.util.tbl_to_tensor` helper as PyTorch's `collate_fn`; it converts numeric Arrow columns into a
+column-major `torch.Tensor` with shape `(columns, rows)`.
 
 `Permutation` works differently: its default output is a list of Python dicts, which PyTorch's default collate function
 can batch into a dict of tensors. This is usually more convenient when you are getting started. However, there is a