docs: document multi-worker DataLoader support for remote tables by mintlify[bot] · Pull Request #260 · lancedb/docs

mintlify · 2026-06-01T09:55:50Z

Summary

Document that remote LanceDB tables can now be used directly with multi-worker PyTorch DataLoaders, and refresh guidance on the optional with_connection_factory escape hatch.

Changes

Add a "Using multiple DataLoader workers" section to the PyTorch integration page covering num_workers, spawn, and persistent_workers.
Add a subsection showing remote (db://) tables working in worker processes out of the box.
Add a subsection showing how to provide a custom connection factory when credentials should be loaded inside the worker.

Context

Triggered by an upstream change that lets remote tables carry their connection state through pickling so they reopen correctly in PyTorch DataLoader workers, while keeping the connection factory available for custom credential loading.

mintlify · 2026-06-01T09:56:06Z

Preview deployment for your docs. Learn more about Mintlify Previews.

Project	Status	Preview	Updated (UTC)
lancedb-bcbb4faf	🟢 Ready	View Preview	Jun 1, 2026, 9:57 AM

prrao87 · 2026-06-09T17:35:23Z

cc @westonpace - I presume these docs make sense based on your latest additions.

prrao87 · 2026-07-02T18:21:58Z

I pushed a follow-up commit to make the PyTorch examples match current LanceDB behavior.

What I tested:

Built the latest lancedb main locally from /Users/prrao/code/lancedb with maturin develop.
Installed torch==2.12.1.
Ran python/tests/test_torch.py: 9 passed, 2 skipped.
Added temporary docs-level repros in /private/tmp/test_lancedb_torch_docs_snippets.py: 4 passed.
Reproduced the direct DataLoader(table, ...) example without collate_fn; it fails with PyTorch default collation:

TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'pyarrow.lib.ChunkedArray'>

Why this change is necessary:

Direct LanceDB Table datasets emit Arrow data, and PyTorch's default collate_fn does not accept Arrow batches directly.
The direct Table examples need collate_fn=tbl_to_tensor to produce tensors.
Permutation examples work as written with PyTorch's default collation because the default Permutation output is a list of Python dicts, which PyTorch batches into a dict of tensors.

prrao87 · 2026-07-02T18:28:18Z

@westonpace could you please review this PyTorch docs update when possible? As I was testing the code that the agent wrote, some more recent changes that we merged recently caused this error to surface.

TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'pyarrow.lib.ChunkedArray'>

I've patched a fix that uses tbl_to_tensor from the utils that were added, and it works now. Just want to be sure the explanation in the PyTorch page makes sense.

Also, are we missing anything in the documentation about how to handle these on RemoteTable specifically?

westonpace

Some suggestions but nice examples

westonpace · 2026-07-02T22:45:29Z

+significant performance penalty converting from Arrow, Lance's internal representation, to this default format. Use a
+direct `Table` with `collate_fn` when you want Arrow-to-tensor conversion, or a `Permutation` when you want the default
+PyTorch dict-of-tensors behavior.


Hmm, Permutation can also support direct Arrow-to-tensor conversion. It just isn't the default. This makes it sound like you'd have to use a Table.

westonpace · 2026-07-02T22:46:08Z

+
+Set `num_workers > 0` to read from LanceDB in multiple PyTorch worker processes. LanceDB tables and `Permutation` objects are picklable, so each worker reopens the table after it starts.
+
+Prefer the `spawn` start method when using multiple workers; LanceDB uses internal threads. See [the performance guide](/performance) for more multiprocessing guidance.


Actually forkserver is probably better than spawn (and will be the new python default). I'd say that should be our preference.

Once the streaming dataset is available my guidance would be to use forkserver and use num_workers=1 unless you can prove you have GIL contention in your trasform function.

docs: document multi-worker DataLoader support for remote tables

6824cd7

mintlify Bot mentioned this pull request Jun 1, 2026

feat(python): support remote tables in PyTorch dataloaders lancedb/lancedb#3432

Merged

mintlify Bot deployed to staging - docs June 1, 2026 09:57 View deployment

prrao87 added the needs_new_release Only merge once we release a new version of LanceDB label Jun 1, 2026

docs: tighten multi-worker torch guidance

f449aa6

mintlify Bot deployed to staging - docs June 9, 2026 17:34 View deployment

prrao87 approved these changes Jun 9, 2026 •

edited

Loading

View reviewed changes

docs: fix PyTorch DataLoader table collation examples

70291f2

mintlify Bot deployed to staging - docs July 2, 2026 18:23 View deployment

docs: clarify PyTorch collate function usage

55e040f

mintlify Bot deployed to staging - docs July 2, 2026 18:24 View deployment

prrao87 requested a review from westonpace July 2, 2026 18:28

westonpace approved these changes Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: document multi-worker DataLoader support for remote tables#260

docs: document multi-worker DataLoader support for remote tables#260
mintlify[bot] wants to merge 4 commits into
mainfrom
mintlify/f5da8d82

mintlify Bot commented Jun 1, 2026

Uh oh!

mintlify Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

prrao87 commented Jun 9, 2026

Uh oh!

prrao87 commented Jul 2, 2026

Uh oh!

prrao87 commented Jul 2, 2026 •

edited

Loading

Uh oh!

westonpace left a comment

Uh oh!

westonpace Jul 2, 2026

Uh oh!

westonpace Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		Set `num_workers > 0` to read from LanceDB in multiple PyTorch worker processes. LanceDB tables and `Permutation` objects are picklable, so each worker reopens the table after it starts.

		Prefer the `spawn` start method when using multiple workers; LanceDB uses internal threads. See [the performance guide](/performance) for more multiprocessing guidance.

Uh oh!

Conversation

mintlify Bot commented Jun 1, 2026

Summary

Changes

Context

Uh oh!

mintlify Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

prrao87 commented Jun 9, 2026

Uh oh!

prrao87 commented Jul 2, 2026

Uh oh!

prrao87 commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

westonpace Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mintlify Bot commented Jun 1, 2026 •

edited

Loading

prrao87 commented Jul 2, 2026 •

edited

Loading