diff --git a/docs/docs.json b/docs/docs.json
index 7f237c5b..7d1519f7 100644
--- a/docs/docs.json
+++ b/docs/docs.json
@@ -63,6 +63,16 @@
"performance"
]
},
+ {
+ "group": "Model training",
+ "pages": [
+ "training/why-lancedb",
+ "training/index",
+ "training/torch",
+ "training/object-detection",
+ "training/vlm-finetuning"
+ ]
+ },
{
"group": "Guides",
"pages": [
@@ -141,15 +151,6 @@
"storage/index",
"storage/configuration"
]
- },
- {
- "group": "Training",
- "pages": [
- "training/index",
- "training/torch",
- "training/object-detection",
- "training/vlm-finetuning"
- ]
}
]
},
diff --git a/docs/index.mdx b/docs/index.mdx
index 83f4e19d..c9cf4f75 100644
--- a/docs/index.mdx
+++ b/docs/index.mdx
@@ -3,51 +3,85 @@ title: LanceDB
sidebarTitle: "LanceDB"
description: "Multimodal lakehouse for AI."
icon: "/static/assets/logo/lancedb-icon-gray.svg"
-keywords: ["open source", "oss"]
+keywords: ["multimodal lakehouse", "training", "feature engineering", "search", "open source", "oss"]
---
-**LanceDB** is a [multimodal lakehouse](https://lancedb.com/blog/multimodal-lakehouse/) for
-AI, built on top of [Lance](/lance), an open-source lakehouse format. Below, we list a few
-ways LanceDB can help you build and scale your AI and ML workloads.
+**LanceDB** is a [multimodal lakehouse](https://lancedb.com/blog/multimodal-lakehouse/) for AI teams that need
+one data layer for curation, feature engineering, search and retrieval, and model training.
+It is built on top of [Lance](/lance), an open-source lakehouse format designed for multimodal AI data.
+
+Move from data exploration to model training on one, unified platform without needing to manage a
+fragmented stack of storage, feature, retrieval, and training systems.
+
+## Build better models, faster
+
+Training data and experimentation slow down when raw data, metadata, embeddings, features, and governance
+artifacts live in separate systems. LanceDB keeps them together in one versioned multimodal table, so AI teams spend less
+time stitching infrastructure together and more time improving datasets, testing features, and keeping GPUs fed.
+
+
+
+Use the same table to curate training data, add derived features, retrieve examples, and feed training jobs that rely on expensive GPUs.
+Training workloads can sample, shuffle, and scan projected columns from local storage or object storage, then assemble
+GPU-ready batches from a tagged dataset version.
+
+For a deeper look at how this works in training pipelines, start with [Why LanceDB for training](/training/why-lancedb).
+
+## LanceDB suite
+
+The LanceDB suite includes LanceDB OSS, an open-source embedded retrieval library, and LanceDB Enterprise,
+a multimodal lakehouse platform for the full AI data lifecycle.
+OSS is easy to set up on a local machine for search and regular-scale workflows. LanceDB Enterprise is built
+for teams that need scale without building bespoke infrastructure for curation,
+feature engineering, search and retrieval, and efficient training data access.
+
+
+
+## Why teams use LanceDB
-
- Use LanceDB to curate, explore and distribute very large multimodal datasets for training and fine-tuning models.
- LanceDB comes with built-in table versioning, schema evolution, and fast random access, making it far more efficient to do
- dataset slicing, sampling, filtering and shuffles on large, rapidly evolving datasets.
+
+ Store images, video, audio, text, annotations, embeddings, and model-generated features together in one schema-enforced table.
+ The same table can support dataset curation, feature backfills, experiment splits, retrieval, and training.
+
+
+ Training workloads mix fast random access with high-throughput sequential scans. LanceDB is designed for both, so
+ teams can shuffle data into GPU-ready batches more efficiently, improve input throughput, and iterate on experiments faster.
-
- Use LanceDB as the data + retrieval layer for production AI workloads: RAG, agents, semantic search,
- recommendation systems, and more.
- Keep multimodal data, metadata, and embeddings in the same table and query them via vector search,
- full-text search or SQL. Easily add new features (columns in your tables) as your
- application evolves, without copying existing data.
+
+ Whether the end user is a human or an agent, LanceDB powers production retrieval workloads such as semantic search,
+ hybrid search, RAG, agent memory, and recommendation systems. Retrieval runs against the same LanceDB tables used
+ for curation, feature engineering, and training workflows.
-LanceDB is designed for a variety of workloads and deployment scenarios, and supports use cases
-that are way beyond traditional vector search. The LanceDB suite includes LanceDB OSS, an open-source embedded library,
-and LanceDB Enterprise, a distributed and managed multimodal lakehouse.
-Both are built on top of the same open-source Lance format and table abstractions.
-
-
+## Start with your workload
-## Use cases
-
-- **Search**: Build high-performance search and retrieval applications using LanceDB's optimized storage, including vector search, full-text search, and hybrid search with secondary indexes.
-- **Data Curation**: Manage and filter on petabyte-scale multimodal datasets, including video and point cloud data, to gain insights, explore data and inform model development.
-- **Feature engineering**: Add new columns (features), create embeddings, and transform your data at
-scale. LanceDB lets you extend tables both vertically and horizontally with minimal I/O overhead.
-- **Training**: Efficiently access and manage large-scale multimodal datasets for training and fine-tuning AI models.
+
+
+ Learn why LanceDB works well as the data layer for training workloads.
+
+
+ Use LanceDB tables and permutations for projected, shuffled, random-access training reads.
+
+
+ Explore Lance-formatted multimodal datasets with raw bytes, metadata, embeddings, and indices.
+
+
+ Use vector search, full-text search, hybrid search, reranking, filtering, and SQL.
+
+
-## Choose how you run LanceDB
+## From local development to production scale
-Depending on your needs, you can choose one of the following ways to run LanceDB.
+LanceDB OSS and LanceDB Enterprise share the same Lance format and table model. Start locally with the embedded OSS
+library, then move to Enterprise when your team needs distributed scale, managed infrastructure, private deployment,
+or higher-throughput curation, feature engineering, search and retrieval, and training workflows.
### 1. LanceDB OSS
The fastest way to get started is the open-source embedded library, with client SDKs in Python, TypeScript
-and Rust. Run it locally during development, then use the same data model and APIs as you scale up
-and need a managed solution. Start here:
+and Rust. Run it locally in just a few steps, which lets you explore datasets, curate data, and run search and retrieval workloads
+for agents. Start here:
- Create tables, search vectors, and modify data in LanceDB.
+ Create tables, evolve schemas, version data, and modify rows in LanceDB.
### 2. LanceDB Enterprise
-[LanceDB Enterprise](/enterprise) is a distributed and managed **multimodal lakehouse** built for
-search, curation, feature engineering, and training-oriented data access workflows
-on top of the same core table abstraction. This eliminates the need for teams to build bespoke
-infrastructure to manage petabyte-scale multimodal datasets.
+[LanceDB Enterprise](/enterprise) is a petabyte-scale (and beyond), distributed **multimodal lakehouse** platform built for
+search, curation, feature engineering, and high-throughput training data access workflows on top of the same core table
+abstraction. This eliminates the need for teams to build bespoke infrastructure to manage large multimodal datasets.
To set up LanceDB Enterprise in your organization, reach out to us at
[contact@lancedb.com](mailto:contact@lancedb.com).
@@ -88,4 +121,4 @@ private deployments, and can operate under strict [security requirements](/enter
href="/enterprise/quickstart"
>
Get started with LanceDB Enterprise in minutes.
-
\ No newline at end of file
+
diff --git a/docs/lance.mdx b/docs/lance.mdx
index 88c26722..e7a650e5 100644
--- a/docs/lance.mdx
+++ b/docs/lance.mdx
@@ -5,15 +5,15 @@ description: "Open-source lakehouse format for multimodal AI."
icon: "/static/assets/logo/lance-logo-gray.svg"
---
-[Lance](https://lance.org/) is an open-source lakehouse format, which provides the
-foundation for LanceDB's capabilities. It provides a file format,
-table format, and catalog spec with multimodal data at the center of its design, allowing developers
+[Lance](https://lance.org/) is an open-source, columnar lakehouse format for multimodal AI.
+It provides a file format, table format, and lightweight catalog spec, allowing developers
to build a complete open lakehouse on top of object storage.
-Building on top of open foundations and optimizing the format for AI workloads brings
-high-performance vector search, full-text search, random access, and feature engineering capabilities
-to a single unified system ([LanceDB](/enterprise)), eliminating the need for bespoke ETL and data pipelines that move data
-to multiple other specialized data systems.
+Building on top of open foundations and optimizing the format for random access
+(without compromising scan performance) enables
+high-performance vector search, full-text search, indexing, and feature engineering capabilities.
+[LanceDB](/enterprise) builds on these capabilities so teams can work with one multimodal data layer
+instead of moving data across separate storage, search, feature, and training systems.
-## Advantages of the Lance format
+## Capabilities of the Lance format
-Advantage | Description
+Capability | What it enables
--- | ---
-Multimodal storage | Efficiently holds vectors, images, videos, audio, text, and more
-Version control | Built-in data versioning for reproducible ML experiments and data lineage
-ML-optimized | Designed for training and inference workloads with fast random access
-Query performance | Columnar storage enables blazing-fast vector search and analytics
-Cloud-native | Seamless integration with cloud object stores (S3, GCS, Azure Blob)
+Multimodal storage | Store images, video, audio, text, embeddings, annotations, metadata, features, and more, all in one table.
+First-class blob API | Store large binary objects such as images, video, audio, and model artifacts in blob columns with lazy reads and streaming byte access.
+Fast random access and scans | Sample, shuffle, and retrieve individual rows efficiently without giving up high-throughput sequential reads.
+Flexible data evolution | Add, drop, rename, or alter columns as datasets change, often without rewriting existing data files.
+Versioned tables | Reproduce experiments, restore previous states, and tie downstream artifacts to the exact table version they used.
+Hybrid search and indexing | Combine vector search, full-text search, and scalar filters on the same dataset with Lance indexes.
+Open lakehouse interoperability | Build on object storage and connect Lance tables to open engines such as PyTorch, Ray, Spark, Trino, DuckDB and Polars.
## Key concepts
diff --git a/docs/static/assets/images/overview/lancedb-suite.svg b/docs/static/assets/images/overview/lancedb-suite.svg
new file mode 100644
index 00000000..92a9335b
--- /dev/null
+++ b/docs/static/assets/images/overview/lancedb-suite.svg
@@ -0,0 +1,95 @@
+
diff --git a/docs/static/assets/images/overview/training-data-lifecycle.svg b/docs/static/assets/images/overview/training-data-lifecycle.svg
new file mode 100644
index 00000000..a533c70e
--- /dev/null
+++ b/docs/static/assets/images/overview/training-data-lifecycle.svg
@@ -0,0 +1,52 @@
+
diff --git a/docs/training/why-lancedb.mdx b/docs/training/why-lancedb.mdx
new file mode 100644
index 00000000..0adfc81a
--- /dev/null
+++ b/docs/training/why-lancedb.mdx
@@ -0,0 +1,118 @@
+---
+title: "Why LanceDB for Training"
+sidebarTitle: "Why LanceDB for training"
+description: "Use LanceDB as the multimodal data layer for model training, fine-tuning, curation, and feature engineering workflows."
+icon: fire
+---
+
+LanceDB is built for AI teams that need a practical data layer between raw multimodal datasets and model training.
+Instead of moving data through separate systems for curation, feature engineering, search, manifests, and training,
+you can keep the whole workflow attached to one versioned LanceDB table.
+
+That table can hold images, video, audio, text, annotations, metadata, embeddings, tokenized fields, model outputs,
+quality signals, and training-ready tensors. As the dataset evolves, LanceDB lets you add new columns, filter rows,
+pin versions, and read batches without rewriting the original data.
+
+
+
+LanceDB gives these stages one platform, so curation, feature engineering, retrieval, and training stay connected.
+
+## A connected data lifecycle
+
+Training pipelines usually need more than a pile of files. They need curation, derived features, reproducible splits,
+fast random access, and a clean path into frameworks such as PyTorch. LanceDB keeps these pieces connected through
+the same table model, whether you organize a workflow as one table or several related tables.
+
+
+
+ Use filters, vector search, full-text search, and retrieval workflows to find the examples that matter: hard negatives,
+ long-tail failure modes, duplicate clusters, low-quality samples, or targeted fine-tuning slices.
+
+
+ Add embeddings, detections, OCR output, labels, token IDs, hidden states, deduplication flags, or quality scores as
+ new columns. Lance's columnar layout and schema evolution avoid rewriting large raw media columns when you add features.
+
+
+ Build filtered splits and materialized views from the table instead of exporting CSV manifests. Data versions and tags
+ make it possible to tie a checkpoint back to the exact rows and features used for training.
+
+
+ Use fast random access and column projection to read only the columns a training step needs. LanceDB tables can be read
+ from local storage or object storage, and integrate with data loading patterns such as PyTorch datasets.
+
+
+
+## Lance as the foundation
+
+LanceDB is built on [Lance](https://lance.org/), an open-source lakehouse format designed for multimodal AI data.
+The table below highlights the Lance features that enable the multimodal lakehouse on top.
+
+| Capability | Why it matters for training |
+|---|---|
+| **Multimodal columns** | Store raw bytes, annotations, metadata, embeddings, and features together. |
+| **Fast random access** | Support shuffled and sampled reads without reshuffling the dataset on disk. |
+| **Column projection** | Read only images, tokens, labels, embeddings, or hidden states needed by a given run. |
+| **Schema evolution** | Add new feature columns without rewriting existing media columns. |
+| **Versioning** | Reproduce experiments against the same table snapshot, even as the dataset evolves. |
+| **Search and filtering** | Find and materialize useful training slices directly from the table. |
+
+## Search inside training workflows
+
+Search is not limited to QA systems, agents, or production retrieval apps. It is also a practical way to inspect,
+curate, and improve training data:
+
+- Find visually similar examples when debugging model failures.
+- Retrieve hard negatives or near-duplicates for contrastive training.
+- Combine vector search, full-text search, and metadata filters to build targeted fine-tuning slices.
+- Reuse the same table for both offline curation and production retrieval.
+
+In LanceDB, retrieval and training workflows can operate over the same multimodal tables instead of forcing teams to
+manage separate data systems for each stage.
+
+## Projects using LanceDB for training workflows
+
+
+
+ A platform for reproducible world-model research built on a LanceDB data layer, reporting faster data loading on Push-T workloads.
+
+
+ A joint-embedding predictive world model from pixels, trained on the stable-worldmodel platform and its LanceDB data layer.
+
+
+ A drop-in LanceDB backend for Hugging Face LeRobot datasets with faster loading across robotics datasets.
+
+
+
+In the world-model ecosystem, [stable-worldmodel](https://github.com/galilai-group/stable-worldmodel) reports
+3-4x faster data loading on Push-T versus HDF5 / MP4 at a fraction of the disk footprint. Across these projects,
+LanceDB and Lance provide the multimodal data layer that keeps raw observations, annotations, features, and training
+access patterns in one format instead of scattering them across task-specific stores.
+
+## Next steps
+
+
+
+ Learn how to use LanceDB permutations to select rows, project columns, split datasets, and shuffle training reads.
+
+
+ Use LanceDB tables and permutations with `torch.utils.data.DataLoader`.
+
+
+ Fine-tune an AV perception model on curated failure-mode slices backed by one LanceDB table.
+
+
+ Fine-tune a VLM on TextVQA using LanceDB and Geneva to cache expensive training features.
+
+