From 1071a330adc3e563c73a4e9dd335b3b8d0d92f90 Mon Sep 17 00:00:00 2001 From: prrao87 <35005448+prrao87@users.noreply.github.com> Date: Sun, 28 Jun 2026 17:40:31 -0400 Subject: [PATCH 1/5] docs: emphasize training data lifecycle --- docs/docs.json | 19 +-- docs/index.mdx | 109 ++++++++++------ .../assets/images/overview/lancedb-suite.svg | 92 ++++++++++++++ .../overview/training-data-lifecycle.svg | 52 ++++++++ docs/training/why-lancedb.mdx | 118 ++++++++++++++++++ 5 files changed, 343 insertions(+), 47 deletions(-) create mode 100644 docs/static/assets/images/overview/lancedb-suite.svg create mode 100644 docs/static/assets/images/overview/training-data-lifecycle.svg create mode 100644 docs/training/why-lancedb.mdx diff --git a/docs/docs.json b/docs/docs.json index 7f237c5b..7d1519f7 100644 --- a/docs/docs.json +++ b/docs/docs.json @@ -63,6 +63,16 @@ "performance" ] }, + { + "group": "Model training", + "pages": [ + "training/why-lancedb", + "training/index", + "training/torch", + "training/object-detection", + "training/vlm-finetuning" + ] + }, { "group": "Guides", "pages": [ @@ -141,15 +151,6 @@ "storage/index", "storage/configuration" ] - }, - { - "group": "Training", - "pages": [ - "training/index", - "training/torch", - "training/object-detection", - "training/vlm-finetuning" - ] } ] }, diff --git a/docs/index.mdx b/docs/index.mdx index 83f4e19d..78f13201 100644 --- a/docs/index.mdx +++ b/docs/index.mdx @@ -3,51 +3,85 @@ title: LanceDB sidebarTitle: "LanceDB" description: "Multimodal lakehouse for AI." icon: "/static/assets/logo/lancedb-icon-gray.svg" -keywords: ["open source", "oss"] +keywords: ["multimodal lakehouse", "training", "feature engineering", "search", "open source", "oss"] --- -**LanceDB** is a [multimodal lakehouse](https://lancedb.com/blog/multimodal-lakehouse/) for -AI, built on top of [Lance](/lance), an open-source lakehouse format. Below, we list a few -ways LanceDB can help you build and scale your AI and ML workloads. +**LanceDB** is a [multimodal lakehouse](https://lancedb.com/blog/multimodal-lakehouse/) for AI teams that need +one data layer for curation, feature engineering, search and retrieval, and model training. +It is built on top of [Lance](/lance), an open-source lakehouse format designed for multimodal AI data. + +Move from data exploration to model training on one, unified platform without needing to manage a +fragmented stack of storage, feature, retrieval, and training systems. + +## Build better models, faster + +Training data pipelines and experimentation slow down when raw data, metadata, embeddings, features, and governance +artifacts live in separate systems. LanceDB keeps them together in one versioned multimodal table, so teams spend less +time stitching infrastructure together and more time improving datasets, testing features, and keeping GPUs fed. + +![Training data lifecycle: Curation, Feature Engineering, Search and Retrieval, Training](/static/assets/images/overview/training-data-lifecycle.svg) + +Use the same table to curate training data, add derived features, retrieve examples, and feed training jobs that rely on expensive GPUs. +Training workloads can sample, shuffle, and scan projected columns from local storage or object storage, then assemble +GPU-ready batches from a tagged dataset version. + +For a deeper look at how this works in training pipelines, start with [Why LanceDB for training](/training/why-lancedb). + +## LanceDB suite + +The LanceDB suite includes LanceDB OSS, an open-source embedded retrieval library, and LanceDB Enterprise, +a multimodal lakehouse platform for the full AI data lifecycle. +OSS is easy to set up on a local machine for search and regular-scale workflows. LanceDB Enterprise is built +for teams that need scale without building bespoke infrastructure for curation, +feature engineering, search and retrieval, and efficient training data access. + +![LanceDB suite: OSS search and Enterprise multimodal lakehouse on Lance format](/static/assets/images/overview/lancedb-suite.svg) + +## Why teams use LanceDB - - Use LanceDB to curate, explore and distribute very large multimodal datasets for training and fine-tuning models. - LanceDB comes with built-in table versioning, schema evolution, and fast random access, making it far more efficient to do - dataset slicing, sampling, filtering and shuffles on large, rapidly evolving datasets. + + Store images, video, audio, text, annotations, embeddings, and model-generated features together in one schema-enforced table. + The same table can support dataset curation, feature backfills, experiment splits, retrieval, and training. + + + Training workloads mix fast random access with high-throughput sequential scans. LanceDB is designed for both, so + teams can shuffle data into GPU-ready batches more efficiently, improve input throughput, and iterate on experiments faster. - - Use LanceDB as the data + retrieval layer for production AI workloads: RAG, agents, semantic search, - recommendation systems, and more. - Keep multimodal data, metadata, and embeddings in the same table and query them via vector search, - full-text search or SQL. Easily add new features (columns in your tables) as your - application evolves, without copying existing data. + + Whether the end user is a human or an agent, LanceDB powers production retrieval workloads such as semantic search, + hybrid search, RAG, agent memory, and recommendation systems. Retrieval runs against the same LanceDB tables used + for curation, feature engineering, and training workflows. -LanceDB is designed for a variety of workloads and deployment scenarios, and supports use cases -that are way beyond traditional vector search. The LanceDB suite includes LanceDB OSS, an open-source embedded library, -and LanceDB Enterprise, a distributed and managed multimodal lakehouse. -Both are built on top of the same open-source Lance format and table abstractions. - -![](/static/assets/images/overview/lancedb-suite.png) +## Start with your workload -## Use cases - -- **Search**: Build high-performance search and retrieval applications using LanceDB's optimized storage, including vector search, full-text search, and hybrid search with secondary indexes. -- **Data Curation**: Manage and filter on petabyte-scale multimodal datasets, including video and point cloud data, to gain insights, explore data and inform model development. -- **Feature engineering**: Add new columns (features), create embeddings, and transform your data at -scale. LanceDB lets you extend tables both vertically and horizontally with minimal I/O overhead. -- **Training**: Efficiently access and manage large-scale multimodal datasets for training and fine-tuning AI models. + + + Learn why LanceDB works well as the data layer for training workloads. + + + Use LanceDB tables and permutations for projected, shuffled, random-access training reads. + + + Explore Lance-formatted multimodal datasets with raw bytes, metadata, embeddings, and indices. + + + Use vector search, full-text search, hybrid search, reranking, filtering, and SQL. + + -## Choose how you run LanceDB +## From local development to production scale -Depending on your needs, you can choose one of the following ways to run LanceDB. +LanceDB OSS and LanceDB Enterprise share the same Lance format and table model. Start locally with the embedded OSS +library, then move to Enterprise when your team needs distributed scale, managed infrastructure, private deployment, +or higher-throughput curation, feature engineering, search and retrieval, and training workflows. ### 1. LanceDB OSS The fastest way to get started is the open-source embedded library, with client SDKs in Python, TypeScript -and Rust. Run it locally during development, then use the same data model and APIs as you scale up -and need a managed solution. Start here: +and Rust. Run it locally in just a few steps, which lets you explore datasets, curate data, and run search and retrieval workloads +for agents. Start here: - Create tables, search vectors, and modify data in LanceDB. + Create tables, evolve schemas, version data, and modify rows in LanceDB. ### 2. LanceDB Enterprise -[LanceDB Enterprise](/enterprise) is a distributed and managed **multimodal lakehouse** built for -search, curation, feature engineering, and training-oriented data access workflows -on top of the same core table abstraction. This eliminates the need for teams to build bespoke -infrastructure to manage petabyte-scale multimodal datasets. +[LanceDB Enterprise](/enterprise) is a petabyte-scale (and beyond), distributed **multimodal lakehouse** platform built for +search, curation, feature engineering, and high-throughput training data access workflows on top of the same core table +abstraction. This eliminates the need for teams to build bespoke infrastructure to manage large multimodal datasets. To set up LanceDB Enterprise in your organization, reach out to us at [contact@lancedb.com](mailto:contact@lancedb.com). @@ -88,4 +121,4 @@ private deployments, and can operate under strict [security requirements](/enter href="/enterprise/quickstart" > Get started with LanceDB Enterprise in minutes. - \ No newline at end of file + diff --git a/docs/static/assets/images/overview/lancedb-suite.svg b/docs/static/assets/images/overview/lancedb-suite.svg new file mode 100644 index 00000000..79805fc3 --- /dev/null +++ b/docs/static/assets/images/overview/lancedb-suite.svg @@ -0,0 +1,92 @@ + + LanceDB suite + LanceDB OSS supports search. LanceDB Enterprise is a multimodal lakehouse for curation, feature engineering, search and retrieval, and training. Both are built on the Lance open lakehouse format for multimodal AI. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + LanceDB OSS + Search + + + + + + + + + + + + + + + LanceDB Enterprise - Multimodal Lakehouse + + Curation + Feature + Engineering + Search & + Retrieval + Training + + + + + + + + + + Lance Open Lakehouse Format for Multimodal AI + + diff --git a/docs/static/assets/images/overview/training-data-lifecycle.svg b/docs/static/assets/images/overview/training-data-lifecycle.svg new file mode 100644 index 00000000..6127b44a --- /dev/null +++ b/docs/static/assets/images/overview/training-data-lifecycle.svg @@ -0,0 +1,52 @@ + + Training data lifecycle + Four pillars of the LanceDB training data lifecycle: Curation, Feature Engineering, Search and Retrieval, and Training. + + + + + + + + + + + + + + + + + + + + + + + + + + + + Curation + + + + + + Feature + Engineering + + + + + + Search & + Retrieval + + + + + + Training + diff --git a/docs/training/why-lancedb.mdx b/docs/training/why-lancedb.mdx new file mode 100644 index 00000000..0adfc81a --- /dev/null +++ b/docs/training/why-lancedb.mdx @@ -0,0 +1,118 @@ +--- +title: "Why LanceDB for Training" +sidebarTitle: "Why LanceDB for training" +description: "Use LanceDB as the multimodal data layer for model training, fine-tuning, curation, and feature engineering workflows." +icon: fire +--- + +LanceDB is built for AI teams that need a practical data layer between raw multimodal datasets and model training. +Instead of moving data through separate systems for curation, feature engineering, search, manifests, and training, +you can keep the whole workflow attached to one versioned LanceDB table. + +That table can hold images, video, audio, text, annotations, metadata, embeddings, tokenized fields, model outputs, +quality signals, and training-ready tensors. As the dataset evolves, LanceDB lets you add new columns, filter rows, +pin versions, and read batches without rewriting the original data. + +![Training data lifecycle: Curation, Feature Engineering, Search and Retrieval, Training](/static/assets/images/overview/training-data-lifecycle.svg) + +LanceDB gives these stages one platform, so curation, feature engineering, retrieval, and training stay connected. + +## A connected data lifecycle + +Training pipelines usually need more than a pile of files. They need curation, derived features, reproducible splits, +fast random access, and a clean path into frameworks such as PyTorch. LanceDB keeps these pieces connected through +the same table model, whether you organize a workflow as one table or several related tables. + + + + Use filters, vector search, full-text search, and retrieval workflows to find the examples that matter: hard negatives, + long-tail failure modes, duplicate clusters, low-quality samples, or targeted fine-tuning slices. + + + Add embeddings, detections, OCR output, labels, token IDs, hidden states, deduplication flags, or quality scores as + new columns. Lance's columnar layout and schema evolution avoid rewriting large raw media columns when you add features. + + + Build filtered splits and materialized views from the table instead of exporting CSV manifests. Data versions and tags + make it possible to tie a checkpoint back to the exact rows and features used for training. + + + Use fast random access and column projection to read only the columns a training step needs. LanceDB tables can be read + from local storage or object storage, and integrate with data loading patterns such as PyTorch datasets. + + + +## Lance as the foundation + +LanceDB is built on [Lance](https://lance.org/), an open-source lakehouse format designed for multimodal AI data. +The table below highlights the Lance features that enable the multimodal lakehouse on top. + +| Capability | Why it matters for training | +|---|---| +| **Multimodal columns** | Store raw bytes, annotations, metadata, embeddings, and features together. | +| **Fast random access** | Support shuffled and sampled reads without reshuffling the dataset on disk. | +| **Column projection** | Read only images, tokens, labels, embeddings, or hidden states needed by a given run. | +| **Schema evolution** | Add new feature columns without rewriting existing media columns. | +| **Versioning** | Reproduce experiments against the same table snapshot, even as the dataset evolves. | +| **Search and filtering** | Find and materialize useful training slices directly from the table. | + +## Search inside training workflows + +Search is not limited to QA systems, agents, or production retrieval apps. It is also a practical way to inspect, +curate, and improve training data: + +- Find visually similar examples when debugging model failures. +- Retrieve hard negatives or near-duplicates for contrastive training. +- Combine vector search, full-text search, and metadata filters to build targeted fine-tuning slices. +- Reuse the same table for both offline curation and production retrieval. + +In LanceDB, retrieval and training workflows can operate over the same multimodal tables instead of forcing teams to +manage separate data systems for each stage. + +## Projects using LanceDB for training workflows + + + + A platform for reproducible world-model research built on a LanceDB data layer, reporting faster data loading on Push-T workloads. + + + A joint-embedding predictive world model from pixels, trained on the stable-worldmodel platform and its LanceDB data layer. + + + A drop-in LanceDB backend for Hugging Face LeRobot datasets with faster loading across robotics datasets. + + + +In the world-model ecosystem, [stable-worldmodel](https://github.com/galilai-group/stable-worldmodel) reports +3-4x faster data loading on Push-T versus HDF5 / MP4 at a fraction of the disk footprint. Across these projects, +LanceDB and Lance provide the multimodal data layer that keeps raw observations, annotations, features, and training +access patterns in one format instead of scattering them across task-specific stores. + +## Next steps + + + + Learn how to use LanceDB permutations to select rows, project columns, split datasets, and shuffle training reads. + + + Use LanceDB tables and permutations with `torch.utils.data.DataLoader`. + + + Fine-tune an AV perception model on curated failure-mode slices backed by one LanceDB table. + + + Fine-tune a VLM on TextVQA using LanceDB and Geneva to cache expensive training features. + + From 19001a0964ef26b41aaa756e5f419afb79338901 Mon Sep 17 00:00:00 2001 From: prrao87 <35005448+prrao87@users.noreply.github.com> Date: Mon, 29 Jun 2026 09:03:52 -0400 Subject: [PATCH 2/5] docs: fix overview SVG dark mode rendering --- docs/index.mdx | 4 +- .../assets/images/overview/lancedb-suite.svg | 45 ++++++++++--------- .../overview/training-data-lifecycle.svg | 20 ++++----- 3 files changed, 36 insertions(+), 33 deletions(-) diff --git a/docs/index.mdx b/docs/index.mdx index 78f13201..c9cf4f75 100644 --- a/docs/index.mdx +++ b/docs/index.mdx @@ -15,8 +15,8 @@ fragmented stack of storage, feature, retrieval, and training systems. ## Build better models, faster -Training data pipelines and experimentation slow down when raw data, metadata, embeddings, features, and governance -artifacts live in separate systems. LanceDB keeps them together in one versioned multimodal table, so teams spend less +Training data and experimentation slow down when raw data, metadata, embeddings, features, and governance +artifacts live in separate systems. LanceDB keeps them together in one versioned multimodal table, so AI teams spend less time stitching infrastructure together and more time improving datasets, testing features, and keeping GPUs fed. ![Training data lifecycle: Curation, Feature Engineering, Search and Retrieval, Training](/static/assets/images/overview/training-data-lifecycle.svg) diff --git a/docs/static/assets/images/overview/lancedb-suite.svg b/docs/static/assets/images/overview/lancedb-suite.svg index 79805fc3..92a9335b 100644 --- a/docs/static/assets/images/overview/lancedb-suite.svg +++ b/docs/static/assets/images/overview/lancedb-suite.svg @@ -28,20 +28,23 @@ - + + - - + + + + + + + + + + - - - - - + - - - + @@ -54,10 +57,10 @@ - LanceDB OSS - Search + LanceDB OSS + Search - + @@ -70,14 +73,14 @@ - LanceDB Enterprise - Multimodal Lakehouse + LanceDB Enterprise - Multimodal Lakehouse - Curation - Feature - Engineering - Search & - Retrieval - Training + Curation + Feature + Engineering + Search & + Retrieval + Training diff --git a/docs/static/assets/images/overview/training-data-lifecycle.svg b/docs/static/assets/images/overview/training-data-lifecycle.svg index 6127b44a..a533c70e 100644 --- a/docs/static/assets/images/overview/training-data-lifecycle.svg +++ b/docs/static/assets/images/overview/training-data-lifecycle.svg @@ -11,13 +11,13 @@ - + - + - + - + @@ -28,25 +28,25 @@ - Curation + Curation - Feature - Engineering + Feature + Engineering - Search & - Retrieval + Search & + Retrieval - Training + Training From cba407d7b80464e24886dda6140a662bf1e0168e Mon Sep 17 00:00:00 2001 From: prrao87 <35005448+prrao87@users.noreply.github.com> Date: Mon, 29 Jun 2026 09:15:05 -0400 Subject: [PATCH 3/5] docs: revamp Lance capabilities table --- docs/lance.mdx | 30 ++++++++++++++++-------------- 1 file changed, 16 insertions(+), 14 deletions(-) diff --git a/docs/lance.mdx b/docs/lance.mdx index 88c26722..e7922563 100644 --- a/docs/lance.mdx +++ b/docs/lance.mdx @@ -5,15 +5,15 @@ description: "Open-source lakehouse format for multimodal AI." icon: "/static/assets/logo/lance-logo-gray.svg" --- -[Lance](https://lance.org/) is an open-source lakehouse format, which provides the -foundation for LanceDB's capabilities. It provides a file format, -table format, and catalog spec with multimodal data at the center of its design, allowing developers +[Lance](https://lance.org/) is an open-source, columnar lakehouse format for multimodal AI. +It provides a file format, table format, and lightweight catalog spec, allowing developers to build a complete open lakehouse on top of object storage. -Building on top of open foundations and optimizing the format for AI workloads brings -high-performance vector search, full-text search, random access, and feature engineering capabilities -to a single unified system ([LanceDB](/enterprise)), eliminating the need for bespoke ETL and data pipelines that move data -to multiple other specialized data systems. +Building on top of open foundations and optimizing the format for random access +(without compromising scan performance) enables +high-performance vector search, full-text search, indexing, and feature engineering capabilities. +[LanceDB](/enterprise) builds on these capabilities so teams can work with one multimodal data layer +instead of moving data across separate storage, search, feature, and training systems. -## Advantages of the Lance format +## Capabilities of the Lance format -Advantage | Description +Capability | What it enables --- | --- -Multimodal storage | Efficiently holds vectors, images, videos, audio, text, and more -Version control | Built-in data versioning for reproducible ML experiments and data lineage -ML-optimized | Designed for training and inference workloads with fast random access -Query performance | Columnar storage enables blazing-fast vector search and analytics -Cloud-native | Seamless integration with cloud object stores (S3, GCS, Azure Blob) +Multimodal columns | Store images, video, audio, text, embeddings, annotations, metadata, and features in one table. +Fast random access and scans | Sample, shuffle, and retrieve individual rows efficiently without giving up high-throughput sequential reads. +Column projection | Read only the raw media, labels, tokens, embeddings, or feature columns needed by a workload. +Data evolution | Add or backfill embeddings, features, quality signals, and derived columns without rewriting the original dataset. +Versioned tables | Reproduce experiments, restore previous states, and tie downstream artifacts to the exact table version they used. +Hybrid search and indexing | Combine vector search, full-text search, and scalar filters on the same dataset with Lance indexes. +Open lakehouse interoperability | Build on object storage and connect Lance tables to open engines such as PyTorch, Ray, Spark, DuckDB, and Trino. ## Key concepts From b700eaa1de6802b442442929e40454f11c6d0adf Mon Sep 17 00:00:00 2001 From: prrao87 <35005448+prrao87@users.noreply.github.com> Date: Mon, 29 Jun 2026 09:26:33 -0400 Subject: [PATCH 4/5] docs: highlight Lance data evolution and blob APIs --- docs/lance.mdx | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/lance.mdx b/docs/lance.mdx index e7922563..224d45c6 100644 --- a/docs/lance.mdx +++ b/docs/lance.mdx @@ -27,10 +27,11 @@ instead of moving data across separate storage, search, feature, and training sy Capability | What it enables --- | --- -Multimodal columns | Store images, video, audio, text, embeddings, annotations, metadata, and features in one table. +Multimodal storage | Store images, video, audio, text, embeddings, annotations, metadata, features, and more, all in one table. +First-class blob API | Store large binary objects such as images, video, audio, and model artifacts in blob columns with lazy reads and streaming byte access. Fast random access and scans | Sample, shuffle, and retrieve individual rows efficiently without giving up high-throughput sequential reads. -Column projection | Read only the raw media, labels, tokens, embeddings, or feature columns needed by a workload. -Data evolution | Add or backfill embeddings, features, quality signals, and derived columns without rewriting the original dataset. +Flexible data evolution | Add, drop, rename, or alter columns as datasets change, often without rewriting existing data files. +Feature backfills | Populate new embedding, label, quality signal, or feature columns with SQL expressions, batch UDFs, or merges without rewriting the original dataset. Versioned tables | Reproduce experiments, restore previous states, and tie downstream artifacts to the exact table version they used. Hybrid search and indexing | Combine vector search, full-text search, and scalar filters on the same dataset with Lance indexes. Open lakehouse interoperability | Build on object storage and connect Lance tables to open engines such as PyTorch, Ray, Spark, DuckDB, and Trino. From e20352a55bc16ce186c06db0ea856b21bdaf6c7c Mon Sep 17 00:00:00 2001 From: prrao87 <35005448+prrao87@users.noreply.github.com> Date: Mon, 29 Jun 2026 09:34:17 -0400 Subject: [PATCH 5/5] Update Lance page --- docs/lance.mdx | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/lance.mdx b/docs/lance.mdx index 224d45c6..e7a650e5 100644 --- a/docs/lance.mdx +++ b/docs/lance.mdx @@ -31,10 +31,9 @@ Multimodal storage | Store images, video, audio, text, embeddings, annotations, First-class blob API | Store large binary objects such as images, video, audio, and model artifacts in blob columns with lazy reads and streaming byte access. Fast random access and scans | Sample, shuffle, and retrieve individual rows efficiently without giving up high-throughput sequential reads. Flexible data evolution | Add, drop, rename, or alter columns as datasets change, often without rewriting existing data files. -Feature backfills | Populate new embedding, label, quality signal, or feature columns with SQL expressions, batch UDFs, or merges without rewriting the original dataset. Versioned tables | Reproduce experiments, restore previous states, and tie downstream artifacts to the exact table version they used. Hybrid search and indexing | Combine vector search, full-text search, and scalar filters on the same dataset with Lance indexes. -Open lakehouse interoperability | Build on object storage and connect Lance tables to open engines such as PyTorch, Ray, Spark, DuckDB, and Trino. +Open lakehouse interoperability | Build on object storage and connect Lance tables to open engines such as PyTorch, Ray, Spark, Trino, DuckDB and Polars. ## Key concepts