diff --git a/docs/docs.json b/docs/docs.json index 7f237c5b..7d1519f7 100644 --- a/docs/docs.json +++ b/docs/docs.json @@ -63,6 +63,16 @@ "performance" ] }, + { + "group": "Model training", + "pages": [ + "training/why-lancedb", + "training/index", + "training/torch", + "training/object-detection", + "training/vlm-finetuning" + ] + }, { "group": "Guides", "pages": [ @@ -141,15 +151,6 @@ "storage/index", "storage/configuration" ] - }, - { - "group": "Training", - "pages": [ - "training/index", - "training/torch", - "training/object-detection", - "training/vlm-finetuning" - ] } ] }, diff --git a/docs/index.mdx b/docs/index.mdx index 83f4e19d..c9cf4f75 100644 --- a/docs/index.mdx +++ b/docs/index.mdx @@ -3,51 +3,85 @@ title: LanceDB sidebarTitle: "LanceDB" description: "Multimodal lakehouse for AI." icon: "/static/assets/logo/lancedb-icon-gray.svg" -keywords: ["open source", "oss"] +keywords: ["multimodal lakehouse", "training", "feature engineering", "search", "open source", "oss"] --- -**LanceDB** is a [multimodal lakehouse](https://lancedb.com/blog/multimodal-lakehouse/) for -AI, built on top of [Lance](/lance), an open-source lakehouse format. Below, we list a few -ways LanceDB can help you build and scale your AI and ML workloads. +**LanceDB** is a [multimodal lakehouse](https://lancedb.com/blog/multimodal-lakehouse/) for AI teams that need +one data layer for curation, feature engineering, search and retrieval, and model training. +It is built on top of [Lance](/lance), an open-source lakehouse format designed for multimodal AI data. + +Move from data exploration to model training on one, unified platform without needing to manage a +fragmented stack of storage, feature, retrieval, and training systems. + +## Build better models, faster + +Training data and experimentation slow down when raw data, metadata, embeddings, features, and governance +artifacts live in separate systems. LanceDB keeps them together in one versioned multimodal table, so AI teams spend less +time stitching infrastructure together and more time improving datasets, testing features, and keeping GPUs fed. + +![Training data lifecycle: Curation, Feature Engineering, Search and Retrieval, Training](/static/assets/images/overview/training-data-lifecycle.svg) + +Use the same table to curate training data, add derived features, retrieve examples, and feed training jobs that rely on expensive GPUs. +Training workloads can sample, shuffle, and scan projected columns from local storage or object storage, then assemble +GPU-ready batches from a tagged dataset version. + +For a deeper look at how this works in training pipelines, start with [Why LanceDB for training](/training/why-lancedb). + +## LanceDB suite + +The LanceDB suite includes LanceDB OSS, an open-source embedded retrieval library, and LanceDB Enterprise, +a multimodal lakehouse platform for the full AI data lifecycle. +OSS is easy to set up on a local machine for search and regular-scale workflows. LanceDB Enterprise is built +for teams that need scale without building bespoke infrastructure for curation, +feature engineering, search and retrieval, and efficient training data access. + +![LanceDB suite: OSS search and Enterprise multimodal lakehouse on Lance format](/static/assets/images/overview/lancedb-suite.svg) + +## Why teams use LanceDB - - Use LanceDB to curate, explore and distribute very large multimodal datasets for training and fine-tuning models. - LanceDB comes with built-in table versioning, schema evolution, and fast random access, making it far more efficient to do - dataset slicing, sampling, filtering and shuffles on large, rapidly evolving datasets. + + Store images, video, audio, text, annotations, embeddings, and model-generated features together in one schema-enforced table. + The same table can support dataset curation, feature backfills, experiment splits, retrieval, and training. + + + Training workloads mix fast random access with high-throughput sequential scans. LanceDB is designed for both, so + teams can shuffle data into GPU-ready batches more efficiently, improve input throughput, and iterate on experiments faster. - - Use LanceDB as the data + retrieval layer for production AI workloads: RAG, agents, semantic search, - recommendation systems, and more. - Keep multimodal data, metadata, and embeddings in the same table and query them via vector search, - full-text search or SQL. Easily add new features (columns in your tables) as your - application evolves, without copying existing data. + + Whether the end user is a human or an agent, LanceDB powers production retrieval workloads such as semantic search, + hybrid search, RAG, agent memory, and recommendation systems. Retrieval runs against the same LanceDB tables used + for curation, feature engineering, and training workflows. -LanceDB is designed for a variety of workloads and deployment scenarios, and supports use cases -that are way beyond traditional vector search. The LanceDB suite includes LanceDB OSS, an open-source embedded library, -and LanceDB Enterprise, a distributed and managed multimodal lakehouse. -Both are built on top of the same open-source Lance format and table abstractions. - -![](/static/assets/images/overview/lancedb-suite.png) +## Start with your workload -## Use cases - -- **Search**: Build high-performance search and retrieval applications using LanceDB's optimized storage, including vector search, full-text search, and hybrid search with secondary indexes. -- **Data Curation**: Manage and filter on petabyte-scale multimodal datasets, including video and point cloud data, to gain insights, explore data and inform model development. -- **Feature engineering**: Add new columns (features), create embeddings, and transform your data at -scale. LanceDB lets you extend tables both vertically and horizontally with minimal I/O overhead. -- **Training**: Efficiently access and manage large-scale multimodal datasets for training and fine-tuning AI models. + + + Learn why LanceDB works well as the data layer for training workloads. + + + Use LanceDB tables and permutations for projected, shuffled, random-access training reads. + + + Explore Lance-formatted multimodal datasets with raw bytes, metadata, embeddings, and indices. + + + Use vector search, full-text search, hybrid search, reranking, filtering, and SQL. + + -## Choose how you run LanceDB +## From local development to production scale -Depending on your needs, you can choose one of the following ways to run LanceDB. +LanceDB OSS and LanceDB Enterprise share the same Lance format and table model. Start locally with the embedded OSS +library, then move to Enterprise when your team needs distributed scale, managed infrastructure, private deployment, +or higher-throughput curation, feature engineering, search and retrieval, and training workflows. ### 1. LanceDB OSS The fastest way to get started is the open-source embedded library, with client SDKs in Python, TypeScript -and Rust. Run it locally during development, then use the same data model and APIs as you scale up -and need a managed solution. Start here: +and Rust. Run it locally in just a few steps, which lets you explore datasets, curate data, and run search and retrieval workloads +for agents. Start here: - Create tables, search vectors, and modify data in LanceDB. + Create tables, evolve schemas, version data, and modify rows in LanceDB. ### 2. LanceDB Enterprise -[LanceDB Enterprise](/enterprise) is a distributed and managed **multimodal lakehouse** built for -search, curation, feature engineering, and training-oriented data access workflows -on top of the same core table abstraction. This eliminates the need for teams to build bespoke -infrastructure to manage petabyte-scale multimodal datasets. +[LanceDB Enterprise](/enterprise) is a petabyte-scale (and beyond), distributed **multimodal lakehouse** platform built for +search, curation, feature engineering, and high-throughput training data access workflows on top of the same core table +abstraction. This eliminates the need for teams to build bespoke infrastructure to manage large multimodal datasets. To set up LanceDB Enterprise in your organization, reach out to us at [contact@lancedb.com](mailto:contact@lancedb.com). @@ -88,4 +121,4 @@ private deployments, and can operate under strict [security requirements](/enter href="/enterprise/quickstart" > Get started with LanceDB Enterprise in minutes. - \ No newline at end of file + diff --git a/docs/lance.mdx b/docs/lance.mdx index 88c26722..e7a650e5 100644 --- a/docs/lance.mdx +++ b/docs/lance.mdx @@ -5,15 +5,15 @@ description: "Open-source lakehouse format for multimodal AI." icon: "/static/assets/logo/lance-logo-gray.svg" --- -[Lance](https://lance.org/) is an open-source lakehouse format, which provides the -foundation for LanceDB's capabilities. It provides a file format, -table format, and catalog spec with multimodal data at the center of its design, allowing developers +[Lance](https://lance.org/) is an open-source, columnar lakehouse format for multimodal AI. +It provides a file format, table format, and lightweight catalog spec, allowing developers to build a complete open lakehouse on top of object storage. -Building on top of open foundations and optimizing the format for AI workloads brings -high-performance vector search, full-text search, random access, and feature engineering capabilities -to a single unified system ([LanceDB](/enterprise)), eliminating the need for bespoke ETL and data pipelines that move data -to multiple other specialized data systems. +Building on top of open foundations and optimizing the format for random access +(without compromising scan performance) enables +high-performance vector search, full-text search, indexing, and feature engineering capabilities. +[LanceDB](/enterprise) builds on these capabilities so teams can work with one multimodal data layer +instead of moving data across separate storage, search, feature, and training systems. -## Advantages of the Lance format +## Capabilities of the Lance format -Advantage | Description +Capability | What it enables --- | --- -Multimodal storage | Efficiently holds vectors, images, videos, audio, text, and more -Version control | Built-in data versioning for reproducible ML experiments and data lineage -ML-optimized | Designed for training and inference workloads with fast random access -Query performance | Columnar storage enables blazing-fast vector search and analytics -Cloud-native | Seamless integration with cloud object stores (S3, GCS, Azure Blob) +Multimodal storage | Store images, video, audio, text, embeddings, annotations, metadata, features, and more, all in one table. +First-class blob API | Store large binary objects such as images, video, audio, and model artifacts in blob columns with lazy reads and streaming byte access. +Fast random access and scans | Sample, shuffle, and retrieve individual rows efficiently without giving up high-throughput sequential reads. +Flexible data evolution | Add, drop, rename, or alter columns as datasets change, often without rewriting existing data files. +Versioned tables | Reproduce experiments, restore previous states, and tie downstream artifacts to the exact table version they used. +Hybrid search and indexing | Combine vector search, full-text search, and scalar filters on the same dataset with Lance indexes. +Open lakehouse interoperability | Build on object storage and connect Lance tables to open engines such as PyTorch, Ray, Spark, Trino, DuckDB and Polars. ## Key concepts diff --git a/docs/static/assets/images/overview/lancedb-suite.svg b/docs/static/assets/images/overview/lancedb-suite.svg new file mode 100644 index 00000000..92a9335b --- /dev/null +++ b/docs/static/assets/images/overview/lancedb-suite.svg @@ -0,0 +1,95 @@ + + LanceDB suite + LanceDB OSS supports search. LanceDB Enterprise is a multimodal lakehouse for curation, feature engineering, search and retrieval, and training. Both are built on the Lance open lakehouse format for multimodal AI. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + LanceDB OSS + Search + + + + + + + + + + + + + + + LanceDB Enterprise - Multimodal Lakehouse + + Curation + Feature + Engineering + Search & + Retrieval + Training + + + + + + + + + + Lance Open Lakehouse Format for Multimodal AI + + diff --git a/docs/static/assets/images/overview/training-data-lifecycle.svg b/docs/static/assets/images/overview/training-data-lifecycle.svg new file mode 100644 index 00000000..a533c70e --- /dev/null +++ b/docs/static/assets/images/overview/training-data-lifecycle.svg @@ -0,0 +1,52 @@ + + Training data lifecycle + Four pillars of the LanceDB training data lifecycle: Curation, Feature Engineering, Search and Retrieval, and Training. + + + + + + + + + + + + + + + + + + + + + + + + + + + + Curation + + + + + + Feature + Engineering + + + + + + Search & + Retrieval + + + + + + Training + diff --git a/docs/training/why-lancedb.mdx b/docs/training/why-lancedb.mdx new file mode 100644 index 00000000..0adfc81a --- /dev/null +++ b/docs/training/why-lancedb.mdx @@ -0,0 +1,118 @@ +--- +title: "Why LanceDB for Training" +sidebarTitle: "Why LanceDB for training" +description: "Use LanceDB as the multimodal data layer for model training, fine-tuning, curation, and feature engineering workflows." +icon: fire +--- + +LanceDB is built for AI teams that need a practical data layer between raw multimodal datasets and model training. +Instead of moving data through separate systems for curation, feature engineering, search, manifests, and training, +you can keep the whole workflow attached to one versioned LanceDB table. + +That table can hold images, video, audio, text, annotations, metadata, embeddings, tokenized fields, model outputs, +quality signals, and training-ready tensors. As the dataset evolves, LanceDB lets you add new columns, filter rows, +pin versions, and read batches without rewriting the original data. + +![Training data lifecycle: Curation, Feature Engineering, Search and Retrieval, Training](/static/assets/images/overview/training-data-lifecycle.svg) + +LanceDB gives these stages one platform, so curation, feature engineering, retrieval, and training stay connected. + +## A connected data lifecycle + +Training pipelines usually need more than a pile of files. They need curation, derived features, reproducible splits, +fast random access, and a clean path into frameworks such as PyTorch. LanceDB keeps these pieces connected through +the same table model, whether you organize a workflow as one table or several related tables. + + + + Use filters, vector search, full-text search, and retrieval workflows to find the examples that matter: hard negatives, + long-tail failure modes, duplicate clusters, low-quality samples, or targeted fine-tuning slices. + + + Add embeddings, detections, OCR output, labels, token IDs, hidden states, deduplication flags, or quality scores as + new columns. Lance's columnar layout and schema evolution avoid rewriting large raw media columns when you add features. + + + Build filtered splits and materialized views from the table instead of exporting CSV manifests. Data versions and tags + make it possible to tie a checkpoint back to the exact rows and features used for training. + + + Use fast random access and column projection to read only the columns a training step needs. LanceDB tables can be read + from local storage or object storage, and integrate with data loading patterns such as PyTorch datasets. + + + +## Lance as the foundation + +LanceDB is built on [Lance](https://lance.org/), an open-source lakehouse format designed for multimodal AI data. +The table below highlights the Lance features that enable the multimodal lakehouse on top. + +| Capability | Why it matters for training | +|---|---| +| **Multimodal columns** | Store raw bytes, annotations, metadata, embeddings, and features together. | +| **Fast random access** | Support shuffled and sampled reads without reshuffling the dataset on disk. | +| **Column projection** | Read only images, tokens, labels, embeddings, or hidden states needed by a given run. | +| **Schema evolution** | Add new feature columns without rewriting existing media columns. | +| **Versioning** | Reproduce experiments against the same table snapshot, even as the dataset evolves. | +| **Search and filtering** | Find and materialize useful training slices directly from the table. | + +## Search inside training workflows + +Search is not limited to QA systems, agents, or production retrieval apps. It is also a practical way to inspect, +curate, and improve training data: + +- Find visually similar examples when debugging model failures. +- Retrieve hard negatives or near-duplicates for contrastive training. +- Combine vector search, full-text search, and metadata filters to build targeted fine-tuning slices. +- Reuse the same table for both offline curation and production retrieval. + +In LanceDB, retrieval and training workflows can operate over the same multimodal tables instead of forcing teams to +manage separate data systems for each stage. + +## Projects using LanceDB for training workflows + + + + A platform for reproducible world-model research built on a LanceDB data layer, reporting faster data loading on Push-T workloads. + + + A joint-embedding predictive world model from pixels, trained on the stable-worldmodel platform and its LanceDB data layer. + + + A drop-in LanceDB backend for Hugging Face LeRobot datasets with faster loading across robotics datasets. + + + +In the world-model ecosystem, [stable-worldmodel](https://github.com/galilai-group/stable-worldmodel) reports +3-4x faster data loading on Push-T versus HDF5 / MP4 at a fraction of the disk footprint. Across these projects, +LanceDB and Lance provide the multimodal data layer that keeps raw observations, annotations, features, and training +access patterns in one format instead of scattering them across task-specific stores. + +## Next steps + + + + Learn how to use LanceDB permutations to select rows, project columns, split datasets, and shuffle training reads. + + + Use LanceDB tables and permutations with `torch.utils.data.DataLoader`. + + + Fine-tune an AV perception model on curated failure-mode slices backed by one LanceDB table. + + + Fine-tune a VLM on TextVQA using LanceDB and Geneva to cache expensive training features. + +