From 1071a330adc3e563c73a4e9dd335b3b8d0d92f90 Mon Sep 17 00:00:00 2001
From: prrao87 <35005448+prrao87@users.noreply.github.com>
Date: Sun, 28 Jun 2026 17:40:31 -0400
Subject: [PATCH 1/5] docs: emphasize training data lifecycle

---
 docs/docs.json                                |  19 +--
 docs/index.mdx                                | 109 ++++++++++------
 .../assets/images/overview/lancedb-suite.svg  |  92 ++++++++++++++
 .../overview/training-data-lifecycle.svg      |  52 ++++++++
 docs/training/why-lancedb.mdx                 | 118 ++++++++++++++++++
 5 files changed, 343 insertions(+), 47 deletions(-)
 create mode 100644 docs/static/assets/images/overview/lancedb-suite.svg
 create mode 100644 docs/static/assets/images/overview/training-data-lifecycle.svg
 create mode 100644 docs/training/why-lancedb.mdx

diff --git a/docs/docs.json b/docs/docs.json
index 7f237c5b..7d1519f7 100644
--- a/docs/docs.json
+++ b/docs/docs.json
@@ -63,6 +63,16 @@
               "performance"
             ]
           },
+          {
+            "group": "Model training",
+            "pages": [
+              "training/why-lancedb",
+              "training/index",
+              "training/torch",
+              "training/object-detection",
+              "training/vlm-finetuning"
+            ]
+          },
           {
             "group": "Guides",
             "pages": [
@@ -141,15 +151,6 @@
                   "storage/index",
                   "storage/configuration"
                 ]
-              },
-              {
-                "group": "Training",
-                "pages": [
-                  "training/index",
-                  "training/torch",
-                  "training/object-detection",
-                  "training/vlm-finetuning"
-                ]
               }
             ]
           },
diff --git a/docs/index.mdx b/docs/index.mdx
index 83f4e19d..78f13201 100644
--- a/docs/index.mdx
+++ b/docs/index.mdx
@@ -3,51 +3,85 @@ title: LanceDB
 sidebarTitle: "LanceDB"
 description: "Multimodal lakehouse for AI."
 icon: "/static/assets/logo/lancedb-icon-gray.svg"
-keywords: ["open source", "oss"]
+keywords: ["multimodal lakehouse", "training", "feature engineering", "search", "open source", "oss"]
 ---
 
-**LanceDB** is a [multimodal lakehouse](https://lancedb.com/blog/multimodal-lakehouse/) for
-AI, built on top of [Lance](/lance), an open-source lakehouse format. Below, we list a few
-ways LanceDB can help you build and scale your AI and ML workloads.
+**LanceDB** is a [multimodal lakehouse](https://lancedb.com/blog/multimodal-lakehouse/) for AI teams that need
+one data layer for curation, feature engineering, search and retrieval, and model training.
+It is built on top of [Lance](/lance), an open-source lakehouse format designed for multimodal AI data.
+
+Move from data exploration to model training on one, unified platform without needing to manage a
+fragmented stack of storage, feature, retrieval, and training systems.
+
+## Build better models, faster
+
+Training data pipelines and experimentation slow down when raw data, metadata, embeddings, features, and governance
+artifacts live in separate systems. LanceDB keeps them together in one versioned multimodal table, so teams spend less
+time stitching infrastructure together and more time improving datasets, testing features, and keeping GPUs fed.
+
+![Training data lifecycle: Curation, Feature Engineering, Search and Retrieval, Training](/static/assets/images/overview/training-data-lifecycle.svg)
+
+Use the same table to curate training data, add derived features, retrieve examples, and feed training jobs that rely on expensive GPUs.
+Training workloads can sample, shuffle, and scan projected columns from local storage or object storage, then assemble
+GPU-ready batches from a tagged dataset version.
+
+For a deeper look at how this works in training pipelines, start with [Why LanceDB for training](/training/why-lancedb).
+
+## LanceDB suite
+
+The LanceDB suite includes LanceDB OSS, an open-source embedded retrieval library, and LanceDB Enterprise,
+a multimodal lakehouse platform for the full AI data lifecycle.
+OSS is easy to set up on a local machine for search and regular-scale workflows. LanceDB Enterprise is built
+for teams that need scale without building bespoke infrastructure for curation,
+feature engineering, search and retrieval, and efficient training data access.
+
+![LanceDB suite: OSS search and Enterprise multimodal lakehouse on Lance format](/static/assets/images/overview/lancedb-suite.svg)
+
+## Why teams use LanceDB
 
 <Steps>
-  <Step title="High-performance random access and data management for model training">
-    Use LanceDB to curate, explore and distribute very large multimodal datasets for training and fine-tuning models.
-    LanceDB comes with built-in table versioning, schema evolution, and fast random access, making it far more efficient to do
-    dataset slicing, sampling, filtering and shuffles on large, rapidly evolving datasets.
+  <Step title="One table for the whole AI data loop">
+    Store images, video, audio, text, annotations, embeddings, and model-generated features together in one schema-enforced table.
+    The same table can support dataset curation, feature backfills, experiment splits, retrieval, and training.
+  </Step>
+  <Step title="High-throughput data access for training">
+    Training workloads mix fast random access with high-throughput sequential scans. LanceDB is designed for both, so
+    teams can shuffle data into GPU-ready batches more efficiently, improve input throughput, and iterate on experiments faster.
   </Step>
-  <Step title="Massively scalable, fast and high-quality retrieval − without breaking the bank">
-    Use LanceDB as the data + retrieval layer for production AI workloads: RAG, agents, semantic search,
-    recommendation systems, and more.
-    Keep multimodal data, metadata, and embeddings in the same table and query them via vector search,
-    full-text search or SQL. Easily add new features (columns in your tables) as your
-    application evolves, without copying existing data.
+  <Step title="Fast, versatile search and retrieval">
+    Whether the end user is a human or an agent, LanceDB powers production retrieval workloads such as semantic search,
+    hybrid search, RAG, agent memory, and recommendation systems. Retrieval runs against the same LanceDB tables used
+    for curation, feature engineering, and training workflows.
   </Step>
 </Steps>
 
-LanceDB is designed for a variety of workloads and deployment scenarios, and supports use cases
-that are way beyond traditional vector search. The LanceDB suite includes LanceDB OSS, an open-source embedded library,
-and LanceDB Enterprise, a distributed and managed multimodal lakehouse.
-Both are built on top of the same open-source Lance format and table abstractions.
-
-![](/static/assets/images/overview/lancedb-suite.png)
+## Start with your workload
 
-## Use cases
-
-- **Search**: Build high-performance search and retrieval applications using LanceDB's optimized storage, including vector search, full-text search, and hybrid search with secondary indexes.
-- **Data Curation**: Manage and filter on petabyte-scale multimodal datasets, including video and point cloud data, to gain insights, explore data and inform model development.
-- **Feature engineering**: Add new columns (features), create embeddings, and transform your data at
-scale. LanceDB lets you extend tables both vertically and horizontally with minimal I/O overhead.
-- **Training**: Efficiently access and manage large-scale multimodal datasets for training and fine-tuning AI models.
+<CardGroup cols={2}>
+  <Card title="Train and fine-tune models" icon="fire" href="/training/why-lancedb">
+    Learn why LanceDB works well as the data layer for training workloads.
+  </Card>
+  <Card title="Load data into PyTorch" icon="boxes-stacked" href="/training/">
+    Use LanceDB tables and permutations for projected, shuffled, random-access training reads.
+  </Card>
+  <Card title="Browse ready-to-use datasets" icon="database" href="/datasets">
+    Explore Lance-formatted multimodal datasets with raw bytes, metadata, embeddings, and indices.
+  </Card>
+  <Card title="Build search and retrieval" icon="search" href="/search/">
+    Use vector search, full-text search, hybrid search, reranking, filtering, and SQL.
+  </Card>
+</CardGroup>
 
-## Choose how you run LanceDB
+## From local development to production scale
 
-Depending on your needs, you can choose one of the following ways to run LanceDB.
+LanceDB OSS and LanceDB Enterprise share the same Lance format and table model. Start locally with the embedded OSS
+library, then move to Enterprise when your team needs distributed scale, managed infrastructure, private deployment,
+or higher-throughput curation, feature engineering, search and retrieval, and training workflows.
 
 ### 1. LanceDB OSS
 The fastest way to get started is the open-source embedded library, with client SDKs in Python, TypeScript
-and Rust. Run it locally during development, then use the same data model and APIs as you scale up
-and need a managed solution. Start here:
+and Rust. Run it locally in just a few steps, which lets you explore datasets, curate data, and run search and retrieval workloads
+for agents. Start here:
 
 <Columns cols={2}>
   <Card
@@ -59,19 +93,18 @@ and need a managed solution. Start here:
 </Card>
   <Card
     title="Basic Table Operations"
-    icon="search"
+    icon="table"
     href="/tables/"
   >
-    Create tables, search vectors, and modify data in LanceDB.
+    Create tables, evolve schemas, version data, and modify rows in LanceDB.
   </Card>
 </Columns>
 
 ### 2. LanceDB Enterprise
 
-[LanceDB Enterprise](/enterprise) is a distributed and managed **multimodal lakehouse** built for
-search, curation, feature engineering, and training-oriented data access workflows
-on top of the same core table abstraction. This eliminates the need for teams to build bespoke
-infrastructure to manage petabyte-scale multimodal datasets.
+[LanceDB Enterprise](/enterprise) is a petabyte-scale (and beyond), distributed **multimodal lakehouse** platform built for
+search, curation, feature engineering, and high-throughput training data access workflows on top of the same core table
+abstraction. This eliminates the need for teams to build bespoke infrastructure to manage large multimodal datasets.
 To set up LanceDB Enterprise in your organization, reach out to us at
 [contact@lancedb.com](mailto:contact@lancedb.com).
 
@@ -88,4 +121,4 @@ private deployments, and can operate under strict [security requirements](/enter
   href="/enterprise/quickstart"
 >
   Get started with LanceDB Enterprise in minutes.
-</Card>
\ No newline at end of file
+</Card>
diff --git a/docs/static/assets/images/overview/lancedb-suite.svg b/docs/static/assets/images/overview/lancedb-suite.svg
new file mode 100644
index 00000000..79805fc3
--- /dev/null
+++ b/docs/static/assets/images/overview/lancedb-suite.svg
@@ -0,0 +1,92 @@
+<svg width="2048" height="540" viewBox="0 0 2048 540" fill="none" xmlns="http://www.w3.org/2000/svg" role="img" aria-labelledby="title desc">
+  <title id="title">LanceDB suite</title>
+  <desc id="desc">LanceDB OSS supports search. LanceDB Enterprise is a multimodal lakehouse for curation, feature engineering, search and retrieval, and training. Both are built on the Lance open lakehouse format for multimodal AI.</desc>
+  <defs>
+    <filter id="cardShadow" x="-8%" y="-18%" width="116%" height="136%" color-interpolation-filters="sRGB">
+      <feDropShadow dx="0" dy="16" stdDeviation="18" flood-color="#4F2A1A" flood-opacity="0.16"/>
+    </filter>
+    <linearGradient id="enterpriseGradient" x1="512" y1="90" x2="2020" y2="90" gradientUnits="userSpaceOnUse">
+      <stop offset="0" stop-color="#E9C9F6"/>
+      <stop offset="0.52" stop-color="#F39A9A"/>
+      <stop offset="1" stop-color="#FF744F"/>
+    </linearGradient>
+    <linearGradient id="curationGradient" x1="512" y1="272" x2="872" y2="272" gradientUnits="userSpaceOnUse">
+      <stop offset="0" stop-color="#E8C9F4"/>
+      <stop offset="1" stop-color="#EFC5E2"/>
+    </linearGradient>
+    <linearGradient id="featureGradient" x1="894" y1="272" x2="1254" y2="272" gradientUnits="userSpaceOnUse">
+      <stop offset="0" stop-color="#F0ABB8"/>
+      <stop offset="1" stop-color="#F49A92"/>
+    </linearGradient>
+    <linearGradient id="searchGradient" x1="1276" y1="272" x2="1636" y2="272" gradientUnits="userSpaceOnUse">
+      <stop offset="0" stop-color="#F78F85"/>
+      <stop offset="1" stop-color="#FA8265"/>
+    </linearGradient>
+    <linearGradient id="trainingGradient" x1="1658" y1="272" x2="2018" y2="272" gradientUnits="userSpaceOnUse">
+      <stop offset="0" stop-color="#FF7C5D"/>
+      <stop offset="1" stop-color="#FF704E"/>
+    </linearGradient>
+  </defs>
+
+  <rect width="2048" height="540" fill="#FAF5F0"/>
+
+  <rect x="28" y="28" width="464" height="146" rx="8" fill="#3E3A35" filter="url(#cardShadow)"/>
+  <rect x="28" y="202" width="464" height="146" rx="8" fill="#3E3A35" filter="url(#cardShadow)"/>
+
+  <rect x="512" y="28" width="1506" height="146" rx="8" fill="url(#enterpriseGradient)" filter="url(#cardShadow)"/>
+  <rect x="512" y="202" width="360" height="146" rx="8" fill="url(#curationGradient)" filter="url(#cardShadow)"/>
+  <rect x="894" y="202" width="360" height="146" rx="8" fill="url(#featureGradient)" filter="url(#cardShadow)"/>
+  <rect x="1276" y="202" width="360" height="146" rx="8" fill="url(#searchGradient)" filter="url(#cardShadow)"/>
+  <rect x="1658" y="202" width="360" height="146" rx="8" fill="url(#trainingGradient)" filter="url(#cardShadow)"/>
+
+  <rect x="28" y="376" width="1990" height="116" rx="8" fill="#FAF5F0" stroke="#665FFF" stroke-width="3" filter="url(#cardShadow)"/>
+
+  <g fill="#FFFFFF">
+    <circle cx="78" cy="78" r="9"/>
+    <circle cx="94" cy="78" r="9"/>
+    <circle cx="110" cy="78" r="9"/>
+    <circle cx="78" cy="94" r="9"/>
+    <circle cx="110" cy="94" r="9"/>
+    <circle cx="126" cy="94" r="9"/>
+    <circle cx="78" cy="110" r="9"/>
+    <circle cx="94" cy="110" r="9"/>
+    <circle cx="110" cy="110" r="9"/>
+    <circle cx="94" cy="126" r="9"/>
+    <circle cx="126" cy="126" r="9"/>
+  </g>
+  <text x="162" y="117" fill="#FFFFFF" font-family="Inter, Arial, sans-serif" font-size="44" font-weight="760">LanceDB OSS</text>
+  <text x="260" y="291" text-anchor="middle" fill="#FFFFFF" font-family="Inter, Arial, sans-serif" font-size="42" font-weight="500">Search</text>
+
+  <g fill="#241712">
+    <circle cx="748" cy="78" r="9"/>
+    <circle cx="764" cy="78" r="9"/>
+    <circle cx="780" cy="78" r="9"/>
+    <circle cx="748" cy="94" r="9"/>
+    <circle cx="780" cy="94" r="9"/>
+    <circle cx="796" cy="94" r="9"/>
+    <circle cx="748" cy="110" r="9"/>
+    <circle cx="764" cy="110" r="9"/>
+    <circle cx="780" cy="110" r="9"/>
+    <circle cx="764" cy="126" r="9"/>
+    <circle cx="796" cy="126" r="9"/>
+  </g>
+  <text x="834" y="117" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="44" font-weight="760">LanceDB Enterprise - Multimodal Lakehouse</text>
+
+  <text x="692" y="285" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="38" font-weight="500">Curation</text>
+  <text x="1074" y="265" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="38" font-weight="500">Feature</text>
+  <text x="1074" y="309" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="38" font-weight="500">Engineering</text>
+  <text x="1456" y="265" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="38" font-weight="500">Search &amp;</text>
+  <text x="1456" y="309" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="38" font-weight="500">Retrieval</text>
+  <text x="1838" y="285" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="38" font-weight="500">Training</text>
+
+  <rect x="456" y="414" width="42" height="42" rx="2" fill="#665FFF"/>
+  <g stroke="#FFFFFF" stroke-width="2.4" stroke-linecap="round" stroke-linejoin="round">
+    <path d="M467 446L487 426"/>
+    <path d="M467 438L479 426"/>
+    <path d="M475 446L487 434"/>
+    <path d="M482 420L493 424L493 438L482 434V420Z"/>
+  </g>
+  <text x="516" y="451" fill="#665FFF" font-family="Inter, Arial, sans-serif" font-size="44" font-weight="760">
+    <tspan font-weight="760">Lance</tspan><tspan font-weight="500"> Open Lakehouse Format for Multimodal AI</tspan>
+  </text>
+</svg>
diff --git a/docs/static/assets/images/overview/training-data-lifecycle.svg b/docs/static/assets/images/overview/training-data-lifecycle.svg
new file mode 100644
index 00000000..6127b44a
--- /dev/null
+++ b/docs/static/assets/images/overview/training-data-lifecycle.svg
@@ -0,0 +1,52 @@
+<svg width="1280" height="280" viewBox="0 0 1280 280" fill="none" xmlns="http://www.w3.org/2000/svg" role="img" aria-labelledby="title desc">
+  <title id="title">Training data lifecycle</title>
+  <desc id="desc">Four pillars of the LanceDB training data lifecycle: Curation, Feature Engineering, Search and Retrieval, and Training.</desc>
+  <defs>
+    <filter id="shadow" x="-20%" y="-20%" width="140%" height="140%" color-interpolation-filters="sRGB">
+      <feDropShadow dx="0" dy="14" stdDeviation="18" flood-color="#4f2a1a" flood-opacity="0.10"/>
+    </filter>
+    <linearGradient id="accent" x1="0" y1="0" x2="0" y2="1">
+      <stop offset="0" stop-color="#FFB08E"/>
+      <stop offset="1" stop-color="#FF7E4F"/>
+    </linearGradient>
+  </defs>
+
+  <rect x="20" y="24" width="260" height="220" rx="16" fill="#FFFFFF" stroke="#F29A75" stroke-width="1.5" filter="url(#shadow)"/>
+  <rect x="20" y="48" width="5" height="170" rx="2.5" fill="url(#accent)"/>
+  <rect x="350" y="24" width="260" height="220" rx="16" fill="#FFFFFF" stroke="#F29A75" stroke-width="1.5" filter="url(#shadow)"/>
+  <rect x="350" y="48" width="5" height="170" rx="2.5" fill="url(#accent)"/>
+  <rect x="680" y="24" width="260" height="220" rx="16" fill="#FFFFFF" stroke="#F29A75" stroke-width="1.5" filter="url(#shadow)"/>
+  <rect x="680" y="48" width="5" height="170" rx="2.5" fill="url(#accent)"/>
+  <rect x="1010" y="24" width="260" height="220" rx="16" fill="#FFFFFF" stroke="#F29A75" stroke-width="1.5" filter="url(#shadow)"/>
+  <rect x="1010" y="48" width="5" height="170" rx="2.5" fill="url(#accent)"/>
+
+  <path d="M292 134H338" stroke="#FF8A5C" stroke-width="2.5" stroke-linecap="round"/>
+  <path d="M622 134H668" stroke="#FF8A5C" stroke-width="2.5" stroke-linecap="round"/>
+  <path d="M952 134H998" stroke="#FF8A5C" stroke-width="2.5" stroke-linecap="round"/>
+
+  <g stroke="#EF7B4E" stroke-width="3" stroke-linecap="round" stroke-linejoin="round">
+    <rect x="112" y="84" width="26" height="26" rx="4"/>
+    <rect x="138" y="62" width="26" height="26" rx="4"/>
+  </g>
+  <text x="150" y="172" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="28" font-weight="700">Curation</text>
+
+  <g stroke="#EF7B4E" stroke-width="3" stroke-linecap="round" stroke-linejoin="round">
+    <rect x="447" y="72" width="66" height="46" rx="5"/>
+    <path d="M447 87H513M447 103H513M469 72V118M491 72V118"/>
+  </g>
+  <text x="480" y="165" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="28" font-weight="700">Feature</text>
+  <text x="480" y="200" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="28" font-weight="700">Engineering</text>
+
+  <g stroke="#EF7B4E" stroke-width="3" stroke-linecap="round" stroke-linejoin="round">
+    <circle cx="795" cy="90" r="19"/>
+    <path d="M809 104L828 123"/>
+  </g>
+  <text x="810" y="165" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="28" font-weight="700">Search &amp;</text>
+  <text x="810" y="200" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="28" font-weight="700">Retrieval</text>
+
+  <g stroke="#EF7B4E" stroke-width="3" stroke-linecap="round" stroke-linejoin="round">
+    <path d="M1106 124C1126 119 1141 107 1153 84C1160 96 1174 99 1190 92"/>
+    <circle cx="1191" cy="91" r="3" fill="#EF7B4E" stroke="none"/>
+  </g>
+  <text x="1140" y="172" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="28" font-weight="700">Training</text>
+</svg>
diff --git a/docs/training/why-lancedb.mdx b/docs/training/why-lancedb.mdx
new file mode 100644
index 00000000..0adfc81a
--- /dev/null
+++ b/docs/training/why-lancedb.mdx
@@ -0,0 +1,118 @@
+---
+title: "Why LanceDB for Training"
+sidebarTitle: "Why LanceDB for training"
+description: "Use LanceDB as the multimodal data layer for model training, fine-tuning, curation, and feature engineering workflows."
+icon: fire
+---
+
+LanceDB is built for AI teams that need a practical data layer between raw multimodal datasets and model training.
+Instead of moving data through separate systems for curation, feature engineering, search, manifests, and training,
+you can keep the whole workflow attached to one versioned LanceDB table.
+
+That table can hold images, video, audio, text, annotations, metadata, embeddings, tokenized fields, model outputs,
+quality signals, and training-ready tensors. As the dataset evolves, LanceDB lets you add new columns, filter rows,
+pin versions, and read batches without rewriting the original data.
+
+![Training data lifecycle: Curation, Feature Engineering, Search and Retrieval, Training](/static/assets/images/overview/training-data-lifecycle.svg)
+
+LanceDB gives these stages one platform, so curation, feature engineering, retrieval, and training stay connected.
+
+## A connected data lifecycle
+
+Training pipelines usually need more than a pile of files. They need curation, derived features, reproducible splits,
+fast random access, and a clean path into frameworks such as PyTorch. LanceDB keeps these pieces connected through
+the same table model, whether you organize a workflow as one table or several related tables.
+
+<Steps>
+  <Step title="Curate and slice the dataset">
+    Use filters, vector search, full-text search, and retrieval workflows to find the examples that matter: hard negatives,
+    long-tail failure modes, duplicate clusters, low-quality samples, or targeted fine-tuning slices.
+  </Step>
+  <Step title="Engineer features in place">
+    Add embeddings, detections, OCR output, labels, token IDs, hidden states, deduplication flags, or quality scores as
+    new columns. Lance's columnar layout and schema evolution avoid rewriting large raw media columns when you add features.
+  </Step>
+  <Step title="Create reproducible splits">
+    Build filtered splits and materialized views from the table instead of exporting CSV manifests. Data versions and tags
+    make it possible to tie a checkpoint back to the exact rows and features used for training.
+  </Step>
+  <Step title="Load batches for training">
+    Use fast random access and column projection to read only the columns a training step needs. LanceDB tables can be read
+    from local storage or object storage, and integrate with data loading patterns such as PyTorch datasets.
+  </Step>
+</Steps>
+
+## Lance as the foundation
+
+LanceDB is built on [Lance](https://lance.org/), an open-source lakehouse format designed for multimodal AI data.
+The table below highlights the Lance features that enable the multimodal lakehouse on top.
+
+| Capability | Why it matters for training |
+|---|---|
+| **Multimodal columns** | Store raw bytes, annotations, metadata, embeddings, and features together. |
+| **Fast random access** | Support shuffled and sampled reads without reshuffling the dataset on disk. |
+| **Column projection** | Read only images, tokens, labels, embeddings, or hidden states needed by a given run. |
+| **Schema evolution** | Add new feature columns without rewriting existing media columns. |
+| **Versioning** | Reproduce experiments against the same table snapshot, even as the dataset evolves. |
+| **Search and filtering** | Find and materialize useful training slices directly from the table. |
+
+## Search inside training workflows
+
+Search is not limited to QA systems, agents, or production retrieval apps. It is also a practical way to inspect,
+curate, and improve training data:
+
+- Find visually similar examples when debugging model failures.
+- Retrieve hard negatives or near-duplicates for contrastive training.
+- Combine vector search, full-text search, and metadata filters to build targeted fine-tuning slices.
+- Reuse the same table for both offline curation and production retrieval.
+
+In LanceDB, retrieval and training workflows can operate over the same multimodal tables instead of forcing teams to
+manage separate data systems for each stage.
+
+## Projects using LanceDB for training workflows
+
+<CardGroup cols={1}>
+  <Card
+    title="stable-worldmodel"
+    icon="github"
+    href="https://github.com/galilai-group/stable-worldmodel"
+  >
+    A platform for reproducible world-model research built on a LanceDB data layer, reporting faster data loading on Push-T workloads.
+  </Card>
+  <Card
+    title="le-wm"
+    icon="github"
+    href="https://github.com/lucas-maes/le-wm"
+  >
+    A joint-embedding predictive world model from pixels, trained on the stable-worldmodel platform and its LanceDB data layer.
+  </Card>
+  <Card
+    title="lerobot-lancedb"
+    icon="github"
+    href="https://github.com/lancedb/lerobot-lancedb"
+  >
+    A drop-in LanceDB backend for Hugging Face LeRobot datasets with faster loading across robotics datasets.
+  </Card>
+</CardGroup>
+
+In the world-model ecosystem, [stable-worldmodel](https://github.com/galilai-group/stable-worldmodel) reports
+3-4x faster data loading on Push-T versus HDF5 / MP4 at a fraction of the disk footprint. Across these projects,
+LanceDB and Lance provide the multimodal data layer that keeps raw observations, annotations, features, and training
+access patterns in one format instead of scattering them across task-specific stores.
+
+## Next steps
+
+<CardGroup cols={2}>
+  <Card title="Data loading and shuffles" icon="boxes-stacked" href="/training/">
+    Learn how to use LanceDB permutations to select rows, project columns, split datasets, and shuffle training reads.
+  </Card>
+  <Card title="PyTorch integration" icon="fire" href="/training/torch">
+    Use LanceDB tables and permutations with `torch.utils.data.DataLoader`.
+  </Card>
+  <Card title="Object detection example" icon="car" href="/training/object-detection">
+    Fine-tune an AV perception model on curated failure-mode slices backed by one LanceDB table.
+  </Card>
+  <Card title="VLM fine-tuning example" icon="image" href="/training/vlm-finetuning">
+    Fine-tune a VLM on TextVQA using LanceDB and Geneva to cache expensive training features.
+  </Card>
+</CardGroup>

From 19001a0964ef26b41aaa756e5f419afb79338901 Mon Sep 17 00:00:00 2001
From: prrao87 <35005448+prrao87@users.noreply.github.com>
Date: Mon, 29 Jun 2026 09:03:52 -0400
Subject: [PATCH 2/5] docs: fix overview SVG dark mode rendering

---
 docs/index.mdx                                |  4 +-
 .../assets/images/overview/lancedb-suite.svg  | 45 ++++++++++---------
 .../overview/training-data-lifecycle.svg      | 20 ++++-----
 3 files changed, 36 insertions(+), 33 deletions(-)

diff --git a/docs/index.mdx b/docs/index.mdx
index 78f13201..c9cf4f75 100644
--- a/docs/index.mdx
+++ b/docs/index.mdx
@@ -15,8 +15,8 @@ fragmented stack of storage, feature, retrieval, and training systems.
 
 ## Build better models, faster
 
-Training data pipelines and experimentation slow down when raw data, metadata, embeddings, features, and governance
-artifacts live in separate systems. LanceDB keeps them together in one versioned multimodal table, so teams spend less
+Training data and experimentation slow down when raw data, metadata, embeddings, features, and governance
+artifacts live in separate systems. LanceDB keeps them together in one versioned multimodal table, so AI teams spend less
 time stitching infrastructure together and more time improving datasets, testing features, and keeping GPUs fed.
 
 ![Training data lifecycle: Curation, Feature Engineering, Search and Retrieval, Training](/static/assets/images/overview/training-data-lifecycle.svg)
diff --git a/docs/static/assets/images/overview/lancedb-suite.svg b/docs/static/assets/images/overview/lancedb-suite.svg
index 79805fc3..92a9335b 100644
--- a/docs/static/assets/images/overview/lancedb-suite.svg
+++ b/docs/static/assets/images/overview/lancedb-suite.svg
@@ -28,20 +28,23 @@
     </linearGradient>
   </defs>
 
-  <rect width="2048" height="540" fill="#FAF5F0"/>
+  <rect x="28" y="28" width="464" height="146" rx="8" style="fill:#FFFFFF !important;stroke:#3E3A35 !important;" stroke-width="2" filter="url(#cardShadow)"/>
+  <rect x="28" y="202" width="464" height="146" rx="8" style="fill:#FFFFFF !important;stroke:#3E3A35 !important;" stroke-width="2" filter="url(#cardShadow)"/>
 
-  <rect x="28" y="28" width="464" height="146" rx="8" fill="#3E3A35" filter="url(#cardShadow)"/>
-  <rect x="28" y="202" width="464" height="146" rx="8" fill="#3E3A35" filter="url(#cardShadow)"/>
+  <rect x="512" y="28" width="1506" height="146" rx="8" style="fill:url(#enterpriseGradient) !important;" filter="url(#cardShadow)"/>
+  <rect x="516" y="32" width="1498" height="138" rx="6" style="fill:#FFFFFF !important;"/>
+  <rect x="512" y="202" width="360" height="146" rx="8" style="fill:url(#curationGradient) !important;" filter="url(#cardShadow)"/>
+  <rect x="516" y="206" width="352" height="138" rx="6" style="fill:#FFFFFF !important;"/>
+  <rect x="894" y="202" width="360" height="146" rx="8" style="fill:url(#featureGradient) !important;" filter="url(#cardShadow)"/>
+  <rect x="898" y="206" width="352" height="138" rx="6" style="fill:#FFFFFF !important;"/>
+  <rect x="1276" y="202" width="360" height="146" rx="8" style="fill:url(#searchGradient) !important;" filter="url(#cardShadow)"/>
+  <rect x="1280" y="206" width="352" height="138" rx="6" style="fill:#FFFFFF !important;"/>
+  <rect x="1658" y="202" width="360" height="146" rx="8" style="fill:url(#trainingGradient) !important;" filter="url(#cardShadow)"/>
+  <rect x="1662" y="206" width="352" height="138" rx="6" style="fill:#FFFFFF !important;"/>
 
-  <rect x="512" y="28" width="1506" height="146" rx="8" fill="url(#enterpriseGradient)" filter="url(#cardShadow)"/>
-  <rect x="512" y="202" width="360" height="146" rx="8" fill="url(#curationGradient)" filter="url(#cardShadow)"/>
-  <rect x="894" y="202" width="360" height="146" rx="8" fill="url(#featureGradient)" filter="url(#cardShadow)"/>
-  <rect x="1276" y="202" width="360" height="146" rx="8" fill="url(#searchGradient)" filter="url(#cardShadow)"/>
-  <rect x="1658" y="202" width="360" height="146" rx="8" fill="url(#trainingGradient)" filter="url(#cardShadow)"/>
+  <rect x="28" y="376" width="1990" height="116" rx="8" style="fill:#FFFFFF !important;stroke:#665FFF !important;" stroke-width="3" filter="url(#cardShadow)"/>
 
-  <rect x="28" y="376" width="1990" height="116" rx="8" fill="#FAF5F0" stroke="#665FFF" stroke-width="3" filter="url(#cardShadow)"/>
-
-  <g fill="#FFFFFF">
+  <g style="fill:#241712 !important;">
     <circle cx="78" cy="78" r="9"/>
     <circle cx="94" cy="78" r="9"/>
     <circle cx="110" cy="78" r="9"/>
@@ -54,10 +57,10 @@
     <circle cx="94" cy="126" r="9"/>
     <circle cx="126" cy="126" r="9"/>
   </g>
-  <text x="162" y="117" fill="#FFFFFF" font-family="Inter, Arial, sans-serif" font-size="44" font-weight="760">LanceDB OSS</text>
-  <text x="260" y="291" text-anchor="middle" fill="#FFFFFF" font-family="Inter, Arial, sans-serif" font-size="42" font-weight="500">Search</text>
+  <text x="162" y="117" style="fill:#241712 !important;" font-family="Inter, Arial, sans-serif" font-size="44" font-weight="760">LanceDB OSS</text>
+  <text x="260" y="291" text-anchor="middle" style="fill:#241712 !important;" font-family="Inter, Arial, sans-serif" font-size="42" font-weight="500">Search</text>
 
-  <g fill="#241712">
+  <g style="fill:#241712 !important;">
     <circle cx="748" cy="78" r="9"/>
     <circle cx="764" cy="78" r="9"/>
     <circle cx="780" cy="78" r="9"/>
@@ -70,14 +73,14 @@
     <circle cx="764" cy="126" r="9"/>
     <circle cx="796" cy="126" r="9"/>
   </g>
-  <text x="834" y="117" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="44" font-weight="760">LanceDB Enterprise - Multimodal Lakehouse</text>
+  <text x="834" y="117" style="fill:#241712 !important;" font-family="Inter, Arial, sans-serif" font-size="44" font-weight="760">LanceDB Enterprise - Multimodal Lakehouse</text>
 
-  <text x="692" y="285" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="38" font-weight="500">Curation</text>
-  <text x="1074" y="265" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="38" font-weight="500">Feature</text>
-  <text x="1074" y="309" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="38" font-weight="500">Engineering</text>
-  <text x="1456" y="265" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="38" font-weight="500">Search &amp;</text>
-  <text x="1456" y="309" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="38" font-weight="500">Retrieval</text>
-  <text x="1838" y="285" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="38" font-weight="500">Training</text>
+  <text x="692" y="285" text-anchor="middle" style="fill:#241712 !important;" font-family="Inter, Arial, sans-serif" font-size="38" font-weight="500">Curation</text>
+  <text x="1074" y="265" text-anchor="middle" style="fill:#241712 !important;" font-family="Inter, Arial, sans-serif" font-size="38" font-weight="500">Feature</text>
+  <text x="1074" y="309" text-anchor="middle" style="fill:#241712 !important;" font-family="Inter, Arial, sans-serif" font-size="38" font-weight="500">Engineering</text>
+  <text x="1456" y="265" text-anchor="middle" style="fill:#241712 !important;" font-family="Inter, Arial, sans-serif" font-size="38" font-weight="500">Search &amp;</text>
+  <text x="1456" y="309" text-anchor="middle" style="fill:#241712 !important;" font-family="Inter, Arial, sans-serif" font-size="38" font-weight="500">Retrieval</text>
+  <text x="1838" y="285" text-anchor="middle" style="fill:#241712 !important;" font-family="Inter, Arial, sans-serif" font-size="38" font-weight="500">Training</text>
 
   <rect x="456" y="414" width="42" height="42" rx="2" fill="#665FFF"/>
   <g stroke="#FFFFFF" stroke-width="2.4" stroke-linecap="round" stroke-linejoin="round">
diff --git a/docs/static/assets/images/overview/training-data-lifecycle.svg b/docs/static/assets/images/overview/training-data-lifecycle.svg
index 6127b44a..a533c70e 100644
--- a/docs/static/assets/images/overview/training-data-lifecycle.svg
+++ b/docs/static/assets/images/overview/training-data-lifecycle.svg
@@ -11,13 +11,13 @@
     </linearGradient>
   </defs>
 
-  <rect x="20" y="24" width="260" height="220" rx="16" fill="#FFFFFF" stroke="#F29A75" stroke-width="1.5" filter="url(#shadow)"/>
+  <rect x="20" y="24" width="260" height="220" rx="16" style="fill:#FFFFFF !important;stroke:#F29A75 !important;" stroke-width="1.5" filter="url(#shadow)"/>
   <rect x="20" y="48" width="5" height="170" rx="2.5" fill="url(#accent)"/>
-  <rect x="350" y="24" width="260" height="220" rx="16" fill="#FFFFFF" stroke="#F29A75" stroke-width="1.5" filter="url(#shadow)"/>
+  <rect x="350" y="24" width="260" height="220" rx="16" style="fill:#FFFFFF !important;stroke:#F29A75 !important;" stroke-width="1.5" filter="url(#shadow)"/>
   <rect x="350" y="48" width="5" height="170" rx="2.5" fill="url(#accent)"/>
-  <rect x="680" y="24" width="260" height="220" rx="16" fill="#FFFFFF" stroke="#F29A75" stroke-width="1.5" filter="url(#shadow)"/>
+  <rect x="680" y="24" width="260" height="220" rx="16" style="fill:#FFFFFF !important;stroke:#F29A75 !important;" stroke-width="1.5" filter="url(#shadow)"/>
   <rect x="680" y="48" width="5" height="170" rx="2.5" fill="url(#accent)"/>
-  <rect x="1010" y="24" width="260" height="220" rx="16" fill="#FFFFFF" stroke="#F29A75" stroke-width="1.5" filter="url(#shadow)"/>
+  <rect x="1010" y="24" width="260" height="220" rx="16" style="fill:#FFFFFF !important;stroke:#F29A75 !important;" stroke-width="1.5" filter="url(#shadow)"/>
   <rect x="1010" y="48" width="5" height="170" rx="2.5" fill="url(#accent)"/>
 
   <path d="M292 134H338" stroke="#FF8A5C" stroke-width="2.5" stroke-linecap="round"/>
@@ -28,25 +28,25 @@
     <rect x="112" y="84" width="26" height="26" rx="4"/>
     <rect x="138" y="62" width="26" height="26" rx="4"/>
   </g>
-  <text x="150" y="172" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="28" font-weight="700">Curation</text>
+  <text x="150" y="172" text-anchor="middle" style="fill:#241712 !important;" font-family="Inter, Arial, sans-serif" font-size="28" font-weight="700">Curation</text>
 
   <g stroke="#EF7B4E" stroke-width="3" stroke-linecap="round" stroke-linejoin="round">
     <rect x="447" y="72" width="66" height="46" rx="5"/>
     <path d="M447 87H513M447 103H513M469 72V118M491 72V118"/>
   </g>
-  <text x="480" y="165" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="28" font-weight="700">Feature</text>
-  <text x="480" y="200" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="28" font-weight="700">Engineering</text>
+  <text x="480" y="165" text-anchor="middle" style="fill:#241712 !important;" font-family="Inter, Arial, sans-serif" font-size="28" font-weight="700">Feature</text>
+  <text x="480" y="200" text-anchor="middle" style="fill:#241712 !important;" font-family="Inter, Arial, sans-serif" font-size="28" font-weight="700">Engineering</text>
 
   <g stroke="#EF7B4E" stroke-width="3" stroke-linecap="round" stroke-linejoin="round">
     <circle cx="795" cy="90" r="19"/>
     <path d="M809 104L828 123"/>
   </g>
-  <text x="810" y="165" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="28" font-weight="700">Search &amp;</text>
-  <text x="810" y="200" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="28" font-weight="700">Retrieval</text>
+  <text x="810" y="165" text-anchor="middle" style="fill:#241712 !important;" font-family="Inter, Arial, sans-serif" font-size="28" font-weight="700">Search &amp;</text>
+  <text x="810" y="200" text-anchor="middle" style="fill:#241712 !important;" font-family="Inter, Arial, sans-serif" font-size="28" font-weight="700">Retrieval</text>
 
   <g stroke="#EF7B4E" stroke-width="3" stroke-linecap="round" stroke-linejoin="round">
     <path d="M1106 124C1126 119 1141 107 1153 84C1160 96 1174 99 1190 92"/>
     <circle cx="1191" cy="91" r="3" fill="#EF7B4E" stroke="none"/>
   </g>
-  <text x="1140" y="172" text-anchor="middle" fill="#241712" font-family="Inter, Arial, sans-serif" font-size="28" font-weight="700">Training</text>
+  <text x="1140" y="172" text-anchor="middle" style="fill:#241712 !important;" font-family="Inter, Arial, sans-serif" font-size="28" font-weight="700">Training</text>
 </svg>

From cba407d7b80464e24886dda6140a662bf1e0168e Mon Sep 17 00:00:00 2001
From: prrao87 <35005448+prrao87@users.noreply.github.com>
Date: Mon, 29 Jun 2026 09:15:05 -0400
Subject: [PATCH 3/5] docs: revamp Lance capabilities table

---
 docs/lance.mdx | 30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/docs/lance.mdx b/docs/lance.mdx
index 88c26722..e7922563 100644
--- a/docs/lance.mdx
+++ b/docs/lance.mdx
@@ -5,15 +5,15 @@ description: "Open-source lakehouse format for multimodal AI."
 icon: "/static/assets/logo/lance-logo-gray.svg"
 ---
 
-[Lance](https://lance.org/) is an open-source lakehouse format, which provides the
-foundation for LanceDB's capabilities. It provides a file format,
-table format, and catalog spec with multimodal data at the center of its design, allowing developers
+[Lance](https://lance.org/) is an open-source, columnar lakehouse format for multimodal AI.
+It provides a file format, table format, and lightweight catalog spec, allowing developers
 to build a complete open lakehouse on top of object storage.
 
-Building on top of open foundations and optimizing the format for AI workloads brings
-high-performance vector search, full-text search, random access, and feature engineering capabilities
-to a single unified system ([LanceDB](/enterprise)), eliminating the need for bespoke ETL and data pipelines that move data
-to multiple other specialized data systems.
+Building on top of open foundations and optimizing the format for random access
+(without compromising scan performance) enables
+high-performance vector search, full-text search, indexing, and feature engineering capabilities.
+[LanceDB](/enterprise) builds on these capabilities so teams can work with one multimodal data layer
+instead of moving data across separate storage, search, feature, and training systems.
 
 <Card
   title="Lance format documentation"
@@ -23,15 +23,17 @@ to multiple other specialized data systems.
   Visit the Lance format documentation to learn more about its design, features, and how it enables the multimodal lakehouse.
 </Card>
 
-## Advantages of the Lance format
+## Capabilities of the Lance format
 
-Advantage | Description
+Capability | What it enables
 --- | ---
-Multimodal storage | Efficiently holds vectors, images, videos, audio, text, and more
-Version control | Built-in data versioning for reproducible ML experiments and data lineage
-ML-optimized | Designed for training and inference workloads with fast random access
-Query performance | Columnar storage enables blazing-fast vector search and analytics
-Cloud-native | Seamless integration with cloud object stores (S3, GCS, Azure Blob)
+Multimodal columns | Store images, video, audio, text, embeddings, annotations, metadata, and features in one table.
+Fast random access and scans | Sample, shuffle, and retrieve individual rows efficiently without giving up high-throughput sequential reads.
+Column projection | Read only the raw media, labels, tokens, embeddings, or feature columns needed by a workload.
+Data evolution | Add or backfill embeddings, features, quality signals, and derived columns without rewriting the original dataset.
+Versioned tables | Reproduce experiments, restore previous states, and tie downstream artifacts to the exact table version they used.
+Hybrid search and indexing | Combine vector search, full-text search, and scalar filters on the same dataset with Lance indexes.
+Open lakehouse interoperability | Build on object storage and connect Lance tables to open engines such as PyTorch, Ray, Spark, DuckDB, and Trino.
 
 ## Key concepts
 

From b700eaa1de6802b442442929e40454f11c6d0adf Mon Sep 17 00:00:00 2001
From: prrao87 <35005448+prrao87@users.noreply.github.com>
Date: Mon, 29 Jun 2026 09:26:33 -0400
Subject: [PATCH 4/5] docs: highlight Lance data evolution and blob APIs

---
 docs/lance.mdx | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/docs/lance.mdx b/docs/lance.mdx
index e7922563..224d45c6 100644
--- a/docs/lance.mdx
+++ b/docs/lance.mdx
@@ -27,10 +27,11 @@ instead of moving data across separate storage, search, feature, and training sy
 
 Capability | What it enables
 --- | ---
-Multimodal columns | Store images, video, audio, text, embeddings, annotations, metadata, and features in one table.
+Multimodal storage | Store images, video, audio, text, embeddings, annotations, metadata, features, and more, all in one table.
+First-class blob API | Store large binary objects such as images, video, audio, and model artifacts in blob columns with lazy reads and streaming byte access.
 Fast random access and scans | Sample, shuffle, and retrieve individual rows efficiently without giving up high-throughput sequential reads.
-Column projection | Read only the raw media, labels, tokens, embeddings, or feature columns needed by a workload.
-Data evolution | Add or backfill embeddings, features, quality signals, and derived columns without rewriting the original dataset.
+Flexible data evolution | Add, drop, rename, or alter columns as datasets change, often without rewriting existing data files.
+Feature backfills | Populate new embedding, label, quality signal, or feature columns with SQL expressions, batch UDFs, or merges without rewriting the original dataset.
 Versioned tables | Reproduce experiments, restore previous states, and tie downstream artifacts to the exact table version they used.
 Hybrid search and indexing | Combine vector search, full-text search, and scalar filters on the same dataset with Lance indexes.
 Open lakehouse interoperability | Build on object storage and connect Lance tables to open engines such as PyTorch, Ray, Spark, DuckDB, and Trino.

From e20352a55bc16ce186c06db0ea856b21bdaf6c7c Mon Sep 17 00:00:00 2001
From: prrao87 <35005448+prrao87@users.noreply.github.com>
Date: Mon, 29 Jun 2026 09:34:17 -0400
Subject: [PATCH 5/5] Update Lance page

---
 docs/lance.mdx | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/docs/lance.mdx b/docs/lance.mdx
index 224d45c6..e7a650e5 100644
--- a/docs/lance.mdx
+++ b/docs/lance.mdx
@@ -31,10 +31,9 @@ Multimodal storage | Store images, video, audio, text, embeddings, annotations,
 First-class blob API | Store large binary objects such as images, video, audio, and model artifacts in blob columns with lazy reads and streaming byte access.
 Fast random access and scans | Sample, shuffle, and retrieve individual rows efficiently without giving up high-throughput sequential reads.
 Flexible data evolution | Add, drop, rename, or alter columns as datasets change, often without rewriting existing data files.
-Feature backfills | Populate new embedding, label, quality signal, or feature columns with SQL expressions, batch UDFs, or merges without rewriting the original dataset.
 Versioned tables | Reproduce experiments, restore previous states, and tie downstream artifacts to the exact table version they used.
 Hybrid search and indexing | Combine vector search, full-text search, and scalar filters on the same dataset with Lance indexes.
-Open lakehouse interoperability | Build on object storage and connect Lance tables to open engines such as PyTorch, Ray, Spark, DuckDB, and Trino.
+Open lakehouse interoperability | Build on object storage and connect Lance tables to open engines such as PyTorch, Ray, Spark, Trino, DuckDB and Polars.
 
 ## Key concepts