Skip to content

perf(memory): enable vector index usage by checking for existing indexes#514

Open
aspotton wants to merge 4 commits intospacedriveapp:mainfrom
aspotton:feat/memory-index-persistence-fix
Open

perf(memory): enable vector index usage by checking for existing indexes#514
aspotton wants to merge 4 commits intospacedriveapp:mainfrom
aspotton:feat/memory-index-persistence-fix

Conversation

@aspotton
Copy link
Copy Markdown

Problem

Spacebot was experiencing extended startup time due to attempting to rebuild vector indexes on every startup, even though index files already existed on disk. This caused:

  • Expensive KMeans training on every startup (IVF model, quantizer training)
  • index_comparisons=0 in query logs (index not being used)
  • High CPU usage during startup due to unnecessary index rebuilding

Root Cause

The code called create_index() unconditionally. LanceDB's create_index() always attempts to rebuild the index from scratch when called, even if an index already exists. The previous error-handling approach was too late - the expensive training had already completed.

Solution

Check for existing indexes using list_indices() BEFORE attempting creation. Only create the index if it doesn't exist. This allows LanceDB to load and use the existing index immediately, enabling efficient vector searches.

Changes

  • src/memory/lance.rs: Replace create_indexes() with ensure_indexes_exist()
    • Uses table.list_indices() to check for existing indexes
    • Only creates vector index if not present on "embedding" column
    • Only creates FTS index if not present on "content" column
    • Adds optimize_indexes() for incremental updates after data insertion
  • src/main.rs: Call ensure_indexes_exist() instead of create_indexes()
  • src/api/agents.rs: Call ensure_indexes_exist() instead of ensure_fts_index()
  • src/tools/memory_save.rs: Call ensure_indexes_exist() instead of ensure_fts_index()

Performance Impact

  • Index usage: Queries now use the HNSW index (index_comparisons > 0)
  • Eliminates unnecessary index rebuilding when index already exists
  • Reduces startup time by skipping expensive training when index exists
  • Note: This fix enables index usage but does not stop the separate Knowledge Synthesis regeneration process

Verification

After deployment, verify:

  1. Startup with existing index: Logs show "Vector index already exists, skipping creation"
  2. Query logs show index_comparisons > 0 (index is being used)
  3. No KMeans training logs appear when index already exists
  4. CPU usage remains normal during startup

Problem
Spacebot was experiencing extended startup time due to attempting to
rebuild vector indexes on every startup, even though index files
already existed on disk. This caused:
- Expensive KMeans training on every startup (IVF model, quantizer)
- index_comparisons=0 in query logs (index not being used)
- High CPU usage during startup due to unnecessary index rebuilding

Root Cause
The code called create_index() unconditionally. LanceDB's create_index()
always attempts to rebuild the index from scratch when called, even if an
index already exists. The previous error-handling approach was too late -
the expensive training had already completed.

Solution
Check for existing indexes using list_indices() BEFORE attempting creation.
Only create the index if it doesn't exist. This allows LanceDB to load and
use the existing index immediately, enabling efficient vector searches.

Changes
- src/memory/lance.rs: Replace create_indexes() with ensure_indexes_exist()
  * Uses table.list_indices() to check for existing indexes
  * Only creates vector index if not present on embedding column
  * Only creates FTS index if not present on content column
  * Adds optimize_indexes() for incremental updates after data insertion
- src/main.rs: Call ensure_indexes_exist() instead of create_indexes()
- src/api/agents.rs: Call ensure_indexes_exist() instead of ensure_fts_index()
- src/tools/memory_save.rs: Call ensure_indexes_exist() instead of ensure_fts_index()

Performance Impact
- Index usage: Queries now use the HNSW index (index_comparisons > 0)
- Eliminates unnecessary index rebuilding when index already exists
- Reduces startup time by skipping expensive training when index exists
- Note: This fix enables index usage but does not stop the separate
  Knowledge Synthesis regeneration process

Verification
After deployment, verify:
1. Startup with existing index: Logs show 'Vector index already exists'
2. Query logs show index_comparisons > 0 (index is being used)
3. No KMeans training logs appear when index already exists
4. CPU usage remains normal during startup
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3927c5a5-ab42-4932-9f27-e83f7155f290

📥 Commits

Reviewing files that changed from the base of the PR and between 6267575 and 23b49c0.

📒 Files selected for processing (3)
  • src/api/agents.rs
  • src/main.rs
  • src/memory/lance.rs
✅ Files skipped from review due to trivial changes (2)
  • src/api/agents.rs
  • src/main.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/memory/lance.rs

Walkthrough

Replaces FTS-only index creation with a unified index-ensuring API that creates or skips vector and FTS indexes with process-wide concurrency guards; adds an index optimization method and updates callers to invoke ensure_indexes_exist() and adjust warning messages.

Changes

Cohort / File(s) Summary
Index Method Calls
src/api/agents.rs, src/main.rs, src/tools/memory_save.rs
Replaced calls to ensure_fts_index().await with ensure_indexes_exist().await and updated warning log text from “failed to create FTS index” to “failed to ensure indexes exist”.
Index Implementation
src/memory/lance.rs
Removed create_indexes() and ensure_fts_index(); added ensure_indexes_exist() and optimize_indexes(). Introduced module-level OnceCell guards to serialize index builds across the process, list_indices() checks to skip existing indexes, and switched vector index creation to Index::IvfHnswSq. Minor formatting/control-flow tweaks in open_or_create().

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: replacing unconditional index creation with existence checks to enable vector index usage and improve performance.
Description check ✅ Passed The description is comprehensively related to the changeset, clearly explaining the problem, root cause, solution, and verification steps for the index persistence optimization.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/memory/lance.rs`:
- Around line 318-352: The ensure_indexes_exist() flow is racy; add a shared
async mutex on EmbeddingTable (e.g., a field like ensure_indexes_lock:
Arc<tokio::sync::Mutex<()>>) and acquire it at the start of
ensure_indexes_exist() so only one clone performs list_indices() →
create_index() at a time; initialize the Arc in the EmbeddingTable constructor
and ensure it is preserved when the struct is cloned (or implement Clone to
clone the Arc), then await the lock before calling self.table.list_indices() and
hold it until after the index creation logic completes.

In `@src/tools/memory_save.rs`:
- Around line 382-390: After ensure_indexes_exist() returns Ok in the memory
save path, spawn a background task to run the incremental index maintenance so
you don't lose post-insert optimize work; specifically, after calling
self.memory_search.embedding_table().ensure_indexes_exist().await, call
embedding_table().optimize_indexes() (or the incremental optimizer helper)
inside a tokio::spawn so it runs fire-and-forget off the request path, and log
errors from the spawned task (use tracing::warn or similar) to surface failures
without blocking the save response.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8aa986ff-de2d-42e0-b7ae-028f2fcd7d55

📥 Commits

Reviewing files that changed from the base of the PR and between b7d5dd2 and a08d1b6.

📒 Files selected for processing (4)
  • src/api/agents.rs
  • src/main.rs
  • src/memory/lance.rs
  • src/tools/memory_save.rs

Comment on lines +382 to +390
// Ensure vector and FTS indexes exist (prevents 30-minute rebuild loop)
// Safe to call repeatedly — skips creation if indexes already exist.
if let Err(error) = self
.memory_search
.embedding_table()
.ensure_fts_index()
.ensure_indexes_exist()
.await
{
tracing::warn!(%error, "failed to ensure FTS index after memory save");
tracing::warn!(%error, "failed to ensure indexes after memory save");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t lose post-insert maintenance on the save path.

After switching this branch to ensure_indexes_exist(), it becomes a pure existence check once the indexes are present, so later writes never reach the new optimize_indexes() flow. Please chain the incremental optimize step from here as well, ideally off the request path.

As per coding guidelines "Use tokio::spawn for fire-and-forget database writes (conversation history saves, memory writes, worker log persistence) so the user gets their response immediately."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tools/memory_save.rs` around lines 382 - 390, After
ensure_indexes_exist() returns Ok in the memory save path, spawn a background
task to run the incremental index maintenance so you don't lose post-insert
optimize work; specifically, after calling
self.memory_search.embedding_table().ensure_indexes_exist().await, call
embedding_table().optimize_indexes() (or the incremental optimizer helper)
inside a tokio::spawn so it runs fire-and-forget off the request path, and log
errors from the spawned task (use tracing::warn or similar) to surface failures
without blocking the save response.

Copy link
Copy Markdown
Collaborator

@vsumner vsumner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One blocking correctness issue remains in the new index-creation flow: the existence check and create path is still racy across EmbeddingTable clones, so concurrent callers can still trigger duplicate rebuilds.

use lancedb::index::Index;

// Check for existing indexes
let indices = self
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list_indices() -> create_index() flow is still race-prone across EmbeddingTable clones. Startup, agent creation, and memory-save paths can all hold cloned tables, so concurrent callers can all observe “missing” and kick off the same expensive rebuild anyway. That breaks the main invariant this PR is trying to restore: at most one index build per missing index. Please add a shared single-flight guard on EmbeddingTable and hold it across the full existence-check/create sequence.

…ex creation

The previous fix using list_indices() check was still race-prone when
multiple concurrent callers (startup, agent creation, memory-save paths)
all held cloned EmbeddingTable instances and simultaneously detected
missing indexes, triggering duplicate expensive rebuilds.

This adds two OnceCell guards (one for vector index, one for FTS) that:
- Ensure only ONE index build runs at a time per index type
- Allow concurrent callers to wait for the first to complete
- Share state across all cloned EmbeddingTable instances
- Include a double-check after acquiring the guard to handle external creation

The fix maintains the original goal of preventing unnecessary rebuilds
while eliminating the race condition that could still trigger duplicate builds.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/memory/lance.rs`:
- Around line 394-400: The log message is misleading because create_index(...,
Index::Auto) will choose an IVF-PQ style index, not HNSW; update the
tracing::info() calls around the create_index call to accurately describe the
index being created (e.g., "Creating vector index (Index::Auto / IVF-PQ)" or
similar), or if HNSW is required replace Index::Auto with the explicit variant
Index::IvfHnswSq in the self.table.create_index(&["embedding"],
Index::Auto).execute().await call and adjust the log to "Creating HNSW vector
index" to match the index type.
- Around line 25-43: The per-instance Arc<OnceCell> guards
(EmbeddingTable::vector_index_guard and ::fts_index_guard) only serialize across
clones of the same instance; to fully serialize index creation across separate
open_or_create() instances, promote these guards to module-level statics (e.g.,
static VECTOR_INDEX_GUARD: OnceCell<()> and static FTS_INDEX_GUARD:
OnceCell<()>) and use those statics inside the get_or_try_init() closures
instead of the instance fields; remove the instance fields from EmbeddingTable
and update all references (including calls in open_or_create(),
get_or_try_init(), list_indices(), and create_index() usage) to use the
module-level guards so index creation is single-flight across the whole process.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8873cea5-2239-44ee-b3d5-a1011b55a150

📥 Commits

Reviewing files that changed from the base of the PR and between a08d1b6 and 6267575.

📒 Files selected for processing (1)
  • src/memory/lance.rs

…or serialization

Changes:
- Replace Index::Auto with explicit Index::IvfHnswSq for predictable behavior
- Update log messages to accurately describe IVF-HNSW-SQ index type
- Move guards from instance fields to module-level statics for defense-in-depth
- Remove unnecessary Arc<OnceCell> fields from EmbeddingTable struct

Rationale:
- Index::IvfHnswSq provides the best recall/latency trade-off for Spacebot's
  workload (7.6k embeddings, 384 dimensions)
- Module-level static guards ensure serialization across ALL EmbeddingTable
  instances, not just clones of the same instance
- Explicit index type prevents behavior changes across LanceDB versions
- Log messages now accurately describe the actual index being created

This addresses the upstream PR review feedback that identified:
1. Misleading log messages claiming HNSW when using Index::Auto
2. Instance-level guards don't serialize across separate open_or_create() calls
@aspotton
Copy link
Copy Markdown
Author

aspotton commented Apr 1, 2026

@vsumner - Great catch on both issues! I've implemented fixes for each point you raised:

1. ✅ Fixed Misleading Log Messages

Changed the log message from misleading 'Creating HNSW vector index' to accurately describing the actual index type:

tracing::info!("Creating vector index (IVF-HNSW with Scalar Quantization)");

And updated the code to use explicit Index::IvfHnswSq instead of Index::Auto:

.create_index(&["embedding"], Index::IvfHnswSq(
    lancedb::index::vector::IvfHnswSqIndexBuilder::default()
))

Why: Index::Auto creates an IVF-based index, and the previous log was misleading. Using explicit IvfHnswSq ensures predictable behavior across LanceDB versions and matches the documented intent.


2. ✅ Fixed Race Condition with Module-Level Static Guards

Changed from instance-level guards (which only serialize across clones of the same instance) to module-level static guards:

// Before: Guards were fields on EmbeddingTable
pub struct EmbeddingTable {
    table: lancedb::Table,
    vector_index_guard: Arc<OnceCell<()>>,
    fts_index_guard: Arc<OnceCell<()>>,
}

// After: Guards are module-level statics
static VECTOR_INDEX_GUARD: OnceCell<()> = OnceCell::const_new();
static FTS_INDEX_GUARD: OnceCell<()> = OnceCell::const_new();

pub struct EmbeddingTable {
    table: lancedb::Table,
}

Why: Instance-level guards don't prevent race conditions if multiple independent EmbeddingTable instances are created (e.g., by different threads calling open_or_create()). Module-level statics ensure only ONE index build runs across the entire process.


Verification

  • ✅ Build passes with no errors or warnings
  • ✅ Log messages now accurately describe the index type
  • ✅ Guards are truly process-wide
  • ✅ No breaking changes to the API

Let me know if this addresses your concerns!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants