perf(memory): enable vector index usage by checking for existing indexes#514
perf(memory): enable vector index usage by checking for existing indexes#514aspotton wants to merge 4 commits intospacedriveapp:mainfrom
Conversation
Problem Spacebot was experiencing extended startup time due to attempting to rebuild vector indexes on every startup, even though index files already existed on disk. This caused: - Expensive KMeans training on every startup (IVF model, quantizer) - index_comparisons=0 in query logs (index not being used) - High CPU usage during startup due to unnecessary index rebuilding Root Cause The code called create_index() unconditionally. LanceDB's create_index() always attempts to rebuild the index from scratch when called, even if an index already exists. The previous error-handling approach was too late - the expensive training had already completed. Solution Check for existing indexes using list_indices() BEFORE attempting creation. Only create the index if it doesn't exist. This allows LanceDB to load and use the existing index immediately, enabling efficient vector searches. Changes - src/memory/lance.rs: Replace create_indexes() with ensure_indexes_exist() * Uses table.list_indices() to check for existing indexes * Only creates vector index if not present on embedding column * Only creates FTS index if not present on content column * Adds optimize_indexes() for incremental updates after data insertion - src/main.rs: Call ensure_indexes_exist() instead of create_indexes() - src/api/agents.rs: Call ensure_indexes_exist() instead of ensure_fts_index() - src/tools/memory_save.rs: Call ensure_indexes_exist() instead of ensure_fts_index() Performance Impact - Index usage: Queries now use the HNSW index (index_comparisons > 0) - Eliminates unnecessary index rebuilding when index already exists - Reduces startup time by skipping expensive training when index exists - Note: This fix enables index usage but does not stop the separate Knowledge Synthesis regeneration process Verification After deployment, verify: 1. Startup with existing index: Logs show 'Vector index already exists' 2. Query logs show index_comparisons > 0 (index is being used) 3. No KMeans training logs appear when index already exists 4. CPU usage remains normal during startup
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughReplaces FTS-only index creation with a unified index-ensuring API that creates or skips vector and FTS indexes with process-wide concurrency guards; adds an index optimization method and updates callers to invoke Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/memory/lance.rs`:
- Around line 318-352: The ensure_indexes_exist() flow is racy; add a shared
async mutex on EmbeddingTable (e.g., a field like ensure_indexes_lock:
Arc<tokio::sync::Mutex<()>>) and acquire it at the start of
ensure_indexes_exist() so only one clone performs list_indices() →
create_index() at a time; initialize the Arc in the EmbeddingTable constructor
and ensure it is preserved when the struct is cloned (or implement Clone to
clone the Arc), then await the lock before calling self.table.list_indices() and
hold it until after the index creation logic completes.
In `@src/tools/memory_save.rs`:
- Around line 382-390: After ensure_indexes_exist() returns Ok in the memory
save path, spawn a background task to run the incremental index maintenance so
you don't lose post-insert optimize work; specifically, after calling
self.memory_search.embedding_table().ensure_indexes_exist().await, call
embedding_table().optimize_indexes() (or the incremental optimizer helper)
inside a tokio::spawn so it runs fire-and-forget off the request path, and log
errors from the spawned task (use tracing::warn or similar) to surface failures
without blocking the save response.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8aa986ff-de2d-42e0-b7ae-028f2fcd7d55
📒 Files selected for processing (4)
src/api/agents.rssrc/main.rssrc/memory/lance.rssrc/tools/memory_save.rs
| // Ensure vector and FTS indexes exist (prevents 30-minute rebuild loop) | ||
| // Safe to call repeatedly — skips creation if indexes already exist. | ||
| if let Err(error) = self | ||
| .memory_search | ||
| .embedding_table() | ||
| .ensure_fts_index() | ||
| .ensure_indexes_exist() | ||
| .await | ||
| { | ||
| tracing::warn!(%error, "failed to ensure FTS index after memory save"); | ||
| tracing::warn!(%error, "failed to ensure indexes after memory save"); |
There was a problem hiding this comment.
Don’t lose post-insert maintenance on the save path.
After switching this branch to ensure_indexes_exist(), it becomes a pure existence check once the indexes are present, so later writes never reach the new optimize_indexes() flow. Please chain the incremental optimize step from here as well, ideally off the request path.
As per coding guidelines "Use tokio::spawn for fire-and-forget database writes (conversation history saves, memory writes, worker log persistence) so the user gets their response immediately."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/tools/memory_save.rs` around lines 382 - 390, After
ensure_indexes_exist() returns Ok in the memory save path, spawn a background
task to run the incremental index maintenance so you don't lose post-insert
optimize work; specifically, after calling
self.memory_search.embedding_table().ensure_indexes_exist().await, call
embedding_table().optimize_indexes() (or the incremental optimizer helper)
inside a tokio::spawn so it runs fire-and-forget off the request path, and log
errors from the spawned task (use tracing::warn or similar) to surface failures
without blocking the save response.
vsumner
left a comment
There was a problem hiding this comment.
One blocking correctness issue remains in the new index-creation flow: the existence check and create path is still racy across EmbeddingTable clones, so concurrent callers can still trigger duplicate rebuilds.
src/memory/lance.rs
Outdated
| use lancedb::index::Index; | ||
|
|
||
| // Check for existing indexes | ||
| let indices = self |
There was a problem hiding this comment.
This list_indices() -> create_index() flow is still race-prone across EmbeddingTable clones. Startup, agent creation, and memory-save paths can all hold cloned tables, so concurrent callers can all observe “missing” and kick off the same expensive rebuild anyway. That breaks the main invariant this PR is trying to restore: at most one index build per missing index. Please add a shared single-flight guard on EmbeddingTable and hold it across the full existence-check/create sequence.
…ex creation The previous fix using list_indices() check was still race-prone when multiple concurrent callers (startup, agent creation, memory-save paths) all held cloned EmbeddingTable instances and simultaneously detected missing indexes, triggering duplicate expensive rebuilds. This adds two OnceCell guards (one for vector index, one for FTS) that: - Ensure only ONE index build runs at a time per index type - Allow concurrent callers to wait for the first to complete - Share state across all cloned EmbeddingTable instances - Include a double-check after acquiring the guard to handle external creation The fix maintains the original goal of preventing unnecessary rebuilds while eliminating the race condition that could still trigger duplicate builds.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/memory/lance.rs`:
- Around line 394-400: The log message is misleading because create_index(...,
Index::Auto) will choose an IVF-PQ style index, not HNSW; update the
tracing::info() calls around the create_index call to accurately describe the
index being created (e.g., "Creating vector index (Index::Auto / IVF-PQ)" or
similar), or if HNSW is required replace Index::Auto with the explicit variant
Index::IvfHnswSq in the self.table.create_index(&["embedding"],
Index::Auto).execute().await call and adjust the log to "Creating HNSW vector
index" to match the index type.
- Around line 25-43: The per-instance Arc<OnceCell> guards
(EmbeddingTable::vector_index_guard and ::fts_index_guard) only serialize across
clones of the same instance; to fully serialize index creation across separate
open_or_create() instances, promote these guards to module-level statics (e.g.,
static VECTOR_INDEX_GUARD: OnceCell<()> and static FTS_INDEX_GUARD:
OnceCell<()>) and use those statics inside the get_or_try_init() closures
instead of the instance fields; remove the instance fields from EmbeddingTable
and update all references (including calls in open_or_create(),
get_or_try_init(), list_indices(), and create_index() usage) to use the
module-level guards so index creation is single-flight across the whole process.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8873cea5-2239-44ee-b3d5-a1011b55a150
📒 Files selected for processing (1)
src/memory/lance.rs
…or serialization Changes: - Replace Index::Auto with explicit Index::IvfHnswSq for predictable behavior - Update log messages to accurately describe IVF-HNSW-SQ index type - Move guards from instance fields to module-level statics for defense-in-depth - Remove unnecessary Arc<OnceCell> fields from EmbeddingTable struct Rationale: - Index::IvfHnswSq provides the best recall/latency trade-off for Spacebot's workload (7.6k embeddings, 384 dimensions) - Module-level static guards ensure serialization across ALL EmbeddingTable instances, not just clones of the same instance - Explicit index type prevents behavior changes across LanceDB versions - Log messages now accurately describe the actual index being created This addresses the upstream PR review feedback that identified: 1. Misleading log messages claiming HNSW when using Index::Auto 2. Instance-level guards don't serialize across separate open_or_create() calls
|
@vsumner - Great catch on both issues! I've implemented fixes for each point you raised: 1. ✅ Fixed Misleading Log MessagesChanged the log message from misleading 'Creating HNSW vector index' to accurately describing the actual index type: tracing::info!("Creating vector index (IVF-HNSW with Scalar Quantization)");And updated the code to use explicit .create_index(&["embedding"], Index::IvfHnswSq(
lancedb::index::vector::IvfHnswSqIndexBuilder::default()
))Why: 2. ✅ Fixed Race Condition with Module-Level Static GuardsChanged from instance-level guards (which only serialize across clones of the same instance) to module-level static guards: // Before: Guards were fields on EmbeddingTable
pub struct EmbeddingTable {
table: lancedb::Table,
vector_index_guard: Arc<OnceCell<()>>,
fts_index_guard: Arc<OnceCell<()>>,
}
// After: Guards are module-level statics
static VECTOR_INDEX_GUARD: OnceCell<()> = OnceCell::const_new();
static FTS_INDEX_GUARD: OnceCell<()> = OnceCell::const_new();
pub struct EmbeddingTable {
table: lancedb::Table,
}Why: Instance-level guards don't prevent race conditions if multiple independent Verification
Let me know if this addresses your concerns! |
Problem
Spacebot was experiencing extended startup time due to attempting to rebuild vector indexes on every startup, even though index files already existed on disk. This caused:
index_comparisons=0in query logs (index not being used)Root Cause
The code called
create_index()unconditionally. LanceDB'screate_index()always attempts to rebuild the index from scratch when called, even if an index already exists. The previous error-handling approach was too late - the expensive training had already completed.Solution
Check for existing indexes using
list_indices()BEFORE attempting creation. Only create the index if it doesn't exist. This allows LanceDB to load and use the existing index immediately, enabling efficient vector searches.Changes
src/memory/lance.rs: Replacecreate_indexes()withensure_indexes_exist()table.list_indices()to check for existing indexesoptimize_indexes()for incremental updates after data insertionsrc/main.rs: Callensure_indexes_exist()instead ofcreate_indexes()src/api/agents.rs: Callensure_indexes_exist()instead ofensure_fts_index()src/tools/memory_save.rs: Callensure_indexes_exist()instead ofensure_fts_index()Performance Impact
index_comparisons > 0)Verification
After deployment, verify:
index_comparisons > 0(index is being used)