fix(store): use vec0 partition columns directly in search queries#384
Open
tkstang wants to merge 1 commit intoarabold:mainfrom
Open
fix(store): use vec0 partition columns directly in search queries#384tkstang wants to merge 1 commit intoarabold:mainfrom
tkstang wants to merge 1 commit intoarabold:mainfrom
Conversation
The hybrid search query filtered documents_vec by library/version through 4 JOINs (documents → pages → versions → libraries → WHERE l.name = ?). sqlite-vec cannot see through JOINs to apply partition pruning, so every search did a brute-force scan of ALL vectors in the table regardless of which library was requested. Fix: resolve library/version names to their IDs upfront using the existing getVersionId prepared statement, then pass library_id and version_id directly as WHERE constraints on the vec0 virtual table. This enables sqlite-vec partition pruning — only vectors belonging to the requested library are scanned. On a 451K-vector database (45GB), this reduces the search scope from the entire table to just the requested library's partition (typically 5-15K vectors), avoiding a multi-second brute-force scan that caused search timeouts in production. The same optimization is applied to the FTS CTEs for consistency, removing unnecessary JOINs through the libraries table.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
We're using this tool to index both documentation and code repositories, so we have a large data set and a large set of libraries. The hybrid search query in
findByContent()filtersdocuments_vecby library/version through 4 JOINs:sqlite-vec'svec0virtual table supports partition pruning vialibrary_idandversion_idcolumns, but only when they appear as direct equality constraints in the WHERE clause on the virtual table itself. The query planner cannot see through JOINs to apply partition pruning, so every search does a brute-force scan of all vectors regardless of which library was requested.At small scale this is unnoticeable. At 451K vectors (45GB database across ~30 indexed repositories), search queries hang indefinitely — the full brute-force scan takes longer than any reasonable timeout.
Fix
Resolve library/version names to their IDs upfront using the existing
getVersionIdprepared statement (read-only, notresolveVersionIdwhich inserts), then passlibrary_idandversion_iddirectly to the vec0 WHERE clause:This enables sqlite-vec partition pruning — only vectors in the requested library's partition are scanned (typically 5-15K vectors instead of 451K).
The same optimization is applied to the FTS CTEs for consistency, replacing JOINs through
versions/librarieswith a directp.version_id = ?filter.Impact
Test plan
npx vitest run src/store/DocumentStore.test.ts— 57 tests pass