Skip to content

fix(store): use vec0 partition columns directly in search queries#384

Open
tkstang wants to merge 1 commit intoarabold:mainfrom
voxmedia:fix/vec-partition-search
Open

fix(store): use vec0 partition columns directly in search queries#384
tkstang wants to merge 1 commit intoarabold:mainfrom
voxmedia:fix/vec-partition-search

Conversation

@tkstang
Copy link
Copy Markdown

@tkstang tkstang commented Apr 7, 2026

Problem

We're using this tool to index both documentation and code repositories, so we have a large data set and a large set of libraries. The hybrid search query in findByContent() filters documents_vec by library/version through 4 JOINs:

FROM documents_vec dv
  JOIN documents d ON dv.rowid = d.id
  JOIN pages p ON d.page_id = p.id
  JOIN versions v ON p.version_id = v.id
  JOIN libraries l ON v.library_id = l.id
WHERE l.name = ?
  AND COALESCE(v.name, '') = COALESCE(?, '')
  AND dv.embedding MATCH ?

sqlite-vec's vec0 virtual table supports partition pruning via library_id and version_id columns, but only when they appear as direct equality constraints in the WHERE clause on the virtual table itself. The query planner cannot see through JOINs to apply partition pruning, so every search does a brute-force scan of all vectors regardless of which library was requested.

At small scale this is unnoticeable. At 451K vectors (45GB database across ~30 indexed repositories), search queries hang indefinitely — the full brute-force scan takes longer than any reasonable timeout.

Fix

Resolve library/version names to their IDs upfront using the existing getVersionId prepared statement (read-only, not resolveVersionId which inserts), then pass library_id and version_id directly to the vec0 WHERE clause:

FROM documents_vec dv
WHERE dv.library_id = ?
  AND dv.version_id = ?
  AND dv.embedding MATCH ?

This enables sqlite-vec partition pruning — only vectors in the requested library's partition are scanned (typically 5-15K vectors instead of 451K).

The same optimization is applied to the FTS CTEs for consistency, replacing JOINs through versions/libraries with a direct p.version_id = ? filter.

Impact

  • Search goes from hanging (60s+ timeout) to sub-second on large databases
  • No change in search results — same vectors, same ranking, just scanned efficiently
  • FTS-only fallback also benefits from fewer JOINs
  • All 57 existing tests pass

Test plan

  • npx vitest run src/store/DocumentStore.test.ts — 57 tests pass

The hybrid search query filtered documents_vec by library/version through
4 JOINs (documents → pages → versions → libraries → WHERE l.name = ?).
sqlite-vec cannot see through JOINs to apply partition pruning, so every
search did a brute-force scan of ALL vectors in the table regardless of
which library was requested.

Fix: resolve library/version names to their IDs upfront using the
existing getVersionId prepared statement, then pass library_id and
version_id directly as WHERE constraints on the vec0 virtual table.
This enables sqlite-vec partition pruning — only vectors belonging to
the requested library are scanned.

On a 451K-vector database (45GB), this reduces the search scope from
the entire table to just the requested library's partition (typically
5-15K vectors), avoiding a multi-second brute-force scan that caused
search timeouts in production.

The same optimization is applied to the FTS CTEs for consistency,
removing unnecessary JOINs through the libraries table.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant