Skip to content

perf(db): sparse store uses O(corpus) brute-force scan — needs inverted postings index #560

@ohdearquant

Description

@ohdearquant

Surfaced by the khive-audit play (UNIFIED_AUDIT_REPORT) and confirmed in code.

Problem

SparseStore (ADR-031) ships the correct contract — SparseVector{indices,values} (CSR), insert_sparse/search_sparse, RRF-fused with dense + BM25 in khive-score. But the SQLite impl search_sparse_vectors (khive-db/src/stores/sparse.rs:~328) is a brute-force scan: it loads EVERY candidate row for the namespace (+ optional kind), deserializes indices_json + values_blob, scores all of them, then truncates to top_k. That's O(corpus) per query — the opposite of what sparse retrieval is for.

Fix

Replace the row scan with an inverted postings table: term(dimension index) → posting list of (subject_id, weight). At query time, touch only the postings for the query vector's non-zero indices, accumulate scores via a merge/accumulator, top-k. This is the standard SPLADE/sparse-retrieval structure and turns query cost from O(corpus) into O(sum of posting-list lengths for query terms).

Alternative (if learned-sparse isn't a near-term priority): retire this path and rely on BM25 + dense ANN fusion, and mark ADR-031 deferred. Decide explicitly rather than leave an O(N) scan wired into hybrid search.

Acceptance

  • Sparse search no longer scans all namespace rows; query cost scales with query nnz × posting lengths.
  • p95 sparse-search benchmark on a representative corpus (add it — none exists today).
  • RRF fusion parity preserved (same ranking as the scan for a fixed input).
  • If retired instead: ADR-031 marked deferred + hybrid path updated to skip sparse.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions