Skip to content

ruvector-router-core HNSW: search returns 0 results from inserted vectors at scale (recall@1 << 1.0) #430

@carheart

Description

@carheart

Summary

ruvector-router-core HNSW (crates/ruvector-router-core/src/index.rs + vector_db.rs) silently loses vectors at scale on three orthogonal axes:

  • Recall@1 collapses as inserted set size crosses ~1k–2k with a biased insertion order — searching with a query equal to a previously-inserted vector returns 0 hits or unrelated hits. The vector is in vectors: HashMap<String, Vec<f32>> but the graph can't reach it.
  • Filtered search returns zero rows for small buckets even at modest scale. With 14k vectors, asking Search(k=10, filter={node_type: T}) for any T whose bucket is <5 % of corpus reliably returns 0 — the index post-filters on top-K after HNSW selection, and top-10 of 14k is dominated by whichever bucket has the most vectors.
  • Caller-driven k is silently capped at ef_search. Setting query.k = 500 with default ef_search = 200 returns at most 200 results — the inner search visits only ef_search nodes regardless of k.

The recall collapse is caused by three independent correctness bugs in index.rs plus a lock-order issue in the insert path (3-cycle deadlock under concurrent inserts, surfaces after the third bug is fixed). The filtered-search and ef-vs-k issues are separate design gaps in the search API.

The existing test suite (test_hnsw_insert_and_search, test_hnsw_multiple_inserts_no_deadlock, test_hnsw_concurrent_inserts) covers ≤40 vectors — small enough that the graph-construction bugs don't bite. None of the existing tests exercise filtered search, recall@1, or k > ef_search. Adding a recall@1 test at scale and a small-bucket filter test (proposed below) makes all three regressions visible.

Reproduction

Add this test to crates/ruvector-router-core/src/index.rs under mod tests:

#[test]
fn test_recall_at_1_with_biased_insertion_order() {
    let dimensions = 64;
    let config = HnswConfig {
        m: 16, ef_construction: 200, ef_search: 200,
        metric: DistanceMetric::Cosine, dimensions,
    };
    let index = HnswIndex::new(config);

    fn random_vec(seed: u64, dim: usize) -> Vec<f32> {
        let mut s = seed.wrapping_add(0x9E37_79B9_7F4A_7C15);
        (0..dim).map(|_| {
            s = s.wrapping_add(0x9E37_79B9_7F4A_7C15);
            let mut z = s;
            z = (z ^ (z >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9);
            z = (z ^ (z >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB);
            z ^= z >> 31;
            ((z >> 40) as f32 / (1u64 << 23) as f32) - 1.0
        }).collect()
    }

    // Biased insertion: tiny early bucket sets the entry point, then a
    // moderate bucket, then a large late bucket — exactly the shape that
    // surfaces the bug.
    let buckets: &[(&str, usize)] = &[("a", 1), ("b", 80), ("c", 1500)];
    let mut all_ids = vec![]; let mut all_vecs = vec![]; let mut i: u64 = 0;
    for (lab, n) in buckets {
        for _ in 0..*n {
            let id = format!("{}_{}", lab, i);
            let v = random_vec(i, dimensions);
            index.insert(id.clone(), v.clone()).unwrap();
            all_ids.push(id); all_vecs.push(v); i += 1;
        }
    }

    // Recall@1: querying with a vector exactly equal to an inserted one
    // must return that vector as top-1.
    let probes = 30; let mut misses = 0;
    for k in 0..probes {
        let idx = (k * (all_ids.len() - 1)) / (probes - 1);
        let q = SearchQuery {
            vector: all_vecs[idx].clone(), k: 1,
            filters: None, threshold: None, ef_search: None,
        };
        let r = index.search(&q).unwrap();
        if r.first().map(|x| &x.id) != Some(&all_ids[idx]) { misses += 1; }
    }
    assert_eq!(misses, 0, "recall@1 regression: {}/{} probes missed", misses, probes);
}

Expected on main: most probes from the late "c" bucket fail. Expected after the fix: 0 misses across all probes. Runs in <5s.

Root cause — three bugs in index.rs, plus three smaller issues

Bug A: result BinaryHeap orientation is inverted

Neighbor's Ord is reversed for min-heap behaviour:

impl Ord for Neighbor {
    fn cmp(&self, other: &Self) -> Ordering {
        // Reverse ordering for min-heap behavior
        other.distance.partial_cmp(&self.distance).unwrap_or(Ordering::Equal)
    }
}

This is correct for the candidates heap (peek = closest unvisited). It is wrong for the result heap. Standard HNSW search needs result to be a max-heap on distance so peek() returns the furthest current result — that's what gets evicted when a closer neighbour arrives. With the reversed Ord, both heaps are min-heaps and result.peek() returns the closest. Two consequences inside search_knn_internal:

if let Some(furthest) = result.peek() {
    if current.distance > furthest.distance && result.len() >= ef {
        break;
    }
}

furthest is the misnamed closest. The condition becomes "is the next candidate worse than my best result?" — true almost immediately. Once result.len() >= ef, the search bails out. Search visits a tiny region around the entry point and returns.

} else if let Some(worst) = result.peek() {
    if dist < worst.distance {
        result.pop();
        result.push(neighbor);
    }
}

worst is also actually the closest. The branch evicts the best result whenever a slightly closer neighbour arrives — the final set is a churned beam, not the top-ef.

Bug B: insert-time beam is clamped to m * 2

let neighbors =
    self.search_knn_internal(&vector, self.config.ef_construction.min(self.config.m * 2));

.min() returns the smaller of ef_construction and m*2. With defaults ef_construction=200, m=16, the insert beam is 32, not 200. New nodes are wired to neighbours from a 32-node beam departing the (single, fixed) entry point. At scale that beam is dominated by whatever's near the entry point, so late-inserted clusters end up linked through the wrong nodes.

Bug C: neighbour pruning is FIFO, not distance-based

if neighbor_connections.len() > self.config.m * 2 {
    neighbor_connections.truncate(self.config.m);   // keeps FIRST m
}

Vec::truncate keeps the first M. Real HNSW does heuristic-1 selection: keep the M closest to the owning node by distance. With FIFO truncation, hub nodes that receive many incoming edges over time end up retaining only their oldest connections — which point back to whichever vectors were inserted early. The graph's connectivity sediments around early inserts, and late clusters become unreachable from the entry point.

Bug D (latent until C is fixed): lock order in pruning

The minimal fix for C requires reading vectors while holding graph.write() to compute distances during pruning. The natural lock order to write is graph.write() → vectors.read(), but search_knn_internal already takes vectors.read() → graph.read() → entry_point.read(). With parking_lot's writer-preferring policy and concurrent inserts, that order mismatch produces a 3-cycle deadlock (caught by the existing test_hnsw_concurrent_inserts). The pruning step must take vectors.read() before graph.write() so the prefix order matches the read path.

Bug E: caller-driven k is silently clamped to ef_search

HnswIndex::search reads ef_search = query.ef_search.unwrap_or(self.config.ef_search) and passes it to search_knn_internal regardless of query.k. The inner search visits only ef_search nodes, then take(query.k) truncates to min(visited, k). With default ef_search = 200, asking for k = 500 returns at most 200 results.

This blocks any caller from driving recall via larger k — for example the post-filter overfetch in Bug F can't get more candidates than ef_search allows. Fix: let ef_search = query.ef_search.unwrap_or(...).max(query.k);.

Bug F: filtered search returns zero for small buckets due to post-filter on top-K

VectorDB::search calls index.search(&query) to get top-K by raw distance, then results.retain(|r| matches_filter(r)) to trim by metadata. With imbalanced bucket sizes — production reality on any heterogeneous corpus — top-K is dominated by whichever bucket has the most vectors. A filter on a 4 %-of-corpus bucket reliably returns 0 hits for k=10 because expected matches in top-10 ≈ 0.4.

This is a design gap: filter pushdown happens after top-K selection. The fix has two layers:

  1. Overfetch: when query.filters is set, internally bump k to max(user_k * 20, 500).min(2000) (matches application-layer overfetch policies that downstream callers typically already implement, e.g. Atlas Data Fabric's filter_pushdown.go). Trim back to user_k after post-filter.
  2. Brute-force fallback: when even the overfetched top-K under-delivers (singleton/very-small buckets at <0.5 % of corpus, where Cap=2000 of n=14000 still expects 0.14 hits), iterate the storage layer directly: for each id, check metadata, compute distance to query, sort, return top-K. O(n) per query but only fires when the fast path didn't find enough — which is precisely the case where it's needed.

Without both layers, Search(k=10, filter={small_bucket}) cannot reliably surface the small bucket's vectors at scale.

Why this is missed by the existing tests

Test Vectors Why it doesn't catch the bug
test_hnsw_insert_and_search 3 Trivially small; everyone is reachable
test_hnsw_multiple_inserts_no_deadlock 20 Small; tests deadlock not recall
test_hnsw_concurrent_inserts 40 Small; tests deadlock not recall
test_vector_db_basic_operations 1 First-vector path — never reaches the buggy logic

None of them exercise insertion-order bias or recall at >100 vectors. The bug is asymptotic in n — at small n the navigable graph is dense enough that even broken termination still finds neighbours.

Proposed fix

Available as a patch — happy to open a PR. ~140 lines net change across two files (index.rs and vector_db.rs); no API change, no storage format change, no FFI impact.

index.rs (Bugs A–E):

  1. Switch Neighbor::Ord to natural ordering by distance with id tie-break (totalises Ord, keeps it consistent with Eq).
  2. Use BinaryHeap<Reverse<Neighbor>> for the candidates heap so pop() returns the closest unvisited; BinaryHeap<Neighbor> (max-heap) for the result heap so peek() returns the furthest, evictable item. Restores standard HNSW search invariants.
  3. Drop the .min(m*2) clamp in the insert beam — use ef_construction.max(m) (effectively just ef_construction with a sane floor).
  4. Replace truncate(m) with distance-based pruning: when a neighbour's connection list exceeds 2M, keep the M connections closest to that neighbour by distance (heuristic-1 selection).
  5. Acquire vectors.read() before graph.write() during pruning so the prefix order matches search_knn_internal's read path.
  6. In HnswIndex::search, set ef = max(ef_search, query.k) so the candidate pool can grow with the caller's request.

vector_db.rs (Bug F):

  1. In VectorDB::search, when query.filters is set, derive fetch_k = max(user_k * 20, 500).min(2000) and pass that as the new query.k (and ef_search) into index.search. Post-filter on returned candidates, then truncate(user_k).
  2. Fallback: if results.len() < user_k after the overfetched fast path, run a brute-force filtered scan over storage.get_all_ids() — check metadata, compute distance, sort, take top-K. Guarantees correctness for very small filter buckets.

Tests:

Add the recall@1 test above plus a top-K cluster-coverage variant to mod tests. Add a small-bucket filter test that asserts Search(k=10, filter=singleton_bucket) returns the singleton from a 14k-vector corpus. None of these tests pass on main; all pass on the patch.

Affected versions / scope

Reproduces on main at d209fc4c (current as of 2026-05-07). The bugs are in code that has been stable for many commits; downstream users running ruvector-router-core HNSW for any corpus larger than a few hundred vectors with non-uniform insertion order are likely affected.

Environmental note unrelated to this issue

Separately, vector_db::tests::test_vector_db_basic_operations hangs in our dev container (Linux x86_64, rustc 1.94, parking_lot, redb default features). Two threads park on futex_wait_queue_me. Reproduces on unmodified main and is independent of the HNSW bug. Will file separately if it's not already known.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions