Summary
ruvector-router-core HNSW (crates/ruvector-router-core/src/index.rs + vector_db.rs) silently loses vectors at scale on three orthogonal axes:
- Recall@1 collapses as inserted set size crosses ~1k–2k with a biased insertion order — searching with a query equal to a previously-inserted vector returns 0 hits or unrelated hits. The vector is in
vectors: HashMap<String, Vec<f32>> but the graph can't reach it.
- Filtered search returns zero rows for small buckets even at modest scale. With 14k vectors, asking
Search(k=10, filter={node_type: T}) for any T whose bucket is <5 % of corpus reliably returns 0 — the index post-filters on top-K after HNSW selection, and top-10 of 14k is dominated by whichever bucket has the most vectors.
- Caller-driven k is silently capped at
ef_search. Setting query.k = 500 with default ef_search = 200 returns at most 200 results — the inner search visits only ef_search nodes regardless of k.
The recall collapse is caused by three independent correctness bugs in index.rs plus a lock-order issue in the insert path (3-cycle deadlock under concurrent inserts, surfaces after the third bug is fixed). The filtered-search and ef-vs-k issues are separate design gaps in the search API.
The existing test suite (test_hnsw_insert_and_search, test_hnsw_multiple_inserts_no_deadlock, test_hnsw_concurrent_inserts) covers ≤40 vectors — small enough that the graph-construction bugs don't bite. None of the existing tests exercise filtered search, recall@1, or k > ef_search. Adding a recall@1 test at scale and a small-bucket filter test (proposed below) makes all three regressions visible.
Reproduction
Add this test to crates/ruvector-router-core/src/index.rs under mod tests:
#[test]
fn test_recall_at_1_with_biased_insertion_order() {
let dimensions = 64;
let config = HnswConfig {
m: 16, ef_construction: 200, ef_search: 200,
metric: DistanceMetric::Cosine, dimensions,
};
let index = HnswIndex::new(config);
fn random_vec(seed: u64, dim: usize) -> Vec<f32> {
let mut s = seed.wrapping_add(0x9E37_79B9_7F4A_7C15);
(0..dim).map(|_| {
s = s.wrapping_add(0x9E37_79B9_7F4A_7C15);
let mut z = s;
z = (z ^ (z >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9);
z = (z ^ (z >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB);
z ^= z >> 31;
((z >> 40) as f32 / (1u64 << 23) as f32) - 1.0
}).collect()
}
// Biased insertion: tiny early bucket sets the entry point, then a
// moderate bucket, then a large late bucket — exactly the shape that
// surfaces the bug.
let buckets: &[(&str, usize)] = &[("a", 1), ("b", 80), ("c", 1500)];
let mut all_ids = vec![]; let mut all_vecs = vec![]; let mut i: u64 = 0;
for (lab, n) in buckets {
for _ in 0..*n {
let id = format!("{}_{}", lab, i);
let v = random_vec(i, dimensions);
index.insert(id.clone(), v.clone()).unwrap();
all_ids.push(id); all_vecs.push(v); i += 1;
}
}
// Recall@1: querying with a vector exactly equal to an inserted one
// must return that vector as top-1.
let probes = 30; let mut misses = 0;
for k in 0..probes {
let idx = (k * (all_ids.len() - 1)) / (probes - 1);
let q = SearchQuery {
vector: all_vecs[idx].clone(), k: 1,
filters: None, threshold: None, ef_search: None,
};
let r = index.search(&q).unwrap();
if r.first().map(|x| &x.id) != Some(&all_ids[idx]) { misses += 1; }
}
assert_eq!(misses, 0, "recall@1 regression: {}/{} probes missed", misses, probes);
}
Expected on main: most probes from the late "c" bucket fail. Expected after the fix: 0 misses across all probes. Runs in <5s.
Root cause — three bugs in index.rs, plus three smaller issues
Bug A: result BinaryHeap orientation is inverted
Neighbor's Ord is reversed for min-heap behaviour:
impl Ord for Neighbor {
fn cmp(&self, other: &Self) -> Ordering {
// Reverse ordering for min-heap behavior
other.distance.partial_cmp(&self.distance).unwrap_or(Ordering::Equal)
}
}
This is correct for the candidates heap (peek = closest unvisited). It is wrong for the result heap. Standard HNSW search needs result to be a max-heap on distance so peek() returns the furthest current result — that's what gets evicted when a closer neighbour arrives. With the reversed Ord, both heaps are min-heaps and result.peek() returns the closest. Two consequences inside search_knn_internal:
if let Some(furthest) = result.peek() {
if current.distance > furthest.distance && result.len() >= ef {
break;
}
}
furthest is the misnamed closest. The condition becomes "is the next candidate worse than my best result?" — true almost immediately. Once result.len() >= ef, the search bails out. Search visits a tiny region around the entry point and returns.
} else if let Some(worst) = result.peek() {
if dist < worst.distance {
result.pop();
result.push(neighbor);
}
}
worst is also actually the closest. The branch evicts the best result whenever a slightly closer neighbour arrives — the final set is a churned beam, not the top-ef.
Bug B: insert-time beam is clamped to m * 2
let neighbors =
self.search_knn_internal(&vector, self.config.ef_construction.min(self.config.m * 2));
.min() returns the smaller of ef_construction and m*2. With defaults ef_construction=200, m=16, the insert beam is 32, not 200. New nodes are wired to neighbours from a 32-node beam departing the (single, fixed) entry point. At scale that beam is dominated by whatever's near the entry point, so late-inserted clusters end up linked through the wrong nodes.
Bug C: neighbour pruning is FIFO, not distance-based
if neighbor_connections.len() > self.config.m * 2 {
neighbor_connections.truncate(self.config.m); // keeps FIRST m
}
Vec::truncate keeps the first M. Real HNSW does heuristic-1 selection: keep the M closest to the owning node by distance. With FIFO truncation, hub nodes that receive many incoming edges over time end up retaining only their oldest connections — which point back to whichever vectors were inserted early. The graph's connectivity sediments around early inserts, and late clusters become unreachable from the entry point.
Bug D (latent until C is fixed): lock order in pruning
The minimal fix for C requires reading vectors while holding graph.write() to compute distances during pruning. The natural lock order to write is graph.write() → vectors.read(), but search_knn_internal already takes vectors.read() → graph.read() → entry_point.read(). With parking_lot's writer-preferring policy and concurrent inserts, that order mismatch produces a 3-cycle deadlock (caught by the existing test_hnsw_concurrent_inserts). The pruning step must take vectors.read() before graph.write() so the prefix order matches the read path.
Bug E: caller-driven k is silently clamped to ef_search
HnswIndex::search reads ef_search = query.ef_search.unwrap_or(self.config.ef_search) and passes it to search_knn_internal regardless of query.k. The inner search visits only ef_search nodes, then take(query.k) truncates to min(visited, k). With default ef_search = 200, asking for k = 500 returns at most 200 results.
This blocks any caller from driving recall via larger k — for example the post-filter overfetch in Bug F can't get more candidates than ef_search allows. Fix: let ef_search = query.ef_search.unwrap_or(...).max(query.k);.
Bug F: filtered search returns zero for small buckets due to post-filter on top-K
VectorDB::search calls index.search(&query) to get top-K by raw distance, then results.retain(|r| matches_filter(r)) to trim by metadata. With imbalanced bucket sizes — production reality on any heterogeneous corpus — top-K is dominated by whichever bucket has the most vectors. A filter on a 4 %-of-corpus bucket reliably returns 0 hits for k=10 because expected matches in top-10 ≈ 0.4.
This is a design gap: filter pushdown happens after top-K selection. The fix has two layers:
- Overfetch: when
query.filters is set, internally bump k to max(user_k * 20, 500).min(2000) (matches application-layer overfetch policies that downstream callers typically already implement, e.g. Atlas Data Fabric's filter_pushdown.go). Trim back to user_k after post-filter.
- Brute-force fallback: when even the overfetched top-K under-delivers (singleton/very-small buckets at <0.5 % of corpus, where
Cap=2000 of n=14000 still expects 0.14 hits), iterate the storage layer directly: for each id, check metadata, compute distance to query, sort, return top-K. O(n) per query but only fires when the fast path didn't find enough — which is precisely the case where it's needed.
Without both layers, Search(k=10, filter={small_bucket}) cannot reliably surface the small bucket's vectors at scale.
Why this is missed by the existing tests
| Test |
Vectors |
Why it doesn't catch the bug |
test_hnsw_insert_and_search |
3 |
Trivially small; everyone is reachable |
test_hnsw_multiple_inserts_no_deadlock |
20 |
Small; tests deadlock not recall |
test_hnsw_concurrent_inserts |
40 |
Small; tests deadlock not recall |
test_vector_db_basic_operations |
1 |
First-vector path — never reaches the buggy logic |
None of them exercise insertion-order bias or recall at >100 vectors. The bug is asymptotic in n — at small n the navigable graph is dense enough that even broken termination still finds neighbours.
Proposed fix
Available as a patch — happy to open a PR. ~140 lines net change across two files (index.rs and vector_db.rs); no API change, no storage format change, no FFI impact.
index.rs (Bugs A–E):
- Switch
Neighbor::Ord to natural ordering by distance with id tie-break (totalises Ord, keeps it consistent with Eq).
- Use
BinaryHeap<Reverse<Neighbor>> for the candidates heap so pop() returns the closest unvisited; BinaryHeap<Neighbor> (max-heap) for the result heap so peek() returns the furthest, evictable item. Restores standard HNSW search invariants.
- Drop the
.min(m*2) clamp in the insert beam — use ef_construction.max(m) (effectively just ef_construction with a sane floor).
- Replace
truncate(m) with distance-based pruning: when a neighbour's connection list exceeds 2M, keep the M connections closest to that neighbour by distance (heuristic-1 selection).
- Acquire
vectors.read() before graph.write() during pruning so the prefix order matches search_knn_internal's read path.
- In
HnswIndex::search, set ef = max(ef_search, query.k) so the candidate pool can grow with the caller's request.
vector_db.rs (Bug F):
- In
VectorDB::search, when query.filters is set, derive fetch_k = max(user_k * 20, 500).min(2000) and pass that as the new query.k (and ef_search) into index.search. Post-filter on returned candidates, then truncate(user_k).
- Fallback: if
results.len() < user_k after the overfetched fast path, run a brute-force filtered scan over storage.get_all_ids() — check metadata, compute distance, sort, take top-K. Guarantees correctness for very small filter buckets.
Tests:
Add the recall@1 test above plus a top-K cluster-coverage variant to mod tests. Add a small-bucket filter test that asserts Search(k=10, filter=singleton_bucket) returns the singleton from a 14k-vector corpus. None of these tests pass on main; all pass on the patch.
Affected versions / scope
Reproduces on main at d209fc4c (current as of 2026-05-07). The bugs are in code that has been stable for many commits; downstream users running ruvector-router-core HNSW for any corpus larger than a few hundred vectors with non-uniform insertion order are likely affected.
Environmental note unrelated to this issue
Separately, vector_db::tests::test_vector_db_basic_operations hangs in our dev container (Linux x86_64, rustc 1.94, parking_lot, redb default features). Two threads park on futex_wait_queue_me. Reproduces on unmodified main and is independent of the HNSW bug. Will file separately if it's not already known.
Summary
ruvector-router-coreHNSW (crates/ruvector-router-core/src/index.rs+vector_db.rs) silently loses vectors at scale on three orthogonal axes:vectors: HashMap<String, Vec<f32>>but the graph can't reach it.Search(k=10, filter={node_type: T})for anyTwhose bucket is <5 % of corpus reliably returns 0 — the index post-filters on top-K after HNSW selection, and top-10 of 14k is dominated by whichever bucket has the most vectors.ef_search. Settingquery.k = 500with defaultef_search = 200returns at most 200 results — the inner search visits onlyef_searchnodes regardless ofk.The recall collapse is caused by three independent correctness bugs in
index.rsplus a lock-order issue in the insert path (3-cycle deadlock under concurrent inserts, surfaces after the third bug is fixed). The filtered-search and ef-vs-k issues are separate design gaps in the search API.The existing test suite (
test_hnsw_insert_and_search,test_hnsw_multiple_inserts_no_deadlock,test_hnsw_concurrent_inserts) covers ≤40 vectors — small enough that the graph-construction bugs don't bite. None of the existing tests exercise filtered search, recall@1, or k > ef_search. Adding a recall@1 test at scale and a small-bucket filter test (proposed below) makes all three regressions visible.Reproduction
Add this test to
crates/ruvector-router-core/src/index.rsundermod tests:Expected on main: most probes from the late
"c"bucket fail. Expected after the fix: 0 misses across all probes. Runs in <5s.Root cause — three bugs in
index.rs, plus three smaller issuesBug A:
resultBinaryHeap orientation is invertedNeighbor'sOrdis reversed for min-heap behaviour:This is correct for the
candidatesheap (peek = closest unvisited). It is wrong for theresultheap. Standard HNSW search needsresultto be a max-heap on distance sopeek()returns the furthest current result — that's what gets evicted when a closer neighbour arrives. With the reversed Ord, both heaps are min-heaps andresult.peek()returns the closest. Two consequences insidesearch_knn_internal:furthestis the misnamed closest. The condition becomes "is the next candidate worse than my best result?" — true almost immediately. Onceresult.len() >= ef, the search bails out. Search visits a tiny region around the entry point and returns.worstis also actually the closest. The branch evicts the best result whenever a slightly closer neighbour arrives — the final set is a churned beam, not the top-ef.Bug B: insert-time beam is clamped to
m * 2.min()returns the smaller ofef_constructionandm*2. With defaultsef_construction=200, m=16, the insert beam is 32, not 200. New nodes are wired to neighbours from a 32-node beam departing the (single, fixed) entry point. At scale that beam is dominated by whatever's near the entry point, so late-inserted clusters end up linked through the wrong nodes.Bug C: neighbour pruning is FIFO, not distance-based
Vec::truncatekeeps the first M. Real HNSW does heuristic-1 selection: keep the M closest to the owning node by distance. With FIFO truncation, hub nodes that receive many incoming edges over time end up retaining only their oldest connections — which point back to whichever vectors were inserted early. The graph's connectivity sediments around early inserts, and late clusters become unreachable from the entry point.Bug D (latent until C is fixed): lock order in pruning
The minimal fix for C requires reading
vectorswhile holdinggraph.write()to compute distances during pruning. The natural lock order to write isgraph.write() → vectors.read(), butsearch_knn_internalalready takesvectors.read() → graph.read() → entry_point.read(). Withparking_lot's writer-preferring policy and concurrent inserts, that order mismatch produces a 3-cycle deadlock (caught by the existingtest_hnsw_concurrent_inserts). The pruning step must takevectors.read()beforegraph.write()so the prefix order matches the read path.Bug E: caller-driven
kis silently clamped toef_searchHnswIndex::searchreadsef_search = query.ef_search.unwrap_or(self.config.ef_search)and passes it tosearch_knn_internalregardless ofquery.k. The inner search visits onlyef_searchnodes, thentake(query.k)truncates tomin(visited, k). With defaultef_search = 200, asking fork = 500returns at most 200 results.This blocks any caller from driving recall via larger
k— for example the post-filter overfetch in Bug F can't get more candidates thanef_searchallows. Fix:let ef_search = query.ef_search.unwrap_or(...).max(query.k);.Bug F: filtered search returns zero for small buckets due to post-filter on top-K
VectorDB::searchcallsindex.search(&query)to get top-K by raw distance, thenresults.retain(|r| matches_filter(r))to trim by metadata. With imbalanced bucket sizes — production reality on any heterogeneous corpus — top-K is dominated by whichever bucket has the most vectors. A filter on a 4 %-of-corpus bucket reliably returns 0 hits fork=10because expected matches in top-10 ≈ 0.4.This is a design gap: filter pushdown happens after top-K selection. The fix has two layers:
query.filtersis set, internally bumpktomax(user_k * 20, 500).min(2000)(matches application-layer overfetch policies that downstream callers typically already implement, e.g. Atlas Data Fabric'sfilter_pushdown.go). Trim back touser_kafter post-filter.Cap=2000ofn=14000still expects 0.14 hits), iterate the storage layer directly: for each id, check metadata, compute distance to query, sort, return top-K. O(n) per query but only fires when the fast path didn't find enough — which is precisely the case where it's needed.Without both layers,
Search(k=10, filter={small_bucket})cannot reliably surface the small bucket's vectors at scale.Why this is missed by the existing tests
test_hnsw_insert_and_searchtest_hnsw_multiple_inserts_no_deadlocktest_hnsw_concurrent_insertstest_vector_db_basic_operationsNone of them exercise insertion-order bias or recall at >100 vectors. The bug is asymptotic in
n— at smallnthe navigable graph is dense enough that even broken termination still finds neighbours.Proposed fix
Available as a patch — happy to open a PR. ~140 lines net change across two files (
index.rsandvector_db.rs); no API change, no storage format change, no FFI impact.index.rs(Bugs A–E):Neighbor::Ordto natural ordering by distance with id tie-break (totalises Ord, keeps it consistent with Eq).BinaryHeap<Reverse<Neighbor>>for thecandidatesheap sopop()returns the closest unvisited;BinaryHeap<Neighbor>(max-heap) for theresultheap sopeek()returns the furthest, evictable item. Restores standard HNSW search invariants..min(m*2)clamp in the insert beam — useef_construction.max(m)(effectively justef_constructionwith a sane floor).truncate(m)with distance-based pruning: when a neighbour's connection list exceeds2M, keep the M connections closest to that neighbour by distance (heuristic-1 selection).vectors.read()beforegraph.write()during pruning so the prefix order matchessearch_knn_internal's read path.HnswIndex::search, setef = max(ef_search, query.k)so the candidate pool can grow with the caller's request.vector_db.rs(Bug F):VectorDB::search, whenquery.filtersis set, derivefetch_k = max(user_k * 20, 500).min(2000)and pass that as the newquery.k(andef_search) intoindex.search. Post-filter on returned candidates, thentruncate(user_k).results.len() < user_kafter the overfetched fast path, run a brute-force filtered scan overstorage.get_all_ids()— check metadata, compute distance, sort, take top-K. Guarantees correctness for very small filter buckets.Tests:
Add the recall@1 test above plus a top-K cluster-coverage variant to
mod tests. Add a small-bucket filter test that assertsSearch(k=10, filter=singleton_bucket)returns the singleton from a 14k-vector corpus. None of these tests pass onmain; all pass on the patch.Affected versions / scope
Reproduces on
mainatd209fc4c(current as of 2026-05-07). The bugs are in code that has been stable for many commits; downstream users running ruvector-router-core HNSW for any corpus larger than a few hundred vectors with non-uniform insertion order are likely affected.Environmental note unrelated to this issue
Separately,
vector_db::tests::test_vector_db_basic_operationshangs in our dev container (Linux x86_64, rustc 1.94, parking_lot, redb default features). Two threads park onfutex_wait_queue_me. Reproduces on unmodifiedmainand is independent of the HNSW bug. Will file separately if it's not already known.