ruvector-router-core HNSW: search returns 0 results from inserted vectors at scale (recall@1 << 1.0)


## Summary

`ruvector-router-core` HNSW (`crates/ruvector-router-core/src/index.rs` + `vector_db.rs`) silently loses vectors at scale on three orthogonal axes:

- **Recall@1 collapses** as inserted set size crosses ~1k–2k with a biased insertion order — searching with a query equal to a previously-inserted vector returns 0 hits or unrelated hits. The vector is in `vectors: HashMap<String, Vec<f32>>` but the graph can't reach it.
- **Filtered search returns zero rows for small buckets** even at modest scale. With 14k vectors, asking `Search(k=10, filter={node_type: T})` for any `T` whose bucket is <5 % of corpus reliably returns 0 — the index post-filters on top-K *after* HNSW selection, and top-10 of 14k is dominated by whichever bucket has the most vectors.
- **Caller-driven k is silently capped at `ef_search`.** Setting `query.k = 500` with default `ef_search = 200` returns at most 200 results — the inner search visits only `ef_search` nodes regardless of `k`.

The recall collapse is caused by **three independent correctness bugs in `index.rs`** plus a **lock-order issue** in the insert path (3-cycle deadlock under concurrent inserts, surfaces after the third bug is fixed). The filtered-search and ef-vs-k issues are separate design gaps in the search API.

The existing test suite (`test_hnsw_insert_and_search`, `test_hnsw_multiple_inserts_no_deadlock`, `test_hnsw_concurrent_inserts`) covers ≤40 vectors — small enough that the graph-construction bugs don't bite. None of the existing tests exercise filtered search, recall@1, or k > ef_search. Adding a recall@1 test at scale and a small-bucket filter test (proposed below) makes all three regressions visible.

## Reproduction

Add this test to `crates/ruvector-router-core/src/index.rs` under `mod tests`:

```rust
#[test]
fn test_recall_at_1_with_biased_insertion_order() {
    let dimensions = 64;
    let config = HnswConfig {
        m: 16, ef_construction: 200, ef_search: 200,
        metric: DistanceMetric::Cosine, dimensions,
    };
    let index = HnswIndex::new(config);

    fn random_vec(seed: u64, dim: usize) -> Vec<f32> {
        let mut s = seed.wrapping_add(0x9E37_79B9_7F4A_7C15);
        (0..dim).map(|_| {
            s = s.wrapping_add(0x9E37_79B9_7F4A_7C15);
            let mut z = s;
            z = (z ^ (z >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9);
            z = (z ^ (z >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB);
            z ^= z >> 31;
            ((z >> 40) as f32 / (1u64 << 23) as f32) - 1.0
        }).collect()
    }

    // Biased insertion: tiny early bucket sets the entry point, then a
    // moderate bucket, then a large late bucket — exactly the shape that
    // surfaces the bug.
    let buckets: &[(&str, usize)] = &[("a", 1), ("b", 80), ("c", 1500)];
    let mut all_ids = vec![]; let mut all_vecs = vec![]; let mut i: u64 = 0;
    for (lab, n) in buckets {
        for _ in 0..*n {
            let id = format!("{}_{}", lab, i);
            let v = random_vec(i, dimensions);
            index.insert(id.clone(), v.clone()).unwrap();
            all_ids.push(id); all_vecs.push(v); i += 1;
        }
    }

    // Recall@1: querying with a vector exactly equal to an inserted one
    // must return that vector as top-1.
    let probes = 30; let mut misses = 0;
    for k in 0..probes {
        let idx = (k * (all_ids.len() - 1)) / (probes - 1);
        let q = SearchQuery {
            vector: all_vecs[idx].clone(), k: 1,
            filters: None, threshold: None, ef_search: None,
        };
        let r = index.search(&q).unwrap();
        if r.first().map(|x| &x.id) != Some(&all_ids[idx]) { misses += 1; }
    }
    assert_eq!(misses, 0, "recall@1 regression: {}/{} probes missed", misses, probes);
}
```

Expected on main: most probes from the late `"c"` bucket fail. Expected after the fix: 0 misses across all probes. Runs in <5s.

## Root cause — three bugs in `index.rs`, plus three smaller issues

### Bug A: `result` BinaryHeap orientation is inverted

`Neighbor`'s `Ord` is reversed for min-heap behaviour:

```rust
impl Ord for Neighbor {
    fn cmp(&self, other: &Self) -> Ordering {
        // Reverse ordering for min-heap behavior
        other.distance.partial_cmp(&self.distance).unwrap_or(Ordering::Equal)
    }
}
```

This is correct for the `candidates` heap (peek = closest unvisited). It is **wrong for the `result` heap**. Standard HNSW search needs `result` to be a max-heap on distance so `peek()` returns the *furthest* current result — that's what gets evicted when a closer neighbour arrives. With the reversed Ord, both heaps are min-heaps and `result.peek()` returns the closest. Two consequences inside `search_knn_internal`:

```rust
if let Some(furthest) = result.peek() {
    if current.distance > furthest.distance && result.len() >= ef {
        break;
    }
}
```
`furthest` is the misnamed *closest*. The condition becomes "is the next candidate worse than my best result?" — true almost immediately. Once `result.len() >= ef`, the search bails out. Search visits a tiny region around the entry point and returns.

```rust
} else if let Some(worst) = result.peek() {
    if dist < worst.distance {
        result.pop();
        result.push(neighbor);
    }
}
```
`worst` is also actually the *closest*. The branch evicts the **best** result whenever a slightly closer neighbour arrives — the final set is a churned beam, not the top-ef.

### Bug B: insert-time beam is clamped to `m * 2`

```rust
let neighbors =
    self.search_knn_internal(&vector, self.config.ef_construction.min(self.config.m * 2));
```

`.min()` returns the *smaller* of `ef_construction` and `m*2`. With defaults `ef_construction=200, m=16`, the insert beam is 32, not 200. New nodes are wired to neighbours from a 32-node beam departing the (single, fixed) entry point. At scale that beam is dominated by whatever's near the entry point, so late-inserted clusters end up linked through the wrong nodes.

### Bug C: neighbour pruning is FIFO, not distance-based

```rust
if neighbor_connections.len() > self.config.m * 2 {
    neighbor_connections.truncate(self.config.m);   // keeps FIRST m
}
```

`Vec::truncate` keeps the first M. Real HNSW does heuristic-1 selection: keep the M closest *to the owning node* by distance. With FIFO truncation, hub nodes that receive many incoming edges over time end up retaining only their oldest connections — which point back to whichever vectors were inserted early. The graph's connectivity sediments around early inserts, and late clusters become unreachable from the entry point.

### Bug D (latent until C is fixed): lock order in pruning

The minimal fix for C requires reading `vectors` while holding `graph.write()` to compute distances during pruning. The natural lock order to write is `graph.write() → vectors.read()`, but `search_knn_internal` already takes `vectors.read() → graph.read() → entry_point.read()`. With `parking_lot`'s writer-preferring policy and concurrent inserts, that order mismatch produces a 3-cycle deadlock (caught by the existing `test_hnsw_concurrent_inserts`). The pruning step must take `vectors.read()` *before* `graph.write()` so the prefix order matches the read path.

### Bug E: caller-driven `k` is silently clamped to `ef_search`

`HnswIndex::search` reads `ef_search = query.ef_search.unwrap_or(self.config.ef_search)` and passes it to `search_knn_internal` regardless of `query.k`. The inner search visits only `ef_search` nodes, then `take(query.k)` truncates to `min(visited, k)`. With default `ef_search = 200`, asking for `k = 500` returns at most 200 results.

This blocks any caller from driving recall via larger `k` — for example the post-filter overfetch in Bug F can't get more candidates than `ef_search` allows. Fix: `let ef_search = query.ef_search.unwrap_or(...).max(query.k);`.

### Bug F: filtered search returns zero for small buckets due to post-filter on top-K

`VectorDB::search` calls `index.search(&query)` to get top-K by raw distance, then `results.retain(|r| matches_filter(r))` to trim by metadata. With imbalanced bucket sizes — production reality on any heterogeneous corpus — top-K is dominated by whichever bucket has the most vectors. A filter on a 4 %-of-corpus bucket reliably returns 0 hits for `k=10` because expected matches in top-10 ≈ 0.4.

This is a design gap: filter pushdown happens *after* top-K selection. The fix has two layers:

1. **Overfetch**: when `query.filters` is set, internally bump `k` to `max(user_k * 20, 500).min(2000)` (matches application-layer overfetch policies that downstream callers typically already implement, e.g. Atlas Data Fabric's `filter_pushdown.go`). Trim back to `user_k` after post-filter.
2. **Brute-force fallback**: when even the overfetched top-K under-delivers (singleton/very-small buckets at <0.5 % of corpus, where `Cap=2000` of `n=14000` still expects 0.14 hits), iterate the storage layer directly: for each id, check metadata, compute distance to query, sort, return top-K. O(n) per query but only fires when the fast path didn't find enough — which is precisely the case where it's needed.

Without both layers, `Search(k=10, filter={small_bucket})` cannot reliably surface the small bucket's vectors at scale.

## Why this is missed by the existing tests

| Test | Vectors | Why it doesn't catch the bug |
|---|---|---|
| `test_hnsw_insert_and_search` | 3 | Trivially small; everyone is reachable |
| `test_hnsw_multiple_inserts_no_deadlock` | 20 | Small; tests deadlock not recall |
| `test_hnsw_concurrent_inserts` | 40 | Small; tests deadlock not recall |
| `test_vector_db_basic_operations` | 1 | First-vector path — never reaches the buggy logic |

None of them exercise insertion-order bias or recall at >100 vectors. The bug is asymptotic in `n` — at small `n` the navigable graph is dense enough that even broken termination still finds neighbours.

## Proposed fix

Available as a patch — happy to open a PR. ~140 lines net change across two files (`index.rs` and `vector_db.rs`); no API change, no storage format change, no FFI impact.

**`index.rs` (Bugs A–E):**

1. Switch `Neighbor::Ord` to natural ordering by distance with id tie-break (totalises Ord, keeps it consistent with Eq).
2. Use `BinaryHeap<Reverse<Neighbor>>` for the `candidates` heap so `pop()` returns the closest unvisited; `BinaryHeap<Neighbor>` (max-heap) for the `result` heap so `peek()` returns the furthest, evictable item. Restores standard HNSW search invariants.
3. Drop the `.min(m*2)` clamp in the insert beam — use `ef_construction.max(m)` (effectively just `ef_construction` with a sane floor).
4. Replace `truncate(m)` with distance-based pruning: when a neighbour's connection list exceeds `2M`, keep the M connections closest *to that neighbour* by distance (heuristic-1 selection).
5. Acquire `vectors.read()` before `graph.write()` during pruning so the prefix order matches `search_knn_internal`'s read path.
6. In `HnswIndex::search`, set `ef = max(ef_search, query.k)` so the candidate pool can grow with the caller's request.

**`vector_db.rs` (Bug F):**

7. In `VectorDB::search`, when `query.filters` is set, derive `fetch_k = max(user_k * 20, 500).min(2000)` and pass that as the new `query.k` (and `ef_search`) into `index.search`. Post-filter on returned candidates, then `truncate(user_k)`.
8. Fallback: if `results.len() < user_k` after the overfetched fast path, run a brute-force filtered scan over `storage.get_all_ids()` — check metadata, compute distance, sort, take top-K. Guarantees correctness for very small filter buckets.

**Tests:**

Add the recall@1 test above plus a top-K cluster-coverage variant to `mod tests`. Add a small-bucket filter test that asserts `Search(k=10, filter=singleton_bucket)` returns the singleton from a 14k-vector corpus. None of these tests pass on `main`; all pass on the patch.

## Affected versions / scope

Reproduces on `main` at `d209fc4c` (current as of 2026-05-07). The bugs are in code that has been stable for many commits; downstream users running ruvector-router-core HNSW for any corpus larger than a few hundred vectors with non-uniform insertion order are likely affected.

## Environmental note unrelated to this issue

Separately, `vector_db::tests::test_vector_db_basic_operations` hangs in our dev container (Linux x86_64, rustc 1.94, parking_lot, redb default features). Two threads park on `futex_wait_queue_me`. Reproduces on unmodified `main` and is independent of the HNSW bug. Will file separately if it's not already known.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ruvector-router-core HNSW: search returns 0 results from inserted vectors at scale (recall@1 << 1.0) #430

Summary

Reproduction

Root cause — three bugs in `index.rs`, plus three smaller issues

Bug A: `result` BinaryHeap orientation is inverted

Bug B: insert-time beam is clamped to `m * 2`

Bug C: neighbour pruning is FIFO, not distance-based

Bug D (latent until C is fixed): lock order in pruning

Bug E: caller-driven `k` is silently clamped to `ef_search`

Bug F: filtered search returns zero for small buckets due to post-filter on top-K

Why this is missed by the existing tests

Proposed fix

Affected versions / scope

Environmental note unrelated to this issue

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Test	Vectors	Why it doesn't catch the bug
`test_hnsw_insert_and_search`	3	Trivially small; everyone is reachable
`test_hnsw_multiple_inserts_no_deadlock`	20	Small; tests deadlock not recall
`test_hnsw_concurrent_inserts`	40	Small; tests deadlock not recall
`test_vector_db_basic_operations`	1	First-vector path — never reaches the buggy logic

ruvector-router-core HNSW: search returns 0 results from inserted vectors at scale (recall@1 << 1.0) #430

Description

Summary

Reproduction

Root cause — three bugs in index.rs, plus three smaller issues

Bug A: result BinaryHeap orientation is inverted

Bug B: insert-time beam is clamped to m * 2

Bug C: neighbour pruning is FIFO, not distance-based

Bug D (latent until C is fixed): lock order in pruning

Bug E: caller-driven k is silently clamped to ef_search

Bug F: filtered search returns zero for small buckets due to post-filter on top-K

Why this is missed by the existing tests

Proposed fix

Affected versions / scope

Environmental note unrelated to this issue

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Root cause — three bugs in `index.rs`, plus three smaller issues

Bug A: `result` BinaryHeap orientation is inverted

Bug B: insert-time beam is clamped to `m * 2`

Bug E: caller-driven `k` is silently clamped to `ef_search`