bftree lazy neighbor init#1215
Conversation
Treat a bf-tree `NotFound` on neighbor read as an empty adjacency list rather than an error, and drop the O(max_points) eager initialization that wrote an empty list for every id at construction. The eager loop was a stop-gap for a missing "exists" query, but bf-tree already distinguishes `Found` / `NotFound` / `Deleted` natively. Mapping `NotFound` to an empty list (identical to the existing `Found(0)` case) lets neighbor lists be created lazily on first write. This makes construction O(num_start_points) instead of O(max_points) — eliminating a billion-insert startup wall for large, larger-than-memory datasets — and removes the artificial coupling that required capacity to be fully materialized up front. The consolidate paths that motivated the stop-gap remain covered by the delete-and-search tests, which exercise inplace_delete end to end. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add coverage for the NotFound -> empty mapping introduced by lazy neighbor-list initialization: a serial sweep over a wide range of never-written ids (including sparse written ids and ids beyond any write) and a concurrent multi-thread hammer. Both assert that absent ids yield an empty adjacency list with no errors or panics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1215 +/- ##
==========================================
- Coverage 90.85% 89.82% -1.04%
==========================================
Files 489 489
Lines 93575 93618 +43
==========================================
- Hits 85020 84093 -927
- Misses 8555 9525 +970
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
nice! can you post perf and recall results for this change? |
| /// `neighbors` is cleared first upon each invocation. A vector with no | ||
| /// stored neighbor list yet (bf-tree `NotFound`) yields an empty list rather | ||
| /// than an error, so neighbor lists are created lazily on first write. | ||
| /// One data copy is involved which copies the data from bf-tree to `neighbors` |
| // Sweep a wide id range, including the written ids and the gaps between | ||
| // and beyond them. The buffer is reused across calls to confirm it is | ||
| // cleared on every read. | ||
| for id in 0u32..200_000 { | ||
| neighbor_provider.get_neighbors(id, &mut result).unwrap(); | ||
| if written.contains(&id) { | ||
| assert_eq!(&[1, 2, 3], &*result, "written id {id} lost its list"); | ||
| } else { | ||
| assert!(result.is_empty(), "unwritten id {id} should be empty"); | ||
| } | ||
| } |
| // Overlapping id ranges across threads to maximize contention on | ||
| // the shared bf-tree read path for absent keys. | ||
| for id in (t * 1_000)..(t * 1_000 + 10_000) { | ||
| neighbor_provider.get_neighbors(id, &mut result).unwrap(); | ||
| assert!(result.is_empty()); | ||
| } |
|
@harsha-simhadri here's a copilot summary of 3 A/B runs for this lazy init change. Ran the wikipedia-100K streaming benchmark (full-precision bf-tree, 100 insert+search stages, 100K vectors, L=200, k=100, 8 threads), 3 trials per branch, all release builds. Conclusion: Lazy neighbor init is result-neutral — recall is identical to baseline on every metric. The ~3–4% timing lean toward baseline is within run-to-run noise: per-trial ranges overlap on all timing metrics (e.g. QPS: A-best 718 vs B-worst 698; insert/op: A-best 521ms vs B-worst 540ms), and the trial spread is as large as the mean difference. No statistically confident perf regression. |
What
Make bf-tree neighbor lists initialize lazily instead of eagerly materializing
an empty list for every id at construction.
Two changes in
diskann-bftree:neighbors.rs— On neighbor read, treat a bf-treeNotFoundresult as anempty adjacency list rather than an error. (
Deleted/InvalidKeystill error.)provider.rs— Drop theO(max_points)eager loop innew_emptythatwrote an empty list for every id up front.
Why
The eager loop was a stop-gap for a missing "exists" query. But bf-tree already
distinguishes
Found/NotFound/Deletednatively, so mappingNotFoundtoan empty list (identical to the existing
Found(0)case) lets neighbor lists becreated on first write.
This makes construction
O(num_start_points)instead ofO(max_points),eliminating a billion-insert startup wall for large, larger-than-memory datasets,
and removes the artificial coupling that required capacity to be fully
materialized up front.
Correctness
The consolidation paths that motivated the stop-gap (
consolidate_vector/on_neighbors) read neighbor lists during delete; aNotFound→empty result isidentical to the eager-init outcome for an unwritten id. These remain covered by
the delete-and-search tests, which exercise
inplace_deleteend to end.