VEctor SeARch (VESAR)

Diving into the rabbit hole of vector search. Core idea is to explore vector search from first principles and implement, benchmark, experiment with known approaches. I've picked up 3 papers to begin with, that leads to HNSW and one accelerated implementation by NVIDIA. This repo would act as an accompanying code for a rather detailed blog I will publish soon on my portfolio mrutyunjaybiswal.com (or substack). No immediate commitment on when I will able to wrap this up, I would instead treat this as a exploration, not an item to tick off.

I would appreciate more relevant paper suggestions for the reading list, I will try to implement so and benchmark.

EDIT: I expected to wrap the blog first, then the code, but the script seems to have flipped. So here it is, Rust implementation of NSW and HNSW papers.

Note that, I am still quite afresh to Rust, so kindly bear with the ugly code and strucutre. I wish to improve as I write more in time. Also, I wouldn't prefer bringing in any coding agent to this repo just yet, as it's for learning and exploration (as of now).

Set up

Prerequisites

Rust

Install Rust if you haven't already.

Project structure

vesar/
    /vesar
        /benches        -- contains benchmarks
        /src            -- core libraries and implementation
            main.rs
            .......
        /tests          -- tests
    .gitignore
    README.md

Benchmark

To run benchmarks on insertions, queries per second (qps) and recall with 1_00_000 16-dimensional data points:

N=1_00_000 DIM=16 cargo bench --bench insert --bench qps --bench recall -- --sample-size 10

You can explore more about usage of environment variables for benchmarks in the respective files. To keep things consistent, here's an abstract nomencleture:

N : Number of data points in the dataset
DIM : Dimension of each datapoint in the dataset
M : Allowed connections per node per layer (per HNSW), or neighbours to connect to per insertion (NSW)
EFC : Exploration factor for Construction (which is typically higher than query in HNSW, but more or same in NSW with some tweaks)
EFQ : Exploration factor for Query (equivalant to beam width W, or multi-search params m in NSW papers)
K : top-K query (search) results.

Tests

To run tests,

cargo test --test hnsw_tests  # for hnsw tests
cargo test --test nsw_tests   # for nsw tests

Results

TODO: to add benchmarking snaps and results aggregation.

TODOs (High-level)

Algorithms & Papers

Approximate Nearest Neighbour (greedy + multi search, insertion) initial implementation
HNSW insertions + query

Datasets

Dataset 1: Simulated data points upto 1M elements with d upto 16; L2 as proximity metric. (all configurable)
Dataste 2: CoPHiR dataset (208-d, k ~ 30 neighoburs, L1 metric).

Benchmarking

Benchmark against experimental set up of Malkov et. al. [2].

Insertion latency.
QPS.
Correctness (top-K Recall).
Number of Proximity Metric Calculations as a fraction of Dataset size.
Scale with/against Dimensionality.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
vesar		vesar
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VEctor SeARch (VESAR)

Set up

Prerequisites

Project structure

Benchmark

Tests

Results

TODOs (High-level)

Algorithms & Papers

Datasets

Benchmarking

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VEctor SeARch (VESAR)

Set up

Prerequisites

Project structure

Benchmark

Tests

Results

TODOs (High-level)

Algorithms & Papers

Datasets

Benchmarking

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages