This repository provides a comprehensive framework for researching, benchmarking, and analyzing vector retrieval algorithms, with a special focus on retrieval guarantees. The codebase is designed to facilitate reproducible experiments and in-depth performance comparisons across various datasets and algorithm configurations.
- Extensible Algorithm Framework: Easily add new vector search algorithms by inheriting from a
BaseAlgorithmclass. - Automated Benchmark Suite: A single script (
scripts/run_full_benchmark.py) to run a full suite of experiments across multiple datasets. - Modular Index/Search Pipelines: Combine any indexing strategy with any search strategy through declarative config (e.g., pair FAISS HNSW indexing with linear or FAISS searchers).
- Expanded FAISS Coverage: Benchmark flat, IVF-Flat, IVF-PQ, IVF-SQ8, and stand-alone PQ indexes side by side without code changes by updating YAML configs.
- Locality-Sensitive Hashing Baseline: Compare an LSH retriever (cosine or Euclidean) with tunable recall guarantees using the same declarative pipeline, including a FAISS-backed IndexLSH variant that reranks expanded candidate sets for improved recall.
- Cover Tree Prototype: Run a lightweight cover tree baseline (random + subsampled GloVe) via
configs/covertree_smoke.yamlto vet hierarchical metric search behavior; candidate/visit limits are currently disabled while we validate recall. - Standard Datasets: Built-in support for benchmark datasets like SIFT1M, GloVe, and MS MARCO (TF-IDF projection or pre-embedded Cohere vectors), with automated download and preprocessing.
- Comprehensive Metrics: Tracks key performance indicators including recall, queries per second (QPS), index build time, and index memory usage.
- Automated Reporting: Automatically generates detailed Markdown summary reports and raw JSON results for each benchmark run.
src/: Source code for the experimental framework.algorithms/: Implementations of vector retrieval algorithms (e.g., ExactSearch, HNSW).benchmark/: Utilities for dataset handling, evaluation, and metrics.experiments/: The core experimental runner and configuration management.
scripts/: High-level scripts for automating experiments.run_full_benchmark.py: The main entry point for running the full benchmark suite.
configs/: Directory for experiment configuration files (in YAML format).data/: Default directory for storing downloaded and processed datasets (configurable viadata_dir).benchmark_results/: Default output directory for benchmark reports and raw results (configurable viaoutput_dir).
# Create and activate a virtual environment (e.g., using conda)
conda create -n vectordb-env python=3.10
conda activate vectordb-env
# Install dependencies
pip install -r requirements.txtFast smoke checks are available via pytest. This runs lightweight algorithm/indexer tests without needing full dataset downloads.
pytest
# Skip FAISS-dependent tests if the backend is unavailable
pytest -m "not requires_faiss"The primary way to run experiments is using the full benchmark runner script.
First, generate the default benchmark configuration file. This file defines which datasets and algorithms to test.
python scripts/run_full_benchmark.py --create-configThis will create configs/benchmark_config.yaml. You can edit this file to customize the benchmark run.
Once the configuration is ready, launch the benchmark suite:
#below is WIP
python scripts/run_full_benchmark.py --config configs/benchmark_config.yaml
#below works
python scripts/run_full_benchmark.py --config configs/benchmark_config_test1.yaml
python scripts/run_full_benchmark.py --config configs/benchmark_config_ms.yaml The script will automatically download the required datasets if they are not found in the configured data_dir, run all experiments, and save the results under the configured output_dir.
PACE deployment note: the repository configuration (
configs/benchmark_config.yaml) points to the shared storage locations:
- Datasets:
/storage/ice-shared/cs8903onl/vectordb-retrieval/datasets- Benchmark results:
/storage/ice-shared/cs8903onl/vectordb-retrieval/results- MS MARCO assets generated by
src/dataprep/:
- Raw cache (ir_datasets):
/storage/ice-shared/cs8903onl/vectordb-retrieval/ms_marco_v1_raw- Subsampled TSV export:
/storage/ice-shared/cs8903onl/vectordb-retrieval/msmarco_v1_subsampled- Sentence-transformer embeddings:
/storage/ice-shared/cs8903onl/vectordb-retrieval/msmarco_v1_embeddingsAdjust those paths if you are running on a different machine or prefer a different layout.
Dataset entries can carry bespoke options via the dataset_options key. The MS MARCO entry in configs/benchmark_config.yaml now points embedded_dataset_dir at the directory produced by src/dataprep/embed_msmarco.py, which contains passage_embeddings.npy, query_embeddings.npy, ground_truth.npy, and the corresponding ID files. The loader validates that ground-truth indices stay within bounds and will memory-map passage_embeddings.npy whenever use_memmap_cache is enabled. Keep ground_truth_k in sync with the value used during embedding and continue to adjust cache_dir, query_batch_size, or topk based on your evaluation budget.
Memory-bound runs: with the new
.npylayout you still get the best mileage by leavinguse_memmap_cache: true, which openspassage_embeddings.npyas a read-only memmap. Pair this withquery_batch_sizeto bound how many vectors are searched at once.
The glove50 loader also accepts new smoke-friendly knobs: test_size, test_limit, train_limit, ground_truth_k, and seed. They make it easy to subsample a few thousand base vectors plus a couple hundred queries—perfect for validating slower research prototypes such as the cover tree without processing all ~400k embeddings.
If you want to run the same dataset multiple times in a single benchmark job (for example, comparing different train_size values), set a unique experiment_name per dataset entry:
datasets:
- name: random
experiment_name: random_train32
dataset_options:
train_size: 32
test_size: 8
ground_truth_k: 5
dimensions: 4
seed: 101
- name: random
experiment_name: random_train64
dataset_options:
train_size: 64
test_size: 8
ground_truth_k: 5
dimensions: 4
seed: 101Each experiment_name becomes its own output subdirectory and key in all_results.json, so runs do not overwrite each other.
To validate the CoverTree baseline end-to-end (random dataset first, then a subsampled GloVe split), run:
python scripts/run_full_benchmark.py --config configs/covertree_smoke.yamlThe config limits the random dataset to ~20k train / 512 queries and trims GloVe to ~20k train / 256 queries while switching the metric to cosine. It also mirrors the repo-standard paths from AGENTS.md, so all data reads come from /storage/ice-shared/cs8903onl/vectordb-retrieval/datasets (no fresh downloads) and results land in /storage/ice-shared/cs8903onl/vectordb-retrieval/results, right next to the other benchmark suites.
Memory-bound runs: set
use_memmap_cache: trueunderdataset_optionsto stream large pre-embedded datasets (MS MARCO) directly into a memory-mapped file instead of materialising all passages in RAM. The loader now writes<dataset>_<digest>_train.memmapalongside JSON metadata insidecache_dir, while queries/ground-truth stay in compact.npyfiles. This avoids the double-copy previously required fornp.vstack+ FAISS warm-up and is especially helpful on PACE nodes with tight memory quotas. You can still cap working set viabase_limit,query_limit, and lowerbatch_sizeif the parquet reader spikes memory during iteration. Combine this withquery_batch_size(global or per-dataset) to execute searches in controllable mini-batches and keep runtime under cluster limits. Short on walltime? Disable strict relevance resolution (strict_relevance_resolution: false) and/or bound the parquet scan (max_passage_scan) so loading stops once thebase_limitbudget is filled; any missing positives are reported and skipped.
Need airtight recall for CoverTree? Use the v2 implementation, which mirrors the prototype from feature/covertree but plugs into the benchmarking stack:
sbatch slurm_jobs/singlerun_nomsma_benchmarking_c_v2_pat.sbatchThe job spins up a uv environment, installs requirements.txt, and evaluates both CoverTree (with candidate limits disabled) and CoverTreeV2 alongside FAISS, HNSW, and LSH using configs/benchmark_nomsma_c_v2.yaml (random + GloVe datasets, topk=200). Results and plots land under benchmark_results/benchmark_<timestamp>/, and the SLURM log is written to slurm_jobs/slurm_logs/VectorDB-Retrieval-Guarantee_FULL-<jobid>-<node>.log. See methodology/covertree_v2_benchmarking.md for the latest recall/QPS snapshot and troubleshooting tips.
The legacy Cohere parquet dump is retired. We now derive a reproducible MSMARCO subset via the scripts under src/dataprep/, both of which read configs/ms_marco_subset_embed.yaml:
python src/dataprep/subsample_msmarco.py- Uses
ir_datasetsto sampleCORPUS_SAMPLE_SIZEpassages andQUERY_SAMPLE_SIZEdev queries. - Writes
corpus.tsvandqueries.tsvto the configuredsubset.OUTPUT_DIR. - If
IR_DATASETS_HOMEis unset the script applies the default from the config (/storage/ice-shared/.../ms_marco_v1_raw) so repeated runs reuse the shared cache.
- Uses
python src/dataprep/embed_msmarco.py- Embeds the subsampled TSVs with
SentenceTransformer(config.embeddings.MODEL_NAME)on CPU or GPU. - Filters queries without positives, aligns qrels, and saves
passage_embeddings.npy,query_embeddings.npy,ground_truth.npy, the corresponding ID arrays, andmetadata.jsonunderembeddings.OUTPUT_DIR.
- Embeds the subsampled TSVs with
PACE users can trigger the same steps through slurm_jobs/ms_marco_subsample_generate.slurm.sh and slurm_jobs/ms_marco_subsample_embed.sh. Point dataset_options.embedded_dataset_dir at the resulting embedding directory to plug the artefacts into the benchmark suite.
Each benchmark configuration can declare reusable indexers and searchers, then mix-and-match them per algorithm via references. For example, exact uses the brute_force_l2 indexer together with the linear_l2 searcher, while the MS MARCO override swaps in cosine-compatible variants. This structure lets you explore new combinations (e.g., FAISS IVF indexer + linear searcher) without touching code—just add a new entry under algorithms with the desired indexer_ref / searcher_ref or inline overrides.
The shipping configs/benchmark_config.yaml now illustrates this by instantiating multiple FAISS retrieval families called out in MethodsUnitVectorDB).pdf: IVF-Flat, IVF-PQ, IVF-SQ8, and a stand-alone PQ baseline, each plugged into the same evaluation harness for apples-to-apples comparisons.
After a benchmark run completes, you will find the following in the benchmark_results/benchmark_<timestamp>/ directory:
benchmark_summary.md: A human-readable summary of the results in Markdown format.all_results.json: The complete raw results in JSON format.benchmark.log: A detailed log of the entire benchmark run.
An example of the performance table from the summary report:
| Algorithm | Recall@10 | QPS | Mean Query Time (ms) | Build Time (s) | Index Memory (MB) |
|---|---|---|---|---|---|
| exact | 1.0000 | 1.83 | 545.39 | 0.01 | 500.00 |
| hnsw |
To add a new algorithm for benchmarking:
- Create a new Python file in
src/algorithms/. - Define a class that inherits from
src.algorithms.BaseAlgorithm. - Implement the required abstract methods:
build_index(self, vectors): To build the search index from a set of vectors.search(self, query, k): To find theknearest neighbors for a single query vector.batch_search(self, queries, k): To find neighbors for a batch of query vectors.
- Add your new algorithm to the
algorithmssection in yourbenchmark_config.yamlfile (reference the newlshentry for a complete modular example).
We use the dataset from IR_Datasets: msmarco-v1-passage, see here
for more details here.
- Run
python src/dataprep/subsample_msmarco.pyto create a reproducible TSV snapshot driven bysubset.*inconfigs/ms_marco_subset_embed.yaml. - The script expects
IR_DATASETS_HOMEto point at the raw MSMARCO cache; if it is absent the config default is applied automatically. - Outputs (
corpus.tsv,queries.tsv) are written tosubset.OUTPUT_DIR(default/storage/ice-shared/cs8903onl/vectordb-retrieval/msmarco_v1_subsampled). - On PACE you can launch the step with
slurm_jobs/ms_marco_subsample_generate.slurm.sh.
- Run
python src/dataprep/embed_msmarco.pyto embed the snapshot and deriveground_truth.npy. - Configure the model, batch size, and
GROUND_TRUTH_Kviaconfigs/ms_marco_subset_embed.yaml. - Artefacts (
passage_embeddings.npy,query_embeddings.npy,passage_ids.npy,query_ids.npy,ground_truth.npy,metadata.json) land inembeddings.OUTPUT_DIR(default/storage/ice-shared/cs8903onl/vectordb-retrieval/msmarco_v1_embeddings). - Use
slurm_jobs/ms_marco_subsample_embed.shfor the GPU-backed PACE workflow.