Vector DB Retrieval Guarantee Research

This repository provides a comprehensive framework for researching, benchmarking, and analyzing vector retrieval algorithms, with a special focus on retrieval guarantees. The codebase is designed to facilitate reproducible experiments and in-depth performance comparisons across various datasets and algorithm configurations.

Features

Extensible Algorithm Framework: Easily add new vector search algorithms by inheriting from a BaseAlgorithm class.
Automated Benchmark Suite: A single script (scripts/run_full_benchmark.py) to run a full suite of experiments across multiple datasets.
Modular Index/Search Pipelines: Combine any indexing strategy with any search strategy through declarative config (e.g., pair FAISS HNSW indexing with linear or FAISS searchers).
Expanded FAISS Coverage: Benchmark flat, IVF-Flat, IVF-PQ, IVF-SQ8, and stand-alone PQ indexes side by side without code changes by updating YAML configs.
Locality-Sensitive Hashing Baseline: Compare an LSH retriever (cosine or Euclidean) with tunable recall guarantees using the same declarative pipeline, including a FAISS-backed IndexLSH variant that reranks expanded candidate sets for improved recall.
Cover Tree Prototype: Run a lightweight cover tree baseline (random + subsampled GloVe) via configs/covertree_smoke.yaml to vet hierarchical metric search behavior; candidate/visit limits are currently disabled while we validate recall.
Standard Datasets: Built-in support for benchmark datasets like SIFT1M, GloVe, and MS MARCO (TF-IDF projection or pre-embedded Cohere vectors), with automated download and preprocessing.
Comprehensive Metrics: Tracks key performance indicators including recall, queries per second (QPS), index build time, and index memory usage.
Automated Reporting: Automatically generates detailed Markdown summary reports and raw JSON results for each benchmark run.

Project Structure

src/: Source code for the experimental framework.
- algorithms/: Implementations of vector retrieval algorithms (e.g., ExactSearch, HNSW).
- benchmark/: Utilities for dataset handling, evaluation, and metrics.
- experiments/: The core experimental runner and configuration management.
scripts/: High-level scripts for automating experiments.
- run_full_benchmark.py: The main entry point for running the full benchmark suite.
configs/: Directory for experiment configuration files (in YAML format).
data/: Default directory for storing downloaded and processed datasets (configurable via data_dir).
benchmark_results/: Default output directory for benchmark reports and raw results (configurable via output_dir).

Setup

# Create and activate a virtual environment (e.g., using conda)
conda create -n vectordb-env python=3.10
conda activate vectordb-env

# Install dependencies
pip install -r requirements.txt

Testing

Fast smoke checks are available via pytest. This runs lightweight algorithm/indexer tests without needing full dataset downloads.

pytest

# Skip FAISS-dependent tests if the backend is unavailable
pytest -m "not requires_faiss"

Running the Benchmark

The primary way to run experiments is using the full benchmark runner script.

1. Create Default Configuration

First, generate the default benchmark configuration file. This file defines which datasets and algorithms to test.

python scripts/run_full_benchmark.py --create-config

This will create configs/benchmark_config.yaml. You can edit this file to customize the benchmark run.

2. Run the Benchmark Suite

Once the configuration is ready, launch the benchmark suite:

#below is WIP
python scripts/run_full_benchmark.py --config configs/benchmark_config.yaml   

#below works
python scripts/run_full_benchmark.py --config configs/benchmark_config_test1.yaml 
python scripts/run_full_benchmark.py --config configs/benchmark_config_ms.yaml

The script will automatically download the required datasets if they are not found in the configured data_dir, run all experiments, and save the results under the configured output_dir.

PACE deployment note: the repository configuration (configs/benchmark_config.yaml) points to the shared storage locations:

Datasets: /storage/ice-shared/cs8903onl/vectordb-retrieval/datasets

Benchmark results: /storage/ice-shared/cs8903onl/vectordb-retrieval/results

MS MARCO assets generated by src/dataprep/:

Raw cache (ir_datasets): /storage/ice-shared/cs8903onl/vectordb-retrieval/ms_marco_v1_raw

Subsampled TSV export: /storage/ice-shared/cs8903onl/vectordb-retrieval/msmarco_v1_subsampled

Sentence-transformer embeddings: /storage/ice-shared/cs8903onl/vectordb-retrieval/msmarco_v1_embeddings

Adjust those paths if you are running on a different machine or prefer a different layout.

Dataset-specific Options

Dataset entries can carry bespoke options via the dataset_options key. The MS MARCO entry in configs/benchmark_config.yaml now points embedded_dataset_dir at the directory produced by src/dataprep/embed_msmarco.py, which contains passage_embeddings.npy, query_embeddings.npy, ground_truth.npy, and the corresponding ID files. The loader validates that ground-truth indices stay within bounds and will memory-map passage_embeddings.npy whenever use_memmap_cache is enabled. Keep ground_truth_k in sync with the value used during embedding and continue to adjust cache_dir, query_batch_size, or topk based on your evaluation budget.

Memory-bound runs: with the new .npy layout you still get the best mileage by leaving use_memmap_cache: true, which opens passage_embeddings.npy as a read-only memmap. Pair this with query_batch_size to bound how many vectors are searched at once.

The glove50 loader also accepts new smoke-friendly knobs: test_size, test_limit, train_limit, ground_truth_k, and seed. They make it easy to subsample a few thousand base vectors plus a couple hundred queries—perfect for validating slower research prototypes such as the cover tree without processing all ~400k embeddings.

If you want to run the same dataset multiple times in a single benchmark job (for example, comparing different train_size values), set a unique experiment_name per dataset entry:

datasets:
  - name: random
    experiment_name: random_train32
    dataset_options:
      train_size: 32
      test_size: 8
      ground_truth_k: 5
      dimensions: 4
      seed: 101
  - name: random
    experiment_name: random_train64
    dataset_options:
      train_size: 64
      test_size: 8
      ground_truth_k: 5
      dimensions: 4
      seed: 101

Each experiment_name becomes its own output subdirectory and key in all_results.json, so runs do not overwrite each other.

CoverTree Smoke Run

To validate the CoverTree baseline end-to-end (random dataset first, then a subsampled GloVe split), run:

python scripts/run_full_benchmark.py --config configs/covertree_smoke.yaml

The config limits the random dataset to ~20k train / 512 queries and trims GloVe to ~20k train / 256 queries while switching the metric to cosine. It also mirrors the repo-standard paths from AGENTS.md, so all data reads come from /storage/ice-shared/cs8903onl/vectordb-retrieval/datasets (no fresh downloads) and results land in /storage/ice-shared/cs8903onl/vectordb-retrieval/results, right next to the other benchmark suites.

Memory-bound runs: set use_memmap_cache: true under dataset_options to stream large pre-embedded datasets (MS MARCO) directly into a memory-mapped file instead of materialising all passages in RAM. The loader now writes <dataset>_<digest>_train.memmap alongside JSON metadata inside cache_dir, while queries/ground-truth stay in compact .npy files. This avoids the double-copy previously required for np.vstack + FAISS warm-up and is especially helpful on PACE nodes with tight memory quotas. You can still cap working set via base_limit, query_limit, and lower batch_size if the parquet reader spikes memory during iteration. Combine this with query_batch_size (global or per-dataset) to execute searches in controllable mini-batches and keep runtime under cluster limits. Short on walltime? Disable strict relevance resolution (strict_relevance_resolution: false) and/or bound the parquet scan (max_passage_scan) so loading stops once the base_limit budget is filled; any missing positives are reported and skipped.

CoverTreeV2 Perfect-Recall Benchmark

Need airtight recall for CoverTree? Use the v2 implementation, which mirrors the prototype from feature/covertree but plugs into the benchmarking stack:

sbatch slurm_jobs/singlerun_nomsma_benchmarking_c_v2_pat.sbatch

The job spins up a uv environment, installs requirements.txt, and evaluates both CoverTree (with candidate limits disabled) and CoverTreeV2 alongside FAISS, HNSW, and LSH using configs/benchmark_nomsma_c_v2.yaml (random + GloVe datasets, topk=200). Results and plots land under benchmark_results/benchmark_<timestamp>/, and the SLURM log is written to slurm_jobs/slurm_logs/VectorDB-Retrieval-Guarantee_FULL-<jobid>-<node>.log. See methodology/covertree_v2_benchmarking.md for the latest recall/QPS snapshot and troubleshooting tips.

MS MARCO Subset and Embedding Pipeline

The legacy Cohere parquet dump is retired. We now derive a reproducible MSMARCO subset via the scripts under src/dataprep/, both of which read configs/ms_marco_subset_embed.yaml:

python src/dataprep/subsample_msmarco.py
- Uses ir_datasets to sample CORPUS_SAMPLE_SIZE passages and QUERY_SAMPLE_SIZE dev queries.
- Writes corpus.tsv and queries.tsv to the configured subset.OUTPUT_DIR.
- If IR_DATASETS_HOME is unset the script applies the default from the config (/storage/ice-shared/.../ms_marco_v1_raw) so repeated runs reuse the shared cache.
python src/dataprep/embed_msmarco.py
- Embeds the subsampled TSVs with SentenceTransformer(config.embeddings.MODEL_NAME) on CPU or GPU.
- Filters queries without positives, aligns qrels, and saves passage_embeddings.npy, query_embeddings.npy, ground_truth.npy, the corresponding ID arrays, and metadata.json under embeddings.OUTPUT_DIR.

PACE users can trigger the same steps through slurm_jobs/ms_marco_subsample_generate.slurm.sh and slurm_jobs/ms_marco_subsample_embed.sh. Point dataset_options.embedded_dataset_dir at the resulting embedding directory to plug the artefacts into the benchmark suite.

Modular Indexing & Searching

Each benchmark configuration can declare reusable indexers and searchers, then mix-and-match them per algorithm via references. For example, exact uses the brute_force_l2 indexer together with the linear_l2 searcher, while the MS MARCO override swaps in cosine-compatible variants. This structure lets you explore new combinations (e.g., FAISS IVF indexer + linear searcher) without touching code—just add a new entry under algorithms with the desired indexer_ref / searcher_ref or inline overrides.

The shipping configs/benchmark_config.yaml now illustrates this by instantiating multiple FAISS retrieval families called out in MethodsUnitVectorDB).pdf: IVF-Flat, IVF-PQ, IVF-SQ8, and a stand-alone PQ baseline, each plugged into the same evaluation harness for apples-to-apples comparisons.

Benchmark Results

After a benchmark run completes, you will find the following in the benchmark_results/benchmark_<timestamp>/ directory:

benchmark_summary.md: A human-readable summary of the results in Markdown format.
all_results.json: The complete raw results in JSON format.
benchmark.log: A detailed log of the entire benchmark run.

An example of the performance table from the summary report:

Algorithm Performance

Algorithm	Recall@10	QPS	Mean Query Time (ms)	Build Time (s)	Index Memory (MB)
exact	1.0000	1.83	545.39	0.01	500.00
hnsw

Adding New Algorithms

To add a new algorithm for benchmarking:

Create a new Python file in src/algorithms/.
Define a class that inherits from src.algorithms.BaseAlgorithm.
Implement the required abstract methods:
- build_index(self, vectors): To build the search index from a set of vectors.
- search(self, query, k): To find the k nearest neighbors for a single query vector.
- batch_search(self, queries, k): To find neighbors for a batch of query vectors.
Add your new algorithm to the algorithms section in your benchmark_config.yaml file (reference the new lsh entry for a complete modular example).

Embedding MSMARCOV1

We use the dataset from IR_Datasets: msmarco-v1-passage, see here for more details here.

Subsampling

Run python src/dataprep/subsample_msmarco.py to create a reproducible TSV snapshot driven by subset.* in configs/ms_marco_subset_embed.yaml.
The script expects IR_DATASETS_HOME to point at the raw MSMARCO cache; if it is absent the config default is applied automatically.
Outputs (corpus.tsv, queries.tsv) are written to subset.OUTPUT_DIR (default /storage/ice-shared/cs8903onl/vectordb-retrieval/msmarco_v1_subsampled).
On PACE you can launch the step with slurm_jobs/ms_marco_subsample_generate.slurm.sh.

Embedding

Run python src/dataprep/embed_msmarco.py to embed the snapshot and derive ground_truth.npy.
Configure the model, batch size, and GROUND_TRUTH_K via configs/ms_marco_subset_embed.yaml.
Artefacts (passage_embeddings.npy, query_embeddings.npy, passage_ids.npy, query_ids.npy, ground_truth.npy, metadata.json) land in embeddings.OUTPUT_DIR (default /storage/ice-shared/cs8903onl/vectordb-retrieval/msmarco_v1_embeddings).
Use slurm_jobs/ms_marco_subsample_embed.sh for the GPU-backed PACE workflow.

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
.idea		.idea
agents		agents
benchmark_results		benchmark_results
configs		configs
data/msmarco_pre_embeded		data/msmarco_pre_embeded
knowledgebase		knowledgebase
methodology		methodology
scripts		scripts
slurm_jobs		slurm_jobs
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
experimental loop		experimental loop
experimental_loop_design.md		experimental_loop_design.md
main.py		main.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_full_benchmark.sh		run_full_benchmark.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vector DB Retrieval Guarantee Research

Features

Project Structure

Setup

Testing

Running the Benchmark

1. Create Default Configuration

2. Run the Benchmark Suite

Dataset-specific Options

CoverTree Smoke Run

CoverTreeV2 Perfect-Recall Benchmark

MS MARCO Subset and Embedding Pipeline

Modular Indexing & Searching

Benchmark Results

Algorithm Performance

Adding New Algorithms

Embedding MSMARCOV1

Subsampling

Embedding

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vector DB Retrieval Guarantee Research

Features

Project Structure

Setup

Testing

Running the Benchmark

1. Create Default Configuration

2. Run the Benchmark Suite

Dataset-specific Options

CoverTree Smoke Run

CoverTreeV2 Perfect-Recall Benchmark

MS MARCO Subset and Embedding Pipeline

Modular Indexing & Searching

Benchmark Results

Algorithm Performance

Adding New Algorithms

Embedding MSMARCOV1

Subsampling

Embedding

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages