Skip to content

Human-Augment-Analytics/vectordb-retrieval

Repository files navigation

Vector DB Retrieval Guarantee Research

This repository provides a comprehensive framework for researching, benchmarking, and analyzing vector retrieval algorithms, with a special focus on retrieval guarantees. The codebase is designed to facilitate reproducible experiments and in-depth performance comparisons across various datasets and algorithm configurations.

Features

  • Extensible Algorithm Framework: Easily add new vector search algorithms by inheriting from a BaseAlgorithm class.
  • Automated Benchmark Suite: A single script (scripts/run_full_benchmark.py) to run a full suite of experiments across multiple datasets.
  • Modular Index/Search Pipelines: Combine any indexing strategy with any search strategy through declarative config (e.g., pair FAISS HNSW indexing with linear or FAISS searchers).
  • Expanded FAISS Coverage: Benchmark flat, IVF-Flat, IVF-PQ, IVF-SQ8, and stand-alone PQ indexes side by side without code changes by updating YAML configs.
  • Locality-Sensitive Hashing Baseline: Compare an LSH retriever (cosine or Euclidean) with tunable recall guarantees using the same declarative pipeline, including a FAISS-backed IndexLSH variant that reranks expanded candidate sets for improved recall.
  • Cover Tree Prototype: Run a lightweight cover tree baseline (random + subsampled GloVe) via configs/covertree_smoke.yaml to vet hierarchical metric search behavior; candidate/visit limits are currently disabled while we validate recall.
  • Standard Datasets: Built-in support for benchmark datasets like SIFT1M, GloVe, and MS MARCO (TF-IDF projection or pre-embedded Cohere vectors), with automated download and preprocessing.
  • Comprehensive Metrics: Tracks key performance indicators including recall, queries per second (QPS), index build time, and index memory usage.
  • Automated Reporting: Automatically generates detailed Markdown summary reports and raw JSON results for each benchmark run.

Project Structure

  • src/: Source code for the experimental framework.
    • algorithms/: Implementations of vector retrieval algorithms (e.g., ExactSearch, HNSW).
    • benchmark/: Utilities for dataset handling, evaluation, and metrics.
    • experiments/: The core experimental runner and configuration management.
  • scripts/: High-level scripts for automating experiments.
    • run_full_benchmark.py: The main entry point for running the full benchmark suite.
  • configs/: Directory for experiment configuration files (in YAML format).
  • data/: Default directory for storing downloaded and processed datasets (configurable via data_dir).
  • benchmark_results/: Default output directory for benchmark reports and raw results (configurable via output_dir).

Setup

# Create and activate a virtual environment (e.g., using conda)
conda create -n vectordb-env python=3.10
conda activate vectordb-env

# Install dependencies
pip install -r requirements.txt

Testing

Fast smoke checks are available via pytest. This runs lightweight algorithm/indexer tests without needing full dataset downloads.

pytest

# Skip FAISS-dependent tests if the backend is unavailable
pytest -m "not requires_faiss"

Running the Benchmark

The primary way to run experiments is using the full benchmark runner script.

1. Create Default Configuration

First, generate the default benchmark configuration file. This file defines which datasets and algorithms to test.

python scripts/run_full_benchmark.py --create-config

This will create configs/benchmark_config.yaml. You can edit this file to customize the benchmark run.

2. Run the Benchmark Suite

Once the configuration is ready, launch the benchmark suite:

#below is WIP
python scripts/run_full_benchmark.py --config configs/benchmark_config.yaml   

#below works
python scripts/run_full_benchmark.py --config configs/benchmark_config_test1.yaml 
python scripts/run_full_benchmark.py --config configs/benchmark_config_ms.yaml 

The script will automatically download the required datasets if they are not found in the configured data_dir, run all experiments, and save the results under the configured output_dir.

PACE deployment note: the repository configuration (configs/benchmark_config.yaml) points to the shared storage locations:

  • Datasets: /storage/ice-shared/cs8903onl/vectordb-retrieval/datasets
  • Benchmark results: /storage/ice-shared/cs8903onl/vectordb-retrieval/results
  • MS MARCO assets generated by src/dataprep/:
    • Raw cache (ir_datasets): /storage/ice-shared/cs8903onl/vectordb-retrieval/ms_marco_v1_raw
    • Subsampled TSV export: /storage/ice-shared/cs8903onl/vectordb-retrieval/msmarco_v1_subsampled
    • Sentence-transformer embeddings: /storage/ice-shared/cs8903onl/vectordb-retrieval/msmarco_v1_embeddings

Adjust those paths if you are running on a different machine or prefer a different layout.

Dataset-specific Options

Dataset entries can carry bespoke options via the dataset_options key. The MS MARCO entry in configs/benchmark_config.yaml now points embedded_dataset_dir at the directory produced by src/dataprep/embed_msmarco.py, which contains passage_embeddings.npy, query_embeddings.npy, ground_truth.npy, and the corresponding ID files. The loader validates that ground-truth indices stay within bounds and will memory-map passage_embeddings.npy whenever use_memmap_cache is enabled. Keep ground_truth_k in sync with the value used during embedding and continue to adjust cache_dir, query_batch_size, or topk based on your evaluation budget.

Memory-bound runs: with the new .npy layout you still get the best mileage by leaving use_memmap_cache: true, which opens passage_embeddings.npy as a read-only memmap. Pair this with query_batch_size to bound how many vectors are searched at once.

The glove50 loader also accepts new smoke-friendly knobs: test_size, test_limit, train_limit, ground_truth_k, and seed. They make it easy to subsample a few thousand base vectors plus a couple hundred queries—perfect for validating slower research prototypes such as the cover tree without processing all ~400k embeddings.

If you want to run the same dataset multiple times in a single benchmark job (for example, comparing different train_size values), set a unique experiment_name per dataset entry:

datasets:
  - name: random
    experiment_name: random_train32
    dataset_options:
      train_size: 32
      test_size: 8
      ground_truth_k: 5
      dimensions: 4
      seed: 101
  - name: random
    experiment_name: random_train64
    dataset_options:
      train_size: 64
      test_size: 8
      ground_truth_k: 5
      dimensions: 4
      seed: 101

Each experiment_name becomes its own output subdirectory and key in all_results.json, so runs do not overwrite each other.

CoverTree Smoke Run

To validate the CoverTree baseline end-to-end (random dataset first, then a subsampled GloVe split), run:

python scripts/run_full_benchmark.py --config configs/covertree_smoke.yaml

The config limits the random dataset to ~20k train / 512 queries and trims GloVe to ~20k train / 256 queries while switching the metric to cosine. It also mirrors the repo-standard paths from AGENTS.md, so all data reads come from /storage/ice-shared/cs8903onl/vectordb-retrieval/datasets (no fresh downloads) and results land in /storage/ice-shared/cs8903onl/vectordb-retrieval/results, right next to the other benchmark suites.

Memory-bound runs: set use_memmap_cache: true under dataset_options to stream large pre-embedded datasets (MS MARCO) directly into a memory-mapped file instead of materialising all passages in RAM. The loader now writes <dataset>_<digest>_train.memmap alongside JSON metadata inside cache_dir, while queries/ground-truth stay in compact .npy files. This avoids the double-copy previously required for np.vstack + FAISS warm-up and is especially helpful on PACE nodes with tight memory quotas. You can still cap working set via base_limit, query_limit, and lower batch_size if the parquet reader spikes memory during iteration. Combine this with query_batch_size (global or per-dataset) to execute searches in controllable mini-batches and keep runtime under cluster limits. Short on walltime? Disable strict relevance resolution (strict_relevance_resolution: false) and/or bound the parquet scan (max_passage_scan) so loading stops once the base_limit budget is filled; any missing positives are reported and skipped.

CoverTreeV2 Perfect-Recall Benchmark

Need airtight recall for CoverTree? Use the v2 implementation, which mirrors the prototype from feature/covertree but plugs into the benchmarking stack:

sbatch slurm_jobs/singlerun_nomsma_benchmarking_c_v2_pat.sbatch

The job spins up a uv environment, installs requirements.txt, and evaluates both CoverTree (with candidate limits disabled) and CoverTreeV2 alongside FAISS, HNSW, and LSH using configs/benchmark_nomsma_c_v2.yaml (random + GloVe datasets, topk=200). Results and plots land under benchmark_results/benchmark_<timestamp>/, and the SLURM log is written to slurm_jobs/slurm_logs/VectorDB-Retrieval-Guarantee_FULL-<jobid>-<node>.log. See methodology/covertree_v2_benchmarking.md for the latest recall/QPS snapshot and troubleshooting tips.

MS MARCO Subset and Embedding Pipeline

The legacy Cohere parquet dump is retired. We now derive a reproducible MSMARCO subset via the scripts under src/dataprep/, both of which read configs/ms_marco_subset_embed.yaml:

  1. python src/dataprep/subsample_msmarco.py
    • Uses ir_datasets to sample CORPUS_SAMPLE_SIZE passages and QUERY_SAMPLE_SIZE dev queries.
    • Writes corpus.tsv and queries.tsv to the configured subset.OUTPUT_DIR.
    • If IR_DATASETS_HOME is unset the script applies the default from the config (/storage/ice-shared/.../ms_marco_v1_raw) so repeated runs reuse the shared cache.
  2. python src/dataprep/embed_msmarco.py
    • Embeds the subsampled TSVs with SentenceTransformer(config.embeddings.MODEL_NAME) on CPU or GPU.
    • Filters queries without positives, aligns qrels, and saves passage_embeddings.npy, query_embeddings.npy, ground_truth.npy, the corresponding ID arrays, and metadata.json under embeddings.OUTPUT_DIR.

PACE users can trigger the same steps through slurm_jobs/ms_marco_subsample_generate.slurm.sh and slurm_jobs/ms_marco_subsample_embed.sh. Point dataset_options.embedded_dataset_dir at the resulting embedding directory to plug the artefacts into the benchmark suite.

Modular Indexing & Searching

Each benchmark configuration can declare reusable indexers and searchers, then mix-and-match them per algorithm via references. For example, exact uses the brute_force_l2 indexer together with the linear_l2 searcher, while the MS MARCO override swaps in cosine-compatible variants. This structure lets you explore new combinations (e.g., FAISS IVF indexer + linear searcher) without touching code—just add a new entry under algorithms with the desired indexer_ref / searcher_ref or inline overrides.

The shipping configs/benchmark_config.yaml now illustrates this by instantiating multiple FAISS retrieval families called out in MethodsUnitVectorDB).pdf: IVF-Flat, IVF-PQ, IVF-SQ8, and a stand-alone PQ baseline, each plugged into the same evaluation harness for apples-to-apples comparisons.

Benchmark Results

After a benchmark run completes, you will find the following in the benchmark_results/benchmark_<timestamp>/ directory:

  • benchmark_summary.md: A human-readable summary of the results in Markdown format.
  • all_results.json: The complete raw results in JSON format.
  • benchmark.log: A detailed log of the entire benchmark run.

An example of the performance table from the summary report:

Algorithm Performance

Algorithm Recall@10 QPS Mean Query Time (ms) Build Time (s) Index Memory (MB)
exact 1.0000 1.83 545.39 0.01 500.00
hnsw

Adding New Algorithms

To add a new algorithm for benchmarking:

  1. Create a new Python file in src/algorithms/.
  2. Define a class that inherits from src.algorithms.BaseAlgorithm.
  3. Implement the required abstract methods:
    • build_index(self, vectors): To build the search index from a set of vectors.
    • search(self, query, k): To find the k nearest neighbors for a single query vector.
    • batch_search(self, queries, k): To find neighbors for a batch of query vectors.
  4. Add your new algorithm to the algorithms section in your benchmark_config.yaml file (reference the new lsh entry for a complete modular example).

Embedding MSMARCOV1

We use the dataset from IR_Datasets: msmarco-v1-passage, see here for more details here.

Subsampling

  • Run python src/dataprep/subsample_msmarco.py to create a reproducible TSV snapshot driven by subset.* in configs/ms_marco_subset_embed.yaml.
  • The script expects IR_DATASETS_HOME to point at the raw MSMARCO cache; if it is absent the config default is applied automatically.
  • Outputs (corpus.tsv, queries.tsv) are written to subset.OUTPUT_DIR (default /storage/ice-shared/cs8903onl/vectordb-retrieval/msmarco_v1_subsampled).
  • On PACE you can launch the step with slurm_jobs/ms_marco_subsample_generate.slurm.sh.

Embedding

  • Run python src/dataprep/embed_msmarco.py to embed the snapshot and derive ground_truth.npy.
  • Configure the model, batch size, and GROUND_TRUTH_K via configs/ms_marco_subset_embed.yaml.
  • Artefacts (passage_embeddings.npy, query_embeddings.npy, passage_ids.npy, query_ids.npy, ground_truth.npy, metadata.json) land in embeddings.OUTPUT_DIR (default /storage/ice-shared/cs8903onl/vectordb-retrieval/msmarco_v1_embeddings).
  • Use slurm_jobs/ms_marco_subsample_embed.sh for the GPU-backed PACE workflow.

About

Vector DB with Retrieval Guarantee

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors