IVF-SQ C++ API by viclafargue · Pull Request #1865 · rapidsai/cuvs

viclafargue · 2026-03-03T09:49:30Z

Closes #1291.

Overview

IVF-SQ combines an inverted file (IVF) partitioning scheme with 8-bit scalar quantization (SQ8) of residuals. Each float32 dimension is compressed to a single uint8 code, giving a 4x memory reduction over IVF-Flat while retaining high recall. The index implements various metrics (L2, inner-product, and cosine distance), data type (float, half) and also filtering.

Build

K-Means clustering : A training subset (controlled by kmeans_trainset_fraction) is sampled from the dataset. Balanced K-Means is run on it to produce n_lists centroids that partition the vector space.
SQ parameter training : The training vectors are assigned to their nearest centroid and residuals are computed. A custom CUDA kernel computes the per-dimension min/max of these residuals (fast CUB reduction in shared memory). The observed range is expanded by a 5% margin on each side to reduce clipping on unseen data. The delta for each dimension is computed as sq_delta[d] = (range + 2*margin) / 255. These two per-dimension parameters (sq_vmin the lower end of the range, sq_delta the scale or quantization step) are stored in the index and are all that is needed to encode/decode any vector.
Data insertion : If add_data_on_build is true (the default), the full dataset is inserted via the extend path described below.

Extend

Extend adds new vectors to an existing index without retraining centroids or SQ parameters and in a batched fashion:

Cluster assignment : New vectors are assigned to their nearest centroid via K-Means predict. ~~(adaptative centers : when enabled, centroids are incrementally updated as new data arrives, and center norms are recomputed)~~.
List resizing : Per-list sizes are histogrammed and the IVF lists are grown to accommodate the new vectors, respecting the interleaved group alignment (kIndexGroupSize = 32).
Residual computation + SQ encoding : A first kernel computes residuals. A second kernel quantize each residual dimension to uint8 and write the code into the interleaved list layout.

Search

Search proceeds in three stages:

Coarse search (GEMM-based) : All query-to-centroid distances are computed in a single batched GEMM (queries x centers^T), with metric-specific pre/post-processing. The top n_probes nearest clusters per query are selected via select_k.
Fine scan (ivf_sq_scan_kernel) : This is the performance-critical kernel. The grid is (n_queries, n_probes) with one block per (query, probe) pair:
- Shared-memory precomputation : Per-dimension constants that are invariant across all vectors in the cluster (query[d], centroid[d], vmin[d] and delta[d]) are pre-loaded into shared memory once per block in a metric-specific fashion (metric templated). This avoids redundant global memory reads in the hot loop.
- Warp-coalesced vector reads : Each warp processes one interleaved group of 32 vectors. Within a dimension block, each lane loads a uint4 (16 bytes), so the 32 lanes together issue a fully coalesced 512-byte read (4 cache lines) per dimension block, achieving full memory-bandwidth utilization.
- Fused distance accumulation : Each lane accumulates the distance for its vector inline as it decodes each dimension block -- there is no separate decode-then-compute pass.
Final top-k selection : Per-query distances from all probed lists are merged and the global top-k neighbors are selected via a second select_k call, followed by index postprocessing to map list-local positions back to original dataset IDs.

Benchmarks on B200

jinsolp

Thanks @viclafargue ! Sharing my first batch of comments for build and extend.
Aside from this, I also think ivf_sq.hpp overall needs some documentation!

cpp/include/cuvs/neighbors/ivf_sq.hpp