Skip to content

IVF-SQ C++ API#1865

Open
viclafargue wants to merge 13 commits intorapidsai:mainfrom
viclafargue:ivf-sq
Open

IVF-SQ C++ API#1865
viclafargue wants to merge 13 commits intorapidsai:mainfrom
viclafargue:ivf-sq

Conversation

@viclafargue
Copy link
Contributor

@viclafargue viclafargue commented Mar 3, 2026

Closes #1291.

Overview

IVF-SQ combines an inverted file (IVF) partitioning scheme with 8-bit scalar quantization (SQ8) of residuals. Each float32 dimension is compressed to a single uint8 code, giving a 4x memory reduction over IVF-Flat while retaining high recall. The index implements various metrics (L2, inner-product, and cosine distance), data type (float, half) and also filtering.

Build

  • K-Means clustering : A training subset (controlled by kmeans_trainset_fraction) is sampled from the dataset. Balanced K-Means is run on it to produce n_lists centroids that partition the vector space.
  • SQ parameter training : The training vectors are assigned to their nearest centroid and residuals are computed. A custom CUDA kernel computes the per-dimension min/max of these residuals (fast CUB reduction in shared memory). The observed range is expanded by a 5% margin on each side to reduce clipping on unseen data. The delta for each dimension is computed as sq_delta[d] = (range + 2*margin) / 255. These two per-dimension parameters (sq_vmin the lower end of the range, sq_delta the scale or quantization step) are stored in the index and are all that is needed to encode/decode any vector.
  • Data insertion : If add_data_on_build is true (the default), the full dataset is inserted via the extend path described below.

Extend

Extend adds new vectors to an existing index without retraining centroids or SQ parameters and in a batched fashion:

  • Cluster assignment : New vectors are assigned to their nearest centroid via K-Means predict. (adaptative centers : when enabled, centroids are incrementally updated as new data arrives, and center norms are recomputed).
  • List resizing : Per-list sizes are histogrammed and the IVF lists are grown to accommodate the new vectors, respecting the interleaved group alignment (kIndexGroupSize = 32).
  • Residual computation + SQ encoding : A first kernel computes residuals. A second kernel quantize each residual dimension to uint8 and write the code into the interleaved list layout.

Search

Search proceeds in three stages:

  • Coarse search (GEMM-based) : All query-to-centroid distances are computed in a single batched GEMM (queries x centers^T), with metric-specific pre/post-processing. The top n_probes nearest clusters per query are selected via select_k.
  • Fine scan (ivf_sq_scan_kernel) : This is the performance-critical kernel. The grid is (n_queries, n_probes) with one block per (query, probe) pair:
    • Shared-memory precomputation : Per-dimension constants that are invariant across all vectors in the cluster (query[d], centroid[d], vmin[d] and delta[d]) are pre-loaded into shared memory once per block in a metric-specific fashion (metric templated). This avoids redundant global memory reads in the hot loop.
    • Warp-coalesced vector reads : Each warp processes one interleaved group of 32 vectors. Within a dimension block, each lane loads a uint4 (16 bytes), so the 32 lanes together issue a fully coalesced 512-byte read (4 cache lines) per dimension block, achieving full memory-bandwidth utilization.
    • Fused distance accumulation : Each lane accumulates the distance for its vector inline as it decodes each dimension block -- there is no separate decode-then-compute pass.
  • Final top-k selection : Per-query distances from all probed lists are merged and the global top-k neighbors are selected via a second select_k call, followed by index postprocessing to map list-local positions back to original dataset IDs.

Benchmarks on B200

ivf-sq-wiki-all-10M-bench

ivf-sq-deep-100m-bench

@viclafargue viclafargue requested review from a team as code owners March 3, 2026 09:49
@viclafargue viclafargue self-assigned this Mar 3, 2026
@viclafargue viclafargue added feature request New feature or request non-breaking Introduces a non-breaking change C++ labels Mar 3, 2026
@viclafargue viclafargue mentioned this pull request Mar 11, 2026
Copy link
Contributor

@jinsolp jinsolp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viclafargue ! Sharing my first batch of comments for build and extend.
Aside from this, I also think ivf_sq.hpp overall needs some documentation!

auto orig_centroids_view = raft::make_device_matrix_view<const float, int64_t>(
index->centers().data_handle(), n_lists, dim);

constexpr size_t kReasonableMaxBatchSize = 65536;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this determined? 👀

Copy link
Contributor Author

@viclafargue viclafargue Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an arbitrary value used in other index types. I think that the build performance was historically not the main concern in cuVS. This probably was a safe value that ensured that we never ran out of memory in smaller systems. But, thanks for pointing this out as it might be interesting to see if we could improve the way this value is determined (dataset dimensions and available VRAM). cc @achirkin @tfeher

}
}

if (params.add_data_on_build) { detail::extend<T, IdxT>(handle, &idx, dataset, nullptr, n_rows); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe check whether if we're adding data on build, we don't have adaptive_centers=true? Think this will take us through unnecessary code paths in the extend function.

Copy link
Contributor Author

@viclafargue viclafargue Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually decided to drop the adaptative_centers config parameter and feature altogether, because moving the centroids without updating the residual encoding lists is not a good idea and updating the list would be prohibitive in term of perf. Addressed in 206cb2e.

@cjnolet
Copy link
Member

cjnolet commented Mar 13, 2026

@viclafargue can you please share the build time speedups over Faiss IVFSQ on GPU and CPU? Those are going to be crucial as the major value prop of cuVS over Faiss (both cpu and gpu) is index build speedup.

@viclafargue viclafargue requested a review from a team as a code owner March 13, 2026 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C++ feature request New feature or request non-breaking Introduces a non-breaking change

Projects

Development

Successfully merging this pull request may close these issues.

[FEA] Support IVF-SQ index

3 participants