Skip to content

CAGRA search: default filtering_rate=-1.0 adds hidden popcount kernel + host sync on every call #1960

@maxwbuckley

Description

@maxwbuckley

Summary

When search_params.filtering_rate is left at its default value of -1.0, the CAGRA search dispatch in cagra.cuh auto-detects the filtering rate on every search call by launching a GPU reduction kernel and synchronizing the stream to read the result back to the host. This adds significant latency that is invisible to users.

Details

In cpp/src/neighbors/cagra.cuh, the bitset_filter dispatch path (lines 371-377) does:

if (params.filtering_rate < 0.0) {
    const auto num_set_bits = sample_filter.bitset_view_.count(res);  // GPU kernel + sync
    auto filtering_rate     = (float)(idx.data().n_rows() - num_set_bits) / idx.data().n_rows();
    // ...
}

bitset_view_.count(res) launches a coalescedSumMediumKernel reduction on the GPU, copies the scalar result to the host, and synchronizes the stream. This happens on every cagra::search() call when filtering_rate < 0.

Impact

Benchmarking on an RTX 5090 at 1M vectors (dim=128, 50% pass rate, k=10, itopk_size=256):

Condition Median latency QPS
filtering_rate = -1.0 (default) 1.111 ms 90,009
filtering_rate = 0.5 (pre-set) 0.944 ms 105,911

The auto-detection adds ~18% overhead at this scale. The overhead is proportionally larger when the search kernel itself is fast (high selectivity, small dataset).

The actual CAGRA search kernel takes ~550 µs (confirmed via nsys profiling). The remaining ~450 µs in the default path is dominated by the popcount kernel launch, D2H copy, and stream synchronization.

Suggested fix

A few options (not mutually exclusive):

  1. Cache the count in bitset_filter — compute it once at construction or on first use, store it as a member. This is the simplest fix and eliminates the per-call overhead entirely.

  2. Document the overhead — add a note to the search_params::filtering_rate docstring explaining that the default triggers a per-call GPU reduction for bitset_filter, and recommend pre-setting it in latency-sensitive applications.

  3. Lazy single-computation — compute on first call with filtering_rate < 0, cache the result, reuse on subsequent calls.

How I found this

While benchmarking a custom filter implementation against bitset_filter, I observed a consistent 5-33% "speedup" that I couldn't explain from the kernel code. Nsight Systems profiling revealed both search kernels had identical register usage (63), launch parameters, and GPU execution time (~1% difference). The entire measured difference was the bitset_view_.count() overhead in the default dispatch path.

Environment

  • cuVS branch-25.12
  • CUDA 12.4
  • RTX 5090 (Blackwell, 170 SMs)
  • WSL2 (Linux 6.6.87)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions