Summary
When search_params.filtering_rate is left at its default value of -1.0, the CAGRA search dispatch in cagra.cuh auto-detects the filtering rate on every search call by launching a GPU reduction kernel and synchronizing the stream to read the result back to the host. This adds significant latency that is invisible to users.
Details
In cpp/src/neighbors/cagra.cuh, the bitset_filter dispatch path (lines 371-377) does:
if (params.filtering_rate < 0.0) {
const auto num_set_bits = sample_filter.bitset_view_.count(res); // GPU kernel + sync
auto filtering_rate = (float)(idx.data().n_rows() - num_set_bits) / idx.data().n_rows();
// ...
}
bitset_view_.count(res) launches a coalescedSumMediumKernel reduction on the GPU, copies the scalar result to the host, and synchronizes the stream. This happens on every cagra::search() call when filtering_rate < 0.
Impact
Benchmarking on an RTX 5090 at 1M vectors (dim=128, 50% pass rate, k=10, itopk_size=256):
| Condition |
Median latency |
QPS |
filtering_rate = -1.0 (default) |
1.111 ms |
90,009 |
filtering_rate = 0.5 (pre-set) |
0.944 ms |
105,911 |
The auto-detection adds ~18% overhead at this scale. The overhead is proportionally larger when the search kernel itself is fast (high selectivity, small dataset).
The actual CAGRA search kernel takes ~550 µs (confirmed via nsys profiling). The remaining ~450 µs in the default path is dominated by the popcount kernel launch, D2H copy, and stream synchronization.
Suggested fix
A few options (not mutually exclusive):
-
Cache the count in bitset_filter — compute it once at construction or on first use, store it as a member. This is the simplest fix and eliminates the per-call overhead entirely.
-
Document the overhead — add a note to the search_params::filtering_rate docstring explaining that the default triggers a per-call GPU reduction for bitset_filter, and recommend pre-setting it in latency-sensitive applications.
-
Lazy single-computation — compute on first call with filtering_rate < 0, cache the result, reuse on subsequent calls.
How I found this
While benchmarking a custom filter implementation against bitset_filter, I observed a consistent 5-33% "speedup" that I couldn't explain from the kernel code. Nsight Systems profiling revealed both search kernels had identical register usage (63), launch parameters, and GPU execution time (~1% difference). The entire measured difference was the bitset_view_.count() overhead in the default dispatch path.
Environment
- cuVS branch-25.12
- CUDA 12.4
- RTX 5090 (Blackwell, 170 SMs)
- WSL2 (Linux 6.6.87)
Summary
When
search_params.filtering_rateis left at its default value of-1.0, the CAGRA search dispatch incagra.cuhauto-detects the filtering rate on every search call by launching a GPU reduction kernel and synchronizing the stream to read the result back to the host. This adds significant latency that is invisible to users.Details
In
cpp/src/neighbors/cagra.cuh, thebitset_filterdispatch path (lines 371-377) does:bitset_view_.count(res)launches acoalescedSumMediumKernelreduction on the GPU, copies the scalar result to the host, and synchronizes the stream. This happens on everycagra::search()call whenfiltering_rate < 0.Impact
Benchmarking on an RTX 5090 at 1M vectors (dim=128, 50% pass rate, k=10, itopk_size=256):
filtering_rate = -1.0(default)filtering_rate = 0.5(pre-set)The auto-detection adds ~18% overhead at this scale. The overhead is proportionally larger when the search kernel itself is fast (high selectivity, small dataset).
The actual CAGRA search kernel takes ~550 µs (confirmed via nsys profiling). The remaining ~450 µs in the default path is dominated by the popcount kernel launch, D2H copy, and stream synchronization.
Suggested fix
A few options (not mutually exclusive):
Cache the count in
bitset_filter— compute it once at construction or on first use, store it as a member. This is the simplest fix and eliminates the per-call overhead entirely.Document the overhead — add a note to the
search_params::filtering_ratedocstring explaining that the default triggers a per-call GPU reduction forbitset_filter, and recommend pre-setting it in latency-sensitive applications.Lazy single-computation — compute on first call with
filtering_rate < 0, cache the result, reuse on subsequent calls.How I found this
While benchmarking a custom filter implementation against
bitset_filter, I observed a consistent 5-33% "speedup" that I couldn't explain from the kernel code. Nsight Systems profiling revealed both search kernels had identical register usage (63), launch parameters, and GPU execution time (~1% difference). The entire measured difference was thebitset_view_.count()overhead in the default dispatch path.Environment