CAGRA search: default filtering_rate=-1.0 adds hidden popcount kernel + host sync on every call

## Summary

When `search_params.filtering_rate` is left at its default value of `-1.0`, the CAGRA search dispatch in `cagra.cuh` auto-detects the filtering rate on **every search call** by launching a GPU reduction kernel and synchronizing the stream to read the result back to the host. This adds significant latency that is invisible to users.

## Details

In `cpp/src/neighbors/cagra.cuh`, the `bitset_filter` dispatch path (lines 371-377) does:

```cpp
if (params.filtering_rate < 0.0) {
    const auto num_set_bits = sample_filter.bitset_view_.count(res);  // GPU kernel + sync
    auto filtering_rate     = (float)(idx.data().n_rows() - num_set_bits) / idx.data().n_rows();
    // ...
}
```

`bitset_view_.count(res)` launches a `coalescedSumMediumKernel` reduction on the GPU, copies the scalar result to the host, and synchronizes the stream. This happens on every `cagra::search()` call when `filtering_rate < 0`.

## Impact

Benchmarking on an RTX 5090 at 1M vectors (dim=128, 50% pass rate, k=10, itopk_size=256):

| Condition | Median latency | QPS |
|-----------|---------------|-----|
| `filtering_rate = -1.0` (default) | 1.111 ms | 90,009 |
| `filtering_rate = 0.5` (pre-set) | 0.944 ms | 105,911 |

The auto-detection adds ~18% overhead at this scale. The overhead is proportionally larger when the search kernel itself is fast (high selectivity, small dataset).

The actual CAGRA search kernel takes ~550 µs (confirmed via nsys profiling). The remaining ~450 µs in the default path is dominated by the popcount kernel launch, D2H copy, and stream synchronization.

## Suggested fix

A few options (not mutually exclusive):

1. **Cache the count in `bitset_filter`** — compute it once at construction or on first use, store it as a member. This is the simplest fix and eliminates the per-call overhead entirely.

2. **Document the overhead** — add a note to the `search_params::filtering_rate` docstring explaining that the default triggers a per-call GPU reduction for `bitset_filter`, and recommend pre-setting it in latency-sensitive applications.

3. **Lazy single-computation** — compute on first call with `filtering_rate < 0`, cache the result, reuse on subsequent calls.

## How I found this

While benchmarking a custom filter implementation against `bitset_filter`, I observed a consistent 5-33% "speedup" that I couldn't explain from the kernel code. Nsight Systems profiling revealed both search kernels had identical register usage (63), launch parameters, and GPU execution time (~1% difference). The entire measured difference was the `bitset_view_.count()` overhead in the default dispatch path.

## Environment

- cuVS branch-25.12
- CUDA 12.4
- RTX 5090 (Blackwell, 170 SMs)
- WSL2 (Linux 6.6.87)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAGRA search: default filtering_rate=-1.0 adds hidden popcount kernel + host sync on every call #1960

Summary

Details

Impact

Suggested fix

How I found this

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Condition	Median latency	QPS
`filtering_rate = -1.0` (default)	1.111 ms	90,009
`filtering_rate = 0.5` (pre-set)	0.944 ms	105,911

CAGRA search: default filtering_rate=-1.0 adds hidden popcount kernel + host sync on every call #1960

Description

Summary

Details

Impact

Suggested fix

How I found this

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions