[REVIEW] Generalize and improve cagra::optimize by mfoerste4 · Pull Request #1830 · rapidsai/cuvs

mfoerste4 · 2026-02-20T12:28:20Z

In preparation for large scale graph creation this PR adds several changes to cagra:optimize by:

adding full device path for pruning and merging / discarding host fallback code
fusing two pruning steps into one batch-able kernel, reducing memory requirements
batched reverse graph creation
batched merging

Due to the batching in all substeps the memory footprint could even be decreased while significantly improving computation time.

The optimize API now supports all variations of memory locations for knn_graph and cagra_graph.
Internally, the data will be buffered in device memory for best performance. Directly accessing managed/pinned/HMM memory from the device showed severe performance degradation upon the first access (x86/H200 with HMM):

=== Benchmarks (256 MiB, 10 iterations) ===

  [malloc] 1. Copy to device:     133.783 ms total (10 iters) -> 18.69 GB/s
  [malloc] 2. Kernel direct read: 4648.468 ms total (10 iters) -> 0.54 GB/s
  [malloc] 3. Kernel subsequent read: 15.164 ms total (10 iters) -> 164.87 GB/s
  [cudaMalloc] 1. Copy to device:     1.294 ms total (10 iters) -> 1932.35 GB/s
  [cudaMalloc] 2. Kernel direct read: 14.945 ms total (10 iters) -> 167.28 GB/s
  [cudaMalloc] 3. Kernel subsequent read: 14.963 ms total (10 iters) -> 167.07 GB/s
  [cudaMallocHost] 1. Copy to device:     95.002 ms total (10 iters) -> 26.32 GB/s
  [cudaMallocHost] 2. Kernel direct read: 290.486 ms total (10 iters) -> 8.61 GB/s
  [cudaMallocHost] 3. Kernel subsequent read: 290.789 ms total (10 iters) -> 8.60 GB/s
  [cudaMallocManaged] 1. Copy to device:     136.737 ms total (10 iters) -> 18.28 GB/s
  [cudaMallocManaged] 2. Kernel direct read: 766.153 ms total (10 iters) -> 3.26 GB/s
  [cudaMallocManaged] 3. Kernel subsequent read: 15.002 ms total (10 iters) -> 166.65 GB/s

New kernels are based on experiments by @bpark-nvidia

CC @tfeher , @irina-resh-nvda

copy-pr-bot · 2026-02-20T12:28:24Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

…_optimize

mfoerste4 and others added 6 commits February 16, 2026 18:52

prune kernel smem

609b0f3

reduce copies within reverse graph compute

a320e0e

optimize() draft move more compute to GPU

6d1a618

Merge branch 'rapidsai:main' into cagra_optimize

77ab079

Merge branch 'rapidsai:main' into cagra_optimize

008e0fb

some fixes, cleanup

822faea

github-project-automation bot added this to Vector Search, ML, & Data Mining Release Board Feb 20, 2026

aamijar added non-breaking Introduces a non-breaking change feature request New feature or request labels Feb 24, 2026

aamijar assigned mfoerste4 Feb 24, 2026

mfoerste4 and others added 7 commits February 24, 2026 20:17

Merge branch 'main' into cagra_optimize

8ed1497

some fixes

9b1f741

extract prune into separate function

ecf3b1d

extract optimize components

972d278

enable both host/device inout graphs for optimize

5e9ebc5

resolve conflicts

8f24d9d

smaller fixes

40977e2

mfoerste4 marked this pull request as ready for review March 2, 2026 23:35

mfoerste4 requested a review from a team as a code owner March 2, 2026 23:35

mfoerste4 added 9 commits March 3, 2026 12:41

bugfix

14e9f3e

fuse and simplify pruning, remove CPU path

416558d

cleanup merge, remove CPU path

d8d8bd8

batch reverse creation

00c4204

add prefetch view to handle managed & host

9e63a7c

fix batched iterator

a38ad52

implement fallback / simplify strategy

89b0d1c

add logging / remove stats compute

d0e3dae

add test, persist stream pool, cleanup

ec45fd2

mfoerste4 requested a review from a team as a code owner March 10, 2026 22:50

Merge branch 'main' into cagra_optimize

e43b51b

mfoerste4 changed the title ~~WIP: Prepare cagra::optimize for use with pinned memory~~ [REVIEW] Prepare cagra::optimize for use with pinned memory Mar 10, 2026

mfoerste4 changed the title ~~[REVIEW] Prepare cagra::optimize for use with pinned memory~~ [REVIEW] Generalize and improve cagra::optimize Mar 10, 2026

mfoerste4 and others added 5 commits March 11, 2026 00:04

switch to cooperative groups as __reduce_min_sync causes issues

c412138

Merge branch 'cagra_optimize' of github.com:mfoerste4/cuvs into cagra…

b035ea0

…_optimize

back to column wise reverse graph creation to boost closer connections

ab01bab

Merge branch 'main' into cagra_optimize

139774f

fix signness

68f7883

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Generalize and improve cagra::optimize#1830

[REVIEW] Generalize and improve cagra::optimize#1830
mfoerste4 wants to merge 28 commits intorapidsai:mainfrom
mfoerste4:cagra_optimize

mfoerste4 commented Feb 20, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mfoerste4 commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mfoerste4 commented Feb 20, 2026 •

edited

Loading