Skip to content

[REVIEW] Generalize and improve cagra::optimize#1830

Open
mfoerste4 wants to merge 28 commits intorapidsai:mainfrom
mfoerste4:cagra_optimize
Open

[REVIEW] Generalize and improve cagra::optimize#1830
mfoerste4 wants to merge 28 commits intorapidsai:mainfrom
mfoerste4:cagra_optimize

Conversation

@mfoerste4
Copy link
Contributor

@mfoerste4 mfoerste4 commented Feb 20, 2026

In preparation for large scale graph creation this PR adds several changes to cagra:optimize by:

  • adding full device path for pruning and merging / discarding host fallback code
  • fusing two pruning steps into one batch-able kernel, reducing memory requirements
  • batched reverse graph creation
  • batched merging

Due to the batching in all substeps the memory footprint could even be decreased while significantly improving computation time.

The optimize API now supports all variations of memory locations for knn_graph and cagra_graph.
Internally, the data will be buffered in device memory for best performance. Directly accessing managed/pinned/HMM memory from the device showed severe performance degradation upon the first access (x86/H200 with HMM):

=== Benchmarks (256 MiB, 10 iterations) ===

  [malloc] 1. Copy to device:     133.783 ms total (10 iters) -> 18.69 GB/s
  [malloc] 2. Kernel direct read: 4648.468 ms total (10 iters) -> 0.54 GB/s
  [malloc] 3. Kernel subsequent read: 15.164 ms total (10 iters) -> 164.87 GB/s
  [cudaMalloc] 1. Copy to device:     1.294 ms total (10 iters) -> 1932.35 GB/s
  [cudaMalloc] 2. Kernel direct read: 14.945 ms total (10 iters) -> 167.28 GB/s
  [cudaMalloc] 3. Kernel subsequent read: 14.963 ms total (10 iters) -> 167.07 GB/s
  [cudaMallocHost] 1. Copy to device:     95.002 ms total (10 iters) -> 26.32 GB/s
  [cudaMallocHost] 2. Kernel direct read: 290.486 ms total (10 iters) -> 8.61 GB/s
  [cudaMallocHost] 3. Kernel subsequent read: 290.789 ms total (10 iters) -> 8.60 GB/s
  [cudaMallocManaged] 1. Copy to device:     136.737 ms total (10 iters) -> 18.28 GB/s
  [cudaMallocManaged] 2. Kernel direct read: 766.153 ms total (10 iters) -> 3.26 GB/s
  [cudaMallocManaged] 3. Kernel subsequent read: 15.002 ms total (10 iters) -> 166.65 GB/s

New kernels are based on experiments by @bpark-nvidia

CC @tfeher , @irina-resh-nvda

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 20, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@aamijar aamijar added non-breaking Introduces a non-breaking change feature request New feature or request labels Feb 24, 2026
@mfoerste4 mfoerste4 marked this pull request as ready for review March 2, 2026 23:35
@mfoerste4 mfoerste4 requested a review from a team as a code owner March 2, 2026 23:35
@mfoerste4 mfoerste4 requested a review from a team as a code owner March 10, 2026 22:50
@mfoerste4 mfoerste4 changed the title WIP: Prepare cagra::optimize for use with pinned memory [REVIEW] Prepare cagra::optimize for use with pinned memory Mar 10, 2026
@mfoerste4 mfoerste4 changed the title [REVIEW] Prepare cagra::optimize for use with pinned memory [REVIEW] Generalize and improve cagra::optimize Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change

Projects

Development

Successfully merging this pull request may close these issues.

2 participants