Skip to content

ScaNN: Fix AVQ prefetch#1899

Open
rmaschal wants to merge 4 commits intorapidsai:mainfrom
rmaschal:fix-avq-prefetch
Open

ScaNN: Fix AVQ prefetch#1899
rmaschal wants to merge 4 commits intorapidsai:mainfrom
rmaschal:fix-avq-prefetch

Conversation

@rmaschal
Copy link
Contributor

@rmaschal rmaschal commented Mar 9, 2026

Switching to usage of modern RAFT in cuVS (#1837) introduced a bug where the prefetched gather for AVQ is performed using the stream associated with raft::device_resources rather than the provided stream for copying. This led to two issues:

  1. Elimination of the benefit for prefetching, as copies where scheduled on the same stream as other gpu work
  2. Possible recall loss. Synchronization was still performed against the copy stream, potentially allowing host to proceed before the prefetch copy is complete.

This PR sets the stream associated with the resource to the copy stream before prefetching, and back when done.

@rmaschal rmaschal requested a review from a team as a code owner March 9, 2026 20:18
@rmaschal rmaschal added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Mar 9, 2026
@rmaschal rmaschal self-assigned this Mar 9, 2026
Copy link
Member

@aamijar aamijar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @rmaschal, and LGTM!

Comment on lines +514 to +541
@@ -532,6 +537,8 @@ class cluster_loader {
raft::copy(res, cluster_vectors, raft::make_const_mdspan(pinned_cluster));
raft::resource::sync_stream(res, stream_);

// reset stream back to previous value
raft::resource::set_cuda_stream(res, stream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can create a new resources object for no cost:

Suggested change
// For prefetching to overlap with other gpu work
// we need to schedule copies on the provided copy stream stream_
auto copy_res = raft::resources(stream_);
// htod
auto h_cluster_ids =
raft::make_pinned_vector_view<LabelT, int64_t>(cluster_ids_buf_.data_handle(), size);
@@ -532,6 +537,8 @@ class cluster_loader {
raft::copy(copy_res, cluster_vectors, raft::make_const_mdspan(pinned_cluster));
raft::resource::sync_stream(res, stream_);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote on this too. Although not very relevant here, another benefit of creating a temporary resources handle is the automatic reset of the changes if an exception is raised. However, I'd like to warn that there are caveats in doing this:

  1. I'd suggest to use a copy the resources handle rather than creating a new resources handle to inherit all other initialized resources from the current handle.
auto copy_res = raft::resources(res);
raft::resource::set_cuda_stream(res, stream);
  1. Even when you do a copy: if you've used any other not-yet-initialized resources from the copy_res handle, it would create those resources on every invocation of prefetch_cluster, which would incur significant overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comments and info. I now copy the passed resources, set the stream, and use that for scheduling copies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

Development

Successfully merging this pull request may close these issues.

4 participants