Skip to content

Conversation

@BBC-Esq
Copy link
Contributor

@BBC-Esq BBC-Esq commented Feb 1, 2026

CUDA: Replace thrust::reduce with CUB DeviceReduce::Sum in primitives::sum

Summary

Inspired by this conversation here NVIDIA/cccl#520

Replaces the CUDA implementation of primitives<Device::CUDA>::sum with cub::DeviceReduce::Sum using cudaMallocAsync for temp storage. Interface and behavior are unchanged.

While both Thrust and CUB use the same kernel backend, this removes the hidden sync cost of cudaMalloc in Thrust's default allocator path.

This PR updates only sum. Similar changes could be made for max, max_element, logsumexp, etc.

Happy to share benchmarks if helpful!

@jordimas
Copy link
Collaborator

jordimas commented Feb 1, 2026

Close it. We will implement it manually with somebody that understand the context.

@jordimas jordimas closed this Feb 1, 2026
@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Feb 1, 2026

??? why was this closed? Was it not correct or are you just closing PRs now whenever you think I don't understand, even though there's nothing wrong with it? I put a fair amount of time into this and would like to know for future pull requests. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants