Skip to content

Conversation

@BBC-Esq
Copy link
Contributor

@BBC-Esq BBC-Esq commented Feb 1, 2026

CUDA: Replace thrust::reduce with CUB DeviceReduce::Sum in primitives::sum

Summary

Inspired by this conversation here NVIDIA/cccl#520

This PR replaces the CUDA implementation of primitivesDevice::CUDA::sum from thrust::reduce to cub::DeviceReduce::Sum with cudaMallocAsync/cudaFreeAsync for temporary storage. No interfaces change. No other code paths are modified.

The underlying GPU reduction kernel is the same family of CUB kernels Thrust would eventually dispatch to, but this change removes framework overhead.

This PR intentionally changes only "sum." Other Thrust-based reductions can be migrated if desired, such as for "max," "max_element," "logsumexp" or what not to CUB.

Hope this helps! I can provide some benchmarks if you want.

@BBC-Esq BBC-Esq closed this Feb 1, 2026
@BBC-Esq BBC-Esq deleted the Is-Thrust-Outdated branch February 1, 2026 05:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant