Improve Primitives a Little Bit #2001

BBC-Esq · 2026-02-01T04:41:31Z

CUDA: Replace thrust::reduce with CUB DeviceReduce::Sum in primitives::sum

Summary

Inspired by this conversation here NVIDIA/cccl#520

This PR replaces the CUDA implementation of primitivesDevice::CUDA::sum from thrust::reduce to cub::DeviceReduce::Sum with cudaMallocAsync/cudaFreeAsync for temporary storage. No interfaces change. No other code paths are modified.

The underlying GPU reduction kernel is the same family of CUB kernels Thrust would eventually dispatch to, but this change removes framework overhead.

This PR intentionally changes only "sum." Other Thrust-based reductions can be migrated if desired, such as for "max," "max_element," "logsumexp" or what not to CUB.

Hope this helps! I can provide some benchmarks if you want.

Improve Primitives a Little Bit

114d903

BBC-Esq closed this Feb 1, 2026

BBC-Esq deleted the Is-Thrust-Outdated branch February 1, 2026 05:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Primitives a Little Bit #2001

Improve Primitives a Little Bit #2001

BBC-Esq commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Improve Primitives a Little Bit #2001

Improve Primitives a Little Bit #2001

Conversation

BBC-Esq commented Feb 1, 2026

CUDA: Replace thrust::reduce with CUB DeviceReduce::Sum in primitives::sum

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant