For input sizes about 2^27 or larger, the gpu__accumulate_previous_coupled_preblocks_ call starts to heavily dominate the runtime of the scan, according to CUDA.@profile on my system (tested with int32 and float32).
For example, using 2^27 elements,
- 3.37 ms are spent in the top-level
gpu__accumulate_block_
- 13.11 µs are spent in the recursed
gpu__accumulate_block_
- 11.16 ms are spent in the
gpu__accumulate_previous_coupled_preblocks_
Since the accumulate_previous_coupled_preblocks is essentially just a vectorized add, it should not be this slow. The problem only gets much worse for larger vectors, for example 2^30 elements, where the gpu__accumulate_previous_coupled_preblocks_ takes 580ms, 92% of the total accumulate time on my system.
In comparison, in a simple C++ cub reference implementation, AcceleratedKernels.jl keeps up with cub performance very well for smaller inputs, but then suddenly falls off a cliff for these larger sizes. The cub reference takes ~12ms for 2^30 elements, and an alpaka3 implementation of the same coupled lookback takes ~27ms for this size.
For input sizes about 2^27 or larger, the
gpu__accumulate_previous_coupled_preblocks_call starts to heavily dominate the runtime of the scan, according toCUDA.@profileon my system (tested with int32 and float32).For example, using 2^27 elements,
gpu__accumulate_block_gpu__accumulate_block_gpu__accumulate_previous_coupled_preblocks_Since the accumulate_previous_coupled_preblocks is essentially just a vectorized add, it should not be this slow. The problem only gets much worse for larger vectors, for example 2^30 elements, where the
gpu__accumulate_previous_coupled_preblocks_takes 580ms, 92% of the total accumulate time on my system.In comparison, in a simple C++ cub reference implementation, AcceleratedKernels.jl keeps up with cub performance very well for smaller inputs, but then suddenly falls off a cliff for these larger sizes. The cub reference takes ~12ms for 2^30 elements, and an alpaka3 implementation of the same coupled lookback takes ~27ms for this size.