performance mistakes in CUDA kernels #59
-
|
What are the most common performance mistakes in CUDA kernels? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Most slow CUDA kernels aren’t “a little inefficient”—they usually violate a few core rules of how GPUs work. Here are the mistakes that show up again and again, with why they hurt and what to do instead.
Problem: Adjacent threads read scattered memory locations Fix: int i = blockIdx.x * blockDim.x + threadIdx.x; |
Beta Was this translation helpful? Give feedback.
Most slow CUDA kernels aren’t “a little inefficient”—they usually violate a few core rules of how GPUs work. Here are the mistakes that show up again and again, with why they hurt and what to do instead.
Problem: Adjacent threads read scattered memory locations
Why it hurts: Global memory transactions become inefficient (many loads instead of one)
Fix:
Map threads → contiguous data
int i = blockIdx.x * blockDim.x + threadIdx.x;
A[i]; // good