Skip to content
Discussion options

You must be logged in to vote

What are the most common performance mistakes in CUDA kernels?

Most slow CUDA kernels aren’t “a little inefficient”—they usually violate a few core rules of how GPUs work. Here are the mistakes that show up again and again, with why they hurt and what to do instead.

  1. Uncoalesced global memory access

Problem: Adjacent threads read scattered memory locations
Why it hurts: Global memory transactions become inefficient (many loads instead of one)

Fix:
Map threads → contiguous data

int i = blockIdx.x * blockDim.x + threadIdx.x;
A[i]; // good

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by sudoUgando
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants