Back to Table of Content | Previous: Necessary Steps for Host-Device Heterogeneous Code | Next:Matrix Multiplication Example
CUDA is API with minimal extensions to C
__device__: Specifies a function or variable that runs on the device (GPU) and can be called from other device or global functions.__global__: Specifies a function that can be called from the host and runs on the device. This type of function is a kernel and must be launched using the triple angle bracket syntax.__shared__: Specifies a variable that is shared among threads within the same block, allowing them to communicate and coordinate their actions more efficiently.
- gridIdx: Index of the current grid in kernel execution.
- gridDim: Dimensions of the grid.
- blockIdx: Index of the current block within the grid.
- blockDim: Dimensions of the block.
- threadIdx: Index of the current thread within the block.
__syncthreads: A barrier synchronization function that ensures all threads in a block have reached the same point in the code before proceeding.
cudaMalloc(...): Allocates memory on the device.cudaMemcpy(...): Copies memory between host and device.
- Kernel Launch Syntax:
kernelname<<<gridspec,tbspec>>>(args)
CUDA Function Declarations
| Declaration Type | Executed on the | Callable from the | Notes |
|---|---|---|---|
__device__ float DeviceFunc() |
Device | Device | Can only be called/executed from the device |
__global__ void KernelFunc() |
Device | Host | Defines a kernel function; must return void. |
__host__ float HostFunc() |
Host | Host | Normal C/C++ function. Default. |
• device and host can be used together • device functions cannot have their address taken
| Qualifier | Memory | Scope | Lifetime |
|---|---|---|---|
__device__ __local__ int LocalVar; |
local | thread | thread |
__device__ __shared__ int SharedVar; |
shared | block | block |
__device__ int GlobalVar; |
global | grid | applicaiton |
__device__is optional when used with__local__or__shared__.- Automatic variables without any qualifier reside in a register:
- Except arrays that reside in local memory. We need to avoid it, all arrays should be explicitely declared as shared (
__shared__) or global(__device__) so we know if it is slow or fast.
- Except arrays that reside in local memory. We need to avoid it, all arrays should be explicitely declared as shared (
- Pointers can only point to memory allocated or declared in global memory.The pointers can be used only for global memory and no pointer to shared memory is possible
__device__it resildes on global memory__shared__it resildes in shared memory__local__means privete to thread but it can be both on register or global memory and being cached; uncertain access time so not recommended.
Back to Table of Content | Previous: Necessary Steps for Host-Device Heterogeneous Code | Next:Matrix Multiplication Example