Skip to content

Latest commit

 

History

History
61 lines (45 loc) · 3.61 KB

File metadata and controls

61 lines (45 loc) · 3.61 KB

CUDA Syntax

Back to Table of Content | Previous: Necessary Steps for Host-Device Heterogeneous Code | Next:Matrix Multiplication Example

CUDA is API with minimal extensions to C

Declaration Specifications

  • __device__: Specifies a function or variable that runs on the device (GPU) and can be called from other device or global functions.
  • __global__: Specifies a function that can be called from the host and runs on the device. This type of function is a kernel and must be launched using the triple angle bracket syntax.
  • __shared__: Specifies a variable that is shared among threads within the same block, allowing them to communicate and coordinate their actions more efficiently.

Special Variables

  • gridIdx: Index of the current grid in kernel execution.
  • gridDim: Dimensions of the grid.
  • blockIdx: Index of the current block within the grid.
  • blockDim: Dimensions of the block.
  • threadIdx: Index of the current thread within the block.

Intrinsics

  • __syncthreads: A barrier synchronization function that ensures all threads in a block have reached the same point in the code before proceeding.

Runtime API Functions

  • cudaMalloc(...): Allocates memory on the device.
  • cudaMemcpy(...): Copies memory between host and device.

Kernel Execution

  • Kernel Launch Syntax: kernelname<<<gridspec,tbspec>>>(args)

CUDA Function Declarations

CUDA Function Declarations

Declaration Type Executed on the Callable from the Notes
__device__ float DeviceFunc() Device Device Can only be called/executed from the device
__global__ void KernelFunc() Device Host Defines a kernel function; must return void.
__host__ float HostFunc() Host Host Normal C/C++ function. Default.

device and host can be used together • device functions cannot have their address taken

CUDA Variable Type Qualifiers

Memory Scope Lifetime

Qualifier Memory Scope Lifetime
__device__ __local__ int LocalVar; local thread thread
__device__ __shared__ int SharedVar; shared block block
__device__ int GlobalVar; global grid applicaiton
  • __device__ is optional when used with __local__ or __shared__.
  • Automatic variables without any qualifier reside in a register:
    • Except arrays that reside in local memory. We need to avoid it, all arrays should be explicitely declared as shared (__shared__) or global(__device__) so we know if it is slow or fast.
  • Pointers can only point to memory allocated or declared in global memory.The pointers can be used only for global memory and no pointer to shared memory is possible
  • __device__ it resildes on global memory
  • __shared__ it resildes in shared memory
  • __local__ means privete to thread but it can be both on register or global memory and being cached; uncertain access time so not recommended.

Back to Table of Content | Previous: Necessary Steps for Host-Device Heterogeneous Code | Next:Matrix Multiplication Example