Back to Table of Content | Previous: Increment Array Example | Next:CUDA Syntax
Here we present a comprehensive overview of the full host-device heterogeneous programming model using CUDA. We outlines the essential steps involved in implementing a kernel to increment each element of an array on the GPU, including memory allocation, data initialization, and data transfer between the host and device. The accompanying code example illustrates the complete process, highlighting how parallelism is leveraged on the GPU to achieve efficient computation.
- Allocate Host Memory: Allocate memory on the host (CPU).
- Initialize Host Memory: Initialize the host memory with values.
- Allocate Device Memory: Allocate memory on the device (GPU).
- Copy Host Memory to Device: Transfer data from host to device.
- Setup Execution Parameters: Define the number of blocks and threads.
- Execute the Kernel: Launch the kernel on the GPU.
- Copy Result from Device to Host: Retrieve the results back to the host.
- Clean Up Memory: Free allocated memory on both host and device.
// File name: gpu_inrement.cu
#include <stdio.h>
#include <cuda.h>
#include <math.h>
__global__ void inc_gpu(int* a, int N) {
int tx = threadIdx.x;
int bx = blockIdx.x;
int tbsize = blockDim.x;
int i = bx * tbsize + tx;
if (i < N) {
a[i] = a[i] + 1;
}
}
int main(int argc, char** argv) {
int *h_a; // Host array
int dimA = 100000000;
int *d_a; // Device array
size_t memSize = dimA * sizeof(int);
// Allocate host memory
h_a = (int *) malloc(memSize);
// Initialize host memory
for (int i = 0; i < dimA; ++i) {
h_a[i] = i;
if(i<10)
printf("%d,", h_a[i]);
}
// Allocate device memory
cudaMalloc((void **) &d_a, memSize);
// Copy host memory to device
cudaMemcpy(d_a, h_a, memSize, cudaMemcpyHostToDevice);
// Setup execution parameters
int tbsize = 32; // Thread block size
int numBlocks = ceil(dimA / (float)tbsize); // Number of blocks
printf("\ncalling the kernel...\n");
// Execute the kernel
inc_gpu<<<numBlocks, tbsize>>>(d_a, dimA);
// Synchronize the device
cudaDeviceSynchronize();
// Copy result from device to host (optional, if needed)
cudaMemcpy(h_a, d_a, memSize, cudaMemcpyDeviceToHost);
// Verify the result
printf("\nVerification of results, first 10 elemets after increment:\n");
for (int i = 0; i < 10; ++i) {
printf("%d,", h_a[i]);
}
printf("\n");
// Clean up memory
cudaFree(d_a);
free(h_a);
return 0;
}Compile CUDA:
- In order to compile the CUDA code you need Nvidia's NVCC compiler which is present in the CUDA tool kit (you can download from here)
- Then you can compile and run the code above like below:
$ nvcc gpu_increment.cu -o gpu_increment $ ./gpu_increment 0,1,2,3,4,5,6,7,8,9, calling the kernel... Verification of results, first 10 elemets after increment: 1,2,3,4,5,6,7,8,9,10, $
Back to Table of Content | Previous: Increment Array Example | Next:CUDA Syntax