GPU_Programming_Tutorial/CUDA Programming Model/Chapter 2: CUDA Programming In Practice/4.steps_Host_Device_Code.md at main · omidasudeh/GPU_Programming_Tutorial

Necessary Steps for Host-Device Heterogeneous Code

Back to Table of Content | Previous: Increment Array Example | Next:CUDA Syntax

Here we present a comprehensive overview of the full host-device heterogeneous programming model using CUDA. We outlines the essential steps involved in implementing a kernel to increment each element of an array on the GPU, including memory allocation, data initialization, and data transfer between the host and device. The accompanying code example illustrates the complete process, highlighting how parallelism is leveraged on the GPU to achieve efficient computation.

Steps:

Allocate Host Memory: Allocate memory on the host (CPU).
Initialize Host Memory: Initialize the host memory with values.
Allocate Device Memory: Allocate memory on the device (GPU).
Copy Host Memory to Device: Transfer data from host to device.
Setup Execution Parameters: Define the number of blocks and threads.
Execute the Kernel: Launch the kernel on the GPU.
Copy Result from Device to Host: Retrieve the results back to the host.
Clean Up Memory: Free allocated memory on both host and device.

Code Example:

// File name: gpu_inrement.cu
#include <stdio.h>
#include <cuda.h>
#include <math.h>

__global__ void inc_gpu(int* a, int N) {
    int tx = threadIdx.x;
    int bx = blockIdx.x;
    int tbsize = blockDim.x;
    int i = bx * tbsize + tx;
    if (i < N) {
        a[i] = a[i] + 1;
    }
}

int main(int argc, char** argv) {
    int *h_a; // Host array
    int dimA = 100000000;
    int *d_a; // Device array
    size_t memSize = dimA * sizeof(int);

    // Allocate host memory
    h_a = (int *) malloc(memSize);

    // Initialize host memory
    for (int i = 0; i < dimA; ++i) {
        h_a[i] = i;
        if(i<10)
                printf("%d,", h_a[i]);
    }

    // Allocate device memory
    cudaMalloc((void **) &d_a, memSize);

    // Copy host memory to device
    cudaMemcpy(d_a, h_a, memSize, cudaMemcpyHostToDevice);

    // Setup execution parameters
    int tbsize = 32; // Thread block size
    int numBlocks = ceil(dimA / (float)tbsize); // Number of blocks

    printf("\ncalling the kernel...\n");
    // Execute the kernel
    inc_gpu<<<numBlocks, tbsize>>>(d_a, dimA);

    // Synchronize the device
    cudaDeviceSynchronize();

    // Copy result from device to host (optional, if needed)
    cudaMemcpy(h_a, d_a, memSize, cudaMemcpyDeviceToHost);

    // Verify the result
    printf("\nVerification of results, first 10 elemets after increment:\n");
    for (int i = 0; i < 10; ++i) {
        printf("%d,", h_a[i]);
    }

    printf("\n");

    // Clean up memory
    cudaFree(d_a);
    free(h_a);

    return 0;
}

Compile CUDA:

In order to compile the CUDA code you need Nvidia's NVCC compiler which is present in the CUDA tool kit (you can download from here)

Then you can compile and run the code above like below:

  $ nvcc gpu_increment.cu -o gpu_increment
  $ ./gpu_increment 
  0,1,2,3,4,5,6,7,8,9,
  calling the kernel...

  Verification of results, first 10 elemets after increment:
  1,2,3,4,5,6,7,8,9,10,
  $

Back to Table of Content | Previous: Increment Array Example | Next:CUDA Syntax

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Necessary Steps for Host-Device Heterogeneous Code

Steps:

Code Example:

FilesExpand file tree

4.steps_Host_Device_Code.md

Latest commit

History

4.steps_Host_Device_Code.md

File metadata and controls

Necessary Steps for Host-Device Heterogeneous Code

Steps:

Code Example: