Skip to content

Latest commit

 

History

History
102 lines (79 loc) · 3.37 KB

File metadata and controls

102 lines (79 loc) · 3.37 KB

Necessary Steps for Host-Device Heterogeneous Code

Back to Table of Content | Previous: Increment Array Example | Next:CUDA Syntax

Here we present a comprehensive overview of the full host-device heterogeneous programming model using CUDA. We outlines the essential steps involved in implementing a kernel to increment each element of an array on the GPU, including memory allocation, data initialization, and data transfer between the host and device. The accompanying code example illustrates the complete process, highlighting how parallelism is leveraged on the GPU to achieve efficient computation.

Steps:

  1. Allocate Host Memory: Allocate memory on the host (CPU).
  2. Initialize Host Memory: Initialize the host memory with values.
  3. Allocate Device Memory: Allocate memory on the device (GPU).
  4. Copy Host Memory to Device: Transfer data from host to device.
  5. Setup Execution Parameters: Define the number of blocks and threads.
  6. Execute the Kernel: Launch the kernel on the GPU.
  7. Copy Result from Device to Host: Retrieve the results back to the host.
  8. Clean Up Memory: Free allocated memory on both host and device.

Code Example:

// File name: gpu_inrement.cu
#include <stdio.h>
#include <cuda.h>
#include <math.h>

__global__ void inc_gpu(int* a, int N) {
    int tx = threadIdx.x;
    int bx = blockIdx.x;
    int tbsize = blockDim.x;
    int i = bx * tbsize + tx;
    if (i < N) {
        a[i] = a[i] + 1;
    }
}

int main(int argc, char** argv) {
    int *h_a; // Host array
    int dimA = 100000000;
    int *d_a; // Device array
    size_t memSize = dimA * sizeof(int);

    // Allocate host memory
    h_a = (int *) malloc(memSize);

    // Initialize host memory
    for (int i = 0; i < dimA; ++i) {
        h_a[i] = i;
        if(i<10)
                printf("%d,", h_a[i]);
    }

    // Allocate device memory
    cudaMalloc((void **) &d_a, memSize);

    // Copy host memory to device
    cudaMemcpy(d_a, h_a, memSize, cudaMemcpyHostToDevice);

    // Setup execution parameters
    int tbsize = 32; // Thread block size
    int numBlocks = ceil(dimA / (float)tbsize); // Number of blocks

    printf("\ncalling the kernel...\n");
    // Execute the kernel
    inc_gpu<<<numBlocks, tbsize>>>(d_a, dimA);

    // Synchronize the device
    cudaDeviceSynchronize();

    // Copy result from device to host (optional, if needed)
    cudaMemcpy(h_a, d_a, memSize, cudaMemcpyDeviceToHost);

    // Verify the result
    printf("\nVerification of results, first 10 elemets after increment:\n");
    for (int i = 0; i < 10; ++i) {
        printf("%d,", h_a[i]);
    }

    printf("\n");

    // Clean up memory
    cudaFree(d_a);
    free(h_a);

    return 0;
}

Compile CUDA:

  • In order to compile the CUDA code you need Nvidia's NVCC compiler which is present in the CUDA tool kit (you can download from here)
  • Then you can compile and run the code above like below:
      $ nvcc gpu_increment.cu -o gpu_increment
      $ ./gpu_increment 
      0,1,2,3,4,5,6,7,8,9,
      calling the kernel...
    
      Verification of results, first 10 elemets after increment:
      1,2,3,4,5,6,7,8,9,10,
      $ 
    

Back to Table of Content | Previous: Increment Array Example | Next:CUDA Syntax