Skip to content

Gabriel-Pedde/GPU_Offloading_MHPC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPU Offloading — MHPC Coursework

GPU programming portfolio from the Master in High Performance Computing (MHPC) programme. Covers the full progression from first CUDA kernels to a production-quality GPU-accelerated fluid solver, with performance analysis at each stage.

Technologies: CUDA · OpenACC · MPI · cuBLAS · OpenMP · C/C++


Repository Structure

GPU_Offloading_MHPC/
├── gpu_training/     # Progressive CUDA/MPI training exercises (day1 → mpi_jacobi)
└── gpu_project/      # LBM fluid simulation project (CPU / CUDA / OpenACC)

gpu_training/ — Training Path

A four-stage hands-on sequence building from memory fundamentals to multi-node GPU computing.

Day 1 — Memory, Kernels & Shared Memory (day1/)

First CUDA kernels: a 1D array reverse (cuda1.cu) and a 2D matrix transpose (cuda2.cu). The transpose exercise compares three implementations — naive copy, unoptimized transpose, and shared-memory transpose with padding to avoid bank conflicts — across different thread block sizes (8×8 to 32×32) using CUDA events and NVTX markers for profiling.

Key result (50 000 × 50 000 matrix, 16×16 block):

  • Copy baseline: 16.9 ms
  • Transpose (no shared memory): 21.1 ms
  • Transpose (shared memory): 19.2 ms

Day 2 — Distributed Matrix Multiplication (day2/)

Parallel dense matrix multiplication using MPI + cuBLAS, implemented with a Cannon's algorithm-style decomposition (CCanMatrix). Processes exchange sub-blocks via MPI while computation is offloaded to the GPU through cuBLAS.

Timing (1000×1000 matrix, 4 MPI ranks):

Phase Max Avg
Communication 15 ms 13.3 ms
Computation (cuBLAS) 23 ms 22.3 ms

Matrix multiplication comparison

Day 3 — Jacobi Solver on GPU (day3/)

2D heat equation solved iteratively with Jacobi relaxation on a 40 000×40 000 grid, using MPI domain decomposition + OpenACC GPU offloading. Halo exchanges between MPI ranks are overlapped with GPU computation where possible.

Jacobi convergence

MPI Jacobi — Scaling Analysis (mpi_jacobi/)

Strong-scaling study comparing:

  • Pure MPI (CPU only)
  • Non-blocking MPI (overlapping communication and computation on CPU)
  • Hybrid MPI + GPU (OpenACC offloading, 8 and 16 processes)

The hybrid approach demonstrates significant throughput gains at higher process counts, where the GPU compute time dominates over inter-node communication costs.


gpu_project/ — LBM Fluid Simulation

A full 2D Lattice Boltzmann Method (LBM) simulation of the Taylor-Green vortex decay, implemented in three backends and benchmarked systematically.

The Physics

LBM tracks a particle distribution function f_i(x, t) on a D2Q9 lattice (9 discrete velocity directions in 2D). Each time step alternates between:

  • Collision — each node relaxes toward local thermodynamic equilibrium via BGK:
    f_i ← f_i − (f_i − f_i^eq) / τ
  • Streaming — post-collision distributions propagate to neighbouring nodes

The simulation starts from a Taylor-Green vortex, a smooth sinusoidal flow that decays exponentially. This provides an analytical reference for validation.

Velocity field Streamline view
Fluid evolution Streamlines

Implementations & Memory Layouts

File Backend Layout Key technique
lbm.c CPU AoS Reference implementation
lbm.cu (AoS) CUDA AoS GPU kernels, __constant__ memory
lbm.cu (SoA) CUDA SoA Shared memory with padding, CUDA streams
lbmACC.c OpenACC SoA Fused collision+streaming kernel, #pragma acc data

The SoA (Structure-of-Arrays) layout stores each velocity direction's data as a separate contiguous array, enabling coalesced memory access on the GPU — the dominant factor driving the performance gap over AoS.

Performance (2048×2048 grid, 1000 steps)

Backend Collision Streaming Collision GFLOP/s
CPU 91.5 s 49.7 s 9.6
GPU AoS 1.51 s 13.4 s 578.9
GPU SoA 0.75 s 0.75 s 1174.4

GPU SoA achieves a ~120× speedup over CPU for collision and reduces streaming from 13 s to under 1 s by enabling coalesced global memory access.

Collision step Streaming step Total

Physics Validation

Simulated kinetic energy decay is compared against the analytical prediction E_k(t) = E_k(0)·exp(−4νk²t):

cd gpu_project/LB_project
python validate.py    # produces validation.png

See the full project documentation in gpu_project/README.md.

About

GPU programming coursework from the MHPC programme — covering CUDA kernels, OpenACC offloading, MPI+GPU hybrid computing, and a full 2D Lattice Boltzmann fluid solver with CPU/CUDA/OpenACC backends. Includes performance benchmarks, physics validation, and scaling analysis across K80/P100/V100 GPUs.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors