GPU programming portfolio from the Master in High Performance Computing (MHPC) programme. Covers the full progression from first CUDA kernels to a production-quality GPU-accelerated fluid solver, with performance analysis at each stage.
Technologies: CUDA · OpenACC · MPI · cuBLAS · OpenMP · C/C++
GPU_Offloading_MHPC/
├── gpu_training/ # Progressive CUDA/MPI training exercises (day1 → mpi_jacobi)
└── gpu_project/ # LBM fluid simulation project (CPU / CUDA / OpenACC)
A four-stage hands-on sequence building from memory fundamentals to multi-node GPU computing.
First CUDA kernels: a 1D array reverse (cuda1.cu) and a 2D matrix transpose (cuda2.cu). The transpose exercise compares three implementations — naive copy, unoptimized transpose, and shared-memory transpose with padding to avoid bank conflicts — across different thread block sizes (8×8 to 32×32) using CUDA events and NVTX markers for profiling.
Key result (50 000 × 50 000 matrix, 16×16 block):
- Copy baseline: 16.9 ms
- Transpose (no shared memory): 21.1 ms
- Transpose (shared memory): 19.2 ms
Parallel dense matrix multiplication using MPI + cuBLAS, implemented with a Cannon's algorithm-style decomposition (CCanMatrix). Processes exchange sub-blocks via MPI while computation is offloaded to the GPU through cuBLAS.
Timing (1000×1000 matrix, 4 MPI ranks):
| Phase | Max | Avg |
|---|---|---|
| Communication | 15 ms | 13.3 ms |
| Computation (cuBLAS) | 23 ms | 22.3 ms |
2D heat equation solved iteratively with Jacobi relaxation on a 40 000×40 000 grid, using MPI domain decomposition + OpenACC GPU offloading. Halo exchanges between MPI ranks are overlapped with GPU computation where possible.
Strong-scaling study comparing:
- Pure MPI (CPU only)
- Non-blocking MPI (overlapping communication and computation on CPU)
- Hybrid MPI + GPU (OpenACC offloading, 8 and 16 processes)
The hybrid approach demonstrates significant throughput gains at higher process counts, where the GPU compute time dominates over inter-node communication costs.
A full 2D Lattice Boltzmann Method (LBM) simulation of the Taylor-Green vortex decay, implemented in three backends and benchmarked systematically.
LBM tracks a particle distribution function f_i(x, t) on a D2Q9 lattice (9 discrete velocity directions in 2D). Each time step alternates between:
- Collision — each node relaxes toward local thermodynamic equilibrium via BGK:
f_i ← f_i − (f_i − f_i^eq) / τ - Streaming — post-collision distributions propagate to neighbouring nodes
The simulation starts from a Taylor-Green vortex, a smooth sinusoidal flow that decays exponentially. This provides an analytical reference for validation.
| Velocity field | Streamline view |
|---|---|
![]() |
![]() |
| File | Backend | Layout | Key technique |
|---|---|---|---|
lbm.c |
CPU | AoS | Reference implementation |
lbm.cu (AoS) |
CUDA | AoS | GPU kernels, __constant__ memory |
lbm.cu (SoA) |
CUDA | SoA | Shared memory with padding, CUDA streams |
lbmACC.c |
OpenACC | SoA | Fused collision+streaming kernel, #pragma acc data |
The SoA (Structure-of-Arrays) layout stores each velocity direction's data as a separate contiguous array, enabling coalesced memory access on the GPU — the dominant factor driving the performance gap over AoS.
| Backend | Collision | Streaming | Collision GFLOP/s |
|---|---|---|---|
| CPU | 91.5 s | 49.7 s | 9.6 |
| GPU AoS | 1.51 s | 13.4 s | 578.9 |
| GPU SoA | 0.75 s | 0.75 s | 1174.4 |
GPU SoA achieves a ~120× speedup over CPU for collision and reduces streaming from 13 s to under 1 s by enabling coalesced global memory access.
| Collision step | Streaming step | Total |
|---|---|---|
![]() |
![]() |
![]() |
Simulated kinetic energy decay is compared against the analytical prediction E_k(t) = E_k(0)·exp(−4νk²t):
cd gpu_project/LB_project
python validate.py # produces validation.pngSee the full project documentation in gpu_project/README.md.






