GPU Offloading — MHPC Coursework

GPU programming portfolio from the Master in High Performance Computing (MHPC) programme. Covers the full progression from first CUDA kernels to a production-quality GPU-accelerated fluid solver, with performance analysis at each stage.

Technologies: CUDA · OpenACC · MPI · cuBLAS · OpenMP · C/C++

Repository Structure

GPU_Offloading_MHPC/
├── gpu_training/     # Progressive CUDA/MPI training exercises (day1 → mpi_jacobi)
└── gpu_project/      # LBM fluid simulation project (CPU / CUDA / OpenACC)

`gpu_training/` — Training Path

A four-stage hands-on sequence building from memory fundamentals to multi-node GPU computing.

Day 1 — Memory, Kernels & Shared Memory (`day1/`)

First CUDA kernels: a 1D array reverse (cuda1.cu) and a 2D matrix transpose (cuda2.cu). The transpose exercise compares three implementations — naive copy, unoptimized transpose, and shared-memory transpose with padding to avoid bank conflicts — across different thread block sizes (8×8 to 32×32) using CUDA events and NVTX markers for profiling.

Key result (50 000 × 50 000 matrix, 16×16 block):

Copy baseline: 16.9 ms
Transpose (no shared memory): 21.1 ms
Transpose (shared memory): 19.2 ms

Day 2 — Distributed Matrix Multiplication (`day2/`)

Parallel dense matrix multiplication using MPI + cuBLAS, implemented with a Cannon's algorithm-style decomposition (CCanMatrix). Processes exchange sub-blocks via MPI while computation is offloaded to the GPU through cuBLAS.

Timing (1000×1000 matrix, 4 MPI ranks):

Phase	Max	Avg
Communication	15 ms	13.3 ms
Computation (cuBLAS)	23 ms	22.3 ms

Day 3 — Jacobi Solver on GPU (`day3/`)

2D heat equation solved iteratively with Jacobi relaxation on a 40 000×40 000 grid, using MPI domain decomposition + OpenACC GPU offloading. Halo exchanges between MPI ranks are overlapped with GPU computation where possible.

MPI Jacobi — Scaling Analysis (`mpi_jacobi/`)

Strong-scaling study comparing:

Pure MPI (CPU only)
Non-blocking MPI (overlapping communication and computation on CPU)
Hybrid MPI + GPU (OpenACC offloading, 8 and 16 processes)

The hybrid approach demonstrates significant throughput gains at higher process counts, where the GPU compute time dominates over inter-node communication costs.

`gpu_project/` — LBM Fluid Simulation

A full 2D Lattice Boltzmann Method (LBM) simulation of the Taylor-Green vortex decay, implemented in three backends and benchmarked systematically.

The Physics

LBM tracks a particle distribution function f_i(x, t) on a D2Q9 lattice (9 discrete velocity directions in 2D). Each time step alternates between:

Collision — each node relaxes toward local thermodynamic equilibrium via BGK:
f_i ← f_i − (f_i − f_i^eq) / τ
Streaming — post-collision distributions propagate to neighbouring nodes

The simulation starts from a Taylor-Green vortex, a smooth sinusoidal flow that decays exponentially. This provides an analytical reference for validation.

Velocity field	Streamline view

Implementations & Memory Layouts

File	Backend	Layout	Key technique
`lbm.c`	CPU	AoS	Reference implementation
`lbm.cu` (AoS)	CUDA	AoS	GPU kernels, `__constant__` memory
`lbm.cu` (SoA)	CUDA	SoA	Shared memory with padding, CUDA streams
`lbmACC.c`	OpenACC	SoA	Fused collision+streaming kernel, `#pragma acc data`

The SoA (Structure-of-Arrays) layout stores each velocity direction's data as a separate contiguous array, enabling coalesced memory access on the GPU — the dominant factor driving the performance gap over AoS.

Performance (2048×2048 grid, 1000 steps)

Backend	Collision	Streaming	Collision GFLOP/s
CPU	91.5 s	49.7 s	9.6
GPU AoS	1.51 s	13.4 s	578.9
GPU SoA	0.75 s	0.75 s	1174.4

GPU SoA achieves a ~120× speedup over CPU for collision and reduces streaming from 13 s to under 1 s by enabling coalesced global memory access.

Collision step	Streaming step	Total

Physics Validation

Simulated kinetic energy decay is compared against the analytical prediction E_k(t) = E_k(0)·exp(−4νk²t):

cd gpu_project/LB_project
python validate.py    # produces validation.png

See the full project documentation in gpu_project/README.md.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
gpu_project		gpu_project
gpu_training		gpu_training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU Offloading — MHPC Coursework

Repository Structure

`gpu_training/` — Training Path

Day 1 — Memory, Kernels & Shared Memory (`day1/`)