A 2D fluid simulation of the Taylor-Green vortex decay using the Lattice Boltzmann Method (LBM), implemented in C with CPU, CUDA, and OpenACC backends. The project studies GPU offloading strategies and memory layout optimizations (AoS vs. SoA) through performance benchmarking.
Rather than solving the Navier-Stokes equations directly, LBM simulates fluid dynamics at the mesoscale by tracking a particle distribution function f_i(x, t) — the probability of finding a fluid "particle" at position x, moving in discrete direction i, at time t.
This implementation uses the D2Q9 model: 2 spatial dimensions, 9 discrete velocity directions. Each lattice node holds 9 distribution values pointing in the cardinal and diagonal directions (plus rest).
6 2 5
\ | /
3 — 0 — 1
/ | \
7 4 8
Lattice weights: w₀ = 4/9, w₁₋₄ = 1/9, w₅₋₈ = 1/36.
Each simulation step consists of two stages:
1. Collision — each node relaxes toward the local equilibrium distribution:
f_i(x, t+1) = f_i(x, t) - [f_i(x, t) - f_i^eq(x, t)] / τ
where τ is the relaxation time controlling viscosity (ν = (τ - 0.5) / 3), and the equilibrium distribution is:
f_i^eq = w_i · ρ · [1 + 3(e_i·u) + 4.5(e_i·u)² - 1.5|u|²]
2. Streaming — post-collision distributions propagate to neighboring nodes along their velocity direction, with periodic boundary conditions.
Macroscopic fluid variables (density ρ and velocity u) are recovered as moments of the distribution:
ρ = Σ f_i ρu = Σ f_i · e_i
The simulation initializes as a Taylor-Green vortex — a smooth, analytically known flow with a sinusoidal velocity field:
u_x(x, y) = U₀ · sin(kx) · cos(ky)
u_y(x, y) = -U₀ · cos(kx) · sin(ky)
This vortex decays over time due to viscosity, making it an ideal benchmark: kinetic energy follows a known exponential decay that can be compared against the numerical solution.
E_k(t) = E_k(0) · exp(-4 · ν · k² · t)
The Taylor-Green vortex gradually dissipates energy, transitioning from an ordered vortex structure to progressively smoothed flow:
| Velocity field evolution | Streamline visualization |
|---|---|
![]() |
![]() |
gpu_project/
├── LB_project/
│ ├── lbm.c # CPU reference implementation
│ ├── lbm.cu # CUDA implementation (AoS and SoA variants)
│ ├── lbmACC.c # OpenACC implementation (SoA, fused collision+stream)
│ ├── lbmACC2.c # OpenACC alternative variant
│ ├── validate.py # Validates Ek.dat against analytical decay
│ ├── validation.png # Validation plot (LBM vs theory)
│ └── jobs/ # SLURM job scripts for HPC clusters
├── data/
│ ├── fluid_evolution.gif # Animated velocity field evolution
│ ├── fluid_evolution2.gif # Alternative view of flow evolution
│ ├── fluid_streamlines.gif # Animated streamline visualization
│ ├── collision_time.png # Collision step timing comparison
│ ├── streaming_time.png # Streaming step timing comparison
│ ├── total_time.png # Total runtime comparison
│ ├── lbm_performance_log.png # Scaling and performance logs
│ ├── *_comparison_*.png # Per-backend and grid-size comparisons
│ ├── compare.txt # Raw benchmark output
│ └── *.py # Plotting scripts
└── lab/
├── axpy.cu # Day 1: CUDA AXPY exercise
├── day2/ # Matrix-vector, reduction, transpose
├── day3/ # Fused multiply-add
└── coka-slurm-scripts/ # SLURM scripts for K80, P100, V100
Plain C reference: separate collision() and streaming() functions operating on an Array-of-Structures (AoS) layout. Kinetic energy is sampled every 20 steps and written to Ek.dat.
Three backends in a single file, run sequentially for comparison:
- GPU AoS — direct port of the CPU AoS layout to CUDA kernels (
gpu_collision,gpu_streaming). Each thread handles one cell. Lattice constants stored in GPU__constant__memory. - GPU SoA — Structure-of-Arrays layout (
LBMCellSoa), where each velocity direction's data is a separate contiguous array. The collision kernel (gpu_collisionSoa) uses shared memory with padding to avoid bank conflicts. Streams viagpu_streamingSoa. Uses CUDA streams and events for precise kernel timing.
Grid size is configurable at runtime: ./lbm.x NX NY.
Directive-based GPU offloading with SoA layout. A single #pragma acc data region transfers data once; the main loop runs as a fused collision+streaming kernel under #pragma acc parallel loop gang worker vector collapse(2).
Benchmarked on a 2048×2048 grid (1000 time steps):
| Backend | Collision | Streaming | Collision GFLOP/s | Streaming GB/s |
|---|---|---|---|---|
| CPU | 91.5 s | 49.7 s | 9.6 | 12.1 |
| GPU AoS | 1.51 s | 13.4 s | 578.9 | 45.2 |
| GPU SoA | 0.75 s | 0.75 s | 1174.4 | 806.2 |
The SoA layout on GPU achieves a ~60× speedup over CPU for collision, and ~66× for streaming, driven by coalesced memory access patterns.
| Collision step | Streaming step | Total runtime |
|---|---|---|
![]() |
![]() |
![]() |
After running, validate.py plots the simulated kinetic energy decay against the analytical solution:
python validate.py # reads Ek.dat, produces validation.pngThe LBM output closely follows E_k(t) = E_k(0) · exp(-4νk²t), confirming the solver is physically correct.
CPU version:
gcc -O3 -o lbm lbm.c -lm
./lbmCUDA version:
nvcc -O3 -o lbm.x lbm.cu
./lbm.x 256 256 # optional: specify grid sizeOpenACC version:
pgcc -acc -Minfo=accel -o lbmACC lbmACC.c -lm
./lbmACCThe lab/ directory contains progressive CUDA exercises to review the basics of GPU offloading:
- Day 1 — AXPY kernel, thread indexing basics
- Day 2 — Matrix-vector product, parallel reduction, matrix transpose
- Day 3 — Fused multiply-add, performance analysis





