Skip to content

Latest commit

 

History

History
185 lines (129 loc) · 6.79 KB

File metadata and controls

185 lines (129 loc) · 6.79 KB

2D Lattice Boltzmann Method — GPU Offloading Study

A 2D fluid simulation of the Taylor-Green vortex decay using the Lattice Boltzmann Method (LBM), implemented in C with CPU, CUDA, and OpenACC backends. The project studies GPU offloading strategies and memory layout optimizations (AoS vs. SoA) through performance benchmarking.


Theory: Lattice Boltzmann Method

Rather than solving the Navier-Stokes equations directly, LBM simulates fluid dynamics at the mesoscale by tracking a particle distribution function f_i(x, t) — the probability of finding a fluid "particle" at position x, moving in discrete direction i, at time t.

D2Q9 Lattice

This implementation uses the D2Q9 model: 2 spatial dimensions, 9 discrete velocity directions. Each lattice node holds 9 distribution values pointing in the cardinal and diagonal directions (plus rest).

  6  2  5
   \ | /
 3 — 0 — 1
   / | \
  7  4  8

Lattice weights: w₀ = 4/9, w₁₋₄ = 1/9, w₅₋₈ = 1/36.

Time-Stepping: Collision + Streaming

Each simulation step consists of two stages:

1. Collision — each node relaxes toward the local equilibrium distribution:

f_i(x, t+1) = f_i(x, t) - [f_i(x, t) - f_i^eq(x, t)] / τ

where τ is the relaxation time controlling viscosity (ν = (τ - 0.5) / 3), and the equilibrium distribution is:

f_i^eq = w_i · ρ · [1 + 3(e_i·u) + 4.5(e_i·u)² - 1.5|u|²]

2. Streaming — post-collision distributions propagate to neighboring nodes along their velocity direction, with periodic boundary conditions.

Macroscopic fluid variables (density ρ and velocity u) are recovered as moments of the distribution:

ρ = Σ f_i        ρu = Σ f_i · e_i

Taylor-Green Vortex Initial Condition

The simulation initializes as a Taylor-Green vortex — a smooth, analytically known flow with a sinusoidal velocity field:

u_x(x, y) =  U₀ · sin(kx) · cos(ky)
u_y(x, y) = -U₀ · cos(kx) · sin(ky)

This vortex decays over time due to viscosity, making it an ideal benchmark: kinetic energy follows a known exponential decay that can be compared against the numerical solution.

E_k(t) = E_k(0) · exp(-4 · ν · k² · t)

Fluid Evolution

The Taylor-Green vortex gradually dissipates energy, transitioning from an ordered vortex structure to progressively smoothed flow:

Velocity field evolution Streamline visualization
Fluid evolution Fluid streamlines

Code Structure

gpu_project/
├── LB_project/
│   ├── lbm.c          # CPU reference implementation
│   ├── lbm.cu         # CUDA implementation (AoS and SoA variants)
│   ├── lbmACC.c       # OpenACC implementation (SoA, fused collision+stream)
│   ├── lbmACC2.c      # OpenACC alternative variant
│   ├── validate.py    # Validates Ek.dat against analytical decay
│   ├── validation.png # Validation plot (LBM vs theory)
│   └── jobs/          # SLURM job scripts for HPC clusters
├── data/
│   ├── fluid_evolution.gif      # Animated velocity field evolution
│   ├── fluid_evolution2.gif     # Alternative view of flow evolution
│   ├── fluid_streamlines.gif    # Animated streamline visualization
│   ├── collision_time.png       # Collision step timing comparison
│   ├── streaming_time.png       # Streaming step timing comparison
│   ├── total_time.png           # Total runtime comparison
│   ├── lbm_performance_log.png  # Scaling and performance logs
│   ├── *_comparison_*.png       # Per-backend and grid-size comparisons
│   ├── compare.txt              # Raw benchmark output
│   └── *.py                    # Plotting scripts
└── lab/
    ├── axpy.cu                  # Day 1: CUDA AXPY exercise
    ├── day2/                    # Matrix-vector, reduction, transpose
    ├── day3/                    # Fused multiply-add
    └── coka-slurm-scripts/      # SLURM scripts for K80, P100, V100

Implementations

CPU (lbm.c)

Plain C reference: separate collision() and streaming() functions operating on an Array-of-Structures (AoS) layout. Kinetic energy is sampled every 20 steps and written to Ek.dat.

CUDA (lbm.cu)

Three backends in a single file, run sequentially for comparison:

  • GPU AoS — direct port of the CPU AoS layout to CUDA kernels (gpu_collision, gpu_streaming). Each thread handles one cell. Lattice constants stored in GPU __constant__ memory.
  • GPU SoA — Structure-of-Arrays layout (LBMCellSoa), where each velocity direction's data is a separate contiguous array. The collision kernel (gpu_collisionSoa) uses shared memory with padding to avoid bank conflicts. Streams via gpu_streamingSoa. Uses CUDA streams and events for precise kernel timing.

Grid size is configurable at runtime: ./lbm.x NX NY.

OpenACC (lbmACC.c)

Directive-based GPU offloading with SoA layout. A single #pragma acc data region transfers data once; the main loop runs as a fused collision+streaming kernel under #pragma acc parallel loop gang worker vector collapse(2).


Performance Results

Benchmarked on a 2048×2048 grid (1000 time steps):

Backend Collision Streaming Collision GFLOP/s Streaming GB/s
CPU 91.5 s 49.7 s 9.6 12.1
GPU AoS 1.51 s 13.4 s 578.9 45.2
GPU SoA 0.75 s 0.75 s 1174.4 806.2

The SoA layout on GPU achieves a ~60× speedup over CPU for collision, and ~66× for streaming, driven by coalesced memory access patterns.

Performance Plots

Collision step Streaming step Total runtime
Collision time Streaming time Total time

Performance log


Physics Validation

After running, validate.py plots the simulated kinetic energy decay against the analytical solution:

python validate.py   # reads Ek.dat, produces validation.png

The LBM output closely follows E_k(t) = E_k(0) · exp(-4νk²t), confirming the solver is physically correct.


Running

CPU version:

gcc -O3 -o lbm lbm.c -lm
./lbm

CUDA version:

nvcc -O3 -o lbm.x lbm.cu
./lbm.x 256 256    # optional: specify grid size

OpenACC version:

pgcc -acc -Minfo=accel -o lbmACC lbmACC.c -lm
./lbmACC

Lab Exercises

The lab/ directory contains progressive CUDA exercises to review the basics of GPU offloading:

  • Day 1 — AXPY kernel, thread indexing basics
  • Day 2 — Matrix-vector product, parallel reduction, matrix transpose
  • Day 3 — Fused multiply-add, performance analysis