Skip to content

Gabriel-Pedde/Parallel-computing

Repository files navigation

Parallel Computing

A collection of HPC implementations covering MPI, OpenMP, and hybrid parallelism, developed as part of the P1.5 Parallel Computing course in the HPC Master programme at SISSA/ICTP (2025-26).

Technologies

Technology Usage
MPI Distributed-memory communication (point-to-point, collective, non-blocking)
OpenMP Shared-memory thread parallelism
Hybrid MPI+OpenMP Two-level parallelism targeting multi-core cluster nodes
HDF5 (parallel) Scalable parallel I/O for large datasets
FFTW3-MPI Distributed 3D FFTs for spectral PDE solving
C++20 / C Implementation languages
CMake / Make Build systems

Projects

1. Distributed Identity Matrix — 01_identity_matrix/

An N×N identity matrix is distributed in block rows across MPI processes using a templated CMatrix<T> class. Explores three communication strategies for collecting and printing the matrix from the root process:

  • Blocking MPI_Send / MPI_Recv
  • Non-blocking MPI_Isend / MPI_Irecv with double-buffering (overlap communication and printing)
  • Binary file I/O via parallel writes and MPI_Recv gather

The load imbalance from non-divisible sizes is handled through the rest term, distributing the extra rows to the first processes.

Build:

cd 01_identity_matrix && mkdir build && cd build
cmake .. && make
mpirun -np 4 ./idMat

2. Matrix Multiplication — 1D Block-Row Distribution — 02_matrix_multiplication/

Parallel dense matrix multiplication (A × B = C) with a 1D block-row data layout using MPI collectives (MPI_Scatterv, MPI_Allgather). The inner loop is further parallelised with OpenMP to exploit shared memory within each node (hybrid MPI+OpenMP).

Performance is benchmarked against the serial baseline on a 50 000 × 50 000 matrix across multiple node counts and thread configurations:

Configuration Plot
Pure MPI scaling data/PureMPI.png
Pure MPI efficiency data/efficiency.png
Hybrid 4 tasks/node data/hybrid4npes.png
Hybrid 8 tasks/node data/hybrid8npes.png
Hybrid efficiency data/hybrideff.png
Cannon vs 1D comparison data/cannonAlg.png

A Google Test suite (gtest.cpp) validates correctness by comparing the parallel result against a serial reference multiplication.

Build:

cd 02_matrix_multiplication && mkdir build && cd build
cmake .. && make
mpirun -np 4 ./matMul

3. Cannon's Algorithm — 2D Block Distribution — 03_cannon_algorithm/

Implementation of Cannon's algorithm for matrix multiplication with a 2D process grid. Each process owns a square subblock; blocks are cyclically shifted along rows and columns to compute the product with O(√P) memory per process instead of O(N²/P) for the 1D layout.

Requires the number of MPI processes to be a perfect square. Supports hybrid execution (OpenMP inner loop). Benchmark scripts compare Cannon vs. 1D distribution performance and efficiency.

Build:

cd 03_cannon_algorithm && mkdir build && cd build
cmake .. && make
mpirun -np 16 ./cannon   # P must be a perfect square

4. OpenMP Introduction — 04_openmp_intro/

Two standalone programs illustrating OpenMP fundamentals:

  • hello_threads.cpp — spawns a fixed number of threads and prints a greeting from each, demonstrating #pragma omp parallel and omp_get_thread_num().
  • matmul_omp.c — benchmarks five OpenMP strategies for the inner loop of matrix multiplication: collapse(3) with atomic, collapse(3) with reduction, collapse(2), collapse(1), and collapse(3) with critical, comparing wall-clock time across strategies.

5. Jacobi Solver — Pure MPI — 05_jacobi_mpi/

Parallel iterative Jacobi solver for the 2D steady-state heat equation on a square domain, using a 1D (row-wise) domain decomposition. Boundary conditions are applied via an injected lambda.

Each iteration requires exchanging halo rows with neighbouring processes. Two communication variants are implemented and compared:

  • Blocking MPI_Sendrecv — simple and deadlock-free
  • Non-blocking MPI_Isend / MPI_Irecv — overlaps halo exchange with interior computation

Scaling results on the Leonardo HPC cluster (CINECA):

Plot Description
plots/pureMPI.png Time vs. process count, blocking
plots/MPINonBlock.png Time vs. process count, non-blocking
plots/hybrid8procs.png Hybrid, 8 processes
plots/hybrid16procs.png Hybrid, 16 processes

Build:

cd 05_jacobi_mpi && mkdir build && cd build
cmake .. && make
mpirun -np 8 ./jacobi

6. Jacobi Solver — Hybrid MPI+OpenMP — 06_jacobi_hybrid/

Extension of the pure-MPI Jacobi solver to a two-level hybrid model. The MPI halo exchange is non-blocking (MPI_Isend/MPI_Irecv) and the inner Jacobi sweep is parallelised with #pragma omp parallel for collapse(2), allowing communication and interior computation to overlap at the thread level.

Benchmarked at 8 and 16 MPI processes with varying OMP thread counts per rank on Leonardo, illustrating the trade-off between MPI granularity and shared-memory efficiency.

Build:

cd 06_jacobi_hybrid && mkdir build && cd build
cmake .. && make
OMP_NUM_THREADS=4 mpirun -np 4 ./hybJacobi

7. Jacobi Solver with HDF5 Parallel I/O — 07_jacobi_hdf5/

Two variants of the Jacobi solver that checkpoint the solution field to HDF5 files using parallel HDF5: each MPI process writes its subdomain collectively without gathering data to rank 0, enabling scalable I/O and post-processing/visualisation at scale.

Sub-project Description
1d_jacobi/ 1D domain decomposition, HDF5 checkpoint every 100 iterations
2d_jacobi/ 2D domain decomposition (process grid), HDF5 output per block

I/O performance benchmarks in the plots/ directories show write throughput scaling with process count.

Dependencies: parallel HDF5 library.

Build:

cd 07_jacobi_hdf5/1d_jacobi && mkdir build && cd build
cmake .. && make
mpirun -np 8 ./jacobi_hdf5

8. 3D Diffusion Equation with FFTW-MPI — 08_fftw_diffusion/

Parallel solution of the 3D diffusion equation with spatially varying diffusivity, using spectral (Fourier) spatial derivatives and forward Euler time integration:

∂c/∂t = ∇·(D(r) ∇c)

The 3D domain is distributed with a 1D slab decomposition along the first dimension via fftw_mpi_local_size_3d. Forward and inverse FFTs use fftw_mpi_plan_dft_3d / fftw_mpi_execute_dft. Global reductions (MPI_Allreduce) maintain concentration normalisation at each diagnostic step.

The serial_reference/ subdirectory contains the equivalent single-process code.

Scaling benchmarks on grids of 256³, 512³, and 1024³ points:

Plot Grid
plots/n256.png 256³
plots/n512.png 512³
plots/n1024.png 1024³

Dependencies: FFTW3 with MPI support.

Build:

cd 08_fftw_diffusion
make           # edit MPI_CC / FFTW_DIR variables as needed
mpirun -np 8 ./diffusion

Extras — extras/

  • alltoall.c — Demonstrates MPI_Alltoall to perform a distributed matrix transpose, sending equal-sized blocks between all process pairs.
  • par_identity.c — A lightweight C implementation of the parallel identity matrix using non-blocking point-to-point communication.

Repository Structure

.
├── 01_identity_matrix/        # MPI parallel identity matrix (blocking / non-blocking / binary I/O)
├── 02_matrix_multiplication/  # 1D MPI+OpenMP matrix multiply with scaling benchmarks
├── 03_cannon_algorithm/       # Cannon's 2D algorithm for matrix multiplication
├── 04_openmp_intro/           # OpenMP thread basics and matmul strategy comparison
├── 05_jacobi_mpi/             # Pure MPI Jacobi heat equation solver
├── 06_jacobi_hybrid/          # Hybrid MPI+OpenMP Jacobi solver
├── 07_jacobi_hdf5/            # Jacobi solver with parallel HDF5 checkpoint I/O
│   ├── 1d_jacobi/
│   └── 2d_jacobi/
├── 08_fftw_diffusion/         # FFTW-MPI 3D spectral diffusion solver
│   └── serial_reference/
└── extras/                    # MPI_Alltoall transpose, C identity matrix

Building

Projects 01–07 use CMake:

cd <project_dir>
mkdir build && cd build
cmake ..
make

Project 08 uses a plain Makefile — edit MPI_CC and FFTW_DIR to match your system installation.

Running on a Cluster

All MPI executables accept standard mpirun / srun invocations:

# OpenMPI / generic
mpirun -np <N> ./<executable>

# SLURM (e.g. Leonardo @ CINECA)
srun --mpi=pmix -n <N> ./<executable>

# Hybrid MPI+OpenMP
export OMP_NUM_THREADS=<T>
mpirun -np <N> --map-by socket:PE=<T> ./<executable>

Dependencies

Library Required by
MPI (OpenMPI ≥ 4 or MPICH ≥ 3) all projects
OpenMP (GCC / Clang) 02, 03, 04, 06
CMake ≥ 3.21 01–07
HDF5 with parallel support 07
FFTW3 with MPI support 08

License

MIT — see LICENSE.

About

HPC implementations in MPI, OpenMP, and hybrid MPI+OpenMP. Covers distributed matrix operations, Jacobi solvers, parallel I/O with HDF5, and spectral PDE solving with FFTW-MPI. Benchmarked on the Leonardo supercomputer (CINECA).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors