A collection of HPC implementations covering MPI, OpenMP, and hybrid parallelism, developed as part of the P1.5 Parallel Computing course in the HPC Master programme at SISSA/ICTP (2025-26).
| Technology | Usage |
|---|---|
| MPI | Distributed-memory communication (point-to-point, collective, non-blocking) |
| OpenMP | Shared-memory thread parallelism |
| Hybrid MPI+OpenMP | Two-level parallelism targeting multi-core cluster nodes |
| HDF5 (parallel) | Scalable parallel I/O for large datasets |
| FFTW3-MPI | Distributed 3D FFTs for spectral PDE solving |
| C++20 / C | Implementation languages |
| CMake / Make | Build systems |
An N×N identity matrix is distributed in block rows across MPI processes using a templated CMatrix<T> class. Explores three communication strategies for collecting and printing the matrix from the root process:
- Blocking
MPI_Send/MPI_Recv - Non-blocking
MPI_Isend/MPI_Irecvwith double-buffering (overlap communication and printing) - Binary file I/O via parallel writes and
MPI_Recvgather
The load imbalance from non-divisible sizes is handled through the rest term, distributing the extra rows to the first processes.
Build:
cd 01_identity_matrix && mkdir build && cd build
cmake .. && make
mpirun -np 4 ./idMatParallel dense matrix multiplication (A × B = C) with a 1D block-row data layout using MPI collectives (MPI_Scatterv, MPI_Allgather). The inner loop is further parallelised with OpenMP to exploit shared memory within each node (hybrid MPI+OpenMP).
Performance is benchmarked against the serial baseline on a 50 000 × 50 000 matrix across multiple node counts and thread configurations:
| Configuration | Plot |
|---|---|
| Pure MPI scaling | data/PureMPI.png |
| Pure MPI efficiency | data/efficiency.png |
| Hybrid 4 tasks/node | data/hybrid4npes.png |
| Hybrid 8 tasks/node | data/hybrid8npes.png |
| Hybrid efficiency | data/hybrideff.png |
| Cannon vs 1D comparison | data/cannonAlg.png |
A Google Test suite (gtest.cpp) validates correctness by comparing the parallel result against a serial reference multiplication.
Build:
cd 02_matrix_multiplication && mkdir build && cd build
cmake .. && make
mpirun -np 4 ./matMulImplementation of Cannon's algorithm for matrix multiplication with a 2D process grid. Each process owns a square subblock; blocks are cyclically shifted along rows and columns to compute the product with O(√P) memory per process instead of O(N²/P) for the 1D layout.
Requires the number of MPI processes to be a perfect square. Supports hybrid execution (OpenMP inner loop). Benchmark scripts compare Cannon vs. 1D distribution performance and efficiency.
Build:
cd 03_cannon_algorithm && mkdir build && cd build
cmake .. && make
mpirun -np 16 ./cannon # P must be a perfect squareTwo standalone programs illustrating OpenMP fundamentals:
hello_threads.cpp— spawns a fixed number of threads and prints a greeting from each, demonstrating#pragma omp parallelandomp_get_thread_num().matmul_omp.c— benchmarks five OpenMP strategies for the inner loop of matrix multiplication:collapse(3)withatomic,collapse(3)withreduction,collapse(2),collapse(1), andcollapse(3)withcritical, comparing wall-clock time across strategies.
Parallel iterative Jacobi solver for the 2D steady-state heat equation on a square domain, using a 1D (row-wise) domain decomposition. Boundary conditions are applied via an injected lambda.
Each iteration requires exchanging halo rows with neighbouring processes. Two communication variants are implemented and compared:
- Blocking
MPI_Sendrecv— simple and deadlock-free - Non-blocking
MPI_Isend/MPI_Irecv— overlaps halo exchange with interior computation
Scaling results on the Leonardo HPC cluster (CINECA):
| Plot | Description |
|---|---|
plots/pureMPI.png |
Time vs. process count, blocking |
plots/MPINonBlock.png |
Time vs. process count, non-blocking |
plots/hybrid8procs.png |
Hybrid, 8 processes |
plots/hybrid16procs.png |
Hybrid, 16 processes |
Build:
cd 05_jacobi_mpi && mkdir build && cd build
cmake .. && make
mpirun -np 8 ./jacobiExtension of the pure-MPI Jacobi solver to a two-level hybrid model. The MPI halo exchange is non-blocking (MPI_Isend/MPI_Irecv) and the inner Jacobi sweep is parallelised with #pragma omp parallel for collapse(2), allowing communication and interior computation to overlap at the thread level.
Benchmarked at 8 and 16 MPI processes with varying OMP thread counts per rank on Leonardo, illustrating the trade-off between MPI granularity and shared-memory efficiency.
Build:
cd 06_jacobi_hybrid && mkdir build && cd build
cmake .. && make
OMP_NUM_THREADS=4 mpirun -np 4 ./hybJacobiTwo variants of the Jacobi solver that checkpoint the solution field to HDF5 files using parallel HDF5: each MPI process writes its subdomain collectively without gathering data to rank 0, enabling scalable I/O and post-processing/visualisation at scale.
| Sub-project | Description |
|---|---|
1d_jacobi/ |
1D domain decomposition, HDF5 checkpoint every 100 iterations |
2d_jacobi/ |
2D domain decomposition (process grid), HDF5 output per block |
I/O performance benchmarks in the plots/ directories show write throughput scaling with process count.
Dependencies: parallel HDF5 library.
Build:
cd 07_jacobi_hdf5/1d_jacobi && mkdir build && cd build
cmake .. && make
mpirun -np 8 ./jacobi_hdf5Parallel solution of the 3D diffusion equation with spatially varying diffusivity, using spectral (Fourier) spatial derivatives and forward Euler time integration:
∂c/∂t = ∇·(D(r) ∇c)
The 3D domain is distributed with a 1D slab decomposition along the first dimension via fftw_mpi_local_size_3d. Forward and inverse FFTs use fftw_mpi_plan_dft_3d / fftw_mpi_execute_dft. Global reductions (MPI_Allreduce) maintain concentration normalisation at each diagnostic step.
The serial_reference/ subdirectory contains the equivalent single-process code.
Scaling benchmarks on grids of 256³, 512³, and 1024³ points:
| Plot | Grid |
|---|---|
plots/n256.png |
256³ |
plots/n512.png |
512³ |
plots/n1024.png |
1024³ |
Dependencies: FFTW3 with MPI support.
Build:
cd 08_fftw_diffusion
make # edit MPI_CC / FFTW_DIR variables as needed
mpirun -np 8 ./diffusionalltoall.c— DemonstratesMPI_Alltoallto perform a distributed matrix transpose, sending equal-sized blocks between all process pairs.par_identity.c— A lightweight C implementation of the parallel identity matrix using non-blocking point-to-point communication.
.
├── 01_identity_matrix/ # MPI parallel identity matrix (blocking / non-blocking / binary I/O)
├── 02_matrix_multiplication/ # 1D MPI+OpenMP matrix multiply with scaling benchmarks
├── 03_cannon_algorithm/ # Cannon's 2D algorithm for matrix multiplication
├── 04_openmp_intro/ # OpenMP thread basics and matmul strategy comparison
├── 05_jacobi_mpi/ # Pure MPI Jacobi heat equation solver
├── 06_jacobi_hybrid/ # Hybrid MPI+OpenMP Jacobi solver
├── 07_jacobi_hdf5/ # Jacobi solver with parallel HDF5 checkpoint I/O
│ ├── 1d_jacobi/
│ └── 2d_jacobi/
├── 08_fftw_diffusion/ # FFTW-MPI 3D spectral diffusion solver
│ └── serial_reference/
└── extras/ # MPI_Alltoall transpose, C identity matrix
Projects 01–07 use CMake:
cd <project_dir>
mkdir build && cd build
cmake ..
makeProject 08 uses a plain Makefile — edit MPI_CC and FFTW_DIR to match your system installation.
All MPI executables accept standard mpirun / srun invocations:
# OpenMPI / generic
mpirun -np <N> ./<executable>
# SLURM (e.g. Leonardo @ CINECA)
srun --mpi=pmix -n <N> ./<executable>
# Hybrid MPI+OpenMP
export OMP_NUM_THREADS=<T>
mpirun -np <N> --map-by socket:PE=<T> ./<executable>| Library | Required by |
|---|---|
| MPI (OpenMPI ≥ 4 or MPICH ≥ 3) | all projects |
| OpenMP (GCC / Clang) | 02, 03, 04, 06 |
| CMake ≥ 3.21 | 01–07 |
| HDF5 with parallel support | 07 |
| FFTW3 with MPI support | 08 |
MIT — see LICENSE.