Small C++23 + OpenMP matrix-multiplication experiments intended for profiling with Intel VTune.
The repo keeps the same simple layout style as omp-experiments: headers under include/,
implementations under src/, a root Makefile, and small helper scripts for build, format,
and run.
The project builds three executables:
- mult-sync: intentionally synchronization-heavy version
- mult-redux: reduction-based version with an OpenMP region inside each output cell
- mult-good: cleaner outer-loop parallel version
All three use square matrices of size 1000 x 1000 and the same deterministic seed.
sudo pacman -S --needed base-devel gcc makesudo pacman -S --needed base-devel clang make
sudo pacman -S --needed libompWith GCC, OpenMP support is usually available directly through
-fopenmp. With Clang,libompis often needed as well.
make -jUseful overrides:
make -j CXX=clang++
make -j CXXFLAGS='-O3 -g -std=c++23 -Wall -Wextra -Wpedantic -Iinclude -fopenmp'Clean:
make cleanRun everything:
./run.shOr select one executable:
MODE=mult-sync ./run.sh
MODE=mult-redux ./run.sh
MODE=mult-good ./run.shControl the number of threads:
OMP_NUM_THREADS=8 ./mult-good
THREADS=8 MODE=mult-good ./run.sh./format.shTypical collection flow:
./build.sh
vtune -collect hotspots -- ./mult-good
vtune -collect hotspots -- ./mult-redux
vtune -collect hotspots -- ./mult-syncYou can also keep the results in dedicated folders:
vtune -collect hotspots -result-dir vtune_results/mult-good -- ./mult-good
vtune -collect hotspots -result-dir vtune_results/mult-redux -- ./mult-redux
vtune -collect hotspots -result-dir vtune_results/mult-sync -- ./mult-syncinclude/demo/
matrix_common.hpp
mult_good.hpp
mult_redux.hpp
mult_sync.hpp
src/
matrix_common.cpp
mult_good.cpp
mult_good_main.cpp
mult_redux.cpp
mult_redux_main.cpp
mult_sync.cpp
mult_sync_main.cpp
Makefile
build.sh
format.sh
run.sh
mult-sync is over-synchronized. It uses single regions to reset and publish the shared
accumulator, and it forces synchronization around each output-cell computation. This introduces
barriers and serialized sections in the hottest part of the algorithm.
mult-redux removes some of that explicit synchronization, but it still parallelizes the inner
dot product for every output cell. That means the program repeatedly enters a parallel reduction on
very fine-grained work. The result is high overhead from creating and synchronizing OpenMP work for
each (i, j) pair.
The threads spend a large fraction of their time waiting instead of computing. In mult-sync, the
use of single sections and repeated barriers means only one thread is doing useful work at some
points while the others are idle. VTune therefore reports low Effective CPU Utilization because the
parallel region is dominated by synchronization rather than sustained computation.
The costly directives are the single regions and the implied synchronization around them, along
with the repeated barrier behavior in the inner computation pattern. The main problem is the design
choice of coordinating threads for every output cell instead of giving each thread an independent
chunk of work. The implementation is parallel in syntax, but much of that parallelism is lost to
coordination overhead.
The synchronization point is the reduction on the inner k loop. For each output cell C(i, j),
all participating threads must combine their partial sums before the final value can be written.
The program spends so much time there because this reduction happens once per cell, so a large
number of very small parallel reductions are executed. The overhead of synchronizing and combining
partial results becomes comparable to, or larger than, the arithmetic work itself.
mult-good parallelizes the outer loop so that each thread owns a distinct subset of output cells
C(i, j). For each cell, the accumulator acc is declared inside the loop body, so it is
naturally private to the thread computing that cell. The thread performs the entire sum over k
locally and then writes the final result directly to C(i, j).
Because of that ownership model, mult-good does not need:
singleregions to reset or publishacc- barriers for each cell
- atomics on
C(i, j) - reductions on
accin the outer-loop-parallel design
The work is independent by construction, so the implementation avoids unnecessary synchronization
and lets threads spend most of their time on arithmetic. That is why VTune shows a much cleaner and
more efficient execution profile for mult-good.
The key takeaway from these experiments is that we do not need extra synchronization for the "good" implementation.
When the outer loop is parallelized, each thread owns a distinct subset of output cells C(i, j).
For each of those cells, the temporary accumulator acc lives inside the loop body, so it is
naturally private to the thread computing that cell. The thread computes the full sum over k
locally and then writes the final value directly to C(i, j).
Those properties already give us the required correctness. Adding single, barriers, atomics, or
reductions would only synchronize threads around work that is already independent. In this setup,
the cleanest approach is also the fastest one: keep acc local, finish the dot product privately,
and store the result once.
This is exactly why mult-good is better than the synchronization-heavy version. It preserves the
same mathematical result while avoiding unnecessary coordination between threads, which reduces
barrier time, lowers runtime overhead, and gives VTune a much cleaner execution profile.