Skip to content

sneakyjbras/vtune-experiments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vtune-experiments (Manjaro)

Small C++23 + OpenMP matrix-multiplication experiments intended for profiling with Intel VTune. The repo keeps the same simple layout style as omp-experiments: headers under include/, implementations under src/, a root Makefile, and small helper scripts for build, format, and run.

Experiments

The project builds three executables:

  • mult-sync: intentionally synchronization-heavy version
  • mult-redux: reduction-based version with an OpenMP region inside each output cell
  • mult-good: cleaner outer-loop parallel version

All three use square matrices of size 1000 x 1000 and the same deterministic seed.

Requirements (Manjaro)

Option A (recommended): GCC toolchain

sudo pacman -S --needed base-devel gcc make

Option B: Clang toolchain

sudo pacman -S --needed base-devel clang make
sudo pacman -S --needed libomp

With GCC, OpenMP support is usually available directly through -fopenmp. With Clang, libomp is often needed as well.

Build

make -j

Useful overrides:

make -j CXX=clang++
make -j CXXFLAGS='-O3 -g -std=c++23 -Wall -Wextra -Wpedantic -Iinclude -fopenmp'

Clean:

make clean

Run

Run everything:

./run.sh

Or select one executable:

MODE=mult-sync  ./run.sh
MODE=mult-redux ./run.sh
MODE=mult-good  ./run.sh

Control the number of threads:

OMP_NUM_THREADS=8 ./mult-good
THREADS=8 MODE=mult-good ./run.sh

Format

./format.sh

VTune examples

Typical collection flow:

./build.sh
vtune -collect hotspots -- ./mult-good
vtune -collect hotspots -- ./mult-redux
vtune -collect hotspots -- ./mult-sync

You can also keep the results in dedicated folders:

vtune -collect hotspots -result-dir vtune_results/mult-good -- ./mult-good
vtune -collect hotspots -result-dir vtune_results/mult-redux -- ./mult-redux
vtune -collect hotspots -result-dir vtune_results/mult-sync -- ./mult-sync

Project layout

include/demo/
  matrix_common.hpp
  mult_good.hpp
  mult_redux.hpp
  mult_sync.hpp
src/
  matrix_common.cpp
  mult_good.cpp
  mult_good_main.cpp
  mult_redux.cpp
  mult_redux_main.cpp
  mult_sync.cpp
  mult_sync_main.cpp
Makefile
build.sh
format.sh
run.sh

Answers to the Threading Questions

1) Which problems can we identify in mult-sync.c and mult-redux.c?

mult-sync is over-synchronized. It uses single regions to reset and publish the shared accumulator, and it forces synchronization around each output-cell computation. This introduces barriers and serialized sections in the hottest part of the algorithm.

mult-redux removes some of that explicit synchronization, but it still parallelizes the inner dot product for every output cell. That means the program repeatedly enters a parallel reduction on very fine-grained work. The result is high overhead from creating and synchronizing OpenMP work for each (i, j) pair.

2) Why is the Effective CPU Utilization of mult-sync so low?

The threads spend a large fraction of their time waiting instead of computing. In mult-sync, the use of single sections and repeated barriers means only one thread is doing useful work at some points while the others are idle. VTune therefore reports low Effective CPU Utilization because the parallel region is dominated by synchronization rather than sustained computation.

3) Which OpenMP directives have high overhead in mult-sync? What is the main problem?

The costly directives are the single regions and the implied synchronization around them, along with the repeated barrier behavior in the inner computation pattern. The main problem is the design choice of coordinating threads for every output cell instead of giving each thread an independent chunk of work. The implementation is parallel in syntax, but much of that parallelism is lost to coordination overhead.

4) Where is the synchronization point in mult-redux? Why does the program spend so much time on it?

The synchronization point is the reduction on the inner k loop. For each output cell C(i, j), all participating threads must combine their partial sums before the final value can be written. The program spends so much time there because this reduction happens once per cell, so a large number of very small parallel reductions are executed. The overhead of synchronizing and combining partial results becomes comparable to, or larger than, the arithmetic work itself.

5) Why is mult-good more efficient?

mult-good parallelizes the outer loop so that each thread owns a distinct subset of output cells C(i, j). For each cell, the accumulator acc is declared inside the loop body, so it is naturally private to the thread computing that cell. The thread performs the entire sum over k locally and then writes the final result directly to C(i, j).

Because of that ownership model, mult-good does not need:

  • single regions to reset or publish acc
  • barriers for each cell
  • atomics on C(i, j)
  • reductions on acc in the outer-loop-parallel design

The work is independent by construction, so the implementation avoids unnecessary synchronization and lets threads spend most of their time on arithmetic. That is why VTune shows a much cleaner and more efficient execution profile for mult-good.

Conclusion

The key takeaway from these experiments is that we do not need extra synchronization for the "good" implementation.

When the outer loop is parallelized, each thread owns a distinct subset of output cells C(i, j). For each of those cells, the temporary accumulator acc lives inside the loop body, so it is naturally private to the thread computing that cell. The thread computes the full sum over k locally and then writes the final value directly to C(i, j).

Those properties already give us the required correctness. Adding single, barriers, atomics, or reductions would only synchronize threads around work that is already independent. In this setup, the cleanest approach is also the fastest one: keep acc local, finish the dot product privately, and store the result once.

This is exactly why mult-good is better than the synchronization-heavy version. It preserves the same mathematical result while avoiding unnecessary coordination between threads, which reduces barrier time, lowers runtime overhead, and gives VTune a much cleaner execution profile.

About

Performance profiling experiments using Intel VTune to analyze hotspots, synchronization overhead, and thread behavior in HPC workloads.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors