A collection of annotated C examples covering single-core performance optimization, profiling workflows, and floating-point arithmetic. Developed as part of the Master in High Performance Computing (MHPC) curriculum at SISSA/ICTP, Trieste.
The goal is to build intuition for why code performs the way it does on modern hardware, from cache hierarchies and branch predictors to instruction-level parallelism and floating-point rounding.
| Example | What it demonstrates |
|---|---|
ex_0__array_traversal |
Row-major vs. column-major access; measuring bandwidth with PAPI |
ex_1__memory_mountain |
Sweeping stride and working-set size to visualize L1/L2/L3/DRAM bandwidth |
ex_2__matrix_transpose |
Naive transpose → blocked (tiled) transpose; benchmarked on laptop, Leonardo HPC, and LUMI supercomputer |
ex_3__hot_and_cold_fields |
Separating frequently-accessed fields from cold data; pointer-chasing vs. contiguous layouts |
ex_4__AoS_vs_SoA |
Array of Structures vs. Struct of Arrays — sparse and dense variants |
Performance plots from three different architectures (including the LUMI supercomputer) are included in ex_2__matrix_transpose/matrix_transpose/.
| Example | What it demonstrates |
|---|---|
ex_0__if_forest_in_loops |
Dense conditionals in loops and compiler handling |
ex_1__branch_prediction |
Cost of mispredicted branches; measured with gprof |
ex_2__crosssort_arrays |
Branch-free alternatives using bit manipulation |
| Example | What it demonstrates |
|---|---|
ex_1__matrix_multiplication |
Schoolbook → loop-reordered → block-tiled matmul; hardware counter data (CPEs, IPCs, L1 misses) included |
ex_2__array_reduction |
Array reduction with 2×1, 4×2, 8×4 loop unrolling; exploiting instruction-level parallelism |
ex_3__multiply_arrays |
Multiply-accumulate with pipeline-aware formulations and vectorization hints |
Explicit software prefetch hints (__builtin_prefetch) to hide memory latency.
Low-level memory layout: endianness, byte-level inspection, and stack frame exploration with GDB.
Compiler-visible vs. compiler-invisible inefficiencies — before and after -O3.
| File | What it demonstrates |
|---|---|
debug/gdb_try_breaks.c |
GDB tutorial: breakpoints, watchpoints, backtraces on a nested function call stack |
profiling/Nbody/Nbody.c |
N-body gravitational simulation with AoS and SoA layouts (compile-time switch via -DUSE_SOA); designed for gprof/perf/callgrind profiling |
profiling/Nbody/Nbody.scatter.c |
Scatter-gather access pattern variant |
profiling/Mandelbrot/Mandelbrot.tasks.c |
Mandelbrot set generator as a compute-intensive profiling target; outputs PNG via stb_image_write.h |
gprof2dot.py |
Convert gprof output to a call-graph visualization (graphviz) |
The N-body example is particularly useful: compiling with and without -DUSE_SOA produces measurably different performance, illustrating how data layout drives real-world speedup.
Demonstrates floating-point precision loss and the Kahan compensated summation algorithm. Three approaches are compared:
- Naive
floataccumulator doubleaccumulator (wider type)- Kahan algorithm (error compensation)
Results vary with summation order (sorted/unsorted, forward/reverse), making this a concrete illustration of IEEE 754 non-associativity.
| Tool | Purpose |
|---|---|
| GCC / Clang / ICX | C compilation; examples use -O0, -O2, -O3, -march=native |
| PAPI | Hardware performance counters (optional; examples fall back gracefully) |
| Python 3 + matplotlib | Performance visualization scripts (plotmountain.py, trans.py, tiling.py, compare.py) |
| GDB | Debugging examples |
| gprof / perf / Valgrind | Profiling examples |
PAPI is optional — examples that use it include mypapi.h which wraps the PAPI calls and can be compiled without the library by defining NO_PAPI.
Most examples compile with a single command, e.g.:
gcc -O3 -march=native -o nbody Nbody.c -lm
gcc -O3 -DUSE_SOA -o nbody_soa Nbody.c -lm # SoA layout variant
gcc -O3 -o matmul matmul.cThe memory mountain has a Makefile with targets for GCC, ICX, and Clang:
cd Single-Core-Optimization/Examples/cache/ex_1__memory_mountain
makeMaster in High Performance Computing (MHPC)
SISSA / ICTP / University of Trieste — 2025/26
Course: P1.3 — Foundations of HPC