Single-Core Performance Optimization

A collection of annotated C examples covering single-core performance optimization, profiling workflows, and floating-point arithmetic. Developed as part of the Master in High Performance Computing (MHPC) curriculum at SISSA/ICTP, Trieste.

The goal is to build intuition for why code performs the way it does on modern hardware, from cache hierarchies and branch predictors to instruction-level parallelism and floating-point rounding.

Example	What it demonstrates
`ex_0__array_traversal`	Row-major vs. column-major access; measuring bandwidth with PAPI
`ex_1__memory_mountain`	Sweeping stride and working-set size to visualize L1/L2/L3/DRAM bandwidth
`ex_2__matrix_transpose`	Naive transpose → blocked (tiled) transpose; benchmarked on laptop, Leonardo HPC, and LUMI supercomputer
`ex_3__hot_and_cold_fields`	Separating frequently-accessed fields from cold data; pointer-chasing vs. contiguous layouts
`ex_4__AoS_vs_SoA`	Array of Structures vs. Struct of Arrays — sparse and dense variants

Performance plots from three different architectures (including the LUMI supercomputer) are included in ex_2__matrix_transpose/matrix_transpose/.

Branch Prediction

Example	What it demonstrates
`ex_0__if_forest_in_loops`	Dense conditionals in loops and compiler handling
`ex_1__branch_prediction`	Cost of mispredicted branches; measured with gprof
`ex_2__crosssort_arrays`	Branch-free alternatives using bit manipulation

Loop Optimization

Example	What it demonstrates
`ex_1__matrix_multiplication`	Schoolbook → loop-reordered → block-tiled matmul; hardware counter data (CPEs, IPCs, L1 misses) included
`ex_2__array_reduction`	Array reduction with 2×1, 4×2, 8×4 loop unrolling; exploiting instruction-level parallelism
`ex_3__multiply_arrays`	Multiply-accumulate with pipeline-aware formulations and vectorization hints

Prefetching

Explicit software prefetch hints (__builtin_prefetch) to hide memory latency.

Stack & Memory

Low-level memory layout: endianness, byte-level inspection, and stack frame exploration with GDB.

Trivial Inefficiencies

Compiler-visible vs. compiler-invisible inefficiencies — before and after -O3.

Debugging & Profiling

File	What it demonstrates
`debug/gdb_try_breaks.c`	GDB tutorial: breakpoints, watchpoints, backtraces on a nested function call stack
`profiling/Nbody/Nbody.c`	N-body gravitational simulation with AoS and SoA layouts (compile-time switch via `-DUSE_SOA`); designed for gprof/perf/callgrind profiling
`profiling/Nbody/Nbody.scatter.c`	Scatter-gather access pattern variant
`profiling/Mandelbrot/Mandelbrot.tasks.c`	Mandelbrot set generator as a compute-intensive profiling target; outputs PNG via `stb_image_write.h`
`gprof2dot.py`	Convert `gprof` output to a call-graph visualization (graphviz)

The N-body example is particularly useful: compiling with and without -DUSE_SOA produces measurably different performance, illustrating how data layout drives real-world speedup.

Floating-Point Arithmetic

Demonstrates floating-point precision loss and the Kahan compensated summation algorithm. Three approaches are compared:

Naive float accumulator
double accumulator (wider type)
Kahan algorithm (error compensation)

Results vary with summation order (sorted/unsorted, forward/reverse), making this a concrete illustration of IEEE 754 non-associativity.

Tools & Prerequisites

Tool	Purpose
GCC / Clang / ICX	C compilation; examples use `-O0`, `-O2`, `-O3`, `-march=native`
PAPI	Hardware performance counters (optional; examples fall back gracefully)
Python 3 + matplotlib	Performance visualization scripts (`plotmountain.py`, `trans.py`, `tiling.py`, `compare.py`)
GDB	Debugging examples
gprof / perf / Valgrind	Profiling examples

PAPI is optional — examples that use it include mypapi.h which wraps the PAPI calls and can be compiled without the library by defining NO_PAPI.

Building

Most examples compile with a single command, e.g.:

gcc -O3 -march=native -o nbody Nbody.c -lm
gcc -O3 -DUSE_SOA -o nbody_soa Nbody.c -lm   # SoA layout variant

gcc -O3 -o matmul matmul.c

The memory mountain has a Makefile with targets for GCC, ICX, and Clang:

cd Single-Core-Optimization/Examples/cache/ex_1__memory_mountain
make

Part Of

Master in High Performance Computing (MHPC)
SISSA / ICTP / University of Trieste — 2025/26
Course: P1.3 — Foundations of HPC

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Debug_and_Profiling		Debug_and_Profiling
Single-Core-Optimization/Examples		Single-Core-Optimization/Examples
Sparse_Materials/kahan_summation		Sparse_Materials/kahan_summation
lec1		lec1
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
enable_perf		enable_perf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Single-Core Performance Optimization

Contents

Cache Optimization

Branch Prediction

Loop Optimization

Prefetching

Stack & Memory

Trivial Inefficiencies

Debugging & Profiling

Floating-Point Arithmetic

Tools & Prerequisites

Building

Part Of

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Single-Core Performance Optimization

Contents

Tools & Prerequisites

Building

Part Of

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Languages