Skip to content

Gabriel-Pedde/Single_Core_Optimization

Repository files navigation

Single-Core Performance Optimization

A collection of annotated C examples covering single-core performance optimization, profiling workflows, and floating-point arithmetic. Developed as part of the Master in High Performance Computing (MHPC) curriculum at SISSA/ICTP, Trieste.

The goal is to build intuition for why code performs the way it does on modern hardware, from cache hierarchies and branch predictors to instruction-level parallelism and floating-point rounding.


Contents

Example What it demonstrates
ex_0__array_traversal Row-major vs. column-major access; measuring bandwidth with PAPI
ex_1__memory_mountain Sweeping stride and working-set size to visualize L1/L2/L3/DRAM bandwidth
ex_2__matrix_transpose Naive transpose → blocked (tiled) transpose; benchmarked on laptop, Leonardo HPC, and LUMI supercomputer
ex_3__hot_and_cold_fields Separating frequently-accessed fields from cold data; pointer-chasing vs. contiguous layouts
ex_4__AoS_vs_SoA Array of Structures vs. Struct of Arrays — sparse and dense variants

Performance plots from three different architectures (including the LUMI supercomputer) are included in ex_2__matrix_transpose/matrix_transpose/.

Example What it demonstrates
ex_0__if_forest_in_loops Dense conditionals in loops and compiler handling
ex_1__branch_prediction Cost of mispredicted branches; measured with gprof
ex_2__crosssort_arrays Branch-free alternatives using bit manipulation
Example What it demonstrates
ex_1__matrix_multiplication Schoolbook → loop-reordered → block-tiled matmul; hardware counter data (CPEs, IPCs, L1 misses) included
ex_2__array_reduction Array reduction with 2×1, 4×2, 8×4 loop unrolling; exploiting instruction-level parallelism
ex_3__multiply_arrays Multiply-accumulate with pipeline-aware formulations and vectorization hints

Explicit software prefetch hints (__builtin_prefetch) to hide memory latency.

Low-level memory layout: endianness, byte-level inspection, and stack frame exploration with GDB.

Compiler-visible vs. compiler-invisible inefficiencies — before and after -O3.


File What it demonstrates
debug/gdb_try_breaks.c GDB tutorial: breakpoints, watchpoints, backtraces on a nested function call stack
profiling/Nbody/Nbody.c N-body gravitational simulation with AoS and SoA layouts (compile-time switch via -DUSE_SOA); designed for gprof/perf/callgrind profiling
profiling/Nbody/Nbody.scatter.c Scatter-gather access pattern variant
profiling/Mandelbrot/Mandelbrot.tasks.c Mandelbrot set generator as a compute-intensive profiling target; outputs PNG via stb_image_write.h
gprof2dot.py Convert gprof output to a call-graph visualization (graphviz)

The N-body example is particularly useful: compiling with and without -DUSE_SOA produces measurably different performance, illustrating how data layout drives real-world speedup.


Demonstrates floating-point precision loss and the Kahan compensated summation algorithm. Three approaches are compared:

  1. Naive float accumulator
  2. double accumulator (wider type)
  3. Kahan algorithm (error compensation)

Results vary with summation order (sorted/unsorted, forward/reverse), making this a concrete illustration of IEEE 754 non-associativity.


Tools & Prerequisites

Tool Purpose
GCC / Clang / ICX C compilation; examples use -O0, -O2, -O3, -march=native
PAPI Hardware performance counters (optional; examples fall back gracefully)
Python 3 + matplotlib Performance visualization scripts (plotmountain.py, trans.py, tiling.py, compare.py)
GDB Debugging examples
gprof / perf / Valgrind Profiling examples

PAPI is optional — examples that use it include mypapi.h which wraps the PAPI calls and can be compiled without the library by defining NO_PAPI.


Building

Most examples compile with a single command, e.g.:

gcc -O3 -march=native -o nbody Nbody.c -lm
gcc -O3 -DUSE_SOA -o nbody_soa Nbody.c -lm   # SoA layout variant

gcc -O3 -o matmul matmul.c

The memory mountain has a Makefile with targets for GCC, ICX, and Clang:

cd Single-Core-Optimization/Examples/cache/ex_1__memory_mountain
make

Part Of

Master in High Performance Computing (MHPC)
SISSA / ICTP / University of Trieste — 2025/26
Course: P1.3 — Foundations of HPC

About

C examples from the MHPC program (SISSA/ICTP) exploring single-core performance optimization on modern hardware. Covers cache locality, branch prediction, loop unrolling, AoS vs. SoA data layouts, and floating-point arithmetic. Includes profiling workflows with gprof, perf, and Valgrind on real HPC systems (Leonardo, LUMI).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors