Reference Guide

Detailed documentation for the PaHiS toolkit. See the README for setup and scenario guides.

Runner Parameters

The runner.py script accepts the following command-line parameters:

Parameter	Default	Description
`--model_coeffs`	`header`	Source of model coefficients. `header` = Use pre-generated `hw_params_{system}.hpp` file in models/. `microbenchmarks` = Run microbenchmark suite and produce the `hw_params_{system}.hpp` header file from scratch. `topology` = NOT IMPLEMENTED. Future release - will support loading coefficients from topology files
`--threads`	auto	Comma-separated list of thread counts to benchmark. Each thread count generates a separate hardware model. auto = `1, 2, 4, …, sysmax/4, sysmax/2, sysmax` (`sysmax` from the topology file). For example, `96-cores` = `1,2,4,8,16,24,48,96`.
`--core-policy`	`close`	`close` or `spread` NUMA thread allocation policy. `close` = threads on consecutive cores; `spread` = threads across NUMA nodes. Comma-separated for multiple policies (e.g., `close,spread`).
`--create-sync-level`	`empirical_poly`	Method for creating a synthetic `GLOBAL_SYNC` level to model primitive synchronization overhead. See options below.

Additional Parameters

Parameter	Default	Description
`--consecutive-bytes`	`1,64`	Comma-separated list of consecutive byte sizes used in benchmarks. `1` captures latency (random access), `64` captures bandwidth (cache-line-sized access).
`--keep-levels`	(all levels)	Comma-separated list of memory hierarchy level names to include in models. Must always include `NodeMem`. Example: `L2Cache,L3Cache,NUMANode,NodeMem,GLOBAL_SYNC`
`--bw-kernels`	`copy,daxpy,triad`	Comma-separated list of kernel names to use for bandwidth (gi) measurements. Available: `chase`, `sync`, `copy`, `daxpy`, `dot`, `triad` (don't use the first 2)
`--kernel-aggregator`	`mean`	Aggregation method used across kernel measurements (e.g. copy, daxpy, triad) when computing a single gi/ls value per level. Options: `min`, `max`, `mean`.
`--clean-dataset`	`no`	Clean runner data from previous runs. `no` = do nothing. `models` = delete `models/` subdirs ( = reapply statistics). `plots` = delete `plots/` subdirs. `results` = delete `results/` subdirs (per-kernel CSV logs). `all` = delete everything under the system results directory (full clean).
`--generate-subsets`	`False`	Generate all subset combinations of levels (always including `NodeMem`). By default, only the exact specified levels are used.
`--verbose`	`False`	Enable verbose debug output to the console. By default, debug information is only written to the log file (`build/runner.log`).

`--create-sync-level` options

Value	Description
(empty string)	No sync level is added.
`empirical`	Runs synchronization benchmarks and uses the measured latency directly.
`empirical_poly`	(default) Runs synchronization benchmarks, then fits a polynomial model (`a·t² + b·t + c`) to predict sync latency as a function of thread count. Boundary points are averaged across allocation policies for consistency.
`empirical_monotony`	Runs synchronization benchmarks, averages per thread count, enforces weighted monotonicity (`ls(tᵢ) ≥ 1.2 · ls(tⱼ)` for `tᵢ > tⱼ`), and uses the adjusted values directly without polynomial fitting.

Examples

# Auto-detect threads from topology (e.g. 96-core → 1,2,4,8,16,24,48,96)
python3 runner.py

# Use a pre-initialized header file (skip benchmarks, compile models directly)
python3 runner.py --model_coeffs header

# Explicit thread counts
python3 runner.py --threads "1,4,16,48"

# Full characterization with both NUMA policies
python3 runner.py --threads "1,2,4,8,16,32,64" \
  --consecutive-bytes "1,64" \
  --core-policy "close,spread" \
  --keep-levels "L1Cache,L2Cache,L3Cache,NodeMem"

# No sync level
python3 runner.py --threads "1,16,48" \
  --keep-levels "L2Cache,L3Cache,NodeMem" \
  --create-sync-level ""

# Generate all subset combinations of levels
python3 runner.py --threads "1,16,48" \
  --keep-levels "L1Cache,L2Cache,L3Cache,NodeMem" \
  --generate-subsets

# Include the dot kernel in bandwidth measurements
python3 runner.py --threads "1,4,16" --bw-kernels "copy,daxpy,dot,triad"

# Use min aggregation across kernels
python3 runner.py --threads "1,4,16" --kernel-aggregator min

# Clean HW_model.csv files and re-run from raw benchmark logs
python3 runner.py --clean-dataset models

# Full clean re-run (deletes all benchmark data, re-runs everything)
python3 runner.py --clean-dataset all

# Verbose output for debugging
python3 runner.py --threads "1,4" --verbose

Output Structure

Hardware Characterization Results

hw_characterization/results/AMDEPYC9634/
├── close/                      # NUMA allocation policy
│   ├── k_1/t_1/               # 1-byte consecutive access, 1 thread
│   │   ├── results/           # Raw benchmark data (CSV)
│   │   ├── plots/             # Generated visualizations (PNG)
│   │   │   ├── time/          # Time-based regression plots
│   │   │   └── gbytes/        # Bandwidth plots
│   │   └── models/            # CSV model files
│   ├── k_64/t_1/              # 64-byte consecutive access, 1 thread
│   └── ...
├── scatter/                    # Alternative NUMA allocation policy
│   └── ...
└── ...

Validation Benchmark Results

$DATADIR/
├── conjugate_gradient/    # Algorithm name
│   ├── $BENCHMARK_NAME/  # Benchmark name (e.g., default, model1, poly_agrsum)
│   │   ├── close/        # NUMA allocation policy
│   │   │   ├── t96/      # 96 threads
│   │   │   │   ├── outputs/        # Raw benchmark logs
│   │   │   │   └── results/        # Analysis and plots
│   │   │   ├── t64/      # 64 threads
│   │   │   └── ...
│   │   ├── spread/       # Alternative NUMA allocation policy
│   │   │   ├── t96/
│   │   │   └── ...
│   │   └── ...
├── bicgstab/             # Algorithm name
│   ├── $BENCHMARK_NAME/  # Benchmark name (e.g., default, model1)
│   │   ├── close/
│   │   │   ├── t96/
│   │   │   └── ...
│   │   └── ...
└── ...

Configuration

Environment Variables

Variable	Required	Description
`HW_MODEL`	✅ Yes	System name (must match topology file in `hw_characterization/topologies/`)
`ROOT_DIR`	No	Repository root directory (set automatically in `config_{HW_MODEL}.sh`)
`ALP_PATH`	❌ Optional	Path to ALP repository (required for ALP integration and validation)
`HW_BUILD_DIR`	No	Build directory (set automatically by CMake in `config_{HW_MODEL}.sh`)

Automatic paths:

ROOT_DIR: Set automatically to the absolute path of the repository root
Topologies: hw_characterization/topologies/
Results: hw_characterization/results/
Generated headers: hw_characterization/results/{HW_MODEL}/output_headers/

Topology File Format

Create topology files in hw_characterization/topologies/HW_hierarchy_<HW_MODEL>.log:

Step 1: View System Topology

# Generate topology visualization
lstopo --output-format ascii --no-io --no-bridges

Step 2: Create PaHiS Format

HW_LEVELS=<N>                    # Number of memory hierarchy levels
HW_THREAD_OFFSET=0               # Thread offset (usually 0, change for hybrid systems)
r_scalar = <VALUE>               # Scalar compute rate in ops/s (e.g. 4.16e10)
r_vec = <VALUE>                  # Vector compute rate in ops/s (e.g. 8.32e10)
level: <N-1> | level_name: <NAME> | pi: <COUNT> | mi: <SIZE_KB> | gi: 0 | Li: 0
level: <N-2> | level_name: <NAME> | pi: <COUNT> | mi: <SIZE_KB> | gi: 0 | Li: 0
...
level: 0 | level_name: PU | pi: <COUNT> | mi: <SIZE_KB> | gi: 0 | Li: 0

Parameter Reference

Parameter	Description
`r_scalar`	Scalar compute rate in ops/s (0.0 if unknown)
`r_vec`	Vector (SIMD) compute rate in ops/s (0.0 if unknown)
`pi`	Number of hierarchical sub-components
`mi`	Memory size in KB
`gi`, `Li`	Benchmark outputs (leave as 0)

Level Name Mapping

hwloc Type	PaHiS Name	Description
`Machine`	`NodeMem`	System memory level
`Package`	`Socket`	CPU socket level
`NUMANode`	`NUMANode`	NUMA node level
`L3`	`L3Cache`	L3 cache level
`L2`	`L2Cache`	L2 cache level
`L1(d)`	`L1Cache`	L1 cache level
`PU`	`PU`	Processing unit level

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reference Guide

Runner Parameters

Additional Parameters

`--create-sync-level` options

Examples

Output Structure

Hardware Characterization Results

Validation Benchmark Results

Configuration

Environment Variables

Topology File Format

Step 1: View System Topology

Step 2: Create PaHiS Format

Parameter Reference

Level Name Mapping

FilesExpand file tree

REFERENCE.md

Latest commit

History

REFERENCE.md

File metadata and controls

Reference Guide

Runner Parameters

Additional Parameters

--create-sync-level options

Examples

Output Structure

Hardware Characterization Results

Validation Benchmark Results

Configuration

Environment Variables

Topology File Format

Step 1: View System Topology

Step 2: Create PaHiS Format

Parameter Reference

Level Name Mapping

`--create-sync-level` options