You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Detailed documentation for the PaHiS toolkit. See the README for setup and scenario guides.
Runner Parameters
The runner.py script accepts the following command-line parameters:
Parameter
Default
Description
--model_coeffs
header
Source of model coefficients. header = Use pre-generated hw_params_{system}.hpp file in models/. microbenchmarks = Run microbenchmark suite and produce the hw_params_{system}.hpp header file from scratch. topology = NOT IMPLEMENTED. Future release - will support loading coefficients from topology files
--threads
auto
Comma-separated list of thread counts to benchmark. Each thread count generates a separate hardware model. auto = 1, 2, 4, …, sysmax/4, sysmax/2, sysmax (sysmax from the topology file). For example, 96-cores = 1,2,4,8,16,24,48,96.
--core-policy
close
close or spread NUMA thread allocation policy. close = threads on consecutive cores; spread = threads across NUMA nodes. Comma-separated for multiple policies (e.g., close,spread).
--create-sync-level
empirical_poly
Method for creating a synthetic GLOBAL_SYNC level to model primitive synchronization overhead. See options below.
Additional Parameters
Parameter
Default
Description
--consecutive-bytes
1,64
Comma-separated list of consecutive byte sizes used in benchmarks. 1 captures latency (random access), 64 captures bandwidth (cache-line-sized access).
--keep-levels
(all levels)
Comma-separated list of memory hierarchy level names to include in models. Must always include NodeMem. Example: L2Cache,L3Cache,NUMANode,NodeMem,GLOBAL_SYNC
--bw-kernels
copy,daxpy,triad
Comma-separated list of kernel names to use for bandwidth (gi) measurements. Available: chase, sync, copy, daxpy, dot, triad (don't use the first 2)
--kernel-aggregator
mean
Aggregation method used across kernel measurements (e.g. copy, daxpy, triad) when computing a single gi/ls value per level. Options: min, max, mean.
--clean-dataset
no
Clean runner data from previous runs. no = do nothing. models = delete models/ subdirs ( = reapply statistics). plots = delete plots/ subdirs. results = delete results/ subdirs (per-kernel CSV logs). all = delete everything under the system results directory (full clean).
--generate-subsets
False
Generate all subset combinations of levels (always including NodeMem). By default, only the exact specified levels are used.
--verbose
False
Enable verbose debug output to the console. By default, debug information is only written to the log file (build/runner.log).
--create-sync-level options
Value
Description
(empty string)
No sync level is added.
empirical
Runs synchronization benchmarks and uses the measured latency directly.
empirical_poly
(default) Runs synchronization benchmarks, then fits a polynomial model (a·t² + b·t + c) to predict sync latency as a function of thread count. Boundary points are averaged across allocation policies for consistency.
empirical_monotony
Runs synchronization benchmarks, averages per thread count, enforces weighted monotonicity (ls(tᵢ) ≥ 1.2 · ls(tⱼ) for tᵢ > tⱼ), and uses the adjusted values directly without polynomial fitting.
Examples
# Auto-detect threads from topology (e.g. 96-core → 1,2,4,8,16,24,48,96)
python3 runner.py
# Use a pre-initialized header file (skip benchmarks, compile models directly)
python3 runner.py --model_coeffs header
# Explicit thread counts
python3 runner.py --threads "1,4,16,48"# Full characterization with both NUMA policies
python3 runner.py --threads "1,2,4,8,16,32,64" \
--consecutive-bytes "1,64" \
--core-policy "close,spread" \
--keep-levels "L1Cache,L2Cache,L3Cache,NodeMem"# No sync level
python3 runner.py --threads "1,16,48" \
--keep-levels "L2Cache,L3Cache,NodeMem" \
--create-sync-level ""# Generate all subset combinations of levels
python3 runner.py --threads "1,16,48" \
--keep-levels "L1Cache,L2Cache,L3Cache,NodeMem" \
--generate-subsets
# Include the dot kernel in bandwidth measurements
python3 runner.py --threads "1,4,16" --bw-kernels "copy,daxpy,dot,triad"# Use min aggregation across kernels
python3 runner.py --threads "1,4,16" --kernel-aggregator min
# Clean HW_model.csv files and re-run from raw benchmark logs
python3 runner.py --clean-dataset models
# Full clean re-run (deletes all benchmark data, re-runs everything)
python3 runner.py --clean-dataset all
# Verbose output for debugging
python3 runner.py --threads "1,4" --verbose