Skip to content

Latest commit

 

History

History
201 lines (160 loc) · 9.35 KB

File metadata and controls

201 lines (160 loc) · 9.35 KB

Reference Guide

Detailed documentation for the PaHiS toolkit. See the README for setup and scenario guides.

Runner Parameters

The runner.py script accepts the following command-line parameters:

Parameter Default Description
--model_coeffs header Source of model coefficients. header = Use pre-generated hw_params_{system}.hpp file in models/. microbenchmarks = Run microbenchmark suite and produce the hw_params_{system}.hpp header file from scratch. topology = NOT IMPLEMENTED. Future release - will support loading coefficients from topology files
--threads auto Comma-separated list of thread counts to benchmark. Each thread count generates a separate hardware model. auto = 1, 2, 4, …, sysmax/4, sysmax/2, sysmax (sysmax from the topology file). For example, 96-cores = 1,2,4,8,16,24,48,96.
--core-policy close close or spread NUMA thread allocation policy. close = threads on consecutive cores; spread = threads across NUMA nodes. Comma-separated for multiple policies (e.g., close,spread).
--create-sync-level empirical_poly Method for creating a synthetic GLOBAL_SYNC level to model primitive synchronization overhead. See options below.

Additional Parameters

Parameter Default Description
--consecutive-bytes 1,64 Comma-separated list of consecutive byte sizes used in benchmarks. 1 captures latency (random access), 64 captures bandwidth (cache-line-sized access).
--keep-levels (all levels) Comma-separated list of memory hierarchy level names to include in models. Must always include NodeMem. Example: L2Cache,L3Cache,NUMANode,NodeMem,GLOBAL_SYNC
--bw-kernels copy,daxpy,triad Comma-separated list of kernel names to use for bandwidth (gi) measurements. Available: chase, sync, copy, daxpy, dot, triad (don't use the first 2)
--kernel-aggregator mean Aggregation method used across kernel measurements (e.g. copy, daxpy, triad) when computing a single gi/ls value per level. Options: min, max, mean.
--clean-dataset no Clean runner data from previous runs. no = do nothing. models = delete models/ subdirs ( = reapply statistics). plots = delete plots/ subdirs. results = delete results/ subdirs (per-kernel CSV logs). all = delete everything under the system results directory (full clean).
--generate-subsets False Generate all subset combinations of levels (always including NodeMem). By default, only the exact specified levels are used.
--verbose False Enable verbose debug output to the console. By default, debug information is only written to the log file (build/runner.log).

--create-sync-level options

Value Description
(empty string) No sync level is added.
empirical Runs synchronization benchmarks and uses the measured latency directly.
empirical_poly (default) Runs synchronization benchmarks, then fits a polynomial model (a·t² + b·t + c) to predict sync latency as a function of thread count. Boundary points are averaged across allocation policies for consistency.
empirical_monotony Runs synchronization benchmarks, averages per thread count, enforces weighted monotonicity (ls(tᵢ) ≥ 1.2 · ls(tⱼ) for tᵢ > tⱼ), and uses the adjusted values directly without polynomial fitting.

Examples

# Auto-detect threads from topology (e.g. 96-core → 1,2,4,8,16,24,48,96)
python3 runner.py

# Use a pre-initialized header file (skip benchmarks, compile models directly)
python3 runner.py --model_coeffs header

# Explicit thread counts
python3 runner.py --threads "1,4,16,48"

# Full characterization with both NUMA policies
python3 runner.py --threads "1,2,4,8,16,32,64" \
  --consecutive-bytes "1,64" \
  --core-policy "close,spread" \
  --keep-levels "L1Cache,L2Cache,L3Cache,NodeMem"

# No sync level
python3 runner.py --threads "1,16,48" \
  --keep-levels "L2Cache,L3Cache,NodeMem" \
  --create-sync-level ""

# Generate all subset combinations of levels
python3 runner.py --threads "1,16,48" \
  --keep-levels "L1Cache,L2Cache,L3Cache,NodeMem" \
  --generate-subsets

# Include the dot kernel in bandwidth measurements
python3 runner.py --threads "1,4,16" --bw-kernels "copy,daxpy,dot,triad"

# Use min aggregation across kernels
python3 runner.py --threads "1,4,16" --kernel-aggregator min

# Clean HW_model.csv files and re-run from raw benchmark logs
python3 runner.py --clean-dataset models

# Full clean re-run (deletes all benchmark data, re-runs everything)
python3 runner.py --clean-dataset all

# Verbose output for debugging
python3 runner.py --threads "1,4" --verbose

Output Structure

Hardware Characterization Results

hw_characterization/results/AMDEPYC9634/
├── close/                      # NUMA allocation policy
│   ├── k_1/t_1/               # 1-byte consecutive access, 1 thread
│   │   ├── results/           # Raw benchmark data (CSV)
│   │   ├── plots/             # Generated visualizations (PNG)
│   │   │   ├── time/          # Time-based regression plots
│   │   │   └── gbytes/        # Bandwidth plots
│   │   └── models/            # CSV model files
│   ├── k_64/t_1/              # 64-byte consecutive access, 1 thread
│   └── ...
├── scatter/                    # Alternative NUMA allocation policy
│   └── ...
└── ...

Validation Benchmark Results

$DATADIR/
├── conjugate_gradient/    # Algorithm name
│   ├── $BENCHMARK_NAME/  # Benchmark name (e.g., default, model1, poly_agrsum)
│   │   ├── close/        # NUMA allocation policy
│   │   │   ├── t96/      # 96 threads
│   │   │   │   ├── outputs/        # Raw benchmark logs
│   │   │   │   └── results/        # Analysis and plots
│   │   │   ├── t64/      # 64 threads
│   │   │   └── ...
│   │   ├── spread/       # Alternative NUMA allocation policy
│   │   │   ├── t96/
│   │   │   └── ...
│   │   └── ...
├── bicgstab/             # Algorithm name
│   ├── $BENCHMARK_NAME/  # Benchmark name (e.g., default, model1)
│   │   ├── close/
│   │   │   ├── t96/
│   │   │   └── ...
│   │   └── ...
└── ...

Configuration

Environment Variables

Variable Required Description
HW_MODEL Yes System name (must match topology file in hw_characterization/topologies/)
ROOT_DIR No Repository root directory (set automatically in config_{HW_MODEL}.sh)
ALP_PATH Optional Path to ALP repository (required for ALP integration and validation)
HW_BUILD_DIR No Build directory (set automatically by CMake in config_{HW_MODEL}.sh)

Automatic paths:

  • ROOT_DIR: Set automatically to the absolute path of the repository root
  • Topologies: hw_characterization/topologies/
  • Results: hw_characterization/results/
  • Generated headers: hw_characterization/results/{HW_MODEL}/output_headers/

Topology File Format

Create topology files in hw_characterization/topologies/HW_hierarchy_<HW_MODEL>.log:

Step 1: View System Topology

# Generate topology visualization
lstopo --output-format ascii --no-io --no-bridges

Step 2: Create PaHiS Format

HW_LEVELS=<N>                    # Number of memory hierarchy levels
HW_THREAD_OFFSET=0               # Thread offset (usually 0, change for hybrid systems)
r_scalar = <VALUE>               # Scalar compute rate in ops/s (e.g. 4.16e10)
r_vec = <VALUE>                  # Vector compute rate in ops/s (e.g. 8.32e10)
level: <N-1> | level_name: <NAME> | pi: <COUNT> | mi: <SIZE_KB> | gi: 0 | Li: 0
level: <N-2> | level_name: <NAME> | pi: <COUNT> | mi: <SIZE_KB> | gi: 0 | Li: 0
...
level: 0 | level_name: PU | pi: <COUNT> | mi: <SIZE_KB> | gi: 0 | Li: 0

Parameter Reference

Parameter Description
r_scalar Scalar compute rate in ops/s (0.0 if unknown)
r_vec Vector (SIMD) compute rate in ops/s (0.0 if unknown)
pi Number of hierarchical sub-components
mi Memory size in KB
gi, Li Benchmark outputs (leave as 0)

Level Name Mapping

hwloc Type PaHiS Name Description
Machine NodeMem System memory level
Package Socket CPU socket level
NUMANode NUMANode NUMA node level
L3 L3Cache L3 cache level
L2 L2Cache L2 cache level
L1(d) L1Cache L1 cache level
PU PU Processing unit level