_____ _ _ _
|_ _|__ _ __ ___| |__ | | | | __ _ _ __ ___ _ __ ___ ___ _ __
| |/ _ \| '__/ __| '_ \ | |_| |/ _` | '_ ` _ \| '_ ` _ \ / _ \ '__|
| | (_) | | | (__| | | | | _ | (_| | | | | | | | | | | | __/ |
|_|\___/|_| \___|_| |_| |_| |_|\__,_|_| |_| |_|_| |_| |_|\___|_|
Forged with PyTorch
GPU/CPU/APU Micro-Benchmark Suite
A portable, PyTorch micro-benchmark suite for stress testing CPUs, GPUs, and APUs
© Copyright 2025-2026 Hewlett Packard Enterprise Development LP
Licensed under the Apache License, Version 2.0. See LICENSE for details.
Torch Hammer is a benchmarking utility designed to stress test and evaluate the performance of CPUs, GPUs, and APUs using PyTorch. The inspiration for Torch Hammer arose from the increasing heterogeneity of workloads, hardware, and runtime environments across the HPC & AI industry. The rich HPC & AI ecosystem inspired Torch Hammer, a tool which aims to:
- Offer a variety of highly-parametrized tests that can push hardware power/thermal limits
- Characterize hardware performance and identify slow components
- Be portable so that it can be run across different platforms
- Provide an easy to maintain mechanism to add new tests
When building Torch Hammer, I was inspired by my undergraduate work in quantum chemistry, differential equations, and my more recent interest in AI. Some of the tests you will find in Torch Hammer are small kernels of the types of problems I would work out by hand for my assignments as a student at the University of Minnesota. Others are reflective of patterns commonly used in AI/ML workloads.
- Key Features
- Tested Platforms
- Installation
- Quick Start
- Command-line Reference
- Verbose-Mode
- Compact Mode
- Syslog Mode
- Telemetry Back-ends
- Reporting
- Examples
- Contributing
- Project Governance
- License
- Cross-platform: CUDA, ROCm, Metal (MPS), CPU – automatically selected.
- Nine micro benchmarks
- Batched GEMM (matrix multiply)
- 2-D Convolution
- 3-D FFT
- Einsum (attention-style tensor contraction)
- Random memory traffic
- Laplacian heat-equation solver
- 1-D time-dependent Schrödinger equation
- Atomic contention (L2 cache stress via scatter_add)
- Sparse matrix multiplication (SpMM)
- Precision choices:
float16,bfloat16,float32,float64,complex64,complex128. - Tensor Core support: Optional TF32 mode on newer platforms.
- Live telemetry
- NVIDIA: power, temperature, utilization, clock, memory controller activity, memory utilization, GPU temperature, SM utilization
- AMD ROCm: power, temperature, utilization, clock
- CPU: basic vendor & model info.
- Verbose logging: every iteration, every telemetry field, one comma-separated line.
- Compact CSV mode: machine-readable CSV to stdout — one row per benchmark, pipe-friendly.
- Syslog / dmesg mode: structured
key=valueentries to syslog with auto-derived severity; optional/dev/kmsgoutput for kernel ring-buffer correlation.
Torch Hammer has been validated on 50+ hardware configurations spanning NVIDIA, AMD, Arm, and Apple platforms.
| Vendor | Accelerator |
|---|---|
| AMD | Instinct MI300A, Instinct MI250X |
| NVIDIA | GB300, B200, GH200, H100, A100 |
| Family | GPUs |
|---|---|
| GeForce (Turing) | GTX 1660 SUPER, RTX 2060 SUPER, RTX 2080 Ti, RTX 2080 SUPER |
| GeForce (Ampere) | RTX 3060 Ti, RTX 3060 Mobile, RTX 3070, RTX 3070 Ti, RTX 3080, RTX 3080 Ti, RTX 3090 |
| GeForce (Ada Lovelace) | RTX 4060 Ti, RTX 4070 Ti SUPER, RTX 4080, RTX 4080 SUPER, RTX 4090, RTX 4090 D |
| GeForce (Blackwell) | RTX 5060 Ti, RTX 5070, RTX 5070 Ti, RTX 5080, RTX 5090 |
| Professional | Quadro RTX 6000, Titan RTX, RTX A4000, RTX A5000, RTX A6000, RTX 6000 Ada, RTX PRO 4000, RTX PRO 5000, RTX PRO 6000 Server, RTX PRO 6000 Workstation |
| Data Center | A10, A40, A100 PCIe, A100 SXM, A800, L4, L40S, H100 NVL, H100 SXM, H200, B200 |
| Vendor | Processor |
|---|---|
| AMD | EPYC Rome, EPYC Milan, EPYC Trento |
| Arm | Cortex-A76 |
| NVIDIA | Grace |
| Chip |
|---|
| M3 Pro |
- Python ≥ 3.8
- PyTorch ≥ 1.10 (see PyTorch installation guide)
# Clone the repository
git clone https://github.com/HPE/torch-hammer.git
cd torch-hammer
# Create and activate virtual environment (recommended)
python3 -m venv .venv
source .venv/bin/activate
# Make executable
chmod +x torch-hammer.pyNote: After the steps above, complete the Platform-Specific Setup below for your hardware (NVIDIA, AMD, Apple Silicon, or CPU-only) before running benchmarks.
# Install PyTorch with CUDA support
# Replace 'cu126' with your CUDA version: cu118, cu121, cu124, cu126, etc.
# Check your CUDA version with: nvcc --version or nvidia-smi
# Examples:
# CUDA 11.8 → cu118
# CUDA 12.1 → cu121
# CUDA 12.6 → cu126
# CUDA 13.x → cu130 (when available)
pip install torch --index-url https://download.pytorch.org/whl/cu126
# Install telemetry library
pip install nvidia-ml-py# Detect ROCm version and install matching PyTorch
ROCM_VER=$(cat /opt/rocm/.info/version | cut -d'-' -f1 | cut -d'.' -f1,2)
pip install torch --index-url https://download.pytorch.org/whl/rocm${ROCM_VER}
# Add ROCm's amdsmi to Python path (auto-detects Python version)
export PYTHONPATH="/opt/rocm/share/amd_smi${PYTHONPATH:+:${PYTHONPATH}}"# Install PyTorch (MPS backend included automatically)
pip install torch# Install PyTorch CPU-only build
pip install torch --index-url https://download.pytorch.org/whl/cpu| Package | Purpose | Installation |
|---|---|---|
pyyaml |
YAML config file support | pip install pyyaml |
setproctitle |
Custom process names in ps/top |
pip install setproctitle |
| Platform | Telemetry Library | Installation |
|---|---|---|
| NVIDIA | pynvml |
pip install nvidia-ml-py |
| AMD ROCm | amdsmi |
Bundled with ROCm (must match ROCm version) |
| Apple/CPU | Built-in | No additional install needed |
Benchmark a single matrix-multiply workload on the first device:
./torch-hammer.py --batched-gemmDefault parameters target GPUs with 80GB+ VRAM. For GPUs with less memory, reduce problem sizes to avoid out-of-memory errors:
| GPU VRAM | Recommended GEMM Size | Example |
|---|---|---|
| 80GB+ (H100, A100-80GB, MI300X) | 16384×16384 | --m 16384 --n 16384 --k 16384 |
| 40-48GB (A100-40GB, A6000) | 8192×8192 | --m 8192 --n 8192 --k 8192 |
| 16-24GB (RTX 4090, A5000) | 4096×4096 | --m 4096 --n 4096 --k 4096 |
| 8-12GB (RTX 3080, T4) | 2048×2048 | --m 2048 --n 2048 --k 2048 |
Auto-scaling option: Use --stress-test to automatically scale problem sizes based on available GPU memory:
./torch-hammer.py --batched-gemm --stress-testOther memory-sensitive parameters:
- FFT:
--nx/--ny/--nz(default 128³ uses ~16MB, try 64 for small GPUs) - Sparse MM:
--sparse-m,--sparse-n,--sparse-k(default 8192) - Heat/Schrödinger:
--heat-grid-size,--schrodinger-grid-size
./torch-hammer.py -h prints the full help.
The most important switches are summarised below (defaults in italics).
| Option | Description |
|---|---|
| Logging & Output | |
--no-log |
Disable all logging. |
--log-file <path> |
Append all log lines to path. |
--log-dir <path> |
Directory for per-GPU log files (multi-GPU runs). |
--verbose |
One line per iteration (see examples). |
--verbose-file-only |
With --log-file or --log-dir, suppress stdout (file only). |
--compact |
Machine-readable CSV to stdout (one row per benchmark). See Compact Mode. |
--syslog |
Emit structured key=value entries to syslog after each benchmark. See Syslog Mode. |
--syslog-dmesg |
Also write syslog messages to /dev/kmsg (requires --syslog). Needs root or CAP_SYSLOG. |
--banner |
Show ASCII banner at startup. |
--json-output <path> |
Write all results and telemetry to a JSON file. |
--summary-csv <path> |
Write benchmark summary table to a CSV file. |
| Configuration | |
--config <path> |
Path to YAML configuration file (see YAML Configuration). |
--list-profiles |
List available configuration profiles and exit. |
--dry-run |
Show configuration and exit without running benchmarks. |
| Device Selection | |
--device-index <int> |
GPU index to use (default: 0). Ignored if --all-gpus or --gpu-list. |
--all-gpus |
Run on all available GPUs in parallel. |
--gpu-list <indices> |
Comma-separated GPU indices (e.g., 0,2,3). |
| CPU Affinity & NUMA | |
--cpu-affinity |
NUMA-aware CPU binding (default: enabled, Linux only). |
--no-cpu-affinity |
Disable NUMA-aware CPU binding. |
--cpu-gpu-map <mapping> |
Manual CPU-GPU binding (e.g., 0:0-15,1:16-31). |
--cpu-list <cores> |
CPU cores for CPU-only mode (e.g., 0-23,48-71 or all). Default: all physical cores. |
--parent-cpu <int> |
Pin parent process to a specific CPU core (default: last core of first NUMA node; -1 to disable). |
| Iteration & Duration | |
--warmup <int> |
Warm-up iterations before timing (default: 10). |
--duration <float> |
Run each benchmark for specified seconds (overrides iteration counts). |
--min-iterations <int> |
Minimum iterations even if --duration is met (default: 10). |
--max-iterations <int> |
Maximum iterations regardless of --duration. |
--repeats <int> |
Number of times to repeat entire benchmark suite (default: 1). |
--repeat-delay <float> |
Delay in seconds between repeats for thermal stabilization (default: 0). |
| Stress & Scheduling | |
--stress-test |
Automatically calculate maximum stress parameters based on available GPU memory. |
--shuffle |
Randomize benchmark execution order. |
--startup-delay-per-gpu <float> |
Staggered startup delay per GPU in seconds (GPU N waits N × delay). Helps avoid ROCm memory allocator contention (default: 0). |
| Telemetry Tuning | |
--skip-telemetry-first-n <int> |
Skip first N telemetry readings when calculating statistics (default: 10). |
--telemetry-interval-ms <int> |
Telemetry polling interval in milliseconds (default: 100). Higher values reduce resolution but may improve performance. |
--no-telemetry-thread |
Disable the background telemetry thread entirely (for debugging). |
| Thermal / Performance Thresholds | |
--temp-warn-C <float> |
Temperature warning threshold in °C (default: 90). |
--temp-critical-C <float> |
Temperature critical threshold in °C (default: 95). |
--power-warn-pct <float> |
Power-limit warning threshold in % (default: 98). |
--outlier-threshold-pct <float> |
Multi-GPU outlier detection threshold in % (default: 15). |
--efficiency-warn-pct <float> |
Hardware efficiency warning threshold in % (default: 70). |
--baseline-file <path> |
Load hardware baselines from JSON/YAML file for performance validation. |
--no-validation |
Disable hardware performance validation. |
| Miscellaneous | |
--temp-dir <path> |
Directory for temp files (multi-GPU result collection). Falls back to TORCH_HAMMER_TEMP_DIR env var, then system temp. |
Each benchmark has its own --precision-<test> flag. The available data types vary by benchmark:
| Benchmark | float16 |
bfloat16 |
float32 |
float64 |
complex64 |
complex128 |
Default |
|---|---|---|---|---|---|---|---|
| Batched GEMM | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | float32 |
| Convolution | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | float32 |
| FFT | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | float32 |
| Einsum | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | float32 |
| Memory Traffic | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | float32 |
| Heat Equation | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | float32 |
| Schrödinger | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | float32 |
| Atomic Contention | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | float32 |
| Sparse MM | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | float32 |
Note: FFT does not support
bfloat16(torch.fft.fftnlimitation). Atomic Contention and Sparse MM only supportfloat32andfloat64. Using TF32 mode (--batched-gemm-TF32-mode) forcesfloat32regardless of--precision-gemm.
| Option | Description |
|---|---|
--batched-gemm |
Enable this benchmark. |
--batch-count-gemm <int> |
Batch size (default: 128). |
--m / --n / --k <int> |
Matrix sizes M×K · K×N (default: 512 each). |
--inner-loop-batched-gemm <int> |
Timed iterations (default: 10). |
--precision-gemm <dtype> |
Data type (default: float32). See Supported Precisions. |
--batched-gemm-TF32-mode |
Allow TF32 (if hardware support exists). |
| Option | Description |
|---|---|
--convolution |
Enable convolution test. |
--batch-count-convolution <int> |
Batch size (default: 128). |
--in-channels / --out-channels <int> |
Default: 3 / 64. |
--height / --width <int> |
Input size (default: 128×128). |
--kernel-size <int> |
Kernel size (default: 3). |
--inner-loop-convolution <int> |
Timed iterations (default: 10). |
--precision-convolution <dtype> |
Data type (default: float32). See Supported Precisions. |
| Option | Description |
|---|---|
--fft |
Enable 3-D FFT test. |
--batch-count-fft <int> |
Batch size (default: 128). |
--nx / --ny / --nz <int> |
Grid size (default: 128³). |
--inner-loop-fft <int> |
Timed iterations (default: 10). |
--precision-fft <dtype> |
Data type (default: float32). See Supported Precisions. |
| Option | Description |
|---|---|
--einsum |
Enable einsum test. |
--batch-count-einsum <int> |
Batch size (default: 128). |
--heads / --seq-len / --d-model <int> |
Default: 8 / 128 / 64. |
--inner-loop-einsum <int> |
Timed iterations (default: 10). |
--precision-einsum <dtype> |
Data type (default: float32). See Supported Precisions. |
| Option | Description |
|---|---|
--memory-traffic |
Enable random traffic test. |
--memory-size <int> |
Elements in array (default: 1024). |
--memory-iterations <int> |
Inner loop per timing (default: 10). |
--memory-pattern <pattern> |
Access pattern: random (random indexing), streaming (sequential), unit (stride-1). Default: random. |
--inner-loop-memory-traffic <int> |
Timed iterations (default: 10). |
--precision-memory <dtype> |
Data type (default: float32). Supports all 6 precisions. |
| Option | Description |
|---|---|
--heat-equation |
Enable Laplacian stencil solver. |
--heat-grid-size <int> |
Grid size (default: 128). |
--heat-time-steps <int> |
Steps (default: 100). |
--alpha <float> |
Thermal diffusivity (default: 0.01). |
--delta-t <float> |
Time increment (default: 0.01). |
--inner-loop-heat-equation <int> |
Timed iterations (default: 10). |
--precision-heat <dtype> |
Data type (default: float32). See Supported Precisions. |
| Option | Description |
|---|---|
--schrodinger |
Enable quantum simulation. |
--schrodinger-grid-size <int> |
Grid points (default: 128). |
--schrodinger-time-steps <int> |
Steps (default: 100). |
--schrodinger-delta-x / --schrodinger-delta-t <float> |
Default: 0.1 / 0.01. |
--schrodinger-hbar / --schrodinger-mass <float> |
Default: 1.0 / 1.0. |
--schrodinger-potential {harmonic, barrier} |
Potential (default: harmonic). |
--inner-loop-schrodinger <int> |
Timed iterations (default: 10). |
--precision-schrodinger <dtype> |
Data type (default: float32). See Supported Precisions. |
| Option | Description |
|---|---|
--atomic-contention |
Enable L2 cache atomic stress test. |
--atomic-target-size <int> |
Size of target array (default: 1,000,000). |
--atomic-num-updates <int> |
Number of scatter_add updates per iter (default: 10,000,000). |
--atomic-contention-range <int> |
Max unique indices; lower = more contention (default: 1024). |
--inner-loop-atomic <int> |
Timed iterations (default: 50). |
--precision-atomic <dtype> |
Data type (default: float32). No complex types — choices: float16, bfloat16, float32, float64. |
| Option | Description |
|---|---|
--sparse-mm |
Enable sparse matrix multiply test. |
--sparse-m <int> |
Sparse matrix rows (default: 8192). |
--sparse-n <int> |
Dense/output columns (default: 8192). |
--sparse-k <int> |
Sparse cols / Dense rows (default: 8192). |
--sparse-density <float> |
Sparsity (0.10 = 10% non-zeros, default: 0.10). |
--inner-loop-sparse <int> |
Timed iterations (default: 50). |
--precision-sparse <dtype> |
Data type (default: float32). No complex types — choices: float16, bfloat16, float32, float64. |
For complex test suites, use YAML configuration files instead of long command lines. Supports all benchmarks with full parameter control and multiple test instances.
Example config.yaml:
# Global settings
verbose: true
log_file: "stress-test.log"
device_index: 0
warmup: 20
repeats: 3
repeat_delay: 10
# Output modes (uncomment to enable)
# compact: true # Machine-readable CSV to stdout
# syslog: true # Structured key=value entries to syslog
# syslog_dmesg: true # Also write to /dev/kmsg (requires syslog: true)
# Benchmark suite - can specify same test multiple times with different params
benchmarks:
- name: batched_gemm
precision: float32
batch_count: 128
m: 4096
n: 4096
k: 4096
inner_loop: 100
- name: batched_gemm
precision: float64
batch_count: 64
m: 8192
n: 8192
k: 8192
TF32_mode: false
- name: convolution
precision: bfloat16
batch_count: 256
in_channels: 64
out_channels: 128
- name: fft
precision: float32
nx: 256
ny: 256
nz: 256
- name: heat_equation
precision: float64
grid_size: 32768
time_steps: 100Run with:
./torch-hammer.py --config config.yamlCLI arguments override YAML settings:
# Use config but override to run on GPU 2
./torch-hammer.py --config config.yaml --device-index 2
# Suppress stdout while keeping verbose CSV in log file
./torch-hammer.py --config config.yaml --verbose-file-onlySee config-examples/ directory for ready-to-use configurations (quick-test.yaml, stress-test.yaml, platform-stress.yaml).
Torch Hammer can compare your measured results against expected hardware performance and flag issues automatically. This is opt-in — no validation runs unless you supply a baseline file.
# Run with baseline validation
./torch-hammer.py --batched-gemm --baseline-file baselines/example.yaml
# Run without validation (the default)
./torch-hammer.py --batched-gemm
# Explicitly disable validation even if a baseline file is loaded
./torch-hammer.py --batched-gemm --baseline-file my_baselines.yaml --no-validationThe model name in the baseline file must match what Torch Hammer detects.
Run a quick test and look for the model field:
./torch-hammer.py --batched-gemm --inner-loop-batched-gemm 1 --verbose 2>&1 | grep model
# → 'model': 'NVIDIA GH200 120GB'
# Or use vendor tools directly:
nvidia-smi --query-gpu=name --format=csv,noheader # NVIDIA
rocm-smi --showproductname # AMDYAML (recommended — supports comments):
"NVIDIA GH200 120GB":
benchmarks:
batched_gemm:
float32:
target_gflops: 49000.0
min_efficiency: 90.0
float64:
target_gflops: 43000.0
min_efficiency: 85.0
tf32:
target_gflops: 252000.0
min_efficiency: 85.0
heat_equation:
float64:
target_mlups: 20000.0
min_efficiency: 80.0
memory_traffic:
float32:
target_gbps: 4500.0
min_efficiency: 75.0This target-based format lets you set per-benchmark, per-dtype expected values
and minimum efficiency thresholds. If a benchmark/dtype pair isn't found in
benchmarks:, validation falls back to the top-level fp32_tflops/fp64_tflops
values automatically.
JSON alternative:
{
"NVIDIA GH200 120GB": {
"fp32_tflops": 51.0,
"fp64_tflops": 25.5,
"tf32_tflops": 756.0,
"memory_bandwidth_gbps": 4800.0,
"tdp_watts": 900
}
}For sites with multiple GPU types, list them all in one file. Keys must match the model string exactly (case-sensitive):
# my_baselines.yaml
# Keys must match the model string exactly (case-sensitive)
# The performance figures below are examples, please check vendor
# datasheets for specific figures
"NVIDIA GH200 120GB":
fp32_tflops: 51.0 # FP32 peak TFLOPS
fp64_tflops: 25.5 # FP64 peak TFLOPS
tf32_tflops: 756.0 # TF32 peak TFLOPS (Tensor Core)
memory_bandwidth_gbps: 4800.0 # HBM3e bandwidth in GB/s
tdp_watts: 900 # Thermal Design Power in Watts
"AMD Instinct MI300X":
fp32_tflops: 163.4
fp64_tflops: 81.7
memory_bandwidth_gbps: 5300.0
tdp_watts: 750
"NVIDIA A100-SXM4-80GB":
fp32_tflops: 19.5
fp64_tflops: 9.7
tf32_tflops: 156.0
memory_bandwidth_gbps: 2039.0
tdp_watts: 400
"AMD Instinct MI250X":
fp32_tflops: 47.9
fp64_tflops: 47.9
memory_bandwidth_gbps: 3276.8
tdp_watts: 500| Field | Unit | Used By |
|---|---|---|
fp32_tflops |
TFLOPS | GEMM, Convolution, FFT, Einsum (float32) |
fp64_tflops |
TFLOPS | GEMM, Convolution, FFT, Einsum (float64) |
fp16_tflops |
TFLOPS | GEMM, Convolution, FFT, Einsum (float16/bfloat16) |
tf32_tflops |
TFLOPS | GEMM with --batched-gemm-TF32-mode |
memory_bandwidth_gbps |
GB/s | Memory Traffic, Heat Equation, Schrödinger |
tdp_watts |
Watts | Informational (not used for validation today) |
When baselines are loaded, Torch Hammer reports efficiency alongside results:
✅ Excellent performance: 96.3% of target (49000.0 gflops)
✅ Good performance: 82.1% of target (43000.0 gflops)
[WARN] Performance below target: 58.2% of expected 49000.0 gflops (threshold: 70%)
The warning thresholds can be adjusted per-run via CLI flags:
| Flag | Default | Purpose |
|---|---|---|
--efficiency-warn-pct |
70% | Flag results below this % of peak/target |
--temp-warn-C |
90°C | Temperature warning threshold |
--temp-critical-C |
95°C | Temperature critical threshold |
--power-warn-pct |
98% | Power-limit proximity warning |
--outlier-threshold-pct |
15% | Multi-GPU outlier detection |
See baselines/ directory for ready-to-use example files.
--verbose Presents every iteration into a single row:
Save it directly to a file via --log-file myrun.txt.
--compact produces machine-readable CSV on stdout — one row per benchmark,
with all log chatter suppressed.
# CSV to stdout, warnings-only on stderr
./torch-hammer.py --compact --batched-gemm --fft > results.csv
# Pipe-friendly: only CSV reaches the file
./torch-hammer.py --compact --batched-gemm --fft 2>/dev/null > results.csv| Column | Description |
|---|---|
hostname |
Node hostname |
gpu |
GPU index (0, 1, …) |
gpu_model |
GPU model string |
serial |
GPU serial number |
benchmark |
Benchmark name (e.g. Batched GEMM) |
dtype |
Data type used |
iterations |
Number of timed iterations completed |
runtime_s |
Wall-clock time for the timed loop (seconds) |
min |
Minimum performance value |
mean |
Mean performance value |
max |
Maximum performance value |
unit |
Performance unit (GFLOP/s, GB/s, etc.) |
power_avg_w |
Average power draw (watts) |
temp_max_c |
Peak GPU temperature (°C) |
Adding --verbose appends five telemetry columns:
| Column | Description |
|---|---|
sm_util_mean |
Mean SM / CU utilisation (%) |
mem_bw_util_mean |
Mean memory-bandwidth utilisation (%) |
gpu_clock_mean |
Mean GPU clock (MHz) |
mem_used_gb_mean |
Mean memory used (GB) |
throttled |
true / false — whether throttling was detected |
# 19-column CSV with extra telemetry
./torch-hammer.py --compact --verbose --batched-gemm > detailed.csvIn multi-GPU mode (--all-gpus / --gpu-list) the parent process prints a
single CSV header before spawning workers; each worker appends its own data
rows. The normal multi-GPU summary table is suppressed — the CSV is the
summary.
./torch-hammer.py --compact --all-gpus --batched-gemm > results.csv- stdout = pure CSV (header + data rows).
- stderr = warnings / errors only (log level
WARNING). --compactdoes not emit per-iteration lines; only--verbosedoes.- Combine
--compact --verboseto get extra telemetry columns on the summary row (not extra rows).
--syslog emits structured key=value entries to the system log after each
benchmark completes. This enables integration with log aggregators such as
Splunk, Elastic, and journald without scraping CSV files.
# Emit to syslog (LOG_USER facility)
./torch-hammer.py --syslog --batched-gemm --fft
# Also write to dmesg ring buffer (requires root / CAP_SYSLOG)
sudo ./torch-hammer.py --syslog --syslog-dmesg --all-gpus
# Combine with compact CSV output
./torch-hammer.py --syslog --compact --batched-gemm > results.csv| Tag | When | Example |
|---|---|---|
RUN_START |
Before first benchmark | RUN_START run_id=a3f1b2c4 ts=2025-… host=node01 gpus=4 |
BENCH_RESULT |
After each benchmark | BENCH_RESULT run_id=a3f1b2c4 hostname=node01 gpu=0 benchmark=Batched_GEMM mean=42.5 unit=TFLOPS … |
RUN_END |
After last benchmark | RUN_END run_id=a3f1b2c4 ts=2025-… passed=9 failed=0 total_elapsed=123.5s |
Syslog priority is derived from your existing threshold flags — no extra configuration required:
| Condition | Severity |
|---|---|
status=FAIL |
LOG_ERR |
temp_max_c ≥ --temp-critical-C |
LOG_CRIT |
temp_max_c ≥ --temp-warn-C |
LOG_WARNING |
throttled=true |
LOG_WARNING |
efficiency_pct < --efficiency-warn-pct |
LOG_WARNING |
| Clean pass | LOG_INFO |
Every message includes a run_id — the first 8 hex characters of a UUID4
generated at invocation time. Use it to correlate all messages from a single
run (e.g. journalctl | grep run_id=a3f1b2c4).
Each BENCH_RESULT entry contains the same fields as
Compact Mode plus status, efficiency_pct (when baselines
are loaded), and throttled (when detected). Spaces in values are replaced
with underscores.
--syslog-dmesg writes the same messages to /dev/kmsg so they appear in
dmesg alongside kernel events. This is useful for correlating GPU issues
with Xid/MCE errors.
If /dev/kmsg is unavailable or the process lacks permissions, a warning is
printed and execution continues with syslog only — dmesg failure never
crashes the benchmark.
In multi-GPU mode the parent generates a single run_id shared by all
workers and the parent framing messages. Every RUN_START, BENCH_RESULT,
and RUN_END from the same invocation carries the same run_id:
journalctl -t torch-hammer -t torch-hammer-mgpu | grep run_id=c4d4a27dThe parent emits a framing RUN_START (with gpus=N) and RUN_END with
aggregate pass/fail counts using the torch-hammer-mgpu syslog tag.
--syslog-dmesgrequires--syslog(same pattern as--verbose-file-onlyrequires--log-file).- Syslog uses the
LOG_USERfacility with tagtorch-hammer(ortorch-hammer-mgpufor the parent frame). - All messages are emitted per-benchmark (not batched) for crash survivability.
--syslogand--compactcan be combined — they are independent output channels.
| Hardware | Requirements | Reported Fields* |
|---|---|---|
| NVIDIA GPU | pynvml (nvidia-ml-py) |
power, temperature, GPU utilization, clock, memory controller activity, memory utilization, GPU temperature, SM utilization, HBM temperature, thermal throttling status, power limit warnings, ECC errors |
| AMD ROCm GPU/APU | amdsmi Python library |
GPU utilization, memory utilization, GPU clock, memory clock, power, temperature (edge/hotspot/memory), serial, throttle status, power cap, thermal warnings |
| CPU only | None | Vendor, architecture |
* AMD telemetry uses native amdsmi Python API for ROCm 6.1+ with comprehensive metrics including throttle detection, power limit monitoring, and thermal warnings. Per-processor-type field sets automatically adapt for GPU vs CPU/APU.
reports/hammer_report.py aggregates results from multi-node runs into CLI summaries, static HTML reports, or interactive Plotly dashboards. It accepts any torch-hammer output format (compact CSV, summary CSV, JSON, verbose logs, or raw shell captures).
# CLI summary with outlier detection
python reports/hammer_report.py results/ --quiet --outlier-threshold 10
# Static HTML report
python reports/hammer_report.py results/ -o report.html
# Interactive Plotly dashboard
python reports/hammer_report.py results/ --interactive -o dashboard.htmlSee reports/README.md for the full CLI reference, input format details, and chart descriptions.
Run the GEMM test against an NVIDIA GH200 module
./torch-hammer.py --batched-gemm --k 16384 --m 4224 --n 2048 --verbose
2025-10-28T21:19:11 INFO Using device cuda:0
2025-10-28T21:19:11 INFO Initial telemetry {'vendor': 'NVIDIA', 'model': 'NVIDIA GH200 120GB', 'device_id': 0, 'hostname': 'nid001000', 'serial': '1652422128547', 'sm_util': 2, 'mem_bw_util': 0, 'mem_util': '0.6', 'gpu_clock': 1980, 'mem_clock': 2619, 'power_W': 106.075, 'temp_gpu_C': 34}
2025-10-28T21:19:20 INFO iter, test, dtype, gflops, vendor, model, hostname, device_id, serial, sm_util, mem_bw_util, mem_util, gpu_clock, mem_clock, power_W, temp_gpu_C, temp_hbm_C
2025-10-28T21:19:20 INFO 1, gemm, float32, 49098.63, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 29, 56.6, 1770, 2619, 567.324, 55,
2025-10-28T21:19:20 INFO 2, gemm, float32, 49008.41, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 29, 56.6, 1770, 2619, 566.19, 55,
2025-10-28T21:19:21 INFO 3, gemm, float32, 49011.46, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 29, 56.6, 1785, 2619, 566.174, 55,
2025-10-28T21:19:22 INFO 4, gemm, float32, 49035.82, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 29, 56.6, 1785, 2619, 567.226, 55,
2025-10-28T21:19:23 INFO 5, gemm, float32, 49054.14, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 29, 56.6, 1785, 2619, 566.38, 55,
2025-10-28T21:19:23 INFO 6, gemm, float32, 49003.10, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 29, 56.6, 1770, 2619, 565.378, 55,
2025-10-28T21:19:24 INFO 7, gemm, float32, 49041.47, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 29, 56.6, 1770, 2619, 565.83, 55,
2025-10-28T21:19:25 INFO 8, gemm, float32, 49069.36, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 29, 56.6, 1785, 2619, 567.327, 55,
2025-10-28T21:19:26 INFO 9, gemm, float32, 48968.03, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 29, 56.6, 1755, 2619, 565.995, 56,
2025-10-28T21:19:26 INFO 10, gemm, float32, 49032.55, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 29, 56.6, 1770, 2619, 566.18, 56,
2025-10-28T21:19:26 INFO [Batched GEMM] 48968.03/49032.30/49098.63 GFLOP/s (min/mean/max) {'vendor': 'NVIDIA', 'model': 'NVIDIA GH200 120GB', 'device_id': 0, 'hostname': 'nid001000', 'serial': '1652422128547', 'sm_util': 100, 'mem_bw_util': 29, 'mem_util': '56.6', 'gpu_clock': 1770, 'mem_clock': 2619, 'power_W': 566.18, 'temp_gpu_C': 56}
2025-10-28T21:19:26 INFO Final telemetry {'vendor': 'NVIDIA', 'model': 'NVIDIA GH200 120GB', 'device_id': 0, 'hostname': 'nid001000', 'serial': '1652422128547', 'sm_util': 100, 'mem_bw_util': 29, 'mem_util': '56.6', 'gpu_clock': 1770, 'mem_clock': 2619, 'power_W': 566.18, 'temp_gpu_C': 56}
2025-10-28T21:19:26 INFO Benchmark run finished.
Run the GEMM test against an NVIDIA GH200 module using the float64 data type
./torch-hammer.py --batch-count-gemm=106 --batched-gemm --k 16384 --m 4224 --n 2048 --precision-gemm float64 --verbose
2025-10-28T23:10:54 INFO Using device cuda:0
2025-10-28T23:10:54 INFO Initial telemetry {'vendor': 'NVIDIA', 'model': 'NVIDIA GH200 120GB', 'device_id': 0, 'hostname': 'nid001000', 'serial': '1652422128547', 'sm_util': 2, 'mem_bw_util': 0, 'mem_util': '0.6', 'gpu_clock': 1980, 'mem_clock': 2619, 'power_W': 106.409, 'temp_gpu_C': 35}
2025-10-28T23:11:01 INFO iter, test, dtype, gflops, vendor, model, hostname, device_id, serial, sm_util, mem_bw_util, mem_util, gpu_clock, mem_clock, power_W, temp_gpu_C, temp_hbm_C
2025-10-28T23:11:01 INFO 1, gemm, float64, 43615.45, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 50, 93.3, 1335, 2619, 565.706, 53,
2025-10-28T23:11:02 INFO 2, gemm, float64, 43564.93, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 50, 93.3, 1320, 2619, 565.575, 53,
2025-10-28T23:11:03 INFO 3, gemm, float64, 43479.01, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 51, 93.3, 1320, 2619, 565.031, 53,
2025-10-28T23:11:03 INFO 4, gemm, float64, 43452.38, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 51, 93.3, 1320, 2619, 564.582, 54,
2025-10-28T23:11:04 INFO 5, gemm, float64, 43375.87, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 50, 93.3, 1320, 2619, 564.172, 54,
2025-10-28T23:11:05 INFO 6, gemm, float64, 43448.23, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 51, 93.3, 1320, 2619, 563.456, 54,
2025-10-28T23:11:05 INFO 7, gemm, float64, 43404.56, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 51, 93.3, 1320, 2619, 563.842, 54,
2025-10-28T23:11:06 INFO 8, gemm, float64, 43397.94, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 50, 93.3, 1305, 2619, 563.7, 54,
2025-10-28T23:11:07 INFO 9, gemm, float64, 43472.86, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 51, 93.3, 1320, 2619, 563.753, 54,
2025-10-28T23:11:08 INFO 10, gemm, float64, 43436.16, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 51, 93.3, 1335, 2619, 563.484, 54,
2025-10-28T23:11:08 INFO [Batched GEMM] 43375.87/43464.74/43615.45 GFLOP/s (min/mean/max) {'vendor': 'NVIDIA', 'model': 'NVIDIA GH200 120GB', 'device_id': 0, 'hostname': 'nid001000', 'serial': '1652422128547', 'sm_util': 100, 'mem_bw_util': 51, 'mem_util': '93.3', 'gpu_clock': 1335, 'mem_clock': 2619, 'power_W': 563.484, 'temp_gpu_C': 54}
2025-10-28T23:11:08 INFO Final telemetry {'vendor': 'NVIDIA', 'model': 'NVIDIA GH200 120GB', 'device_id': 0, 'hostname': 'nid001000', 'serial': '1652422128547', 'sm_util': 100, 'mem_bw_util': 51, 'mem_util': '93.3', 'gpu_clock': 1335, 'mem_clock': 2619, 'power_W': 563.484, 'temp_gpu_C': 54}
2025-10-28T23:11:08 INFO Benchmark run finished.
Run the GEMM test, but utilize Tensor Cores (if available):
./torch-hammer.py --batch-count-gemm=106 --batched-gemm --k 16384 --m 4224 --n 2048 --batched-gemm-TF32-mode --verbose
2025-10-28T23:23:40 INFO Using device cuda:0
2025-10-28T23:23:40 INFO Initial telemetry {'vendor': 'NVIDIA', 'model': 'NVIDIA GH200 120GB', 'device_id': 0, 'hostname': 'nid001000', 'serial': '1652422128547', 'sm_util': 0, 'mem_bw_util': 0, 'mem_util': '0.6', 'gpu_clock': 1980, 'mem_clock': 2619, 'power_W': 106.708, 'temp_gpu_C': 34}
2025-10-28T23:23:41 INFO iter, test, dtype, gflops, vendor, model, hostname, device_id, serial, sm_util, mem_bw_util, mem_util, gpu_clock, mem_clock, power_W, temp_gpu_C, temp_hbm_C
2025-10-28T23:23:41 INFO 1, gemm, float32, 256246.11, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 67, 47.0, 1260, 2619, 559.151, 48,
2025-10-28T23:23:41 INFO 2, gemm, float32, 253168.00, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 67, 47.0, 1260, 2619, 558.875, 48,
2025-10-28T23:23:41 INFO 3, gemm, float32, 251230.15, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 70, 47.0, 1230, 2619, 559.197, 48,
2025-10-28T23:23:41 INFO 4, gemm, float32, 252279.34, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 70, 47.0, 1245, 2619, 557.965, 48,
2025-10-28T23:23:42 INFO 5, gemm, float32, 253690.52, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 72, 47.0, 1245, 2619, 557.214, 48,
2025-10-28T23:23:42 INFO 6, gemm, float32, 251942.72, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 68, 47.0, 1260, 2619, 556.396, 49,
2025-10-28T23:23:42 INFO 7, gemm, float32, 252252.57, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 68, 47.0, 1245, 2619, 558.838, 49,
2025-10-28T23:23:42 INFO 8, gemm, float32, 252393.06, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 69, 47.0, 1245, 2619, 561.048, 49,
2025-10-28T23:23:42 INFO 9, gemm, float32, 252073.11, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 69, 47.0, 1260, 2619, 559.898, 49,
2025-10-28T23:23:42 INFO 10, gemm, float32, 252572.02, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 67, 47.0, 1260, 2619, 558.2, 49,
2025-10-28T23:23:42 INFO [Batched GEMM] 251230.15/252784.76/256246.11 GFLOP/s (min/mean/max) {'vendor': 'NVIDIA', 'model': 'NVIDIA GH200 120GB', 'device_id': 0, 'hostname': 'nid001000', 'serial': '1652422128547', 'sm_util': 100, 'mem_bw_util': 67, 'mem_util': '47.0', 'gpu_clock': 1260, 'mem_clock': 2619, 'power_W': 558.2, 'temp_gpu_C': 49}
2025-10-28T23:23:42 INFO Final telemetry {'vendor': 'NVIDIA', 'model': 'NVIDIA GH200 120GB', 'device_id': 0, 'hostname': 'nid001000', 'serial': '1652422128547', 'sm_util': 100, 'mem_bw_util': 67, 'mem_util': '47.0', 'gpu_clock': 1260, 'mem_clock': 2619, 'power_W': 558.2, 'temp_gpu_C': 49}
2025-10-28T23:23:42 INFO Benchmark run finished.
Run the Laplacian Heat Equation
./torch-hammer.py --heat-equation --heat-grid-size 32768 --heat-time-steps 100 --alpha .000127 --delta-t .001 --precision-heat float64 --verbose
2025-10-28T23:41:52 INFO Using device cuda:0
2025-10-28T23:41:52 INFO Initial telemetry {'vendor': 'NVIDIA', 'model': 'NVIDIA GH200 120GB', 'device_id': 0, 'hostname': 'nid001000', 'serial': '1652422128547', 'sm_util': 2, 'mem_bw_util': 0, 'mem_util': '0.6', 'gpu_clock': 1980, 'mem_clock': 2619, 'power_W': 105.937, 'temp_gpu_C': 34}
2025-10-28T23:41:52 INFO iter, test, dtype, vendor, model, hostname, device_id, serial, sm_util, mem_bw_util, mem_util, gpu_clock, mem_clock, power_W, temp_gpu_C, temp_hbm_C
2025-10-28T23:41:52 INFO 0.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 1.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 2.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 3.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 4.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 5.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 6.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 7.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 8.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 9.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 10.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 11.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 12.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 13.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 14.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 15.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 16.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 17.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 18.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 19.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 20.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 21.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 22.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 23.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 24.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 25.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 26.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 27.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 28.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 29.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 30.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 113.762, 35,
2025-10-28T23:41:52 INFO 31.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 2, 0, 42.8, 1980, 2619, 154.74, 35,
2025-10-28T23:41:52 INFO 32.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 93, 100, 42.8, 1980, 2619, 154.74, 37,
2025-10-28T23:41:52 INFO 33.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 93, 100, 42.8, 1635, 2619, 204.161, 37,
2025-10-28T23:41:52 INFO 34.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 93, 100, 42.8, 1635, 2619, 204.161, 39,
2025-10-28T23:41:52 INFO 35.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1575, 2619, 256.126, 39,
2025-10-28T23:41:52 INFO 36.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1575, 2619, 256.126, 41,
2025-10-28T23:41:52 INFO 37.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1605, 2619, 305.641, 41,
2025-10-28T23:41:52 INFO 38.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1605, 2619, 305.641, 42,
2025-10-28T23:41:52 INFO 39.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1605, 2619, 365.741, 42,
2025-10-28T23:41:52 INFO 40.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1605, 2619, 365.741, 42,
2025-10-28T23:41:52 INFO 41.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1620, 2619, 409.326, 42,
2025-10-28T23:41:52 INFO 42.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1620, 2619, 409.326, 44,
2025-10-28T23:41:52 INFO 43.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1545, 2619, 464.929, 44,
2025-10-28T23:41:53 INFO 44.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1575, 2619, 464.929, 44,
2025-10-28T23:41:53 INFO 45.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1575, 2619, 506.784, 44,
2025-10-28T23:41:53 INFO 46.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1560, 2619, 506.784, 44,
2025-10-28T23:41:53 INFO 47.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1560, 2619, 552.12, 44,
2025-10-28T23:41:53 INFO 48.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1395, 2619, 552.12, 44,
2025-10-28T23:41:53 INFO 49.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1395, 2619, 589.225, 44,
2025-10-28T23:41:53 INFO 50.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1380, 2619, 585.284, 44,
2025-10-28T23:41:53 INFO 51.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1380, 2619, 585.284, 44,
2025-10-28T23:41:53 INFO 52.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1380, 2619, 576.641, 44,
2025-10-28T23:41:53 INFO 53.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1380, 2619, 576.641, 44,
2025-10-28T23:41:53 INFO 54.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1485, 2619, 572.522, 44,
2025-10-28T23:41:53 INFO 55.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1485, 2619, 572.522, 45,
2025-10-28T23:41:53 INFO 56.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1530, 2619, 571.509, 45,
2025-10-28T23:41:53 INFO 57.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1530, 2619, 571.509, 45,
2025-10-28T23:41:53 INFO 58.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1530, 2619, 570.536, 45,
2025-10-28T23:41:53 INFO 59.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1530, 2619, 570.536, 45,
2025-10-28T23:41:53 INFO 60.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1530, 2619, 569.789, 45,
2025-10-28T23:41:53 INFO 61.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1530, 2619, 569.789, 46,
2025-10-28T23:41:53 INFO 62.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1545, 2619, 570.795, 46,
2025-10-28T23:41:54 INFO 63.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1545, 2619, 570.795, 46,
2025-10-28T23:41:54 INFO 64.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1515, 2619, 569.745, 46,
2025-10-28T23:41:54 INFO 65.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1515, 2619, 569.745, 46,
2025-10-28T23:41:54 INFO 66.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1425, 2619, 566.433, 46,
2025-10-28T23:41:54 INFO 67.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1395, 2619, 566.433, 46,
2025-10-28T23:41:54 INFO 68.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1395, 2619, 567.298, 46,
2025-10-28T23:41:54 INFO 69.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1395, 2619, 567.298, 46,
2025-10-28T23:41:54 INFO 70.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1395, 2619, 568.042, 46,
2025-10-28T23:41:54 INFO 71.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1425, 2619, 568.042, 46,
2025-10-28T23:41:54 INFO 72.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1425, 2619, 569.252, 46,
2025-10-28T23:41:54 INFO 73.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1440, 2619, 570.087, 46,
2025-10-28T23:41:54 INFO 74.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1440, 2619, 570.087, 46,
2025-10-28T23:41:54 INFO 75.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1515, 2619, 569.602, 46,
2025-10-28T23:41:54 INFO 76.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1515, 2619, 569.602, 46,
2025-10-28T23:41:54 INFO 77.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1500, 2619, 568.971, 46,
2025-10-28T23:41:54 INFO 78.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1500, 2619, 568.971, 46,
2025-10-28T23:41:54 INFO 79.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1530, 2619, 569.572, 46,
2025-10-28T23:41:54 INFO 80.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1530, 2619, 569.572, 47,
2025-10-28T23:41:55 INFO 81.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1500, 2619, 569.063, 47,
2025-10-28T23:41:55 INFO 82.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1500, 2619, 569.063, 47,
2025-10-28T23:41:55 INFO 83.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1455, 2619, 567.604, 47,
2025-10-28T23:41:55 INFO 84.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1455, 2619, 567.604, 47,
2025-10-28T23:41:55 INFO 85.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1425, 2619, 567.397, 47,
2025-10-28T23:41:55 INFO 86.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1425, 2619, 567.397, 47,
2025-10-28T23:41:55 INFO 87.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1410, 2619, 567.496, 47,
2025-10-28T23:41:55 INFO 88.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1410, 2619, 567.496, 47,
2025-10-28T23:41:55 INFO 89.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1410, 2619, 568.607, 47,
2025-10-28T23:41:55 INFO 90.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1410, 2619, 568.607, 46,
2025-10-28T23:41:55 INFO 91.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1425, 2619, 569.332, 46,
2025-10-28T23:41:55 INFO 92.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1485, 2619, 569.332, 46,
2025-10-28T23:41:55 INFO 93.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1485, 2619, 569.982, 46,
2025-10-28T23:41:55 INFO 94.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1500, 2619, 569.982, 47,
2025-10-28T23:41:55 INFO 95.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1500, 2619, 568.653, 47,
2025-10-28T23:41:55 INFO 96.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1500, 2619, 568.653, 47,
2025-10-28T23:41:55 INFO 97.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1500, 2619, 568.576, 47,
2025-10-28T23:41:55 INFO 98.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1500, 2619, 568.576, 47,
2025-10-28T23:41:55 INFO 99.00, heat, float64, NVIDIA, NVIDIA GH200 120GB, nid001000, 0, 1652422128547, 100, 100, 42.8, 1500, 2619, 568.041, 47,
2025-10-28T23:41:57 INFO [Heat] 20048.3 MLUPS total 5.36s {'vendor': 'NVIDIA', 'model': 'NVIDIA GH200 120GB', 'device_id': 0, 'hostname': 'nid001000', 'serial': '1652422128547', 'sm_util': 100, 'mem_bw_util': 100, 'mem_util': '42.8', 'gpu_clock': 1455, 'mem_clock': 2619, 'power_W': 567.965, 'temp_gpu_C': 47}
2025-10-28T23:41:57 INFO Final telemetry {'vendor': 'NVIDIA', 'model': 'NVIDIA GH200 120GB', 'device_id': 0, 'hostname': 'nid001000', 'serial': '1652422128547', 'sm_util': 100, 'mem_bw_util': 100, 'mem_util': '42.8', 'gpu_clock': 1455, 'mem_clock': 2619, 'power_W': 567.965, 'temp_gpu_C': 47}
2025-10-28T23:41:57 INFO Benchmark run finished.
Run benchmarks on all GPUs simultaneously with per-GPU log files:
./torch-hammer.py --batched-gemm --all-gpus --log-dir ./logs --verbose-file-only
# Each GPU writes to separate file: logs/gpu0_<timestamp>.log, logs/gpu1_<timestamp>.log, etc.
# No stdout pollution - all CSV data goes to files onlyRun on specific GPU subset with NUMA-aware CPU binding (enabled by default):
./torch-hammer.py --batched-gemm --gpu-list "0,2,3"
2025-12-08T15:23:10 INFO GPU 0 on NUMA node 0, using all NUMA CPUs: 0-31
2025-12-08T15:23:10 INFO GPU 2 on NUMA node 1, using all NUMA CPUs: 32-63
2025-12-08T15:23:10 INFO GPU 3 on NUMA node 1, using all NUMA CPUs: 32-63Manual CPU-GPU mapping for fine-grained control:
./torch-hammer.py --batched-gemm --gpu-list "0,1" --cpu-gpu-map "0:0-15,1:16-31"Run benchmark suite multiple times for stability testing:
# Run 10 times to gather statistical distribution
./torch-hammer.py --batched-gemm --repeats 10
============================================================
REPEAT 1/10
============================================================
[... benchmark output ...]
============================================================
REPEAT 2/10
============================================================
[... benchmark output ...]With thermal stabilization delay between repeats:
# Run 5 times with 30-second cooling period between runs
./torch-hammer.py --batched-gemm --repeats 5 --repeat-delay 30Verbose mode includes repeat number in CSV output:
./torch-hammer.py --batched-gemm --repeats 3 --verbose --log-file stability.csv
# CSV format:
# repeat, iter, test, dtype, gflops, vendor, model, ...
# 1, 1, gemm, float32, 49123.45, NVIDIA, ...
# 1, 2, gemm, float32, 49087.23, NVIDIA, ...
# ...
# 2, 1, gemm, float32, 48956.12, NVIDIA, ...
# 2, 2, gemm, float32, 48998.76, NVIDIA, ...Run each benchmark for a specific duration instead of fixed iterations:
# Run GEMM for 60 seconds per repeat, repeated 3 times
./torch-hammer.py --batched-gemm --duration 60 --repeats 3
# Total runtime: ~60s × 3 repeats = ~180s
# (Plus optional --repeat-delay between repeats)Combined with multi-GPU for cluster stress testing:
# All 8 GPUs run for 5 minutes each (NUMA binding enabled by default)
./torch-hammer.py --all-gpus \
--batched-gemm --convolution --fft \
--duration 300 \
--log-dir ./stress-test-logs \
--verbose-file-onlyWe welcome contributions from the community! Please read our Contributing Guidelines before submitting pull requests.
Key requirements:
- All commits must be signed off (DCO) - see CONTRIBUTING.md
- Follow the Code of Conduct
- Run syntax check:
python3 -m py_compile torch-hammer.py
Areas where help is especially welcome:
- Additional benchmarks or kernels
- Extended telemetry support:
- Additional AMD metrics (PCIe bandwidth, per-CU utilization)
- macOS Metal Performance Shaders (MPS) telemetry
- Packaging / CI for wheels and Homebrew formula
- Windows support and testing
Communication:
- Issues & PRs: Use the GitHub Issues and Pull Requests tabs
Torch Hammer includes a comprehensive test suite to ensure reliability across updates.
# Install test dependencies
pip install pytest pytest-cov
# Run all tests
pytest tests/ -v
# Run specific test categories
pytest tests/test_parsing.py -v # CLI argument parsing
pytest tests/test_compact.py -v # Compact CSV output mode
pytest tests/test_syslog.py -v # Syslog output mode
pytest tests/test_utilities.py -v # Timer, helpers, validation
pytest tests/test_telemetry.py -v # Telemetry classes
pytest tests/test_smoke.py -v # Benchmark smoke tests (CPU)
# Run with coverage report
pytest tests/ --cov=. --cov-report=term-missing| Test File | Coverage | Description |
|---|---|---|
test_parsing.py |
CLI & config | Argument parsing, CPU-GPU mapping, validation |
test_compact.py |
Compact mode | CSV output, columns, header control, logging suppression |
test_syslog.py |
Syslog mode | SyslogReporter, DmesgWriter, KV formatting, priority derivation, pass/fail counting, run_id |
test_utilities.py |
Core helpers | Timer, VerbosePrinter, GFLOP calculations |
test_telemetry.py |
Telemetry | Class structure, thread behavior, factory |
test_smoke.py |
Benchmarks | Run each benchmark on CPU with minimal iterations |
test_report.py |
Fleet report | Parsing, stats, outlier detection, HTML/SVG/interactive output, XSS safety, throttling |
When contributing new features, please include tests:
# tests/test_myfeature.py
def test_my_new_function(th):
"""Test description."""
result = th.my_new_function(args)
assert result is not None
assert result["expected_key"] == expected_valueThe th fixture (defined in conftest.py) provides access to the torch-hammer module.
Note: GPU-specific tests are skipped on machines without GPU hardware. The smoke tests run all benchmarks on CPU to verify basic functionality without requiring specialized hardware.
The GitHub Actions workflow (.github/workflows/gpu-functional.yml) runs the
full pytest suite on ubuntu-latest across multiple Python versions (3.9, 3.11,
3.12) using CPU-only PyTorch. No GPU hardware is required.
# Run locally — identical to what CI runs
pytest tests/ -vFor GPU-specific regression testing on HPC clusters, an exhaustive 92-test
ReFrame suite lives in reframe/ci_functional_checks.py. See
reframe/README.md for details on running it with real GPU hardware.
This project is maintained by Hewlett Packard Enterprise. Contributions are reviewed by the maintainer team and merged following the process described in CONTRIBUTING.md.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Copyright 2024-2026 Hewlett Packard Enterprise Development LP
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.