GPU PCIe Validator v4.1
Usage: ./gpu_pcie_validator --list-devices
./gpu_pcie_validator --device N [--memory-mode MODE] [--window-ms MS] [--interval-ms MS] [--size-mib MiB]
./gpu_pcie_validator --all-devices [--memory-mode MODE] [--window-ms MS] [--interval-ms MS] [--size-mib MiB]
Options: --list-devices List all GPUs with index, BDF, NUMA, PCIe link, name --device N CUDA device index to validate (default: 0) --all-devices Validate all GPUs sequentially --memory-mode MODE pinned (default) or unpinned --window-ms MS NVML sampling window in ms (default: 2000) --interval-ms MS NVML poll interval in ms (default: 100) --size-mib MiB Transfer buffer size in MiB (default: 1024)
Exit codes: 0 All GPUs HEALTHY 1 Runtime error (NVML/CUDA failure, bad arguments) 2 One or more GPUs DEGRADED or LINK_DEGRADED
Establish a known-good reference before any deployment or after hardware changes.
./gpu_pcie_validator --device 0 --size-mib 1024 --window-ms 2000 --interval-ms 100Tests:
- Sustained payload throughput
- Negotiated Gen/width correctness
- Replay counter integrity
- AER error state
Healthy baseline indicators:
- Replay counter increase: NONE
- AER correctable delta: 0
- Link consistency: TRUE
- Efficiency ratio at or above platform typical for that link
Detects thermal or signal degradation over time.
./gpu_pcie_validator --device 0 --size-mib 1024 --window-ms 60000 --interval-ms 100Watch for:
- Replay counter increase
- Efficiency drift
- Link speed or width changes
- AER correctable error accumulation
Pushes DMA engines and link stability under heavy transfer load.
./gpu_pcie_validator --device 0 --size-mib 4096 --window-ms 60000 --interval-ms 50Watch for:
- Speed Change: YES
- Width Change: YES
- Efficiency degradation relative to baseline
- AER non-fatal or fatal errors
Evaluates control-path efficiency and PCIe transaction overhead.
./gpu_pcie_validator --device 0 --size-mib 64 --window-ms 10000 --interval-ms 50Healthy systems:
- 64 KiB H2D memcpy: single-digit microseconds
- 64 KiB D2H memcpy: single-digit microseconds
Elevated latency relative to a known-good baseline suggests link congestion, ASPM interference, or scheduling overhead.
Detects host memory configuration inefficiencies.
# Pinned — page-locked host memory
./gpu_pcie_validator --device 0 --memory-mode pinned --size-mib 1024
# Unpinned — pageable host memory
./gpu_pcie_validator --device 0 --memory-mode unpinned --size-mib 1024Both unpinned and pageable are accepted (pageable is treated as an alias).
On systems with IOMMU remapping active, pinned memory typically outperforms unpinned. On systems with IOMMU disabled or in passthrough mode, both modes may perform similarly.
Validate all detected GPUs in one command:
./gpu_pcie_validator --all-devices --memory-mode pinned --window-ms 2000
echo "Fleet status: $?"Output structure:
./logs/runs/<timestamp>_ALL/
report.txt
report.json
gpu0.json
gpu1.json
...
Exit code 0 = all HEALTHY Exit code 2 = one or more DEGRADED or LINK_DEGRADED
Summary across GPUs:
for j in logs/runs/*_ALL/gpu*.json; do
jq -r '[.gpu_model, .pcie_negotiated_post, .bulk_avg_gbs,
.efficiency_ratio, .assessment_state] | @tsv' "$j"
done | column -tDetects burst instability and jitter in NVML throughput counters.
./gpu_pcie_validator --device 0 --size-mib 1024 --window-ms 30000 --interval-ms 10Useful when intermittent drops are suspected but not reproducible at standard intervals.
AER (PCIe Advanced Error Reporting) counters are captured as deltas.
Correctable Errors (total delta): 0 Non-Fatal Errors (total delta): 0 Fatal Errors (total delta): 0
Interpretation:
- Correctable = 0 → Normal
- Correctable > 0 → Monitor across runs
- Non-Fatal > 0 → Investigate slot, cable, riser
- Fatal > 0 → Replace hardware
Clocks are captured before and after the transfer window.
- Pre-load P8, post-load P2 → Normal boost behavior
- Pre-load P0, post-load P0 → Persistence mode or active workload
- Post-load remains P8+ → GPU failed to boost
- Memory clock drops → Thermal throttle
Correlate against power and thermal deltas in the report.
Values sourced from PCIe config space.
Suppress lspci sudo prompt:
echo "$(whoami) ALL=(ALL) NOPASSWD: /usr/bin/lspci" \
| sudo tee /etc/sudoers.d/pcie-validatorTypical MPS:
- 128–256 bytes → consumer/workstation
- 512–4096 bytes → server platforms
auto means runtime power management may reduce link power state.
Disable for benchmarking:
sudo bash -c 'echo on > /sys/bus/pci/devices/<bdf>/power/control'no groups → disabled or passthrough (normal for GPU workloads)
If strict mode is enabled and throughput is low, translation overhead may contribute.
| Signal | Healthy | Investigate |
|---|---|---|
| Link consistency | TRUE | FALSE |
| Replay counter increase | NONE | > 0 |
| AER correctable delta | 0 | > 0 across runs |
| AER fatal delta | 0 | > 0 |
| Efficiency ratio | Platform typical | Sustained drop vs baseline |
| Post-load P-state | P0–P3 | P8+ |
| Thermal delta | Stable | Sustained rise with clock throttling |