You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Release v0.9
New features:
- Add coefficient of variance to bandwidth output statistics
- Add huge page support for host memory (disabled on Windows)
- Add option to sample pairs in device-to-device tests
- Add troubleshooting guide
- Unify multinode and single-node execution paths
Improvements:
- Improve CUDA architecture detection without requiring GPU access
- Deprecate Volta (sm_70/sm_72) support for CUDA toolkit >=13.0
Bug fixes:
- Fix JSON output aggregation
Platform:
- Skip Boost static libs on Azure Linux
For large multinode systems, testing all possible GPU pairs can be time-consuming and resource-intensive. nvbandwidth provides a sampling option to reduce test time while maintaining good coverage of the GPU topology.
115
+
116
+
#### Sampling Options
117
+
118
+
-**`--targetNumPairs -1`** (default): Test all possible pairs (N×(N-1) for N GPUs)
119
+
-**`--targetNumPairs <number>`**: Test exactly `<number>` pairs using intelligent sampling
120
+
121
+
For example, to run multinode bandwidth on a system with 4 nodes and 8 GPUs per node, and select 8 pairs of GPUs run the command:
# For a 4-node, 8-GPU system: 992 total pairs available
145
+
# Using --targetNumPairs 100 will test ~10% of pairs while covering all GPUs
146
+
```
147
+
148
+
**Note**: The `--targetNumPairs` parameter only affects multinode device-to-device tests. In single-node mode, this parameter is ignored and a warning will be displayed if a positive value is specified.
149
+
109
150
### Local testing
110
151
111
152
You can test it on a single-node machine (Ampere+ GPU required):
@@ -141,6 +182,24 @@ SM copies will truncate the copy size to fit uniformly on the target device to c
141
182
142
183
threadsPerBlock is set to 512.
143
184
185
+
### Latency Measurements
186
+
187
+
nvbandwidth uses **pointer chasing** to measure memory latency rather than simple sequential access patterns. This methodology provides more realistic latency measurements by forcing random memory access patterns that prevent prefetching optimizations.
188
+
189
+
#### How Pointer Chasing Works
190
+
191
+
1.**Setup**: Memory is organized as a linked list where each node contains a pointer to the next node
192
+
2.**Pattern**: The chain follows a strided pattern through memory: `Node[i] -> Node[(i + stride) % total_nodes]`
193
+
3.**Execution**: The kernel follows this pointer chain for a specified number of accesses
194
+
4.**Measurement**: Total time is divided by number of accesses gives latency per access
195
+
196
+
#### Important Notes
197
+
198
+
-**TLB costs are excluded **: from measurements because the same buffer is accessed across all tests, keeping TLB entries cached and eliminating address translation overhead from the latency measurements.
199
+
-**Data cache hits prevented**: Random pointer chasing patterns prevent data cache benefits by design
200
+
201
+
This approach provides latency measurements that benefit from TLB optimization while maintaining random memory access patterns that prevent data cache effects.
0 commit comments