Efficient QMC calculations on GPU machines requires understanding throughput (samples/second) as a function of number of walkers per GPU. For a small number of walkers, the GPU's are underutilized and thus spend significant time 'idle' (although the actual wall time to take a particular number of steps may still be very small). Adding walkers, then, improves throughput by giving each GPU more work to do so but at the cost of increased wall time. Eventually, the GPU is 'saturated' with walkers and the throughput is maximum. By optimizing a QMC calculation for throughput, one can work backwards from total number of samples required for target energy errorbar (after equilibration has been achieved) to determine the ideal number of blocks/samples.
While conducting these kinds of tests on both 'hops' (Intel Sapphire rapids x2 + Nvidia H100 x4 per node) and 'Eldorado / El Capitan' (AMD Zen x2 + MI300X x4 per node) hardware I have noticed that v1 and v3 t-moves schemes yield, in some cases, throughput that is 10x slower as compared to VMC, DMC(locality), or DMC(v0 t-moves). Running on a CPU-only machine shows a slowdown of around 1.5-2x. So some T-moves schemes are way slower on GPU's then expected. Why is this?
The problem I've been working on is a 16 atom supercell of bcc tungsten and i'm using a standard Slater Jastrow wave function with optimized 1-, 2-, and 3-body Jastrow. This is a 224 electron problem, and, if i were to run on a typical CPU machine, would complete in a few hours on a couple of nodes with a few thousand walkers. Not a heroic calculation by any stretch. But my initial (stupid) attempts on GPU showed that this job took more than 48 hours to finish. After many many tests, I've narrowed the problem down to the nonlocalmoves scheme.
On Eldorado/El Capitan, I used this input block for the VMC/DMC:
<qmc method="vmc" move="pbyp">
<parameter name="walkers_per_rank" > 2048 </parameter>
<parameter name="blocks" > 10 </parameter>
<parameter name="steps" > 5 </parameter>
<parameter name="timestep" > 0.50 </parameter>
<parameter name="useDrift" > yes </parameter>
</qmc>
<qmc method="dmc" move="pbyp">
<parameter name="walkers_per_rank" > 2048 </parameter>
<parameter name="blocks" > 10 </parameter>
<parameter name="steps" > 5 </parameter>
<parameter name="timestep" > 0.10 </parameter>
<parameter name="nonlocalmoves" > no </parameter>
</qmc>
and am running the code on 1 node using all 4 GPUs with 1 MPI task per GPU and 8 threads per task as follows:
flux run --exclusive -N 1 -n 4 -c 24 -g 1 -o mpibind=verbose:1 -o gpu-affinity=per-task -o cpu-affinity=per-task ${QMC_ROOT}\
/bin/qmcpack_complex --enable-timers=fine qmc.in.xml > qmc.output
QMCPACK output:
Global options
Total number of MPI ranks = 4
Number of MPI groups = 1
MPI group ID = 0
Number of ranks in group = 4
MPI ranks per node = 4
Accelerators per rank = 1
OMP 1st level threads = 8
OMP nested threading disabled or only 1 thread on the 2nd level
Throughput measured in samples per second on Eldorado/El Capitan 1 node (AMD Zen x2 + MI300X x4):
| VMC |
Locality |
Tmoves-v0 |
Tmoves-v1 |
Tmoves-v3 |
| 1790.21 |
1646.37 |
1534.43 |
158.82 |
160.14 |
On hops, I used this input block for the VMC/DMC:
<qmc method="vmc" move="pbyp">
<parameter name="walkers_per_rank" > 2044 </parameter>
<parameter name="blocks" > 10 </parameter>
<parameter name="steps" > 5 </parameter>
<parameter name="timestep" > 0.50 </parameter>
<parameter name="useDrift" > yes </parameter>
</qmc>
<qmc method="dmc" move="pbyp">
<parameter name="walkers_per_rank" > 2044 </parameter>
<parameter name="blocks" > 10 </parameter>
<parameter name="steps" > 5 </parameter>
<parameter name="timestep" > 0.10 </parameter>
<parameter name="nonlocalmoves" > no </parameter>
</qmc>
and am running the code on 1 node using all 4 GPUs with 1 MPI task per GPU and 7 threads per task as follows:
srun --ntasks=4 --ntasks-per-node=4 --cpus-per-task=14 --hint=nomultithread --gpus-per-task=1 qmcpack_complex qmc.in.xml > qmc.output
QMCPACK output:
Global options
Total number of MPI ranks = 4
Number of MPI groups = 1
MPI group ID = 0
Number of ranks in group = 4
MPI ranks per node = 4
Accelerators per node = 1
OMP 1st level threads = 7
OMP nested threading disabled or only 1 thread on the 2nd level
Throughput measured in samples per second on Hops 1 node (Intel Sapphire Rapids x2 + Nvidia H100 x4)
| VMC |
Locality |
Tmoves-v0 |
Tmoves-v1 |
Tmoves-v3 |
| 983.71 |
921.56 |
898.44 |
248.06 |
329.77 |
These tests used a fixed number of walkers_per_rank of ~2400, and were adjusted on a per machine basis so that walkers evenly divide into crowds for the initial VMC run. For DMC the population fluctuates so in general there is some load imbalance.
For a point of comparison, here are the same calculations for 'flight' - a CPU machine with Intel Sapphire Rapids x2 per node only.
On flight, i used this input block:
<qmc method="vmc" move="pbyp">
<parameter name="walkers_per_rank" > 252 </parameter>
<parameter name="blocks" > 10 </parameter>
<parameter name="steps" > 5 </parameter>
<parameter name="timestep" > 0.50 </parameter>
<parameter name="useDrift" > yes </parameter>
</qmc>
<qmc method="dmc" move="pbyp">
<parameter name="walkers_per_rank" > 252 </parameter>
<parameter name="blocks" > 10 </parameter>
<parameter name="steps" > 5 </parameter>
<parameter name="timestep" > 0.10 </parameter>
<parameter name="nonlocalmoves" > no </parameter>
</qmc>
and am running the code on 4 nodes using 8 MPI tasks per node (1 per NUMA domain) and 14 threads per MPI task as follows:
export OMP_PLACES=threads
export OMP_PROC_BIND=spread
export OMP_NUM_THREADS=14
mpirun --bind-to socket ${QMC_ROOT}/bin/qmcpack_complex qmc.in.xml > qmcpack.output
QMCPACK output:
Global options
Total number of MPI ranks = 32
Number of MPI groups = 1
MPI group ID = 0
Number of ranks in group = 32
MPI ranks per node = 8
OMP 1st level threads = 14
OMP nested threading disabled or only 1 thread on the 2nd level
Throughput measured in samples per second on Flight 4 nodes (Intel Sapphire Rapids x2)
| VMC |
Locality |
Tmoves-v0 |
Tmoves-v1 |
Tmoves-v3 |
| 738.33 |
665.91 |
672.06 |
424.91 |
431.85 |
In all cases, v1 and v3 t-moves yield lower throughput. But the differences w.r.t. VMC or DMC locality are much larger for GPU machines than for CPU only. In some cases, I find as much as a 10x slowdown!
Comparing the timers from v0 and v3, the salient difference is the DMCBatched::Tmove timer and its dependencies:
Relevant timers for nonlocalmoves=v0 (Eldorado data):
DMCBatched::Tmove 12.9077 0.2959 50 0.258154554
ParticleSet:none::acceptMove 0.1184 0.0203 12232 0.000009678
DTAAOMPTarget::update_e_e 0.0968 0.0968 12232 0.000007916
DTAB::update_ion0_e 0.0013 0.0013 12232 0.000000105
ParticleSet:none::computeNewPosDT 0.1112 0.0285 12232 0.000009089
DTAAOMPTarget::move_e_e 0.0692 0.0692 12232 0.000005654
DTAB::move_ion0_e 0.0135 0.0135 12232 0.000001107
ParticleSet:none::donePbyP 0.2694 0.2694 12232 0.000022022
WaveFunction:psi0::VGL 2.0880 0.0325 12232 0.000170703
J1OrbitalSoA:J1::VGL 0.0188 0.0188 12232 0.000001534
JeeIOrbitalSoA:J3::VGL 0.1150 0.1150 12232 0.000009399
SlaterDet::VGL 1.8569 0.0251 12232 0.000151803
DiracDeterminantBatched::ratio 0.0140 0.0140 12232 0.000001148
DiracDeterminantBatched::spovgl 1.8177 0.0658 12232 0.000148604
SplineC2COMPTarget::offload 1.7519 1.7519 12232 0.000143226
TwoBodyJastrow:J2::VGL 0.0650 0.0650 12232 0.000005314
WaveFunction:psi0::accept 8.7706 0.0399 24464 0.000358511
J1OrbitalSoA:J1::accept 0.0039 0.0039 24464 0.000000159
JeeIOrbitalSoA:J3::accept 0.0871 0.0871 24464 0.000003559
SlaterDet::accept 8.5815 0.0238 24464 0.000350781
DiracDeterminantBatched::update 8.5577 8.5577 36696 0.000233205
TwoBodyJastrow:J2::accept 0.0583 0.0583 24464 0.000002382
WaveFunction:psi0::buffer 1.1991 0.0655 50 0.023981916
J1OrbitalSoA:J1::buffer 0.0413 0.0413 50 0.000826967
JeeIOrbitalSoA:J3::buffer 0.0378 0.0378 50 0.000755661
SlaterDet::buffer 1.0298 1.0298 50 0.020596545
TwoBodyJastrow:J2::buffer 0.0246 0.0246 50 0.000491827
WaveFunction:psi0::preparegroup 0.0551 0.0367 12232 0.000004507
J1OrbitalSoA:J1::preparegroup 0.0031 0.0031 12232 0.000000252
JeeIOrbitalSoA:J3::preparegroup 0.0030 0.0030 12232 0.000000247
SlaterDet::preparegroup 0.0094 0.0094 12232 0.000000768
TwoBodyJastrow:J2::preparegroup 0.0029 0.0029 12232 0.000000237
Relevant timers for nonlocalmoves=v3 (Eldorado data):
DMCBatched::Tmove 2271.8685 4.9963 50 45.437370043
ParticleSet:none::acceptMove 1.1712 0.1037 189277 0.000006188
DTAAOMPTarget::update_e_e 1.0526 1.0526 189277 0.000005561
DTAB::update_ion0_e 0.0149 0.0149 189277 0.000000079
ParticleSet:none::computeNewPosDT 0.5353 0.0827 189278 0.000002828
DTAAOMPTarget::move_e_e 0.4095 0.4095 189278 0.000002163
DTAB::move_ion0_e 0.0432 0.0432 189278 0.000000228
ParticleSet:none::donePbyP 0.3338 0.3338 12821 0.000026039
ParticleSet:none::update 1273.2887 2.4717 3353609 0.000379677
DTABOMPTarget::evaluate_e_virtual 662.9313 0.5147 3353609 0.000197677
DTABOMPTarget::offload_e_virtual 662.4166 662.4166 3353609 0.000197524
DTABOMPTarget::evaluate_ion0_virtual 607.8857 0.5879 3353609 0.000181263
DTABOMPTarget::offload_ion0_virtual 607.2978 607.2978 3353609 0.000181088
WaveFunction:psi0::NLratio 785.6095 1.1359 3353609 0.000234258
J1OrbitalSoA:J1::NLratio 12.9144 12.9144 3353609 0.000003851
JeeIOrbitalSoA:J3::NLratio 33.6203 33.6203 3353609 0.000010025
SlaterDet::NLratio 647.7075 0.4950 3353609 0.000193137
DiracDeterminantBatched::spoval 647.2125 3.1713 3353609 0.000192990
SplineC2COMPTarget::offload 644.0412 644.0412 3353609 0.000192044
TwoBodyJastrow:J2::NLratio 90.2314 90.2314 3353609 0.000026906
WaveFunction:psi0::VGL 40.6931 0.1909 189277 0.000214993
J1OrbitalSoA:J1::VGL 0.1401 0.1401 189277 0.000000740
JeeIOrbitalSoA:J3::VGL 0.8388 0.8388 189277 0.000004432
SlaterDet::VGL 38.9646 0.0981 189277 0.000205860
DiracDeterminantBatched::ratio 0.1087 0.1087 189277 0.000000574
DiracDeterminantBatched::spovgl 38.7578 0.6637 189277 0.000204767
SplineC2COMPTarget::offload 38.0941 38.0941 189277 0.000201261
TwoBodyJastrow:J2::VGL 0.5587 0.5587 189277 0.000002952
WaveFunction:psi0::accept 163.8894 0.2902 202098 0.000810940
J1OrbitalSoA:J1::accept 0.0215 0.0215 202098 0.000000106
JeeIOrbitalSoA:J3::accept 0.8342 0.8342 202098 0.000004128
SlaterDet::accept 162.1243 0.0978 202098 0.000802206
DiracDeterminantBatched::update 162.0264 162.0264 214919 0.000753895
TwoBodyJastrow:J2::accept 0.6193 0.6193 202098 0.000003064
WaveFunction:psi0::buffer 1.2684 0.0751 50 0.025367192
J1OrbitalSoA:J1::buffer 0.0434 0.0434 50 0.000867111
JeeIOrbitalSoA:J3::buffer 0.0455 0.0455 50 0.000910542
SlaterDet::buffer 1.0708 1.0708 50 0.021416095
TwoBodyJastrow:J2::buffer 0.0336 0.0336 50 0.000671632
WaveFunction:psi0::preparegroup 0.0826 0.0569 25642 0.000003222
J1OrbitalSoA:J1::preparegroup 0.0046 0.0046 25642 0.000000179
JeeIOrbitalSoA:J3::preparegroup 0.0045 0.0045 25642 0.000000176
SlaterDet::preparegroup 0.0122 0.0122 25642 0.000000476
TwoBodyJastrow:J2::preparegroup 0.0045 0.0045 25642 0.000000174
While the values reported above were obtained for a particular number of walkers, the result is general. It holds whether there are 16 walkers per rank or 16000, although the exact ratios are slightly difference. Is this expected behavior?
To Reproduce
Steps to reproduce the behavior:
- Using up-to date QMCPACK develop (git hash: 6dcfc43-dirty)
Expected behavior
I expected reasonably optimal GPU settings to yield performance comparable to, or better than, CPU only benchmark.
System:
- Eldorado/El Capitan built using rocm 6.4.2 (based on clang19, i think) with Cray Mpich 9.0.1
- Hops built using cudatoolkit 12.9 with clang 19.1.7 and openMPI 4
- Flight built using gnu 13.3.1 mkl 24.0.2 and openmpi 4.1
- Can provide full build scripts if needed.
Additional context
Happy to provide as much detail as needed.
Efficient QMC calculations on GPU machines requires understanding throughput (samples/second) as a function of number of walkers per GPU. For a small number of walkers, the GPU's are underutilized and thus spend significant time 'idle' (although the actual wall time to take a particular number of steps may still be very small). Adding walkers, then, improves throughput by giving each GPU more work to do so but at the cost of increased wall time. Eventually, the GPU is 'saturated' with walkers and the throughput is maximum. By optimizing a QMC calculation for throughput, one can work backwards from total number of samples required for target energy errorbar (after equilibration has been achieved) to determine the ideal number of blocks/samples.
While conducting these kinds of tests on both 'hops' (Intel Sapphire rapids x2 + Nvidia H100 x4 per node) and 'Eldorado / El Capitan' (AMD Zen x2 + MI300X x4 per node) hardware I have noticed that v1 and v3 t-moves schemes yield, in some cases, throughput that is 10x slower as compared to VMC, DMC(locality), or DMC(v0 t-moves). Running on a CPU-only machine shows a slowdown of around 1.5-2x. So some T-moves schemes are way slower on GPU's then expected. Why is this?
The problem I've been working on is a 16 atom supercell of bcc tungsten and i'm using a standard Slater Jastrow wave function with optimized 1-, 2-, and 3-body Jastrow. This is a 224 electron problem, and, if i were to run on a typical CPU machine, would complete in a few hours on a couple of nodes with a few thousand walkers. Not a heroic calculation by any stretch. But my initial (stupid) attempts on GPU showed that this job took more than 48 hours to finish. After many many tests, I've narrowed the problem down to the nonlocalmoves scheme.
On Eldorado/El Capitan, I used this input block for the VMC/DMC:
and am running the code on 1 node using all 4 GPUs with 1 MPI task per GPU and 8 threads per task as follows:
QMCPACK output:
Throughput measured in samples per second on Eldorado/El Capitan 1 node (AMD Zen x2 + MI300X x4):
On hops, I used this input block for the VMC/DMC:
and am running the code on 1 node using all 4 GPUs with 1 MPI task per GPU and 7 threads per task as follows:
QMCPACK output:
Throughput measured in samples per second on Hops 1 node (Intel Sapphire Rapids x2 + Nvidia H100 x4)
These tests used a fixed number of walkers_per_rank of ~2400, and were adjusted on a per machine basis so that walkers evenly divide into crowds for the initial VMC run. For DMC the population fluctuates so in general there is some load imbalance.
For a point of comparison, here are the same calculations for 'flight' - a CPU machine with Intel Sapphire Rapids x2 per node only.
On flight, i used this input block:
and am running the code on 4 nodes using 8 MPI tasks per node (1 per NUMA domain) and 14 threads per MPI task as follows:
QMCPACK output:
Throughput measured in samples per second on Flight 4 nodes (Intel Sapphire Rapids x2)
In all cases, v1 and v3 t-moves yield lower throughput. But the differences w.r.t. VMC or DMC locality are much larger for GPU machines than for CPU only. In some cases, I find as much as a 10x slowdown!
Comparing the timers from v0 and v3, the salient difference is the DMCBatched::Tmove timer and its dependencies:
Relevant timers for nonlocalmoves=v0 (Eldorado data):
Relevant timers for nonlocalmoves=v3 (Eldorado data):
While the values reported above were obtained for a particular number of walkers, the result is general. It holds whether there are 16 walkers per rank or 16000, although the exact ratios are slightly difference. Is this expected behavior?
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I expected reasonably optimal GPU settings to yield performance comparable to, or better than, CPU only benchmark.
System:
Additional context
Happy to provide as much detail as needed.