Skip to content

Some 'nonlocalmoves' schemes are unexpectedly slow on GPUs #5699

@jptowns

Description

@jptowns

Efficient QMC calculations on GPU machines requires understanding throughput (samples/second) as a function of number of walkers per GPU. For a small number of walkers, the GPU's are underutilized and thus spend significant time 'idle' (although the actual wall time to take a particular number of steps may still be very small). Adding walkers, then, improves throughput by giving each GPU more work to do so but at the cost of increased wall time. Eventually, the GPU is 'saturated' with walkers and the throughput is maximum. By optimizing a QMC calculation for throughput, one can work backwards from total number of samples required for target energy errorbar (after equilibration has been achieved) to determine the ideal number of blocks/samples.

While conducting these kinds of tests on both 'hops' (Intel Sapphire rapids x2 + Nvidia H100 x4 per node) and 'Eldorado / El Capitan' (AMD Zen x2 + MI300X x4 per node) hardware I have noticed that v1 and v3 t-moves schemes yield, in some cases, throughput that is 10x slower as compared to VMC, DMC(locality), or DMC(v0 t-moves). Running on a CPU-only machine shows a slowdown of around 1.5-2x. So some T-moves schemes are way slower on GPU's then expected. Why is this?

The problem I've been working on is a 16 atom supercell of bcc tungsten and i'm using a standard Slater Jastrow wave function with optimized 1-, 2-, and 3-body Jastrow. This is a 224 electron problem, and, if i were to run on a typical CPU machine, would complete in a few hours on a couple of nodes with a few thousand walkers. Not a heroic calculation by any stretch. But my initial (stupid) attempts on GPU showed that this job took more than 48 hours to finish. After many many tests, I've narrowed the problem down to the nonlocalmoves scheme.

On Eldorado/El Capitan, I used this input block for the VMC/DMC:

   <qmc method="vmc" move="pbyp">
     <parameter name="walkers_per_rank"    >    2048            </parameter>
     <parameter name="blocks"              >    10              </parameter>
     <parameter name="steps"               >    5               </parameter>
     <parameter name="timestep"            >    0.50            </parameter>
     <parameter name="useDrift"            >    yes             </parameter>
   </qmc>
   <qmc method="dmc" move="pbyp">
     <parameter name="walkers_per_rank"    >    2048            </parameter>
     <parameter name="blocks"              >    10              </parameter>
     <parameter name="steps"               >    5               </parameter>
     <parameter name="timestep"            >    0.10            </parameter>
     <parameter name="nonlocalmoves"       >    no              </parameter>
   </qmc>

and am running the code on 1 node using all 4 GPUs with 1 MPI task per GPU and 8 threads per task as follows:

    flux run --exclusive -N 1 -n 4 -c 24 -g 1 -o mpibind=verbose:1 -o gpu-affinity=per-task -o cpu-affinity=per-task ${QMC_ROOT}\
/bin/qmcpack_complex --enable-timers=fine qmc.in.xml > qmc.output

QMCPACK output:

  Global options

  Total number of MPI ranks = 4
  Number of MPI groups      = 1
  MPI group ID              = 0
  Number of ranks in group  = 4
  MPI ranks per node        = 4
  Accelerators per rank     = 1
  OMP 1st level threads     = 8
  OMP nested threading disabled or only 1 thread on the 2nd level

Throughput measured in samples per second on Eldorado/El Capitan 1 node (AMD Zen x2 + MI300X x4):

VMC Locality Tmoves-v0 Tmoves-v1 Tmoves-v3
1790.21 1646.37 1534.43 158.82 160.14

On hops, I used this input block for the VMC/DMC:

   <qmc method="vmc" move="pbyp">
     <parameter name="walkers_per_rank"    >    2044            </parameter>
     <parameter name="blocks"              >    10              </parameter>
     <parameter name="steps"               >    5               </parameter>
     <parameter name="timestep"            >    0.50            </parameter>
     <parameter name="useDrift"            >    yes             </parameter>
   </qmc>
   <qmc method="dmc" move="pbyp">
     <parameter name="walkers_per_rank"    >    2044            </parameter>
     <parameter name="blocks"              >    10              </parameter>
     <parameter name="steps"               >    5               </parameter>
     <parameter name="timestep"            >    0.10            </parameter>
     <parameter name="nonlocalmoves"       >    no              </parameter>
   </qmc>

and am running the code on 1 node using all 4 GPUs with 1 MPI task per GPU and 7 threads per task as follows:

srun --ntasks=4 --ntasks-per-node=4 --cpus-per-task=14 --hint=nomultithread --gpus-per-task=1 qmcpack_complex qmc.in.xml > qmc.output

QMCPACK output:

  Global options

  Total number of MPI ranks = 4
  Number of MPI groups      = 1
  MPI group ID              = 0
  Number of ranks in group  = 4
  MPI ranks per node        = 4
  Accelerators per node     = 1
  OMP 1st level threads     = 7
  OMP nested threading disabled or only 1 thread on the 2nd level

Throughput measured in samples per second on Hops 1 node (Intel Sapphire Rapids x2 + Nvidia H100 x4)

VMC Locality Tmoves-v0 Tmoves-v1 Tmoves-v3
983.71 921.56 898.44 248.06 329.77

These tests used a fixed number of walkers_per_rank of ~2400, and were adjusted on a per machine basis so that walkers evenly divide into crowds for the initial VMC run. For DMC the population fluctuates so in general there is some load imbalance.

For a point of comparison, here are the same calculations for 'flight' - a CPU machine with Intel Sapphire Rapids x2 per node only.

On flight, i used this input block:

   <qmc method="vmc" move="pbyp">
     <parameter name="walkers_per_rank"    >    252             </parameter>
     <parameter name="blocks"              >    10              </parameter>
     <parameter name="steps"               >    5               </parameter>
     <parameter name="timestep"            >    0.50            </parameter>
     <parameter name="useDrift"            >    yes             </parameter>
   </qmc>
   <qmc method="dmc" move="pbyp">
     <parameter name="walkers_per_rank"    >    252             </parameter>
     <parameter name="blocks"              >    10              </parameter>
     <parameter name="steps"               >    5               </parameter>
     <parameter name="timestep"            >    0.10            </parameter>
     <parameter name="nonlocalmoves"       >    no              </parameter>
   </qmc>

and am running the code on 4 nodes using 8 MPI tasks per node (1 per NUMA domain) and 14 threads per MPI task as follows:

export OMP_PLACES=threads
export OMP_PROC_BIND=spread
export OMP_NUM_THREADS=14
mpirun --bind-to socket ${QMC_ROOT}/bin/qmcpack_complex qmc.in.xml > qmcpack.output

QMCPACK output:

  Global options

  Total number of MPI ranks = 32
  Number of MPI groups      = 1
  MPI group ID              = 0
  Number of ranks in group  = 32
  MPI ranks per node        = 8
  OMP 1st level threads     = 14
  OMP nested threading disabled or only 1 thread on the 2nd level

Throughput measured in samples per second on Flight 4 nodes (Intel Sapphire Rapids x2)

VMC Locality Tmoves-v0 Tmoves-v1 Tmoves-v3
738.33 665.91 672.06 424.91 431.85

In all cases, v1 and v3 t-moves yield lower throughput. But the differences w.r.t. VMC or DMC locality are much larger for GPU machines than for CPU only. In some cases, I find as much as a 10x slowdown!

Comparing the timers from v0 and v3, the salient difference is the DMCBatched::Tmove timer and its dependencies:
Relevant timers for nonlocalmoves=v0 (Eldorado data):

        DMCBatched::Tmove                                12.9077     0.2959             50       0.258154554
          ParticleSet:none::acceptMove                    0.1184     0.0203          12232       0.000009678
            DTAAOMPTarget::update_e_e                     0.0968     0.0968          12232       0.000007916
            DTAB::update_ion0_e                           0.0013     0.0013          12232       0.000000105
          ParticleSet:none::computeNewPosDT               0.1112     0.0285          12232       0.000009089
            DTAAOMPTarget::move_e_e                       0.0692     0.0692          12232       0.000005654
            DTAB::move_ion0_e                             0.0135     0.0135          12232       0.000001107
          ParticleSet:none::donePbyP                      0.2694     0.2694          12232       0.000022022
          WaveFunction:psi0::VGL                          2.0880     0.0325          12232       0.000170703
            J1OrbitalSoA:J1::VGL                          0.0188     0.0188          12232       0.000001534
            JeeIOrbitalSoA:J3::VGL                        0.1150     0.1150          12232       0.000009399
            SlaterDet::VGL                                1.8569     0.0251          12232       0.000151803
              DiracDeterminantBatched::ratio              0.0140     0.0140          12232       0.000001148
              DiracDeterminantBatched::spovgl             1.8177     0.0658          12232       0.000148604
                SplineC2COMPTarget::offload               1.7519     1.7519          12232       0.000143226
            TwoBodyJastrow:J2::VGL                        0.0650     0.0650          12232       0.000005314
          WaveFunction:psi0::accept                       8.7706     0.0399          24464       0.000358511
            J1OrbitalSoA:J1::accept                       0.0039     0.0039          24464       0.000000159
            JeeIOrbitalSoA:J3::accept                     0.0871     0.0871          24464       0.000003559
            SlaterDet::accept                             8.5815     0.0238          24464       0.000350781
              DiracDeterminantBatched::update             8.5577     8.5577          36696       0.000233205
            TwoBodyJastrow:J2::accept                     0.0583     0.0583          24464       0.000002382
          WaveFunction:psi0::buffer                       1.1991     0.0655             50       0.023981916
            J1OrbitalSoA:J1::buffer                       0.0413     0.0413             50       0.000826967
            JeeIOrbitalSoA:J3::buffer                     0.0378     0.0378             50       0.000755661
            SlaterDet::buffer                             1.0298     1.0298             50       0.020596545
            TwoBodyJastrow:J2::buffer                     0.0246     0.0246             50       0.000491827
          WaveFunction:psi0::preparegroup                 0.0551     0.0367          12232       0.000004507
            J1OrbitalSoA:J1::preparegroup                 0.0031     0.0031          12232       0.000000252
            JeeIOrbitalSoA:J3::preparegroup               0.0030     0.0030          12232       0.000000247
            SlaterDet::preparegroup                       0.0094     0.0094          12232       0.000000768
            TwoBodyJastrow:J2::preparegroup               0.0029     0.0029          12232       0.000000237

Relevant timers for nonlocalmoves=v3 (Eldorado data):

        DMCBatched::Tmove                              2271.8685     4.9963             50      45.437370043
          ParticleSet:none::acceptMove                    1.1712     0.1037         189277       0.000006188
            DTAAOMPTarget::update_e_e                     1.0526     1.0526         189277       0.000005561
            DTAB::update_ion0_e                           0.0149     0.0149         189277       0.000000079
          ParticleSet:none::computeNewPosDT               0.5353     0.0827         189278       0.000002828
            DTAAOMPTarget::move_e_e                       0.4095     0.4095         189278       0.000002163
            DTAB::move_ion0_e                             0.0432     0.0432         189278       0.000000228
          ParticleSet:none::donePbyP                      0.3338     0.3338          12821       0.000026039
          ParticleSet:none::update                     1273.2887     2.4717        3353609       0.000379677
            DTABOMPTarget::evaluate_e_virtual           662.9313     0.5147        3353609       0.000197677
              DTABOMPTarget::offload_e_virtual          662.4166   662.4166        3353609       0.000197524
            DTABOMPTarget::evaluate_ion0_virtual        607.8857     0.5879        3353609       0.000181263
              DTABOMPTarget::offload_ion0_virtual       607.2978   607.2978        3353609       0.000181088
          WaveFunction:psi0::NLratio                    785.6095     1.1359        3353609       0.000234258
            J1OrbitalSoA:J1::NLratio                     12.9144    12.9144        3353609       0.000003851
            JeeIOrbitalSoA:J3::NLratio                   33.6203    33.6203        3353609       0.000010025
            SlaterDet::NLratio                          647.7075     0.4950        3353609       0.000193137
              DiracDeterminantBatched::spoval           647.2125     3.1713        3353609       0.000192990
                SplineC2COMPTarget::offload             644.0412   644.0412        3353609       0.000192044
            TwoBodyJastrow:J2::NLratio                   90.2314    90.2314        3353609       0.000026906
          WaveFunction:psi0::VGL                         40.6931     0.1909         189277       0.000214993
            J1OrbitalSoA:J1::VGL                          0.1401     0.1401         189277       0.000000740
            JeeIOrbitalSoA:J3::VGL                        0.8388     0.8388         189277       0.000004432
            SlaterDet::VGL                               38.9646     0.0981         189277       0.000205860
              DiracDeterminantBatched::ratio              0.1087     0.1087         189277       0.000000574
              DiracDeterminantBatched::spovgl            38.7578     0.6637         189277       0.000204767
                SplineC2COMPTarget::offload              38.0941    38.0941         189277       0.000201261
            TwoBodyJastrow:J2::VGL                        0.5587     0.5587         189277       0.000002952
          WaveFunction:psi0::accept                     163.8894     0.2902         202098       0.000810940
            J1OrbitalSoA:J1::accept                       0.0215     0.0215         202098       0.000000106
            JeeIOrbitalSoA:J3::accept                     0.8342     0.8342         202098       0.000004128
            SlaterDet::accept                           162.1243     0.0978         202098       0.000802206
              DiracDeterminantBatched::update           162.0264   162.0264         214919       0.000753895
            TwoBodyJastrow:J2::accept                     0.6193     0.6193         202098       0.000003064
          WaveFunction:psi0::buffer                       1.2684     0.0751             50       0.025367192
            J1OrbitalSoA:J1::buffer                       0.0434     0.0434             50       0.000867111
            JeeIOrbitalSoA:J3::buffer                     0.0455     0.0455             50       0.000910542
            SlaterDet::buffer                             1.0708     1.0708             50       0.021416095
            TwoBodyJastrow:J2::buffer                     0.0336     0.0336             50       0.000671632
          WaveFunction:psi0::preparegroup                 0.0826     0.0569          25642       0.000003222
            J1OrbitalSoA:J1::preparegroup                 0.0046     0.0046          25642       0.000000179
            JeeIOrbitalSoA:J3::preparegroup               0.0045     0.0045          25642       0.000000176
            SlaterDet::preparegroup                       0.0122     0.0122          25642       0.000000476
            TwoBodyJastrow:J2::preparegroup               0.0045     0.0045          25642       0.000000174

While the values reported above were obtained for a particular number of walkers, the result is general. It holds whether there are 16 walkers per rank or 16000, although the exact ratios are slightly difference. Is this expected behavior?

To Reproduce
Steps to reproduce the behavior:

  1. Using up-to date QMCPACK develop (git hash: 6dcfc43-dirty)

Expected behavior
I expected reasonably optimal GPU settings to yield performance comparable to, or better than, CPU only benchmark.

System:

  • Eldorado/El Capitan built using rocm 6.4.2 (based on clang19, i think) with Cray Mpich 9.0.1
  • Hops built using cudatoolkit 12.9 with clang 19.1.7 and openMPI 4
  • Flight built using gnu 13.3.1 mkl 24.0.2 and openmpi 4.1
  • Can provide full build scripts if needed.

Additional context
Happy to provide as much detail as needed.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions