Skip to content

LAMMPS on multiple GPUs #195

@Felixrccs

Description

@Felixrccs

Everything works on one gpu, however I would like to run my LAMMPS simulation over multiple GPUs.
My lammps submission command for two GPUs:
srun -n 4 lmp -partition 1 1 1 1 -l lammps.log -sc screen -k on g 2 -sf kk -i lammps.in

If I run it on multiple GPUs, LAMMPS returns this error back:

Starting MPS on ravg1112
Exception: Specified device cuda:0 does not match device of data cuda:1
Exception raised from make_tensor at aten/src/ATen/Functions.cpp:26 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x14f6528553cb in /u/----/LAMMPS/lammps_mace_v1/libtorch-gpu/lib/libc10.so)
frame #1: at::TensorMaker::make_tensor() + 0xa1d (0x14f604c2641d in /u/----/LAMMPS/lammps_mace_v1/libtorch-gpu/lib/libtorch_cpu.so)
frame #2: torch::from_blob(void*, c10::ArrayRef<long>, c10::TensorOptions const&) + 0xfb (0x14f656649e2b in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #3: LAMMPS_NS::PairMACEKokkos<Kokkos::Cuda>::compute(int, int) + 0xce1 (0x14f6566620f1 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #4: LAMMPS_NS::VerletKokkos::setup(int) + 0x5b2 (0x14f655d85fb2 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #5: LAMMPS_NS::Temper::command(int, char**) + 0x6c4 (0x14f655583c84 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #6: LAMMPS_NS::Input::execute_command() + 0xaec (0x14f6551e55ec in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #7: LAMMPS_NS::Input::file() + 0x155 (0x14f6551e5995 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #8: /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/lmp() [0x4049e8]
frame #9: __libc_start_main + 0xef (0x14f653e3e24d in /lib64/libc.so.6)
frame #10: /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/lmp() [0x404b9a]

Exception: Specified device cuda:0 does not match device of data cuda:1
Exception raised from make_tensor at aten/src/ATen/Functions.cpp:26 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x14dd2ea553cb in /u/----/LAMMPS/lammps_mace_v1/libtorch-gpu/lib/libc10.so)
frame #1: at::TensorMaker::make_tensor() + 0xa1d (0x14dce0e2641d in /u/----/LAMMPS/lammps_mace_v1/libtorch-gpu/lib/libtorch_cpu.so)
frame #2: torch::from_blob(void*, c10::ArrayRef<long>, c10::TensorOptions const&) + 0xfb (0x14dd32849e2b in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #3: LAMMPS_NS::PairMACEKokkos<Kokkos::Cuda>::compute(int, int) + 0xce1 (0x14dd328620f1 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #4: LAMMPS_NS::VerletKokkos::setup(int) + 0x5b2 (0x14dd31f85fb2 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #5: LAMMPS_NS::Temper::command(int, char**) + 0x6c4 (0x14dd31783c84 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #6: LAMMPS_NS::Input::execute_command() + 0xaec (0x14dd313e55ec in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #7: LAMMPS_NS::Input::file() + 0x155 (0x14dd313e5995 in /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/liblammps.so.0)
frame #8: /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/lmp() [0x4049e8]
frame #9: __libc_start_main + 0xef (0x14dd3003e24d in /lib64/libc.so.6)
frame #10: /u/----/LAMMPS/lammps_mace_v1/lammps/build-kokkos-cuda/lmp() [0x404b9a]

Abort(1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
Abort(1) on node 3 (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
slurmstepd: error: *** STEP 7590002.0 ON ravg1112 CANCELLED AT 2023-10-19T14:03:48 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: ravg1112: tasks 0,2: Killed
srun: launch/slurm: _step_signal: Terminating StepId=7590002.0
srun: error: ravg1112: task 3: Killed
srun: error: ravg1112: task 1: Killed

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions