Overview
SLURM Pyxis Container Plugin has a very clean UX & first class support for using container with SLURM with just an additional arg --container-imagepassed into srun.
Many nvidia customers that use SLURM are already using SLURM Pyxis Container Plugin / in the process of switching to it.
Most of CoreWeave SLURM customers use containers through pyxis and most of aws/gcp/oci's slurm customers too.
This issue seems minor but it is tons of minors quirks across the ROCm ecosystem that hurts the user experience and increase the activation energy to port multi-node ML applications to ROCm and more thing to worry about on top of the many ROCm porting quirks.
ROCm should work towards making container+SLURM way better via supporting pyxis to make using containers with SLURM as simple as additional arg --container-image
cc: @hliuca and @powderluv
Pyxis Benefits
Here are the Pyxis benefits as described by the pyxis README
Benefits
- Seamlessly execute the user's task in an unprivileged container.
- Simple command-line interface.
- Fast Docker image download with support for layers caching and layers sharing across users.
- Supports multi-node MPI jobs through [PMI2](https://slurm.schedmd.com/mpi_guide.html) or [PMIx](https://pmix.org/) (requires Slurm support).
- Allows users to install packages inside the container.
- Works with shared filesystems.
- Does not require cluster-wide management of subordinate user/group ids.
Slides from schedMD about Pyxis/enroot https://slurm.schedmd.com/SLUG19/NVIDIA_Containers.pdf
Current ROCm SLURM+Container situation
On AMD website, there is a blog about doing multinode torch jobs with container+slurm but it seems like the UX is decently subpar compared to on NVIDIA just adding 1 arg with pyxis --container-image . Furthermore, AMD's current docker run + srun solution does not automatically clean up containers when exiting so users need to explicitly delete existing containers before starting new ones
InstellaVL's sbatch docker cleanup

The AMD UX currently for multinode container+slurm is by having the sbatch script call another script that launches containers which is another indirection the user needs to worry about
https://rocm.blogs.amd.com/artificial-intelligence/mosaicml-composer/README.html#multinode-composer-training-using-slurm

On InstellaVL 's training launcher script, it runs into the same situation of where there is an additional layer of indirection. This additional layer of indirection may be cause it to be harder to debug ML applications too.
InstellaVL's sbatch launch script

Pyxis Features
Example 1 - Multi-Node PyTorch
Compared to AMD's current solution of stuffing docker run torchrun into srun, for example, in my torchrun example using pyxis, the only 2 lines of code i needed to add is the following
Multi-Node PyTorch Code Additional
--container-image=nvcr.io#nvidia/pytorch:25.02-py3 \
--container-mounts=$PWD:/workspace \
Multi-Node PyTorch Example Code
#!/bin/bash
#SBATCH --job-name=all_reduce_bench_pyxis
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8
#SBATCH --time=01:00:00
# Set up environment variables for torchrun
GPUS_PER_NODE=8
NNODES=2
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000
srun --container-image=nvcr.io#nvidia/pytorch:25.02-py3 \
--container-mounts=$PWD:/workspace \
python -u -m torch.distributed.run \
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--rdzv_endpoint ${MASTER_ADDR}:${MASTER_PORT} \
--rdzv_backend c10d \
--max_restarts 0 \
--role `hostname -s`':' \
--tee 3 \
all_reduce_bench.py
Example 2 - Launching interactive dev jobs
With pyxis, launching interactive container dev jobs is quick easy
srun --gpus=8 --pty --container-image=nvcr.io#nvidia/pytorch:24.12-py3 bash -i
Example Terminal Outputs
user-1@slurm-login-0:~$ srun --gpus=8 --pty --container-image=nvcr.io#nvidia/pytorch:24.12-py3 bash -i
pyxis: importing docker image: nvcr.io/nvidia/pytorch:24.12-py3
pyxis: imported docker image: nvcr.io/nvidia/pytorch:24.12-py3
user-1@slurm-compute-node-123:/workspace# nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02 Driver Version: 535.230.02 CUDA Version: 12.6 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:1A:00.0 Off | 0 |
| N/A 39C P0 71W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:40:00.0 Off | 0 |
| N/A 31C P0 69W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
...
Example 3 - Auto-Saving Interactive Containers
With pyxis, with the --container-save arg, any command you run into inside the interactive container will automatically get saved to the .sqsh container image such that when you re-launch the container, all the installed packages & files will be there
Example container-save
srun
--container-image=./nightly-torch.sqsh
--container-save=./nightly-torch.sqsh bash
apt install nettools
Example 4 - multi-node nccl-tests
Using pyxis container plugin, launching reproducible mult-node nccl-tests is quite intuitive and all the dependencies such as nccl, ofed userspace libraries, nccl-tests, cuda-toolkit (besides the gpu drivers) are inside the container such that the performance is reproducible and easier to debug. As AMD and the AMD ml community move towards making amd multi-node user experience great and performance great, it seems like having this reproducibility is important.
all my nccl-tests are easily reproducible via srun --container-image but currently on AMD, multinode stuff is harder to create reproducible scripts as things aren't easily packaged up inside a container that slurm can execute. i wish rccl-tests could also be launched easily
srun --container-image=... \
/opt/nccl_tests/build/all_reduce_perf -b 8M -e 8G -f 2 -g 1
Example 5 - multi-node MPI
With pyxis, slurm+containers job that require MPI are easily launched too by adding the --mpi=pmix flag
srun --container-image=ghcr.io#coreweave/nccl-tests:12.8.0-devel-ubuntu22.04-nccl2.25.1-1-57fa979 \
--mpi=pmix ... \
/opt/nccl_tests/build/all_reduce_perf -b 8M -e 8G -f 2 -g 1
Example 6 - Pyxis Container srun args
$ srun --help
...
--container-image=[USER@][REGISTRY#]IMAGE[:TAG]|PATH
[pyxis] the image to use for the container
filesystem. Can be either a docker image given as
an enroot URI, or a path to a squashfs file on the
remote host filesystem.
--container-mounts=SRC:DST[:FLAGS][,SRC:DST...]
[pyxis] bind mount[s] inside the container. Mount
flags are separated with "+", e.g. "ro+rprivate"
--container-workdir=PATH
[pyxis] working directory inside the container
--container-name=NAME [pyxis] name to use for saving and loading the
container on the host. Unnamed containers are
removed after the slurm task is complete; named
containers are not. If a container with this name
already exists, the existing container is used and
the import is skipped.
--container-save=PATH [pyxis] Save the container state to a squashfs
file on the remote host filesystem.
--container-mount-home [pyxis] bind mount the user's home directory.
System-level enroot settings might cause this
directory to be already-mounted.
--no-container-mount-home
[pyxis] do not bind mount the user's home
directory
--container-remap-root [pyxis] ask to be remapped to root inside the
container. Does not grant elevated system
permissions, despite appearances.
--no-container-remap-root
[pyxis] do not remap to root inside the container
--container-entrypoint [pyxis] execute the entrypoint from the container
image
--no-container-entrypoint
[pyxis] do not execute the entrypoint from the
container image
--container-entrypoint-log
[pyxis] print the output of the entrypoint script
--container-writable [pyxis] make the container filesystem writable
--container-readonly [pyxis] make the container filesystem read-only
--container-env=NAME[,NAME...]
[pyxis] names of environment variables to override
with the host environment and set at the entrypoint.
By default, all exported host environment variables
are set in the container after the entrypoint is run,
but their existing values in the image take precedence;
the variables specified with this flag are preserved
Overview
SLURM Pyxis Container Plugin has a very clean UX & first class support for using container with
SLURMwith just an additional arg--container-imagepassed intosrun.Many nvidia customers that use
SLURMare already using SLURM Pyxis Container Plugin / in the process of switching to it.Most of CoreWeave
SLURMcustomers use containers throughpyxisand most of aws/gcp/oci's slurm customers too.This issue seems minor but it is tons of minors quirks across the ROCm ecosystem that hurts the user experience and increase the activation energy to port multi-node ML applications to ROCm and more thing to worry about on top of the many ROCm porting quirks.
ROCm should work towards making container+SLURM way better via supporting
pyxisto make using containers with SLURM as simple as additional arg--container-imagecc: @hliuca and @powderluv
Pyxis Benefits
Here are the Pyxis benefits as described by the pyxis README
Slides from
schedMDabout Pyxis/enroot https://slurm.schedmd.com/SLUG19/NVIDIA_Containers.pdfCurrent ROCm SLURM+Container situation
On AMD website, there is a blog about doing multinode torch jobs with container+slurm but it seems like the UX is decently subpar compared to on NVIDIA just adding 1 arg with pyxis
--container-image. Furthermore, AMD's currentdocker run+srunsolution does not automatically clean up containers when exiting so users need to explicitly delete existing containers before starting new onesInstellaVL's sbatch docker cleanup
The AMD UX currently for multinode container+slurm is by having the sbatch script call another script that launches containers which is another indirection the user needs to worry about
https://rocm.blogs.amd.com/artificial-intelligence/mosaicml-composer/README.html#multinode-composer-training-using-slurm
On InstellaVL 's training launcher script, it runs into the same situation of where there is an additional layer of indirection. This additional layer of indirection may be cause it to be harder to debug ML applications too.
InstellaVL's sbatch launch script

Pyxis Features
Example 1 - Multi-Node PyTorch
Compared to AMD's current solution of stuffing
docker run torchrunintosrun, for example, in my torchrun example usingpyxis, the only 2 lines of code i needed to add is the followingMulti-Node PyTorch Code Additional
--container-image=nvcr.io#nvidia/pytorch:25.02-py3 \ --container-mounts=$PWD:/workspace \Multi-Node PyTorch Example Code
Example 2 - Launching interactive dev jobs
With
pyxis, launching interactive container dev jobs is quick easyExample Terminal Outputs
Example 3 - Auto-Saving Interactive Containers
With
pyxis, with the--container-savearg, any command you run into inside the interactive container will automatically get saved to the.sqshcontainer image such that when you re-launch the container, all the installed packages & files will be thereExample container-save
Example 4 - multi-node nccl-tests
Using
pyxiscontainer plugin, launching reproducible mult-nodenccl-testsis quite intuitive and all the dependencies such asnccl,ofeduserspace libraries,nccl-tests,cuda-toolkit(besides the gpu drivers) are inside the container such that the performance is reproducible and easier to debug. As AMD and the AMD ml community move towards making amd multi-node user experience great and performance great, it seems like having this reproducibility is important.all my
nccl-testsare easily reproducible viasrun --container-imagebut currently on AMD, multinode stuff is harder to create reproducible scripts as things aren't easily packaged up inside a container that slurm can execute. i wishrccl-testscould also be launched easilysrun --container-image=... \ /opt/nccl_tests/build/all_reduce_perf -b 8M -e 8G -f 2 -g 1Example 5 - multi-node MPI
With
pyxis, slurm+containers job that require MPI are easily launched too by adding the--mpi=pmixflagsrun --container-image=ghcr.io#coreweave/nccl-tests:12.8.0-devel-ubuntu22.04-nccl2.25.1-1-57fa979 \ --mpi=pmix ... \ /opt/nccl_tests/build/all_reduce_perf -b 8M -e 8G -f 2 -g 1Example 6 - Pyxis Container srun args
$ srun --help ... --container-image=[USER@][REGISTRY#]IMAGE[:TAG]|PATH [pyxis] the image to use for the container filesystem. Can be either a docker image given as an enroot URI, or a path to a squashfs file on the remote host filesystem. --container-mounts=SRC:DST[:FLAGS][,SRC:DST...] [pyxis] bind mount[s] inside the container. Mount flags are separated with "+", e.g. "ro+rprivate" --container-workdir=PATH [pyxis] working directory inside the container --container-name=NAME [pyxis] name to use for saving and loading the container on the host. Unnamed containers are removed after the slurm task is complete; named containers are not. If a container with this name already exists, the existing container is used and the import is skipped. --container-save=PATH [pyxis] Save the container state to a squashfs file on the remote host filesystem. --container-mount-home [pyxis] bind mount the user's home directory. System-level enroot settings might cause this directory to be already-mounted. --no-container-mount-home [pyxis] do not bind mount the user's home directory --container-remap-root [pyxis] ask to be remapped to root inside the container. Does not grant elevated system permissions, despite appearances. --no-container-remap-root [pyxis] do not remap to root inside the container --container-entrypoint [pyxis] execute the entrypoint from the container image --no-container-entrypoint [pyxis] do not execute the entrypoint from the container image --container-entrypoint-log [pyxis] print the output of the entrypoint script --container-writable [pyxis] make the container filesystem writable --container-readonly [pyxis] make the container filesystem read-only --container-env=NAME[,NAME...] [pyxis] names of environment variables to override with the host environment and set at the entrypoint. By default, all exported host environment variables are set in the container after the entrypoint is run, but their existing values in the image take precedence; the variables specified with this flag are preserved