Skip to content

[Feature]: ROCm Pyxis SLURM Container Plugin - Achieve User Experience Parity with NVIDIA #153

@functionstackx

Description

@functionstackx

Overview

SLURM Pyxis Container Plugin has a very clean UX & first class support for using container with SLURM with just an additional arg --container-imagepassed into srun.

Many nvidia customers that use SLURM are already using SLURM Pyxis Container Plugin / in the process of switching to it.

Most of CoreWeave SLURM customers use containers through pyxis and most of aws/gcp/oci's slurm customers too.

This issue seems minor but it is tons of minors quirks across the ROCm ecosystem that hurts the user experience and increase the activation energy to port multi-node ML applications to ROCm and more thing to worry about on top of the many ROCm porting quirks.

ROCm should work towards making container+SLURM way better via supporting pyxis to make using containers with SLURM as simple as additional arg --container-image

cc: @hliuca and @powderluv

Pyxis Benefits

Here are the Pyxis benefits as described by the pyxis README

Benefits
- Seamlessly execute the user's task in an unprivileged container.
- Simple command-line interface.
- Fast Docker image download with support for layers caching and layers sharing across users.
- Supports multi-node MPI jobs through [PMI2](https://slurm.schedmd.com/mpi_guide.html) or [PMIx](https://pmix.org/) (requires Slurm support).
- Allows users to install packages inside the container.
- Works with shared filesystems.
- Does not require cluster-wide management of subordinate user/group ids.

Slides from schedMD about Pyxis/enroot https://slurm.schedmd.com/SLUG19/NVIDIA_Containers.pdf

Current ROCm SLURM+Container situation

On AMD website, there is a blog about doing multinode torch jobs with container+slurm but it seems like the UX is decently subpar compared to on NVIDIA just adding 1 arg with pyxis --container-image . Furthermore, AMD's current docker run + srun solution does not automatically clean up containers when exiting so users need to explicitly delete existing containers before starting new ones

InstellaVL's sbatch docker cleanup

Image

The AMD UX currently for multinode container+slurm is by having the sbatch script call another script that launches containers which is another indirection the user needs to worry about

https://rocm.blogs.amd.com/artificial-intelligence/mosaicml-composer/README.html#multinode-composer-training-using-slurm

Image

On InstellaVL 's training launcher script, it runs into the same situation of where there is an additional layer of indirection. This additional layer of indirection may be cause it to be harder to debug ML applications too.

InstellaVL's sbatch launch script
Image

Pyxis Features

Example 1 - Multi-Node PyTorch

Compared to AMD's current solution of stuffing docker run torchrun into srun, for example, in my torchrun example using pyxis, the only 2 lines of code i needed to add is the following

Multi-Node PyTorch Code Additional

--container-image=nvcr.io#nvidia/pytorch:25.02-py3 \
     --container-mounts=$PWD:/workspace \

Multi-Node PyTorch Example Code

#!/bin/bash
#SBATCH --job-name=all_reduce_bench_pyxis
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8
#SBATCH --time=01:00:00

# Set up environment variables for torchrun
GPUS_PER_NODE=8
NNODES=2
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000

srun --container-image=nvcr.io#nvidia/pytorch:25.02-py3 \
     --container-mounts=$PWD:/workspace \
     python -u -m torch.distributed.run \
         --nproc_per_node $GPUS_PER_NODE \
         --nnodes $NNODES \
         --rdzv_endpoint ${MASTER_ADDR}:${MASTER_PORT} \
         --rdzv_backend c10d \
         --max_restarts 0 \
         --role `hostname -s`':' \
         --tee 3 \
         all_reduce_bench.py

Example 2 - Launching interactive dev jobs

With pyxis, launching interactive container dev jobs is quick easy

srun --gpus=8 --pty --container-image=nvcr.io#nvidia/pytorch:24.12-py3 bash -i 

Example Terminal Outputs

user-1@slurm-login-0:~$ srun --gpus=8 --pty --container-image=nvcr.io#nvidia/pytorch:24.12-py3 bash -i 
pyxis: importing docker image: nvcr.io/nvidia/pytorch:24.12-py3
pyxis: imported docker image: nvcr.io/nvidia/pytorch:24.12-py3
user-1@slurm-compute-node-123:/workspace# nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02             Driver Version: 535.230.02   CUDA Version: 12.6     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:1A:00.0 Off |                    0 |
| N/A   39C    P0              71W / 700W |      0MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:40:00.0 Off |                    0 |
| N/A   31C    P0              69W / 700W |      0MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
...

Example 3 - Auto-Saving Interactive Containers

With pyxis, with the --container-save arg, any command you run into inside the interactive container will automatically get saved to the .sqsh container image such that when you re-launch the container, all the installed packages & files will be there

Example container-save

srun  
--container-image=./nightly-torch.sqsh  
--container-save=./nightly-torch.sqsh bash
apt install nettools

Example 4 - multi-node nccl-tests

Using pyxis container plugin, launching reproducible mult-node nccl-tests is quite intuitive and all the dependencies such as nccl, ofed userspace libraries, nccl-tests, cuda-toolkit (besides the gpu drivers) are inside the container such that the performance is reproducible and easier to debug. As AMD and the AMD ml community move towards making amd multi-node user experience great and performance great, it seems like having this reproducibility is important.

all my nccl-tests are easily reproducible via srun --container-image but currently on AMD, multinode stuff is harder to create reproducible scripts as things aren't easily packaged up inside a container that slurm can execute. i wish rccl-tests could also be launched easily

srun --container-image=... \
     /opt/nccl_tests/build/all_reduce_perf -b 8M -e 8G -f 2 -g 1

Example 5 - multi-node MPI

With pyxis, slurm+containers job that require MPI are easily launched too by adding the --mpi=pmix flag

srun --container-image=ghcr.io#coreweave/nccl-tests:12.8.0-devel-ubuntu22.04-nccl2.25.1-1-57fa979 \
     --mpi=pmix ... \
     /opt/nccl_tests/build/all_reduce_perf -b 8M -e 8G -f 2 -g 1

Example 6 - Pyxis Container srun args

$ srun --help
...
      --container-image=[USER@][REGISTRY#]IMAGE[:TAG]|PATH
                              [pyxis] the image to use for the container
                              filesystem. Can be either a docker image given as
                              an enroot URI, or a path to a squashfs file on the
                              remote host filesystem.

      --container-mounts=SRC:DST[:FLAGS][,SRC:DST...]
                              [pyxis] bind mount[s] inside the container. Mount
                              flags are separated with "+", e.g. "ro+rprivate"

      --container-workdir=PATH
                              [pyxis] working directory inside the container
      --container-name=NAME   [pyxis] name to use for saving and loading the
                              container on the host. Unnamed containers are
                              removed after the slurm task is complete; named
                              containers are not. If a container with this name
                              already exists, the existing container is used and
                              the import is skipped.
      --container-save=PATH   [pyxis] Save the container state to a squashfs
                              file on the remote host filesystem.
      --container-mount-home  [pyxis] bind mount the user's home directory.
                              System-level enroot settings might cause this
                              directory to be already-mounted.

      --no-container-mount-home
                              [pyxis] do not bind mount the user's home
                              directory
      --container-remap-root  [pyxis] ask to be remapped to root inside the
                              container. Does not grant elevated system
                              permissions, despite appearances.

      --no-container-remap-root
                              [pyxis] do not remap to root inside the container
      --container-entrypoint  [pyxis] execute the entrypoint from the container
                              image

      --no-container-entrypoint
                              [pyxis] do not execute the entrypoint from the
                              container image

      --container-entrypoint-log
                              [pyxis] print the output of the entrypoint script
      --container-writable    [pyxis] make the container filesystem writable
      --container-readonly    [pyxis] make the container filesystem read-only

      --container-env=NAME[,NAME...]
                              [pyxis] names of environment variables to override
                              with the host environment and set at the entrypoint.
                              By default, all exported host environment variables
                              are set in the container after the entrypoint is run,
                              but their existing values in the image take precedence;
                              the variables specified with this flag are preserved

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions