[Feature]: ROCm Pyxis SLURM Container Plugin - Achieve User Experience Parity with NVIDIA

# Overview
[SLURM Pyxis Container Plugin](https://github.com/NVIDIA/pyxis) has a very clean UX & first class support for using container with `SLURM` with just an additional arg `--container-image`passed into `srun`.

Many nvidia customers that use `SLURM` are already using [SLURM Pyxis Container Plugin](https://github.com/NVIDIA/pyxis) / in the process of switching to it. 

Most of CoreWeave `SLURM` customers use containers through `pyxis` and most of aws/gcp/oci's slurm customers too.

This issue seems minor but it is tons of minors quirks across the ROCm ecosystem that hurts the user experience and increase the activation energy to port multi-node ML applications to ROCm and more thing to worry about on top of the many ROCm porting quirks.

ROCm should work towards making container+SLURM way better via supporting `pyxis` to make using containers with SLURM as simple as  additional arg `--container-image`

cc: @hliuca and @powderluv

## Pyxis Benefits
Here are the Pyxis benefits as described by the [pyxis README](https://github.com/NVIDIA/pyxis?tab=readme-ov-file#benefits)
```bash
Benefits
- Seamlessly execute the user's task in an unprivileged container.
- Simple command-line interface.
- Fast Docker image download with support for layers caching and layers sharing across users.
- Supports multi-node MPI jobs through [PMI2](https://slurm.schedmd.com/mpi_guide.html) or [PMIx](https://pmix.org/) (requires Slurm support).
- Allows users to install packages inside the container.
- Works with shared filesystems.
- Does not require cluster-wide management of subordinate user/group ids.
```

Slides from `schedMD` about Pyxis/enroot https://slurm.schedmd.com/SLUG19/NVIDIA_Containers.pdf

# Current ROCm SLURM+Container situation

On AMD website, there is a blog about doing multinode torch jobs with container+slurm but it seems like the UX is decently subpar compared to on NVIDIA just adding 1 arg with pyxis `--container-image` . Furthermore, AMD's current `docker run` + `srun` solution does not automatically clean up containers when exiting so users need to explicitly delete existing containers before starting new ones

[InstellaVL's sbatch docker cleanup](https://github.com/AMD-AIG-AIMA/InstellaVL/blob/9e143b50378cbbaea7784d6da0058c78d2e01352/scripts/1B_release/sbatch_instellavl_mlp_warmup.sh#L92-L93)

![Image](https://github.com/user-attachments/assets/aacbe496-71ef-4bea-b375-c71fd93c2d7b)


The AMD UX currently for multinode container+slurm is by having the sbatch script call another script that launches containers which is another indirection the user needs to worry about 

https://rocm.blogs.amd.com/artificial-intelligence/mosaicml-composer/README.html#multinode-composer-training-using-slurm


![Image](https://github.com/user-attachments/assets/e6db9f5d-c193-440d-8b22-15856f74ecef)

On [InstellaVL](https://github.com/AMD-AIG-AIMA/InstellaVL/tree/main) 's training launcher script, it runs into the same situation of where there is an additional layer of indirection. This additional layer of indirection may be cause it to be harder to debug ML applications too.

[InstellaVL's sbatch launch script](https://github.com/AMD-AIG-AIMA/InstellaVL/blob/9e143b50378cbbaea7784d6da0058c78d2e01352/scripts/1B_release/sbatch_instellavl_mlp_warmup.sh#L133-L142)
![Image](https://github.com/user-attachments/assets/066ae967-4cf3-496e-9cd6-61a76d2cb149)


# Pyxis Features
## Example 1 - Multi-Node PyTorch
Compared to AMD's current solution of stuffing `docker run torchrun` into `srun`, for example, in my [torchrun example](https://github.com/stas00/ml-engineering/pull/101) using `pyxis`, the only **2 lines of code** i needed to add is the following
### Multi-Node PyTorch Code Additional 
```bash
--container-image=nvcr.io#nvidia/pytorch:25.02-py3 \
     --container-mounts=$PWD:/workspace \
```

### Multi-Node PyTorch Example Code
```bash
#!/bin/bash
#SBATCH --job-name=all_reduce_bench_pyxis
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8
#SBATCH --time=01:00:00

# Set up environment variables for torchrun
GPUS_PER_NODE=8
NNODES=2
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000

srun --container-image=nvcr.io#nvidia/pytorch:25.02-py3 \
     --container-mounts=$PWD:/workspace \
     python -u -m torch.distributed.run \
         --nproc_per_node $GPUS_PER_NODE \
         --nnodes $NNODES \
         --rdzv_endpoint ${MASTER_ADDR}:${MASTER_PORT} \
         --rdzv_backend c10d \
         --max_restarts 0 \
         --role `hostname -s`':' \
         --tee 3 \
         all_reduce_bench.py
``` 

## Example 2 - Launching interactive dev jobs

With `pyxis`, launching interactive container dev jobs is quick easy
```bash
srun --gpus=8 --pty --container-image=nvcr.io#nvidia/pytorch:24.12-py3 bash -i 
```

### Example Terminal Outputs
```bash
user-1@slurm-login-0:~$ srun --gpus=8 --pty --container-image=nvcr.io#nvidia/pytorch:24.12-py3 bash -i 
pyxis: importing docker image: nvcr.io/nvidia/pytorch:24.12-py3
pyxis: imported docker image: nvcr.io/nvidia/pytorch:24.12-py3
user-1@slurm-compute-node-123:/workspace# nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02             Driver Version: 535.230.02   CUDA Version: 12.6     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:1A:00.0 Off |                    0 |
| N/A   39C    P0              71W / 700W |      0MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:40:00.0 Off |                    0 |
| N/A   31C    P0              69W / 700W |      0MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
...
```

## Example 3 - Auto-Saving Interactive Containers
With `pyxis`, with the `--container-save` arg, any command you run into inside the interactive container will automatically get saved to the `.sqsh` container image such that when you re-launch the container, all the installed packages & files will be there

### Example container-save
```
srun  
--container-image=./nightly-torch.sqsh  
--container-save=./nightly-torch.sqsh bash
apt install nettools
```


## Example 4 - multi-node nccl-tests
Using `pyxis` container plugin, launching reproducible mult-node `nccl-tests` is quite intuitive and all the dependencies such as `nccl`, `ofed` userspace libraries, `nccl-tests`, `cuda-toolkit` (besides the gpu drivers) are inside the container such that the performance is reproducible and easier to debug. As AMD and the AMD ml community move towards making amd multi-node user experience great and performance great, it seems like having this reproducibility is important. 

all my `nccl-tests` are easily reproducible via `srun --container-image` but currently on AMD, multinode stuff is harder to create reproducible scripts as things aren't easily packaged up inside a container that slurm can execute. i wish `rccl-tests` could also be launched easily
```bash
srun --container-image=... \
     /opt/nccl_tests/build/all_reduce_perf -b 8M -e 8G -f 2 -g 1
```

## Example 5 - multi-node MPI
With `pyxis`, slurm+containers job that require MPI are easily launched too by adding the `--mpi=pmix` flag
```bash
srun --container-image=ghcr.io#coreweave/nccl-tests:12.8.0-devel-ubuntu22.04-nccl2.25.1-1-57fa979 \
     --mpi=pmix ... \
     /opt/nccl_tests/build/all_reduce_perf -b 8M -e 8G -f 2 -g 1
```

## Example 6 - Pyxis Container srun args
```bash
$ srun --help
...
      --container-image=[USER@][REGISTRY#]IMAGE[:TAG]|PATH
                              [pyxis] the image to use for the container
                              filesystem. Can be either a docker image given as
                              an enroot URI, or a path to a squashfs file on the
                              remote host filesystem.

      --container-mounts=SRC:DST[:FLAGS][,SRC:DST...]
                              [pyxis] bind mount[s] inside the container. Mount
                              flags are separated with "+", e.g. "ro+rprivate"

      --container-workdir=PATH
                              [pyxis] working directory inside the container
      --container-name=NAME   [pyxis] name to use for saving and loading the
                              container on the host. Unnamed containers are
                              removed after the slurm task is complete; named
                              containers are not. If a container with this name
                              already exists, the existing container is used and
                              the import is skipped.
      --container-save=PATH   [pyxis] Save the container state to a squashfs
                              file on the remote host filesystem.
      --container-mount-home  [pyxis] bind mount the user's home directory.
                              System-level enroot settings might cause this
                              directory to be already-mounted.

      --no-container-mount-home
                              [pyxis] do not bind mount the user's home
                              directory
      --container-remap-root  [pyxis] ask to be remapped to root inside the
                              container. Does not grant elevated system
                              permissions, despite appearances.

      --no-container-remap-root
                              [pyxis] do not remap to root inside the container
      --container-entrypoint  [pyxis] execute the entrypoint from the container
                              image

      --no-container-entrypoint
                              [pyxis] do not execute the entrypoint from the
                              container image

      --container-entrypoint-log
                              [pyxis] print the output of the entrypoint script
      --container-writable    [pyxis] make the container filesystem writable
      --container-readonly    [pyxis] make the container filesystem read-only

      --container-env=NAME[,NAME...]
                              [pyxis] names of environment variables to override
                              with the host environment and set at the entrypoint.
                              By default, all exported host environment variables
                              are set in the container after the entrypoint is run,
                              but their existing values in the image take precedence;
                              the variables specified with this flag are preserved
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: ROCm Pyxis SLURM Container Plugin - Achieve User Experience Parity with NVIDIA #153

Overview

Pyxis Benefits

Current ROCm SLURM+Container situation

Pyxis Features

Example 1 - Multi-Node PyTorch

Multi-Node PyTorch Code Additional

Multi-Node PyTorch Example Code

Example 2 - Launching interactive dev jobs

Example Terminal Outputs

Example 3 - Auto-Saving Interactive Containers

Example container-save

Example 4 - multi-node nccl-tests

Example 5 - multi-node MPI

Example 6 - Pyxis Container srun args

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: ROCm Pyxis SLURM Container Plugin - Achieve User Experience Parity with NVIDIA #153

Description

Overview

Pyxis Benefits

Current ROCm SLURM+Container situation

Pyxis Features

Example 1 - Multi-Node PyTorch

Multi-Node PyTorch Code Additional

Multi-Node PyTorch Example Code

Example 2 - Launching interactive dev jobs

Example Terminal Outputs

Example 3 - Auto-Saving Interactive Containers

Example container-save

Example 4 - multi-node nccl-tests

Example 5 - multi-node MPI

Example 6 - Pyxis Container srun args

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions