Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
314 changes: 231 additions & 83 deletions source/hpc.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,129 +4,277 @@ review_priority: "index"

# HPC

RAPIDS works extremely well in traditional HPC (High Performance Computing) environments where GPUs are often co-located with accelerated networking hardware such as InfiniBand. Deploying on HPC often means using queue management systems such as SLURM, LSF, PBS, etc.
RAPIDS can be deployed on HPC clusters managed by [SLURM](https://slurm.schedmd.com/).

## SLURM

```{warning}
This is a legacy page and may contain outdated information. We are working hard to update our documentation with the latest and greatest information, thank you for bearing with us.
```
SLURM is a job scheduler that manages access to compute nodes on HPC clusters.
Instead of logging into a GPU machine directly, you ask SLURM for resources
(CPUs, GPUs, memory, time) and it allocates a node for you when one becomes
available.

Nodes are organized into **partitions**, groups of machines with similar
hardware. For example, your cluster might have a `gpu` partition with A100 nodes
and a `cpu` partition with CPU-only nodes.

If you are unfamiliar with SLURM or need a refresher, we recommend the [quickstart guide](https://slurm.schedmd.com/quickstart.html).
Depending on how your nodes are configured, additional settings may be required such as defining the number of GPUs `(--gpus)` desired or the number of gpus per node `(--gpus-per-node)`.
In the following example, we assume each allocation runs on a DGX1 with access to all eight GPUs.
For a more comprehensive overview, see the [SLURM quickstart guide](https://slurm.schedmd.com/quickstart.html).

### Start Scheduler
### Partitions

First, start the scheduler with the following SLURM script. This and the following scripts can deployed with `salloc` for interactive usage or `sbatch` for batched run.
Check which partitions are available and what GPUs they have. The `-o` flag
customizes the output format: `%P` shows the partition name, `%G` the
generic resources (such as GPUs), `%D` the number of nodes, and `%T` the
node state.

```bash
#!/usr/bin/env bash
sinfo -o "%P %G %D %T"
PARTITION GRES NODES STATE
gpu gpu:a100:4 10 idle
gpu-dev gpu:v100:2 4 idle
```

Your cluster admin can tell you which partition to use. Throughout this guide
we use `-p gpu`. Replace this with your partition name.

#SBATCH -J dask-scheduler
#SBATCH -n 1
#SBATCH -t 00:10:00
### Interactive Jobs

module load cuda/11.0.3
CONDA_ROOT=/nfs-mount/user/miniconda3
source $CONDA_ROOT/etc/profile.d/conda.sh
conda activate rapids
An interactive job gives you a shell on a compute node where you can run
commands directly. This is useful for development, debugging, and testing
before submitting longer batch jobs.

LOCAL_DIRECTORY=/nfs-mount/dask-local-directory
mkdir $LOCAL_DIRECTORY
CUDA_VISIBLE_DEVICES=0 dask-scheduler \
--protocol tcp \
--scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json" &
Use `srun` to request a GPU node. The `--gres=gpu:1` flag requests one GPU,
`--time` sets the maximum walltime, and `--pty bash` gives you a terminal.

dask-cuda-worker \
--rmm-pool-size 14GB \
--scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json"
```bash
srun -p gpu --gres=gpu:1 --time=01:00:00 --pty bash
```

Notice that we configure the scheduler to write a `scheduler-file` to a NFS accessible location. This file contains metadata about the scheduler and will
include the IP address and port for the scheduler. The file will serve as input to the workers informing them what address and port to connect.
This will queue until a node is available, then drop you into a shell on
the allocated node.

The scheduler doesn't need the whole node to itself so we can also start a worker on this node to fill out the unused resources.
### Batch Jobs

### Start Dask CUDA Workers
For longer-running work, write a script and submit it with `sbatch`. SLURM
runs the script when resources become available and you don't need to stay
connected.

Next start the other [dask-cuda workers](https://docs.rapids.ai/api/dask-cuda/~~~rapids_api_docs_version~~~/). Dask-CUDA extends the traditional Dask `Worker` class with specific options and enhancements for GPU environments. Unlike the scheduler and client, the workers script should be scalable and allow the users to tune how many workers are created.
For example, we can scale the number of nodes to 3: `sbatch/salloc -N3 dask-cuda-worker.script` . In this case, because we have 8 GPUs per node and we have 3 nodes,
our job will have 24 workers.
```bash
sbatch my_job.sh
Submitted batch job 12345
```

Check the status of your jobs with `squeue`. The `-u` flag filters by your
username.

```bash
#!/usr/bin/env bash
squeue -u $USER
```

#SBATCH -J dask-cuda-workers
#SBATCH -t 00:10:00
### Keeping Sessions Alive

module load cuda/11.0.3
CONDA_ROOT=/nfs-mount/miniconda3
source $CONDA_ROOT/etc/profile.d/conda.sh
conda activate rapids
If your SSH connection drops while in an interactive job, the job is
terminated and you lose your work. To avoid this, start a
[tmux](https://github.com/tmux/tmux) or
[screen](https://www.gnu.org/software/screen/) session on the login node
**before** requesting your interactive job.

LOCAL_DIRECTORY=/nfs-mount/dask-local-directory
mkdir $LOCAL_DIRECTORY
dask-cuda-worker \
--rmm-pool-size 14GB \
--scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json"
```bash
tmux new -s rapids
srun -p gpu --gres=gpu:1 --time=01:00:00 --pty bash
```

### cuDF Example Workflow
To detach from the tmux session without ending your job, press `Ctrl+b`
then `d`. Your interactive job continues running in the background. When
you reconnect via SSH, reattach to the session with:

```bash
tmux attach -t rapids
```

## Install RAPIDS

### Environment Modules

[Environment modules](https://modules.readthedocs.io/) are the standard way
to manage software on HPC clusters. We'll create a
[conda](https://docs.conda.io/) environment containing both CUDA and RAPIDS,
then wrap it in an [Lmod](https://lmod.readthedocs.io/) module file so it can
be loaded with a single command.

We use conda here because it handles the CUDA toolkit and RAPIDS dependencies
together, avoiding version conflicts with system libraries.

Lastly, we can now run a job on the established Dask Cluster.
```{note}
Conda installs the CUDA **toolkit** (runtime libraries), but
the NVIDIA **kernel driver** must already be installed on the cluster's compute
nodes. This is typically managed by your cluster admin. You can verify the
driver is available by running `nvidia-smi` on a compute node.
```

#### Create the environment

```bash
#!/usr/bin/env bash
conda create -n rapids-{{ rapids_version }} {{ rapids_conda_channels }} \
{{ rapids_conda_packages }}
```

#### Create the module file

#SBATCH -J dask-client
#SBATCH -n 1
#SBATCH -t 00:10:00
Place a module file in your cluster's module path so that users can load
the environment. Replace `<path to miniconda3>` with the absolute path to
your conda installation.

module load cuda/11.0.3
CONDA_ROOT=/nfs-mount/miniconda3
source $CONDA_ROOT/etc/profile.d/conda.sh
conda activate rapids
```bash
mkdir -p /opt/modulefiles/rapids
cat << 'EOF' > /opt/modulefiles/rapids/{{ rapids_version }}.lua
help([[RAPIDS {{ rapids_version }} - GPU-accelerated data science libraries.]])

whatis("Name: RAPIDS")
whatis("Version: {{ rapids_version }}")
whatis("Description: GPU-accelerated data science libraries")

LOCAL_DIRECTORY=/nfs-mount/dask-local-directory
family("rapids")

cat <<EOF >>/tmp/dask-cudf-example.py
import cudf
import dask.dataframe as dd
from dask.distributed import Client
local conda_root = "<path to miniconda3>"
local env = "rapids-{{ rapids_version }}"
local env_prefix = pathJoin(conda_root, "envs", env)

client = Client(scheduler_file="$LOCAL_DIRECTORY/dask-scheduler.json")
cdf = cudf.datasets.timeseries()
prepend_path("PATH", pathJoin(env_prefix, "bin"))
prepend_path("LD_LIBRARY_PATH", pathJoin(env_prefix, "lib"))

ddf = dd.from_pandas(cdf, npartitions=10)
res = ddf.groupby(['id', 'name']).agg(['mean', 'sum', 'count']).compute()
print(res)
setenv("CONDA_PREFIX", env_prefix)
setenv("CONDA_DEFAULT_ENV", env)
EOF
```

#### Verify

```bash
module load rapids/{{ rapids_version }}
python -c "import cudf; print(cudf.__version__)"
```

### Containers

Many HPC clusters support running containers through runtimes such as
[Apptainer](https://apptainer.org/) (formerly Singularity),
[Pyxis](https://github.com/NVIDIA/pyxis) + [Enroot](https://github.com/NVIDIA/enroot),
[Podman](https://podman.io/), or
[Charliecloud](https://hpc.github.io/charliecloud/). This is an alternative
to environment modules, as the RAPIDS container image ships with CUDA and all
RAPIDS libraries pre-installed and does not need any additional configuration.

Check with your cluster admin which container runtime is available. The
examples below cover Apptainer and Pyxis + Enroot, two of the most common
setups on HPC clusters.

#### Apptainer

[Apptainer](https://apptainer.org/) is a container runtime designed for HPC.
The `--nv` flag exposes the host GPU drivers to the container.

```bash
apptainer pull rapids.sif docker://{{ rapids_container }}
```

#### Pyxis + Enroot

[Enroot](https://github.com/NVIDIA/enroot) is NVIDIA's lightweight container
runtime for HPC. [Pyxis](https://github.com/NVIDIA/pyxis) is a SLURM plugin
that integrates Enroot into SLURM, adding `--container-*` flags to `srun` and
`sbatch` so you can launch containerized jobs directly through the scheduler.
Pyxis + Enroot is pre-installed on many GPU clusters including NVIDIA DGX
systems.

Import the RAPIDS container image as a squashfs file. We recommend
pre-importing large images to avoid re-downloading on every job.

Note that Enroot uses `#` instead of `/` to separate the registry hostname
from the image path.

```bash
enroot import --output rapids.sqsh 'docker://{{ rapids_container.replace("/", "#", 1) }}'
```

## Run a Single GPU Job

[cudf.pandas](https://docs.rapids.ai/api/cudf/stable/cudf_pandas/) lets you
accelerate existing pandas code on a GPU with no code changes. You run your
script with `python -m cudf.pandas` instead of `python` and pandas operations
are automatically dispatched to the GPU.

The following example uses pandas to generate and aggregate random data.

```python
# my_script.py
import pandas as pd

df = pd.DataFrame({"x": range(1_000_000), "y": range(1_000_000)})
df["group"] = df["x"] % 100
result = df.groupby("group").agg(["mean", "sum", "count"])
print(result)
```

### Interactive

#### With modules

```bash
srun -p gpu --gres=gpu:1 --pty bash
module load rapids/{{ rapids_version }}
python -m cudf.pandas my_script.py
```

python /tmp/dask-cudf-example.py
#### With containers

`````{tab-set}

````{tab-item} Apptainer

The `--nv` flag exposes the host GPU drivers to the container.

```bash
srun -p gpu --gres=gpu:1 apptainer exec --nv rapids.sif \
python -m cudf.pandas my_script.py
```

### Confirm Output
````

Putting the above together will result in the following output:
````{tab-item} Pyxis + Enroot

The `--container-image` flag is provided by Pyxis. Use `--container-mounts`
to make your data and scripts available inside the container.

```bash
x y
mean sum count mean sum count
id name
1077 Laura 0.028305 1.868120 66 -0.098905 -6.527731 66
1026 Frank 0.001536 1.414839 921 -0.017223 -15.862306 921
1082 Patricia 0.072045 3.602228 50 0.081853 4.092667 50
1007 Wendy 0.009837 11.676199 1187 0.022978 27.275216 1187
976 Wendy -0.003663 -3.267674 892 0.008262 7.369577 892
... ... ... ... ... ... ...
912 Michael 0.012409 0.459119 37 0.002528 0.093520 37
1103 Ingrid -0.132714 -1.327142 10 0.108364 1.083638 10
998 Tim 0.000587 0.747745 1273 0.001777 2.262094 1273
941 Yvonne 0.050258 11.358393 226 0.080584 18.212019 226
900 Michael -0.134216 -1.073729 8 0.008701 0.069610 8
srun -p gpu --gres=gpu:1 \
--container-image=./rapids.sqsh \
--container-mounts=$(pwd):/work --container-workdir=/work \
python -m cudf.pandas /work/my_script.py
```

````

`````

### Batch

[6449 rows x 6 columns]
Write a SLURM batch script to run the same workload without an interactive
session. This is the typical workflow for production jobs.

```bash
#!/usr/bin/env bash
#SBATCH --job-name=rapids-cudf
#SBATCH --gres=gpu:1
#SBATCH --time=01:00:00

module load rapids/{{ rapids_version }}
python -m cudf.pandas my_script.py
```

<br/><br/>
```bash
sbatch rapids_job.sh
```

```{relatedexamples}

```