From ac3243bc14dad3991df95de7c1500a1ffe1dc46f Mon Sep 17 00:00:00 2001 From: Jaya Venkatesh Date: Wed, 11 Feb 2026 08:14:44 -0800 Subject: [PATCH 1/3] added hpc docs page Signed-off-by: Jaya Venkatesh --- source/hpc.md | 318 ++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 230 insertions(+), 88 deletions(-) diff --git a/source/hpc.md b/source/hpc.md index bbfdae15..4430a4d3 100644 --- a/source/hpc.md +++ b/source/hpc.md @@ -1,132 +1,274 @@ --- -review_priority: "index" +review_priority: "p1" --- # HPC -RAPIDS works extremely well in traditional HPC (High Performance Computing) environments where GPUs are often co-located with accelerated networking hardware such as InfiniBand. Deploying on HPC often means using queue management systems such as SLURM, LSF, PBS, etc. +RAPIDS can be deployed on HPC clusters managed by [SLURM](https://slurm.schedmd.com/). -## SLURM +## SLURM Basics -```{warning} -This is a legacy page and may contain outdated information. We are working hard to update our documentation with the latest and greatest information, thank you for bearing with us. -``` +SLURM is a job scheduler that manages access to compute nodes on HPC clusters. +Instead of logging into a GPU machine directly, you ask SLURM for resources +(CPUs, GPUs, memory, time) and it allocates a node for you when one becomes +available. -If you are unfamiliar with SLURM or need a refresher, we recommend the [quickstart guide](https://slurm.schedmd.com/quickstart.html). -Depending on how your nodes are configured, additional settings may be required such as defining the number of GPUs `(--gpus)` desired or the number of gpus per node `(--gpus-per-node)`. -In the following example, we assume each allocation runs on a DGX1 with access to all eight GPUs. +Nodes are organized into **partitions**, groups of machines with similar +hardware. For example, your cluster might have a `gpu` partition with A100 nodes +and a `cpu` partition with CPU-only nodes. -### Start Scheduler +### Partitions -First, start the scheduler with the following SLURM script. This and the following scripts can deployed with `salloc` for interactive usage or `sbatch` for batched run. +Check which partitions are available and what GPUs they have. The `-o` flag +customizes the output format: `%P` shows the partition name, `%G` the +generic resources (such as GPUs), `%D` the number of nodes, and `%T` the +node state. ```bash -#!/usr/bin/env bash +sinfo -o "%P %G %D %T" +PARTITION GRES NODES STATE +gpu gpu:a100:4 10 idle +gpu-dev gpu:v100:2 4 idle +``` -#SBATCH -J dask-scheduler -#SBATCH -n 1 -#SBATCH -t 00:10:00 +Your cluster admin can tell you which partition to use. Throughout this guide +we use `-p gpu`. Replace this with your partition name. -module load cuda/11.0.3 -CONDA_ROOT=/nfs-mount/user/miniconda3 -source $CONDA_ROOT/etc/profile.d/conda.sh -conda activate rapids +### Interactive Jobs -LOCAL_DIRECTORY=/nfs-mount/dask-local-directory -mkdir $LOCAL_DIRECTORY -CUDA_VISIBLE_DEVICES=0 dask-scheduler \ - --protocol tcp \ - --scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json" & +An interactive job gives you a shell on a compute node where you can run +commands directly. This is useful for development, debugging, and testing +before submitting longer batch jobs. -dask-cuda-worker \ - --rmm-pool-size 14GB \ - --scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json" +Use `srun` to request a GPU node. The `--gres=gpu:1` flag requests one GPU, +`--time` sets the maximum walltime, and `--pty bash` gives you a terminal. + +```bash +srun -p gpu --gres=gpu:1 --time=01:00:00 --pty bash ``` -Notice that we configure the scheduler to write a `scheduler-file` to a NFS accessible location. This file contains metadata about the scheduler and will -include the IP address and port for the scheduler. The file will serve as input to the workers informing them what address and port to connect. +This will queue until a node is available, then drop you into a shell on +the allocated node. -The scheduler doesn't need the whole node to itself so we can also start a worker on this node to fill out the unused resources. +### Batch Jobs -### Start Dask CUDA Workers +For longer-running work, write a script and submit it with `sbatch`. SLURM +runs the script when resources become available and you don't need to stay +connected. -Next start the other [dask-cuda workers](https://docs.rapids.ai/api/dask-cuda/~~~rapids_api_docs_version~~~/). Dask-CUDA extends the traditional Dask `Worker` class with specific options and enhancements for GPU environments. Unlike the scheduler and client, the workers script should be scalable and allow the users to tune how many workers are created. -For example, we can scale the number of nodes to 3: `sbatch/salloc -N3 dask-cuda-worker.script` . In this case, because we have 8 GPUs per node and we have 3 nodes, -our job will have 24 workers. +```bash +sbatch my_job.sh +Submitted batch job 12345 +``` + +Check the status of your jobs with `squeue`. The `-u` flag filters by your +username. ```bash -#!/usr/bin/env bash +squeue -u $USER +``` -#SBATCH -J dask-cuda-workers -#SBATCH -t 00:10:00 +### Keeping Sessions Alive -module load cuda/11.0.3 -CONDA_ROOT=/nfs-mount/miniconda3 -source $CONDA_ROOT/etc/profile.d/conda.sh -conda activate rapids +If your SSH connection drops while in an interactive job, the job is +terminated and you lose your work. To avoid this, start a +[tmux](https://github.com/tmux/tmux) or +[screen](https://www.gnu.org/software/screen/) session on the login node +**before** requesting your interactive job. -LOCAL_DIRECTORY=/nfs-mount/dask-local-directory -mkdir $LOCAL_DIRECTORY -dask-cuda-worker \ - --rmm-pool-size 14GB \ - --scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json" +```bash +tmux new -s rapids +srun -p gpu --gres=gpu:1 --time=01:00:00 --pty bash +# ... work ... +# Detach with Ctrl+b, d and your job keeps running +# Reattach later: +tmux attach -t rapids ``` -### cuDF Example Workflow +## Install RAPIDS + +### Environment Modules + +[Environment modules](https://modules.readthedocs.io/) are the standard way +to manage software on HPC clusters. We'll create a +[conda](https://docs.conda.io/) environment containing both CUDA and RAPIDS, +then wrap it in an [Lmod](https://lmod.readthedocs.io/) module file so it can +be loaded with a single command. -Lastly, we can now run a job on the established Dask Cluster. +We use conda here because it handles the CUDA toolkit and RAPIDS dependencies +together, avoiding version conflicts with system libraries. + +```{note} +Conda installs the CUDA **toolkit** (runtime libraries), but +the NVIDIA **kernel driver** must already be installed on the cluster's compute +nodes. This is typically managed by your cluster admin. You can verify the +driver is available by running `nvidia-smi` on a compute node. +``` + +#### Create the environment ```bash -#!/usr/bin/env bash +conda create -n rapids-{{ rapids_version }} {{ rapids_conda_channels }} \ + {{ rapids_conda_packages }} +``` -#SBATCH -J dask-client -#SBATCH -n 1 -#SBATCH -t 00:10:00 +#### Create the module file -module load cuda/11.0.3 -CONDA_ROOT=/nfs-mount/miniconda3 -source $CONDA_ROOT/etc/profile.d/conda.sh -conda activate rapids +Place a module file in your cluster's module path so that users can load +the environment. Replace `` with the absolute path to +your conda installation. -LOCAL_DIRECTORY=/nfs-mount/dask-local-directory +```bash +mkdir -p /opt/modulefiles/rapids +cat << 'EOF' > /opt/modulefiles/rapids/{{ rapids_version }}.lua +help([[RAPIDS {{ rapids_version }} - GPU-accelerated data science libraries.]]) -cat <>/tmp/dask-cudf-example.py -import cudf -import dask.dataframe as dd -from dask.distributed import Client +whatis("Name: RAPIDS") +whatis("Version: {{ rapids_version }}") +whatis("Description: GPU-accelerated data science libraries") -client = Client(scheduler_file="$LOCAL_DIRECTORY/dask-scheduler.json") -cdf = cudf.datasets.timeseries() +family("rapids") -ddf = dd.from_pandas(cdf, npartitions=10) -res = ddf.groupby(['id', 'name']).agg(['mean', 'sum', 'count']).compute() -print(res) +local conda_root = "" +local env = "rapids-{{ rapids_version }}" +local env_prefix = pathJoin(conda_root, "envs", env) + +prepend_path("PATH", pathJoin(env_prefix, "bin")) +prepend_path("LD_LIBRARY_PATH", pathJoin(env_prefix, "lib")) + +setenv("CONDA_PREFIX", env_prefix) +setenv("CONDA_DEFAULT_ENV", env) EOF +``` + +#### Verify + +```bash +module load rapids/{{ rapids_version }} +python -c "import cudf; print(cudf.__version__)" +``` + +### Containers + +Many HPC clusters support running containers through runtimes such as +[Apptainer](https://apptainer.org/) (formerly Singularity), +[Pyxis](https://github.com/NVIDIA/pyxis) + [Enroot](https://github.com/NVIDIA/enroot), +[Podman](https://podman.io/), or +[Charliecloud](https://hpc.github.io/charliecloud/). This is an alternative +to environment modules, as the RAPIDS container image ships with CUDA and all +RAPIDS libraries pre-installed and does not need any additional configuration. + +Check with your cluster admin which container runtime is available. The +examples below cover Apptainer and Pyxis + Enroot, two of the most common +setups on HPC clusters. + +#### Apptainer + +[Apptainer](https://apptainer.org/) is a container runtime designed for HPC. +The `--nv` flag exposes the host GPU drivers to the container. + +```bash +apptainer pull rapids.sif docker://{{ rapids_container }} +``` + +#### Pyxis + Enroot + +[Enroot](https://github.com/NVIDIA/enroot) is NVIDIA's lightweight container +runtime for HPC. [Pyxis](https://github.com/NVIDIA/pyxis) is a SLURM plugin +that integrates Enroot into SLURM, adding `--container-*` flags to `srun` and +`sbatch` so you can launch containerized jobs directly through the scheduler. +Pyxis + Enroot is pre-installed on many GPU clusters including NVIDIA DGX +systems. + +Import the RAPIDS container image as a squashfs file. We recommend +pre-importing large images to avoid re-downloading on every job. -python /tmp/dask-cudf-example.py +Note that Enroot uses `#` instead of `/` to separate the registry hostname +from the image path. + +```bash +enroot import --output rapids.sqsh 'docker://{{ rapids_container.replace("/", "#", 1) }}' +``` + +## Run a Single GPU Job + +[cudf.pandas](https://docs.rapids.ai/api/cudf/stable/cudf_pandas/) lets you +accelerate existing pandas code on a GPU with no code changes. You run your +script with `python -m cudf.pandas` instead of `python` and pandas operations +are automatically dispatched to the GPU. + +The following example uses pandas to generate and aggregate random data. + +```python +# my_script.py +import pandas as pd + +df = pd.DataFrame({"x": range(1_000_000), "y": range(1_000_000)}) +df["group"] = df["x"] % 100 +result = df.groupby("group").agg(["mean", "sum", "count"]) +print(result) ``` -### Confirm Output +### Interactive -Putting the above together will result in the following output: +#### With modules ```bash - x y - mean sum count mean sum count -id name -1077 Laura 0.028305 1.868120 66 -0.098905 -6.527731 66 -1026 Frank 0.001536 1.414839 921 -0.017223 -15.862306 921 -1082 Patricia 0.072045 3.602228 50 0.081853 4.092667 50 -1007 Wendy 0.009837 11.676199 1187 0.022978 27.275216 1187 -976 Wendy -0.003663 -3.267674 892 0.008262 7.369577 892 -... ... ... ... ... ... ... -912 Michael 0.012409 0.459119 37 0.002528 0.093520 37 -1103 Ingrid -0.132714 -1.327142 10 0.108364 1.083638 10 -998 Tim 0.000587 0.747745 1273 0.001777 2.262094 1273 -941 Yvonne 0.050258 11.358393 226 0.080584 18.212019 226 -900 Michael -0.134216 -1.073729 8 0.008701 0.069610 8 - -[6449 rows x 6 columns] -``` - -

+srun -p gpu --gres=gpu:1 --pty bash +module load rapids/{{ rapids_version }} +python -m cudf.pandas my_script.py +``` + +#### With containers + +`````{tab-set} + +````{tab-item} Apptainer + +The `--nv` flag exposes the host GPU drivers to the container. + +```bash +srun -p gpu --gres=gpu:1 apptainer exec --nv rapids.sif \ + python -m cudf.pandas my_script.py +``` + +```` + +````{tab-item} Pyxis + Enroot + +The `--container-image` flag is provided by Pyxis. Use `--container-mounts` +to make your data and scripts available inside the container. + +```bash +srun -p gpu --gres=gpu:1 \ + --container-image=./rapids.sqsh \ + --container-mounts=$(pwd):/work --container-workdir=/work \ + python -m cudf.pandas /work/my_script.py +``` + +```` + +````` + +### Batch + +Write a SLURM batch script to run the same workload without an interactive +session. This is the typical workflow for production jobs. + +```bash +#!/usr/bin/env bash +#SBATCH --job-name=rapids-cudf +#SBATCH --gres=gpu:1 +#SBATCH --time=01:00:00 + +module load rapids/{{ rapids_version }} +python -m cudf.pandas my_script.py +``` + +```bash +sbatch rapids_job.sh +``` + +```{relatedexamples} + +``` From aaf45f7074d5c168a9d4223a15b38ddb3079fe72 Mon Sep 17 00:00:00 2001 From: Jaya Venkatesh Date: Wed, 11 Feb 2026 08:23:55 -0800 Subject: [PATCH 2/3] added links to SLURM docs Signed-off-by: Jaya Venkatesh --- source/hpc.md | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/source/hpc.md b/source/hpc.md index 4430a4d3..b44be9a1 100644 --- a/source/hpc.md +++ b/source/hpc.md @@ -6,7 +6,7 @@ review_priority: "p1" RAPIDS can be deployed on HPC clusters managed by [SLURM](https://slurm.schedmd.com/). -## SLURM Basics +## SLURM SLURM is a job scheduler that manages access to compute nodes on HPC clusters. Instead of logging into a GPU machine directly, you ask SLURM for resources @@ -17,6 +17,8 @@ Nodes are organized into **partitions**, groups of machines with similar hardware. For example, your cluster might have a `gpu` partition with A100 nodes and a `cpu` partition with CPU-only nodes. +For a more comprehensive overview, see the [SLURM quickstart guide](https://slurm.schedmd.com/quickstart.html). + ### Partitions Check which partitions are available and what GPUs they have. The `-o` flag @@ -79,9 +81,13 @@ terminated and you lose your work. To avoid this, start a ```bash tmux new -s rapids srun -p gpu --gres=gpu:1 --time=01:00:00 --pty bash -# ... work ... -# Detach with Ctrl+b, d and your job keeps running -# Reattach later: +``` + +To detach from the tmux session without ending your job, press `Ctrl+b` +then `d`. Your interactive job continues running in the background. When +you reconnect via SSH, reattach to the session with: + +```bash tmux attach -t rapids ``` From 730d0d24283d3d358e5ee696dcbef3f40f554f53 Mon Sep 17 00:00:00 2001 From: Jaya Venkatesh Date: Wed, 11 Feb 2026 08:25:22 -0800 Subject: [PATCH 3/3] make review priority index Signed-off-by: Jaya Venkatesh --- source/hpc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/hpc.md b/source/hpc.md index b44be9a1..06b45753 100644 --- a/source/hpc.md +++ b/source/hpc.md @@ -1,5 +1,5 @@ --- -review_priority: "p1" +review_priority: "index" --- # HPC