diff --git a/source/hpc.md b/source/hpc.md index bbfdae15..06b45753 100644 --- a/source/hpc.md +++ b/source/hpc.md @@ -4,129 +4,277 @@ review_priority: "index" # HPC -RAPIDS works extremely well in traditional HPC (High Performance Computing) environments where GPUs are often co-located with accelerated networking hardware such as InfiniBand. Deploying on HPC often means using queue management systems such as SLURM, LSF, PBS, etc. +RAPIDS can be deployed on HPC clusters managed by [SLURM](https://slurm.schedmd.com/). ## SLURM -```{warning} -This is a legacy page and may contain outdated information. We are working hard to update our documentation with the latest and greatest information, thank you for bearing with us. -``` +SLURM is a job scheduler that manages access to compute nodes on HPC clusters. +Instead of logging into a GPU machine directly, you ask SLURM for resources +(CPUs, GPUs, memory, time) and it allocates a node for you when one becomes +available. + +Nodes are organized into **partitions**, groups of machines with similar +hardware. For example, your cluster might have a `gpu` partition with A100 nodes +and a `cpu` partition with CPU-only nodes. -If you are unfamiliar with SLURM or need a refresher, we recommend the [quickstart guide](https://slurm.schedmd.com/quickstart.html). -Depending on how your nodes are configured, additional settings may be required such as defining the number of GPUs `(--gpus)` desired or the number of gpus per node `(--gpus-per-node)`. -In the following example, we assume each allocation runs on a DGX1 with access to all eight GPUs. +For a more comprehensive overview, see the [SLURM quickstart guide](https://slurm.schedmd.com/quickstart.html). -### Start Scheduler +### Partitions -First, start the scheduler with the following SLURM script. This and the following scripts can deployed with `salloc` for interactive usage or `sbatch` for batched run. +Check which partitions are available and what GPUs they have. The `-o` flag +customizes the output format: `%P` shows the partition name, `%G` the +generic resources (such as GPUs), `%D` the number of nodes, and `%T` the +node state. ```bash -#!/usr/bin/env bash +sinfo -o "%P %G %D %T" +PARTITION GRES NODES STATE +gpu gpu:a100:4 10 idle +gpu-dev gpu:v100:2 4 idle +``` + +Your cluster admin can tell you which partition to use. Throughout this guide +we use `-p gpu`. Replace this with your partition name. -#SBATCH -J dask-scheduler -#SBATCH -n 1 -#SBATCH -t 00:10:00 +### Interactive Jobs -module load cuda/11.0.3 -CONDA_ROOT=/nfs-mount/user/miniconda3 -source $CONDA_ROOT/etc/profile.d/conda.sh -conda activate rapids +An interactive job gives you a shell on a compute node where you can run +commands directly. This is useful for development, debugging, and testing +before submitting longer batch jobs. -LOCAL_DIRECTORY=/nfs-mount/dask-local-directory -mkdir $LOCAL_DIRECTORY -CUDA_VISIBLE_DEVICES=0 dask-scheduler \ - --protocol tcp \ - --scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json" & +Use `srun` to request a GPU node. The `--gres=gpu:1` flag requests one GPU, +`--time` sets the maximum walltime, and `--pty bash` gives you a terminal. -dask-cuda-worker \ - --rmm-pool-size 14GB \ - --scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json" +```bash +srun -p gpu --gres=gpu:1 --time=01:00:00 --pty bash ``` -Notice that we configure the scheduler to write a `scheduler-file` to a NFS accessible location. This file contains metadata about the scheduler and will -include the IP address and port for the scheduler. The file will serve as input to the workers informing them what address and port to connect. +This will queue until a node is available, then drop you into a shell on +the allocated node. -The scheduler doesn't need the whole node to itself so we can also start a worker on this node to fill out the unused resources. +### Batch Jobs -### Start Dask CUDA Workers +For longer-running work, write a script and submit it with `sbatch`. SLURM +runs the script when resources become available and you don't need to stay +connected. -Next start the other [dask-cuda workers](https://docs.rapids.ai/api/dask-cuda/~~~rapids_api_docs_version~~~/). Dask-CUDA extends the traditional Dask `Worker` class with specific options and enhancements for GPU environments. Unlike the scheduler and client, the workers script should be scalable and allow the users to tune how many workers are created. -For example, we can scale the number of nodes to 3: `sbatch/salloc -N3 dask-cuda-worker.script` . In this case, because we have 8 GPUs per node and we have 3 nodes, -our job will have 24 workers. +```bash +sbatch my_job.sh +Submitted batch job 12345 +``` + +Check the status of your jobs with `squeue`. The `-u` flag filters by your +username. ```bash -#!/usr/bin/env bash +squeue -u $USER +``` -#SBATCH -J dask-cuda-workers -#SBATCH -t 00:10:00 +### Keeping Sessions Alive -module load cuda/11.0.3 -CONDA_ROOT=/nfs-mount/miniconda3 -source $CONDA_ROOT/etc/profile.d/conda.sh -conda activate rapids +If your SSH connection drops while in an interactive job, the job is +terminated and you lose your work. To avoid this, start a +[tmux](https://github.com/tmux/tmux) or +[screen](https://www.gnu.org/software/screen/) session on the login node +**before** requesting your interactive job. -LOCAL_DIRECTORY=/nfs-mount/dask-local-directory -mkdir $LOCAL_DIRECTORY -dask-cuda-worker \ - --rmm-pool-size 14GB \ - --scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json" +```bash +tmux new -s rapids +srun -p gpu --gres=gpu:1 --time=01:00:00 --pty bash ``` -### cuDF Example Workflow +To detach from the tmux session without ending your job, press `Ctrl+b` +then `d`. Your interactive job continues running in the background. When +you reconnect via SSH, reattach to the session with: + +```bash +tmux attach -t rapids +``` + +## Install RAPIDS + +### Environment Modules + +[Environment modules](https://modules.readthedocs.io/) are the standard way +to manage software on HPC clusters. We'll create a +[conda](https://docs.conda.io/) environment containing both CUDA and RAPIDS, +then wrap it in an [Lmod](https://lmod.readthedocs.io/) module file so it can +be loaded with a single command. + +We use conda here because it handles the CUDA toolkit and RAPIDS dependencies +together, avoiding version conflicts with system libraries. -Lastly, we can now run a job on the established Dask Cluster. +```{note} +Conda installs the CUDA **toolkit** (runtime libraries), but +the NVIDIA **kernel driver** must already be installed on the cluster's compute +nodes. This is typically managed by your cluster admin. You can verify the +driver is available by running `nvidia-smi` on a compute node. +``` + +#### Create the environment ```bash -#!/usr/bin/env bash +conda create -n rapids-{{ rapids_version }} {{ rapids_conda_channels }} \ + {{ rapids_conda_packages }} +``` + +#### Create the module file -#SBATCH -J dask-client -#SBATCH -n 1 -#SBATCH -t 00:10:00 +Place a module file in your cluster's module path so that users can load +the environment. Replace `` with the absolute path to +your conda installation. -module load cuda/11.0.3 -CONDA_ROOT=/nfs-mount/miniconda3 -source $CONDA_ROOT/etc/profile.d/conda.sh -conda activate rapids +```bash +mkdir -p /opt/modulefiles/rapids +cat << 'EOF' > /opt/modulefiles/rapids/{{ rapids_version }}.lua +help([[RAPIDS {{ rapids_version }} - GPU-accelerated data science libraries.]]) + +whatis("Name: RAPIDS") +whatis("Version: {{ rapids_version }}") +whatis("Description: GPU-accelerated data science libraries") -LOCAL_DIRECTORY=/nfs-mount/dask-local-directory +family("rapids") -cat <>/tmp/dask-cudf-example.py -import cudf -import dask.dataframe as dd -from dask.distributed import Client +local conda_root = "" +local env = "rapids-{{ rapids_version }}" +local env_prefix = pathJoin(conda_root, "envs", env) -client = Client(scheduler_file="$LOCAL_DIRECTORY/dask-scheduler.json") -cdf = cudf.datasets.timeseries() +prepend_path("PATH", pathJoin(env_prefix, "bin")) +prepend_path("LD_LIBRARY_PATH", pathJoin(env_prefix, "lib")) -ddf = dd.from_pandas(cdf, npartitions=10) -res = ddf.groupby(['id', 'name']).agg(['mean', 'sum', 'count']).compute() -print(res) +setenv("CONDA_PREFIX", env_prefix) +setenv("CONDA_DEFAULT_ENV", env) EOF +``` + +#### Verify + +```bash +module load rapids/{{ rapids_version }} +python -c "import cudf; print(cudf.__version__)" +``` + +### Containers + +Many HPC clusters support running containers through runtimes such as +[Apptainer](https://apptainer.org/) (formerly Singularity), +[Pyxis](https://github.com/NVIDIA/pyxis) + [Enroot](https://github.com/NVIDIA/enroot), +[Podman](https://podman.io/), or +[Charliecloud](https://hpc.github.io/charliecloud/). This is an alternative +to environment modules, as the RAPIDS container image ships with CUDA and all +RAPIDS libraries pre-installed and does not need any additional configuration. + +Check with your cluster admin which container runtime is available. The +examples below cover Apptainer and Pyxis + Enroot, two of the most common +setups on HPC clusters. + +#### Apptainer + +[Apptainer](https://apptainer.org/) is a container runtime designed for HPC. +The `--nv` flag exposes the host GPU drivers to the container. + +```bash +apptainer pull rapids.sif docker://{{ rapids_container }} +``` + +#### Pyxis + Enroot + +[Enroot](https://github.com/NVIDIA/enroot) is NVIDIA's lightweight container +runtime for HPC. [Pyxis](https://github.com/NVIDIA/pyxis) is a SLURM plugin +that integrates Enroot into SLURM, adding `--container-*` flags to `srun` and +`sbatch` so you can launch containerized jobs directly through the scheduler. +Pyxis + Enroot is pre-installed on many GPU clusters including NVIDIA DGX +systems. + +Import the RAPIDS container image as a squashfs file. We recommend +pre-importing large images to avoid re-downloading on every job. + +Note that Enroot uses `#` instead of `/` to separate the registry hostname +from the image path. + +```bash +enroot import --output rapids.sqsh 'docker://{{ rapids_container.replace("/", "#", 1) }}' +``` + +## Run a Single GPU Job + +[cudf.pandas](https://docs.rapids.ai/api/cudf/stable/cudf_pandas/) lets you +accelerate existing pandas code on a GPU with no code changes. You run your +script with `python -m cudf.pandas` instead of `python` and pandas operations +are automatically dispatched to the GPU. + +The following example uses pandas to generate and aggregate random data. + +```python +# my_script.py +import pandas as pd + +df = pd.DataFrame({"x": range(1_000_000), "y": range(1_000_000)}) +df["group"] = df["x"] % 100 +result = df.groupby("group").agg(["mean", "sum", "count"]) +print(result) +``` + +### Interactive + +#### With modules + +```bash +srun -p gpu --gres=gpu:1 --pty bash +module load rapids/{{ rapids_version }} +python -m cudf.pandas my_script.py +``` -python /tmp/dask-cudf-example.py +#### With containers + +`````{tab-set} + +````{tab-item} Apptainer + +The `--nv` flag exposes the host GPU drivers to the container. + +```bash +srun -p gpu --gres=gpu:1 apptainer exec --nv rapids.sif \ + python -m cudf.pandas my_script.py ``` -### Confirm Output +```` -Putting the above together will result in the following output: +````{tab-item} Pyxis + Enroot + +The `--container-image` flag is provided by Pyxis. Use `--container-mounts` +to make your data and scripts available inside the container. ```bash - x y - mean sum count mean sum count -id name -1077 Laura 0.028305 1.868120 66 -0.098905 -6.527731 66 -1026 Frank 0.001536 1.414839 921 -0.017223 -15.862306 921 -1082 Patricia 0.072045 3.602228 50 0.081853 4.092667 50 -1007 Wendy 0.009837 11.676199 1187 0.022978 27.275216 1187 -976 Wendy -0.003663 -3.267674 892 0.008262 7.369577 892 -... ... ... ... ... ... ... -912 Michael 0.012409 0.459119 37 0.002528 0.093520 37 -1103 Ingrid -0.132714 -1.327142 10 0.108364 1.083638 10 -998 Tim 0.000587 0.747745 1273 0.001777 2.262094 1273 -941 Yvonne 0.050258 11.358393 226 0.080584 18.212019 226 -900 Michael -0.134216 -1.073729 8 0.008701 0.069610 8 +srun -p gpu --gres=gpu:1 \ + --container-image=./rapids.sqsh \ + --container-mounts=$(pwd):/work --container-workdir=/work \ + python -m cudf.pandas /work/my_script.py +``` + +```` + +````` + +### Batch -[6449 rows x 6 columns] +Write a SLURM batch script to run the same workload without an interactive +session. This is the typical workflow for production jobs. + +```bash +#!/usr/bin/env bash +#SBATCH --job-name=rapids-cudf +#SBATCH --gres=gpu:1 +#SBATCH --time=01:00:00 + +module load rapids/{{ rapids_version }} +python -m cudf.pandas my_script.py ``` -

+```bash +sbatch rapids_job.sh +``` + +```{relatedexamples} + +```