|
| 1 | +--- |
| 2 | +sidebar_position: 3 |
| 3 | +title: DeltaAI (NCSA) |
| 4 | +description: Getting started with IOWarp on the NCSA DeltaAI GH200 supercomputer. |
| 5 | +--- |
| 6 | + |
| 7 | +# Getting Started with DeltaAI |
| 8 | + |
| 9 | +> For IOWarp team members on project CIS250329 |
| 10 | +> DeltaAI: NVIDIA GH200 Grace Hopper Supercomputer at NCSA |
| 11 | +
|
| 12 | +--- |
| 13 | + |
| 14 | +## What is DeltaAI? |
| 15 | + |
| 16 | +DeltaAI is a 152-node supercomputer at NCSA, each node packing 4x NVIDIA GH200 superchips (H100 GPU + Grace ARM CPU). Our allocation gives us ~1,000 GPU Hours on H100s with 120GB HBM3 each. |
| 17 | + |
| 18 | +**Important:** DeltaAI runs on **ARM (aarch64)** CPUs, not x86. This affects everything you compile. |
| 19 | + |
| 20 | +--- |
| 21 | + |
| 22 | +## Step 1: Get Your Credentials |
| 23 | + |
| 24 | +You need three things before you can log in: |
| 25 | + |
| 26 | +### 1a. NCSA Username |
| 27 | +Your PI or allocation manager has already added you to the project. Your NCSA username is typically your university NetID (e.g., `jdoe3`). Check with the PI if unsure. |
| 28 | + |
| 29 | +### 1b. NCSA Kerberos Password |
| 30 | +This is **separate** from your university password. Set it at: |
| 31 | + |
| 32 | +https://identity.ncsa.illinois.edu/reset |
| 33 | + |
| 34 | +Enter your NCSA username and follow the email verification flow. |
| 35 | + |
| 36 | +### 1c. NCSA Duo MFA |
| 37 | +You need a second factor for every login. The easiest method: |
| 38 | + |
| 39 | +1. Go to https://duo.security.ncsa.illinois.edu |
| 40 | +2. Generate **emergency backup recovery codes** |
| 41 | +3. Save these codes somewhere safe — you'll type one each time you SSH in |
| 42 | + |
| 43 | +Alternatively, install the Duo Mobile app and enroll your phone. |
| 44 | + |
| 45 | +--- |
| 46 | + |
| 47 | +## Step 2: SSH In |
| 48 | + |
| 49 | +```bash |
| 50 | +ssh YOUR_USERNAME@dtai-login.delta.ncsa.illinois.edu |
| 51 | +``` |
| 52 | + |
| 53 | +You'll be prompted for: |
| 54 | +1. Your NCSA Kerberos password |
| 55 | +2. A Duo passcode (type a recovery code or `1` for a push notification) |
| 56 | + |
| 57 | +### Pro tip: Use tmux for persistent sessions |
| 58 | + |
| 59 | +```bash |
| 60 | +# After logging in, immediately start tmux |
| 61 | +tmux new -s work |
| 62 | + |
| 63 | +# If you disconnect, reconnect with: |
| 64 | +ssh YOUR_USERNAME@gh-login04.delta.ncsa.illinois.edu # same login node! |
| 65 | +tmux attach -t work |
| 66 | +``` |
| 67 | + |
| 68 | +### SSH config shortcut |
| 69 | + |
| 70 | +Add this to your `~/.ssh/config`: |
| 71 | +``` |
| 72 | +Host delta-ai |
| 73 | + HostName dtai-login.delta.ncsa.illinois.edu |
| 74 | + User YOUR_USERNAME |
| 75 | + PreferredAuthentications keyboard-interactive,password |
| 76 | + ServerAliveInterval 60 |
| 77 | + ServerAliveCountMax 3 |
| 78 | +``` |
| 79 | + |
| 80 | +Then just: `ssh delta-ai` |
| 81 | + |
| 82 | +--- |
| 83 | + |
| 84 | +## Step 3: Understand Your Storage |
| 85 | + |
| 86 | +| Path | Quota | Use For | |
| 87 | +|------|-------|---------| |
| 88 | +| `/u/YOUR_USERNAME` | ~100 GB | Dotfiles, scripts, small configs | |
| 89 | +| `/work/hdd/bekn/YOUR_USERNAME/` | 1 TB | **Your primary workspace** — code, builds, data | |
| 90 | +| `/work/nvme/bekn/` | 500 GB | Fast I/O scratch (shared across team) | |
| 91 | +| `/projects/bekn/` | 500 GB | Shared project files | |
| 92 | +| `/tmp` | 3.9 TB | Compute-node-local scratch (deleted after your job ends) | |
| 93 | + |
| 94 | +**Rule of thumb:** Do everything in `/work/hdd/bekn/YOUR_USERNAME/`. Home is too small for builds. |
| 95 | + |
| 96 | +Check your quota: `quota` |
| 97 | + |
| 98 | +--- |
| 99 | + |
| 100 | +## Step 4: Run Your First Job |
| 101 | + |
| 102 | +### Interactive session (for exploration) |
| 103 | + |
| 104 | +```bash |
| 105 | +srun --account=bekn-dtai-gh --partition=ghx4-interactive \ |
| 106 | + --nodes=1 --gpus-per-node=1 --cpus-per-task=16 \ |
| 107 | + --mem=64G --time=00:30:00 --pty bash |
| 108 | +``` |
| 109 | + |
| 110 | +This gives you a shell on a compute node with 1 GPU for 30 minutes. |
| 111 | + |
| 112 | +Once on the compute node: |
| 113 | +```bash |
| 114 | +nvidia-smi # See your GPU (GH200 120GB) |
| 115 | +uname -m # Should print "aarch64" |
| 116 | +``` |
| 117 | + |
| 118 | +### Batch job |
| 119 | + |
| 120 | +Create `job.slurm`: |
| 121 | +```bash |
| 122 | +#!/bin/bash |
| 123 | +#SBATCH --account=bekn-dtai-gh |
| 124 | +#SBATCH --partition=ghx4 |
| 125 | +#SBATCH --nodes=1 |
| 126 | +#SBATCH --gpus-per-node=1 |
| 127 | +#SBATCH --cpus-per-task=16 |
| 128 | +#SBATCH --mem=64G |
| 129 | +#SBATCH --time=01:00:00 |
| 130 | +#SBATCH --job-name=my-experiment |
| 131 | +#SBATCH --output=logs/%j.out |
| 132 | +#SBATCH --error=logs/%j.err |
| 133 | + |
| 134 | +# Load your environment |
| 135 | +source ~/miniconda3/etc/profile.d/conda.sh |
| 136 | +conda activate myenv |
| 137 | + |
| 138 | +# Run your code |
| 139 | +srun python train.py |
| 140 | +``` |
| 141 | + |
| 142 | +Submit: `sbatch job.slurm` |
| 143 | +Check status: `squeue -u $USER` |
| 144 | +Cancel: `scancel JOB_ID` |
| 145 | + |
| 146 | +### Cost awareness |
| 147 | + |
| 148 | +| Action | Cost | |
| 149 | +|--------|------| |
| 150 | +| 1 GPU for 1 hour (batch) | 1 GPU Hour | |
| 151 | +| 1 GPU for 1 hour (interactive) | **2 GPU Hours** | |
| 152 | +| Full node (4 GPUs) for 1 hour | 4 GPU Hours | |
| 153 | + |
| 154 | +We have ~1,000 GPU Hours. Use interactive sessions for debugging, batch for real work. |
| 155 | + |
| 156 | +--- |
| 157 | + |
| 158 | +## Step 5: Set Up Python / Conda |
| 159 | + |
| 160 | +DeltaAI doesn't have Anaconda. Install Miniconda: |
| 161 | + |
| 162 | +```bash |
| 163 | +curl -L -o /tmp/mc.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh |
| 164 | +bash /tmp/mc.sh -b -p $HOME/miniconda3 |
| 165 | +source $HOME/miniconda3/etc/profile.d/conda.sh |
| 166 | +conda init bash |
| 167 | +``` |
| 168 | + |
| 169 | +Create an environment: |
| 170 | +```bash |
| 171 | +conda create -n myenv python=3.11 -y |
| 172 | +conda activate myenv |
| 173 | +conda install -c conda-forge pytorch numpy scipy matplotlib -y |
| 174 | +``` |
| 175 | + |
| 176 | +For large environments, install to `/work` to avoid HOME quota: |
| 177 | +```bash |
| 178 | +conda create --prefix /work/hdd/bekn/$USER/envs/myenv python=3.11 -y |
| 179 | +``` |
| 180 | + |
| 181 | +--- |
| 182 | + |
| 183 | +## Step 6: Build IOWarp Clio Core |
| 184 | + |
| 185 | +:::warning ARM Architecture |
| 186 | +DeltaAI uses aarch64 ARM CPUs. The default system GCC is 7.5 (too old). You **must** use `gcc-13`/`g++-13` explicitly. |
| 187 | +::: |
| 188 | + |
| 189 | +```bash |
| 190 | +# Activate conda with all deps |
| 191 | +source ~/miniconda3/etc/profile.d/conda.sh |
| 192 | +conda activate iowarp |
| 193 | + |
| 194 | +# Clone |
| 195 | +cd /work/hdd/bekn/$USER |
| 196 | +git clone --recurse-submodules https://github.com/iowarp/clio-core.git |
| 197 | +cd clio-core |
| 198 | + |
| 199 | +# Build (must use gcc-13 explicitly!) |
| 200 | +cmake \ |
| 201 | + -DCMAKE_BUILD_TYPE=Release \ |
| 202 | + -DCMAKE_C_COMPILER=/usr/bin/gcc-13 \ |
| 203 | + -DCMAKE_CXX_COMPILER=/usr/bin/g++-13 \ |
| 204 | + -DCMAKE_C_FLAGS="-I$CONDA_PREFIX/include" \ |
| 205 | + -DCMAKE_CXX_FLAGS="-I$CONDA_PREFIX/include" \ |
| 206 | + -DCMAKE_EXE_LINKER_FLAGS="-L$CONDA_PREFIX/lib" \ |
| 207 | + -DCMAKE_SHARED_LINKER_FLAGS="-L$CONDA_PREFIX/lib" \ |
| 208 | + -DCMAKE_PREFIX_PATH=$CONDA_PREFIX \ |
| 209 | + -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX \ |
| 210 | + -DWRP_CORE_ENABLE_RUNTIME=ON -DWRP_CORE_ENABLE_CTE=ON \ |
| 211 | + -DWRP_CORE_ENABLE_CAE=ON -DWRP_CORE_ENABLE_CEE=ON \ |
| 212 | + -DWRP_CORE_ENABLE_TESTS=OFF -DWRP_CORE_ENABLE_PYTHON=OFF \ |
| 213 | + -DWRP_CORE_ENABLE_MPI=OFF -DWRP_CORE_ENABLE_IO_URING=OFF \ |
| 214 | + -DWRP_CORE_ENABLE_ZMQ=ON -DWRP_CORE_ENABLE_CEREAL=ON \ |
| 215 | + -DWRP_CORE_ENABLE_HDF5=ON \ |
| 216 | + -Wno-dev -B build -G Ninja |
| 217 | + |
| 218 | +cmake --build build -j16 |
| 219 | +cmake --install build |
| 220 | +``` |
| 221 | + |
| 222 | +### Known build issues |
| 223 | + |
| 224 | +- **msgpack cmake naming** — conda `msgpack-cxx` provides `msgpack-cxx-config.cmake` but CMake expects `msgpackConfig.cmake`. Create symlinks: |
| 225 | + ```bash |
| 226 | + mkdir -p $CONDA_PREFIX/lib/cmake/msgpack |
| 227 | + ln -sf $CONDA_PREFIX/lib/cmake/msgpack-cxx/msgpack-cxx-config.cmake \ |
| 228 | + $CONDA_PREFIX/lib/cmake/msgpack/msgpackConfig.cmake |
| 229 | + ln -sf $CONDA_PREFIX/lib/cmake/msgpack-cxx/msgpack-cxx-config-version.cmake \ |
| 230 | + $CONDA_PREFIX/lib/cmake/msgpack/msgpackConfigVersion.cmake |
| 231 | + ln -sf $CONDA_PREFIX/lib/cmake/msgpack-cxx/msgpack-cxx-targets.cmake \ |
| 232 | + $CONDA_PREFIX/lib/cmake/msgpack/msgpack-cxx-targets.cmake |
| 233 | + ``` |
| 234 | +- **No io_uring** — SLES 15.6 kernel may not support it. Disable with `-DWRP_CORE_ENABLE_IO_URING=OFF`. |
| 235 | +- **Ninja from conda** — system cmake is 3.20 (old). Install cmake + ninja from conda for better compatibility. |
| 236 | + |
| 237 | +--- |
| 238 | + |
| 239 | +## Key Things to Remember |
| 240 | + |
| 241 | +1. **This is ARM, not x86.** Binaries from your laptop won't run here. Compile everything on DeltaAI. |
| 242 | +2. **No `mpirun`.** Use `srun` for everything. |
| 243 | +3. **Use `gcc-13`/`g++-13` explicitly.** The default system GCC is 7.5 (too old). |
| 244 | +4. **No SSH keys.** Password + Duo every time. Use tmux. |
| 245 | +5. **Interactive = 2x cost.** Use batch jobs for anything longer than quick debugging. |
| 246 | +6. **No backups on `/work`.** Only HOME has snapshots. Back up important work yourself. |
| 247 | +7. **Keep builds off HOME.** Use `/work/hdd/bekn/YOUR_USERNAME/` for everything. |
| 248 | + |
| 249 | +--- |
| 250 | + |
| 251 | +## Useful Commands Cheat Sheet |
| 252 | + |
| 253 | +```bash |
| 254 | +accounts # Check GPU hour balance |
| 255 | +quota # Check storage usage |
| 256 | +sinfo -a # See partition status |
| 257 | +squeue -u $USER # Your running/queued jobs |
| 258 | +scancel JOB_ID # Cancel a job |
| 259 | +nvidia-smi # GPU status (compute nodes only) |
| 260 | +module list # Loaded software modules |
| 261 | +module spider PACKAGE # Search for available software |
| 262 | +``` |
| 263 | + |
| 264 | +## GPU Info |
| 265 | + |
| 266 | +- NVIDIA GH200 120GB per superchip |
| 267 | +- 4 superchips per node (4 GPUs) |
| 268 | +- CUDA 12.8, Driver 570.172 |
| 269 | +- SM architecture: 9.0 (Hopper) |
| 270 | +- Use `nvidia-smi` on compute nodes (no GPUs on login nodes) |
| 271 | + |
| 272 | +## Getting Help |
| 273 | + |
| 274 | +- **NCSA Support:** http://help.ncsa.illinois.edu or email help@ncsa.illinois.edu |
| 275 | +- **DeltaAI Docs:** https://docs.ncsa.illinois.edu/systems/deltaai/en/latest/ |
| 276 | +- **Team Slack/Chat:** Ask the PI or allocation managers (Jaime, Luke) |
| 277 | + |
| 278 | +## Required Acknowledgment |
| 279 | + |
| 280 | +If you publish results using DeltaAI, include: |
| 281 | + |
| 282 | +> "This research used the DeltaAI system at the National Center for Supercomputing Applications through allocation CIS250329 from the ACCESS program, supported by NSF award OAC 2320345." |
0 commit comments