Skip to content

Commit d5d9f3b

Browse files
committed
Add DeltaAI getting-started guide to Deployment section
1 parent 5caefd7 commit d5d9f3b

2 files changed

Lines changed: 283 additions & 0 deletions

File tree

docs/deployment/deltaai.md

Lines changed: 282 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,282 @@
1+
---
2+
sidebar_position: 3
3+
title: DeltaAI (NCSA)
4+
description: Getting started with IOWarp on the NCSA DeltaAI GH200 supercomputer.
5+
---
6+
7+
# Getting Started with DeltaAI
8+
9+
> For IOWarp team members on project CIS250329
10+
> DeltaAI: NVIDIA GH200 Grace Hopper Supercomputer at NCSA
11+
12+
---
13+
14+
## What is DeltaAI?
15+
16+
DeltaAI is a 152-node supercomputer at NCSA, each node packing 4x NVIDIA GH200 superchips (H100 GPU + Grace ARM CPU). Our allocation gives us ~1,000 GPU Hours on H100s with 120GB HBM3 each.
17+
18+
**Important:** DeltaAI runs on **ARM (aarch64)** CPUs, not x86. This affects everything you compile.
19+
20+
---
21+
22+
## Step 1: Get Your Credentials
23+
24+
You need three things before you can log in:
25+
26+
### 1a. NCSA Username
27+
Your PI or allocation manager has already added you to the project. Your NCSA username is typically your university NetID (e.g., `jdoe3`). Check with the PI if unsure.
28+
29+
### 1b. NCSA Kerberos Password
30+
This is **separate** from your university password. Set it at:
31+
32+
https://identity.ncsa.illinois.edu/reset
33+
34+
Enter your NCSA username and follow the email verification flow.
35+
36+
### 1c. NCSA Duo MFA
37+
You need a second factor for every login. The easiest method:
38+
39+
1. Go to https://duo.security.ncsa.illinois.edu
40+
2. Generate **emergency backup recovery codes**
41+
3. Save these codes somewhere safe — you'll type one each time you SSH in
42+
43+
Alternatively, install the Duo Mobile app and enroll your phone.
44+
45+
---
46+
47+
## Step 2: SSH In
48+
49+
```bash
50+
ssh YOUR_USERNAME@dtai-login.delta.ncsa.illinois.edu
51+
```
52+
53+
You'll be prompted for:
54+
1. Your NCSA Kerberos password
55+
2. A Duo passcode (type a recovery code or `1` for a push notification)
56+
57+
### Pro tip: Use tmux for persistent sessions
58+
59+
```bash
60+
# After logging in, immediately start tmux
61+
tmux new -s work
62+
63+
# If you disconnect, reconnect with:
64+
ssh YOUR_USERNAME@gh-login04.delta.ncsa.illinois.edu # same login node!
65+
tmux attach -t work
66+
```
67+
68+
### SSH config shortcut
69+
70+
Add this to your `~/.ssh/config`:
71+
```
72+
Host delta-ai
73+
HostName dtai-login.delta.ncsa.illinois.edu
74+
User YOUR_USERNAME
75+
PreferredAuthentications keyboard-interactive,password
76+
ServerAliveInterval 60
77+
ServerAliveCountMax 3
78+
```
79+
80+
Then just: `ssh delta-ai`
81+
82+
---
83+
84+
## Step 3: Understand Your Storage
85+
86+
| Path | Quota | Use For |
87+
|------|-------|---------|
88+
| `/u/YOUR_USERNAME` | ~100 GB | Dotfiles, scripts, small configs |
89+
| `/work/hdd/bekn/YOUR_USERNAME/` | 1 TB | **Your primary workspace** — code, builds, data |
90+
| `/work/nvme/bekn/` | 500 GB | Fast I/O scratch (shared across team) |
91+
| `/projects/bekn/` | 500 GB | Shared project files |
92+
| `/tmp` | 3.9 TB | Compute-node-local scratch (deleted after your job ends) |
93+
94+
**Rule of thumb:** Do everything in `/work/hdd/bekn/YOUR_USERNAME/`. Home is too small for builds.
95+
96+
Check your quota: `quota`
97+
98+
---
99+
100+
## Step 4: Run Your First Job
101+
102+
### Interactive session (for exploration)
103+
104+
```bash
105+
srun --account=bekn-dtai-gh --partition=ghx4-interactive \
106+
--nodes=1 --gpus-per-node=1 --cpus-per-task=16 \
107+
--mem=64G --time=00:30:00 --pty bash
108+
```
109+
110+
This gives you a shell on a compute node with 1 GPU for 30 minutes.
111+
112+
Once on the compute node:
113+
```bash
114+
nvidia-smi # See your GPU (GH200 120GB)
115+
uname -m # Should print "aarch64"
116+
```
117+
118+
### Batch job
119+
120+
Create `job.slurm`:
121+
```bash
122+
#!/bin/bash
123+
#SBATCH --account=bekn-dtai-gh
124+
#SBATCH --partition=ghx4
125+
#SBATCH --nodes=1
126+
#SBATCH --gpus-per-node=1
127+
#SBATCH --cpus-per-task=16
128+
#SBATCH --mem=64G
129+
#SBATCH --time=01:00:00
130+
#SBATCH --job-name=my-experiment
131+
#SBATCH --output=logs/%j.out
132+
#SBATCH --error=logs/%j.err
133+
134+
# Load your environment
135+
source ~/miniconda3/etc/profile.d/conda.sh
136+
conda activate myenv
137+
138+
# Run your code
139+
srun python train.py
140+
```
141+
142+
Submit: `sbatch job.slurm`
143+
Check status: `squeue -u $USER`
144+
Cancel: `scancel JOB_ID`
145+
146+
### Cost awareness
147+
148+
| Action | Cost |
149+
|--------|------|
150+
| 1 GPU for 1 hour (batch) | 1 GPU Hour |
151+
| 1 GPU for 1 hour (interactive) | **2 GPU Hours** |
152+
| Full node (4 GPUs) for 1 hour | 4 GPU Hours |
153+
154+
We have ~1,000 GPU Hours. Use interactive sessions for debugging, batch for real work.
155+
156+
---
157+
158+
## Step 5: Set Up Python / Conda
159+
160+
DeltaAI doesn't have Anaconda. Install Miniconda:
161+
162+
```bash
163+
curl -L -o /tmp/mc.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh
164+
bash /tmp/mc.sh -b -p $HOME/miniconda3
165+
source $HOME/miniconda3/etc/profile.d/conda.sh
166+
conda init bash
167+
```
168+
169+
Create an environment:
170+
```bash
171+
conda create -n myenv python=3.11 -y
172+
conda activate myenv
173+
conda install -c conda-forge pytorch numpy scipy matplotlib -y
174+
```
175+
176+
For large environments, install to `/work` to avoid HOME quota:
177+
```bash
178+
conda create --prefix /work/hdd/bekn/$USER/envs/myenv python=3.11 -y
179+
```
180+
181+
---
182+
183+
## Step 6: Build IOWarp Clio Core
184+
185+
:::warning ARM Architecture
186+
DeltaAI uses aarch64 ARM CPUs. The default system GCC is 7.5 (too old). You **must** use `gcc-13`/`g++-13` explicitly.
187+
:::
188+
189+
```bash
190+
# Activate conda with all deps
191+
source ~/miniconda3/etc/profile.d/conda.sh
192+
conda activate iowarp
193+
194+
# Clone
195+
cd /work/hdd/bekn/$USER
196+
git clone --recurse-submodules https://github.com/iowarp/clio-core.git
197+
cd clio-core
198+
199+
# Build (must use gcc-13 explicitly!)
200+
cmake \
201+
-DCMAKE_BUILD_TYPE=Release \
202+
-DCMAKE_C_COMPILER=/usr/bin/gcc-13 \
203+
-DCMAKE_CXX_COMPILER=/usr/bin/g++-13 \
204+
-DCMAKE_C_FLAGS="-I$CONDA_PREFIX/include" \
205+
-DCMAKE_CXX_FLAGS="-I$CONDA_PREFIX/include" \
206+
-DCMAKE_EXE_LINKER_FLAGS="-L$CONDA_PREFIX/lib" \
207+
-DCMAKE_SHARED_LINKER_FLAGS="-L$CONDA_PREFIX/lib" \
208+
-DCMAKE_PREFIX_PATH=$CONDA_PREFIX \
209+
-DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX \
210+
-DWRP_CORE_ENABLE_RUNTIME=ON -DWRP_CORE_ENABLE_CTE=ON \
211+
-DWRP_CORE_ENABLE_CAE=ON -DWRP_CORE_ENABLE_CEE=ON \
212+
-DWRP_CORE_ENABLE_TESTS=OFF -DWRP_CORE_ENABLE_PYTHON=OFF \
213+
-DWRP_CORE_ENABLE_MPI=OFF -DWRP_CORE_ENABLE_IO_URING=OFF \
214+
-DWRP_CORE_ENABLE_ZMQ=ON -DWRP_CORE_ENABLE_CEREAL=ON \
215+
-DWRP_CORE_ENABLE_HDF5=ON \
216+
-Wno-dev -B build -G Ninja
217+
218+
cmake --build build -j16
219+
cmake --install build
220+
```
221+
222+
### Known build issues
223+
224+
- **msgpack cmake naming** — conda `msgpack-cxx` provides `msgpack-cxx-config.cmake` but CMake expects `msgpackConfig.cmake`. Create symlinks:
225+
```bash
226+
mkdir -p $CONDA_PREFIX/lib/cmake/msgpack
227+
ln -sf $CONDA_PREFIX/lib/cmake/msgpack-cxx/msgpack-cxx-config.cmake \
228+
$CONDA_PREFIX/lib/cmake/msgpack/msgpackConfig.cmake
229+
ln -sf $CONDA_PREFIX/lib/cmake/msgpack-cxx/msgpack-cxx-config-version.cmake \
230+
$CONDA_PREFIX/lib/cmake/msgpack/msgpackConfigVersion.cmake
231+
ln -sf $CONDA_PREFIX/lib/cmake/msgpack-cxx/msgpack-cxx-targets.cmake \
232+
$CONDA_PREFIX/lib/cmake/msgpack/msgpack-cxx-targets.cmake
233+
```
234+
- **No io_uring** — SLES 15.6 kernel may not support it. Disable with `-DWRP_CORE_ENABLE_IO_URING=OFF`.
235+
- **Ninja from conda** — system cmake is 3.20 (old). Install cmake + ninja from conda for better compatibility.
236+
237+
---
238+
239+
## Key Things to Remember
240+
241+
1. **This is ARM, not x86.** Binaries from your laptop won't run here. Compile everything on DeltaAI.
242+
2. **No `mpirun`.** Use `srun` for everything.
243+
3. **Use `gcc-13`/`g++-13` explicitly.** The default system GCC is 7.5 (too old).
244+
4. **No SSH keys.** Password + Duo every time. Use tmux.
245+
5. **Interactive = 2x cost.** Use batch jobs for anything longer than quick debugging.
246+
6. **No backups on `/work`.** Only HOME has snapshots. Back up important work yourself.
247+
7. **Keep builds off HOME.** Use `/work/hdd/bekn/YOUR_USERNAME/` for everything.
248+
249+
---
250+
251+
## Useful Commands Cheat Sheet
252+
253+
```bash
254+
accounts # Check GPU hour balance
255+
quota # Check storage usage
256+
sinfo -a # See partition status
257+
squeue -u $USER # Your running/queued jobs
258+
scancel JOB_ID # Cancel a job
259+
nvidia-smi # GPU status (compute nodes only)
260+
module list # Loaded software modules
261+
module spider PACKAGE # Search for available software
262+
```
263+
264+
## GPU Info
265+
266+
- NVIDIA GH200 120GB per superchip
267+
- 4 superchips per node (4 GPUs)
268+
- CUDA 12.8, Driver 570.172
269+
- SM architecture: 9.0 (Hopper)
270+
- Use `nvidia-smi` on compute nodes (no GPUs on login nodes)
271+
272+
## Getting Help
273+
274+
- **NCSA Support:** http://help.ncsa.illinois.edu or email help@ncsa.illinois.edu
275+
- **DeltaAI Docs:** https://docs.ncsa.illinois.edu/systems/deltaai/en/latest/
276+
- **Team Slack/Chat:** Ask the PI or allocation managers (Jaime, Luke)
277+
278+
## Required Acknowledgment
279+
280+
If you publish results using DeltaAI, include:
281+
282+
> "This research used the DeltaAI system at the National Center for Supercomputing Applications through allocation CIS250329 from the ACCESS program, supported by NSF award OAC 2320345."

sidebars.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ const sidebars: SidebarsConfig = {
2626
'deployment/configuration',
2727
'deployment/dashboard',
2828
'deployment/hpc-cluster',
29+
'deployment/deltaai',
2930
'deployment/performance',
3031
'deployment/monitoring',
3132
],

0 commit comments

Comments
 (0)