Skip to content

Commit 415897e

Browse files
authored
Merge pull request #138 from AI-Hypercomputer/rishabh/qwen
add qwen on slurm
2 parents 14184a9 + bd14735 commit 415897e

4 files changed

Lines changed: 301 additions & 1 deletion

File tree

training/a3ultra/llama3-1-70b/megatron-bridge-pretraining-slurm/4node-FP8CS-GBS1024/recipe/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ Set the environment variables to match your environment:
3939
export PROJECT_ID=<PROJECT_ID>
4040
export CLUSTER_REGION=<CLUSTER_REGION>
4141
export CLUSTER_NAME=<CLUSTER_NAME>
42-
gcloud compute ssh $CLUSTER_NAME --project supercomputer-testing --zone $CLUSTER_REGION -- -o Hostname=nic0.$CLUSTER_NAME.$CLUSTER_REGION.c.$PROJECT_ID$.internal.gcpnode.com
42+
gcloud compute ssh $CLUSTER_NAME --project <project-name> --zone $CLUSTER_REGION -- -o Hostname=nic0.$CLUSTER_NAME.$CLUSTER_REGION.c.$PROJECT_ID$.internal.gcpnode.com
4343

4444
```
4545

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
<!-- mdformat global-off -->
2+
# Pretrain Qwen 3 235B workloads on A4 Slurm Cluster with Nvidia Megatron-Bridge
3+
4+
This recipe outlines the steps for running a Qwen 3 235B pretraining workload on [Google Cloud A4 Slurm clusters](https://docs.cloud.google.com/ai-hypercomputer/docs/create/create-slurm-cluster) by using [NVIDIA Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge).
5+
6+
## Orchestration and deployment tools
7+
8+
For this recipe, the following setup is used:
9+
10+
- Orchestration - [Slurm Workload Manager](https://slurm.schedmd.com/)
11+
- Deployment - [Cluster Toolkit](https://cloud.google.com/cluster-toolkit/docs/overview)
12+
13+
## Test environment
14+
15+
This recipe has been optimized for and tested with the following configuration:
16+
17+
- A4 Slurm Cluster (16 nodes, 128 GPUs)
18+
- Machine Type: `a4-highgpu-8g`
19+
- Lustre Filesystem
20+
21+
Please follow the instructions in the [Cluster Toolkit A4 Example README](https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/examples/machine-learning/a4-highgpu-8g) to provision an A4 Slurm cluster.
22+
23+
24+
## Docker container image
25+
26+
This recipe uses the following container images:
27+
28+
- `nvcr.io/nvidia/nemo:25.11`
29+
30+
## Run the recipe
31+
32+
33+
### Configure environment settings
34+
35+
Set the environment variables to match your environment:
36+
37+
```bash
38+
export PROJECT_ID=<PROJECT_ID>
39+
export CLUSTER_REGION=<CLUSTER_REGION>
40+
export CLUSTER_NAME=<CLUSTER_NAME>
41+
gcloud compute ssh $CLUSTER_NAME --project <project-name> --zone $CLUSTER_REGION -- -o Hostname=nic0.$CLUSTER_NAME.$CLUSTER_REGION.c.$PROJECT_ID$.internal.gcpnode.com
42+
43+
```
44+
45+
Replace the following values:
46+
47+
- `<PROJECT_ID>`: your Google Cloud project ID.
48+
- `<CLUSTER_REGION>`: the region where your cluster is located.
49+
- `<CLUSTER_NAME>`: the name of your SLURM cluster.
50+
51+
Set the default project:
52+
53+
```bash
54+
gcloud config set project $PROJECT_ID
55+
```
56+
57+
From your cluster login node, complete the following steps:
58+
59+
### Get the recipe
60+
61+
Clone the `gpu-recipes` repository and set a reference to the recipe folder.
62+
63+
```
64+
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
65+
cd gpu-recipes
66+
export REPO_ROOT=`git rev-parse --show-toplevel`
67+
export RECIPE_ROOT=$REPO_ROOT/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/16node-BF16-GBS4096/recipe
68+
cd $RECIPE_ROOT
69+
```
70+
71+
### Submit a pretraining job
72+
73+
```
74+
# set your HF_TOKEN inside launch_script.sh
75+
export HF_TOKEN="YOUR_HF_TOKEN" # Replace with your Hugging Face token.
76+
77+
cd ..
78+
sbatch ./recipe/sbatch_script.sh
79+
```
80+
81+
### Monitor the job
82+
83+
To check the status of pods in your job, run the following command:
84+
85+
```
86+
squeue --me
87+
```
88+
89+
90+
To get the logs for the job, run the following command:
91+
92+
```
93+
tail -f slurm_{jobID}.out
94+
```
95+
96+
### Uninstall the job
97+
98+
```bash
99+
scancel -u $USER
100+
```
Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
usage()
2+
{
3+
cat << EOF
4+
usage: bash ./launcher.sh [config-override [config-override ...]]
5+
config-override (Optional) A NeMo configuration override. E.g. trainer.max_steps=10000.
6+
EOF
7+
}
8+
9+
parse_args() {
10+
while [[ "$1" != "" ]]; do
11+
case $(grep -o "=" <<< "$1" | wc -l) in
12+
1 )
13+
config_overrides+=("$1")
14+
;;
15+
* )
16+
echo "Invalid config override: $1"
17+
usage
18+
exit 1
19+
esac
20+
shift
21+
done
22+
config_overrides="${config_overrides[*]}"
23+
}
24+
25+
config_overrides=()
26+
parse_args "$@"
27+
28+
if [[ -z "${config_overrides[*]}" ]]; then
29+
echo "No NeMo config overrides specified"
30+
else
31+
echo "NeMo config overrides:"
32+
echo " ${config_overrides}"
33+
fi
34+
35+
export LD_LIBRARY_PATH="/usr/local/cuda/compat/lib:$NCCL_PLUGIN_PATH:$LD_LIBRARY_PATH"
36+
ldconfig "$LD_LIBRARY_PATH"
37+
echo "Added $LD_LIBRARY_PATH to ldconfig:"
38+
ldconfig -p | grep libcuda | sed 's/^/ /'
39+
echo ""
40+
41+
if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then
42+
explicit_log_dir=${EXPLICIT_LOG_DIR}
43+
else
44+
explicit_log_dir=workload_logs
45+
fi
46+
echo "Logging to ${explicit_log_dir}"
47+
48+
if [[ -n "${TOKENIZER_PATH}" ]]; then
49+
echo "Getting tokenizer files"
50+
cp "${TOKENIZER_PATH}"/* .
51+
echo ""
52+
fi
53+
54+
echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes"
55+
56+
pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger
57+
58+
# Create the nsys directory.
59+
mkdir -p "${explicit_log_dir}/nsys"
60+
61+
# Collect diagnostics to a single line
62+
kv="\"kernel_version\": \"$(uname --kernel-release)\""
63+
if command -v nvidia-smi &> /dev/null; then
64+
cuda_v=$(nvidia-smi -q -x | grep -Po '(?<=<cuda_version>).*(?=</cuda_version>)' || true)
65+
driver_v=$(nvidia-smi -q -x | grep -Po '(?<=<driver_version>).*(?=</driver_version>)' || true)
66+
vbios_v=$(nvidia-smi -q -x | grep -Po '(?<=<vbios_version>).*(?=</vbios_version>)' | head -n1 || true)
67+
kv="${kv}, \"cuda_version\": \"${cuda_v}\""
68+
kv="${kv}, \"driver_version\": \"${driver_v}\""
69+
kv="${kv}, \"vbios_version\": \"${vbios_v}\""
70+
fi
71+
echo "VERSION_DIAGNOSTICS: {${kv}}"
72+
73+
74+
export HF_TOKEN="<HF_TOKEN>"
75+
76+
cd /opt
77+
rm -rf Megatron-Bridge
78+
git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git
79+
cd Megatron-Bridge
80+
git checkout 7695d4acbfac19353d20e456509117efe4733d6b
81+
ls
82+
83+
84+
85+
worker_command=$(cat <<- EOM
86+
if [ "\$RANK" -eq "0" ]; then
87+
echo "Worker 0 is stalling for a few seconds.." ;
88+
sleep 3 ;
89+
echo "The detected environment within worker rank 0 is:" ;
90+
env | sed 's/^/ /' ;
91+
fi ;
92+
93+
cd /opt/Megatron-Bridge ;
94+
95+
numactl \
96+
--cpunodebind=\$((LOCAL_RANK/4)) \
97+
--membind=\$((LOCAL_RANK/4)) nsys profile \
98+
-t nvtx,cuda \
99+
--cuda-event-trace=false \
100+
--sample=none \
101+
--capture-range=cudaProfilerApi \
102+
--capture-range-end=stop \
103+
--kill none \
104+
-o "/${explicit_log_dir}/$JOB_IDENTIFIER/rank-\$RANK" \
105+
--force-overwrite true \
106+
--session-new "nsys-\$RANDOM-\$RANK" \
107+
nice -10 \
108+
python scripts/performance/run_script.py \
109+
--gpu b200 \
110+
--model_family_name qwen \
111+
--model_recipe_name qwen3_235b_a22b \
112+
--gpus_per_node 8 \
113+
--num_gpus 128 \
114+
--seq_length 4096 \
115+
--compute_dtype bf16 \
116+
--global_batch_size 4096 \
117+
--tensor_model_parallel_size 1 \
118+
--pipeline_model_parallel_size 8 \
119+
--virtual_pipeline_model_parallel_size 4 \
120+
--expert_model_parallel_size 8 \
121+
--expert_tensor_parallel_size 1 \
122+
--moe_a2a_overlap True \
123+
--max_steps 30
124+
125+
EOM
126+
)
127+
128+
echo "$worker_command" > worker_command.sh
129+
chmod 777 worker_command.sh
130+
131+
torchrun \
132+
--nproc-per-node="8" \
133+
--nnodes="16" \
134+
--node_rank="${JOB_COMPLETION_INDEX}" \
135+
--rdzv_id="${JOB_IDENTIFIER}" \
136+
--master_addr="${MASTER_ADDR}" \
137+
--master_port="${MASTER_PORT}" \
138+
--no-python bash worker_command.sh
139+
140+
141+
if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then
142+
mkdir -p "${ARTIFACT_DIR}"
143+
cp -r "${explicit_log_dir}"/* "${ARTIFACT_DIR}/"
144+
env > "${ARTIFACT_DIR}/environ.txt"
145+
ls "${ARTIFACT_DIR}"
146+
fi
147+
echo "Training completed"
148+
echo "Pod on $(hostname --fqdn) is exiting"
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
#!/bin/bash
2+
#SBATCH --job-name=qwen3_235b_bf16_b200_128gpus-8jje
3+
#SBATCH --nodes=16
4+
#SBATCH --ntasks-per-node=1
5+
#SBATCH --gres=gpu:8
6+
#SBATCH --mem=0
7+
8+
# Exit early on failures
9+
set -e
10+
11+
# Validate that the recipe location is setup correctly.
12+
# Recipe is expected to be in "recipe" folder inside current working directory
13+
RECIPE_DIR="$(pwd)/recipe"
14+
LAUNCH_SCRIPT="${RECIPE_DIR}/launch_script.sh"
15+
if [[ ! -f "${LAUNCH_SCRIPT}" ]]; then
16+
echo "Error: Recipe is not located correctly. The recipe is expected to be in "recipe" folder inside current working directory. We could not find the launch script there." >&2
17+
exit 1
18+
fi
19+
chmod +x "${LAUNCH_SCRIPT}"
20+
21+
# Enroot the image if it is not already enrooted.
22+
export ENROOT_CONFIG_PATH=${HOME}/.config/enroot
23+
ORIG_IMAGE=nvcr.io#nvidia/nemo:25.11
24+
SQSH_IMAGE_PATH=${RECIPE_DIR}/sqsh/nvcr.io_nvidia_nemo:25.11
25+
if [[ ! -f "${SQSH_IMAGE_PATH}" ]]; then
26+
mkdir -p "$(dirname "${SQSH_IMAGE_PATH}")"
27+
echo "enrooting $ORIG_IMAGE to ${SQSH_IMAGE_PATH}"
28+
enroot import --output "${SQSH_IMAGE_PATH}" -- "docker://${ORIG_IMAGE}"
29+
fi
30+
31+
# get the master node
32+
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
33+
master_port=29500
34+
35+
ARTIFACT_DIR_HOME="/home/$USER/job_artifacts/${SLURM_JOB_ID}"
36+
mkdir -p "$ARTIFACT_DIR_HOME"
37+
38+
export NNODES=$SLURM_NNODES
39+
export MASTER_ADDR=$master_addr
40+
export MASTER_PORT=$master_port
41+
export ARTIFACT_DIR=/artifacts
42+
export JOB_NAME=qwen3_235b_bf16_b200_128gpus-8jje
43+
export JOB_IDENTIFIER=qwen3_235b_bf16_b200_128gpus-8jje
44+
45+
46+
47+
srun --container-image="$SQSH_IMAGE_PATH" \
48+
--container-mounts="${RECIPE_DIR}:/recipe:mkdir,${ARTIFACT_DIR_HOME}:${ARTIFACT_DIR}:mkdir" \
49+
--container-workdir=/recipe \
50+
--container-writable \
51+
bash -c 'export JOB_COMPLETION_INDEX=$SLURM_NODEID; ./launch_script.sh'
52+

0 commit comments

Comments
 (0)