Skip to content

Commit c0caacc

Browse files
authored
Merge pull request #136 from AI-Hypercomputer/wan2_1_14B_4node_b200
Wan2 1 14 b 4node b200
2 parents 0485fa3 + 7f03353 commit c0caacc

9 files changed

Lines changed: 754 additions & 0 deletions

File tree

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Copyright 2025 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
apiVersion: v2
16+
name: a4_jobset_workload
17+
description: a4_jobset_workload
18+
type: application
19+
version: 0.1.0
20+
appVersion: "1.16.0"
Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
<!-- mdformat global-off -->
2+
# Pretrain wan2-1-14b-bf16-gbs64-gpus32 workloads on a4 GKE Node pools with NVIDIA DFM & Megatron-Bridge
3+
4+
This recipe outlines the steps for running a wan2-1-14b-bf16-gbs64-gpus32 pretraining workload on a4 GKE Node pools by using the NeMo DFM (Diffusion Foundation Models) and Megatron-Bridge within Nemo Framework.
5+
6+
## Orchestration and deployment tools
7+
8+
For this recipe, the following setup is used:
9+
10+
- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
11+
- Pretraining job configuration and deployment - A Helm chart is used to configure and deploy the Kubernetes Jobset resource which manages the execution of the DFM pretraining workload.
12+
13+
## Test environment
14+
15+
This recipe has been optimized for and tested with the following configuration:
16+
17+
- GKE cluster: Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4) to create your a4 GKE cluster.
18+
- Node Configuration: 4 nodes (8 GPUs per node, 32 GPUs total).
19+
- GPU Architecture: NVIDIA Blackwell B200.
20+
21+
## Training dataset
22+
23+
This recipe uses a mock pretraining dataset. To support long-duration stress testing, the recipe includes a patch to the WanMockDataModule via sed within the launcher to extend the mock data length from 1,024 to effectively infinite (10^12 tokens).
24+
25+
## Docker container image
26+
27+
This recipe uses the following docker images:
28+
29+
- `nvcr.io/nvidia/nemo:25.11.00`
30+
- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-arm64:v1.1.1`
31+
32+
## Run the recipe
33+
34+
From your client workstation, complete the following steps:
35+
36+
### Configure environment settings
37+
38+
Set the environment variables to match your environment:
39+
40+
```bash
41+
export PROJECT_ID=<PROJECT_ID>
42+
export CLUSTER_REGION=<CLUSTER_REGION>
43+
export CLUSTER_NAME=<CLUSTER_NAME>
44+
export GCS_BUCKET=<GCS_BUCKET> # Note: path should not be prefixed with gs://
45+
export KUEUE_NAME=<KUEUE_NAME>
46+
```
47+
48+
Replace the following values:
49+
50+
- `<PROJECT_ID>`: your Google Cloud project ID.
51+
- `<CLUSTER_REGION>`: the region where your cluster is located.
52+
- `<CLUSTER_NAME>`: the name of your GKE cluster.
53+
- `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the gs:// prefix.
54+
- `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is a4.
55+
56+
Set the default project:
57+
58+
```bash
59+
gcloud config set project $PROJECT_ID
60+
```
61+
62+
### Get cluster credentials
63+
64+
```bash
65+
gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
66+
```
67+
68+
### Get the recipe
69+
70+
Clone the `gpu-recipes` repository and set a reference to the recipe folder.
71+
72+
```
73+
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
74+
cd gpu-recipes
75+
export REPO_ROOT=`git rev-parse --show-toplevel`
76+
export RECIPE_ROOT=$REPO_ROOT/training/a4/wan2-1-14b/nemo-pretraining-gke/4node-BF16-GBS64/recipe
77+
cd $RECIPE_ROOT
78+
```
79+
80+
### Configure and submit a pretraining job
81+
82+
#### Using 8 nodes (32 gpus) fp8 precision
83+
84+
To execute the job with the default settings, run the following command from your client:
85+
86+
```bash
87+
cd $RECIPE_ROOT
88+
export WORKLOAD_NAME=$USER-a4-wan2-1-14b-4node
89+
helm install $WORKLOAD_NAME . -f values.yaml \
90+
--set-file workload_launcher=launcher.sh \
91+
--set workload.image=nvcr.io/nvidia/nemo:25.11.00 \
92+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
93+
--set volumes.gcsMounts[0].mountPath=/job-logs \
94+
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
95+
--set queue=${KUEUE_NAME}
96+
```
97+
98+
**Examples**
99+
100+
- To set the number of training steps to 100, run the following command from
101+
your client:
102+
103+
```bash
104+
cd $RECIPE_ROOT
105+
export WORKLOAD_NAME=$USER-a4-wan2-1-14b-4node
106+
helm install $WORKLOAD_NAME . -f values.yaml \
107+
--set-file workload_launcher=launcher.sh \
108+
--set workload.image=nvcr.io/nvidia/nemo:25.11.00 \
109+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
110+
--set volumes.gcsMounts[0].mountPath=/job-logs \
111+
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
112+
--set queue=${KUEUE_NAME} \
113+
--set workload.arguments[0]="train.train_iters=100"
114+
```
115+
116+
### Monitor the job
117+
118+
To check the status of pods in your job, run the following command:
119+
120+
```
121+
kubectl get pods | grep $USER-a4-wan2-1-14b-4node
122+
```
123+
124+
Replace the following:
125+
126+
- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4-wan2-1-14b-4node.
127+
128+
To get the logs for one of the pods, run the following command:
129+
130+
```
131+
kubectl logs POD_NAME
132+
```
133+
134+
Information about the training job's progress, including crucial details such as
135+
loss, step count, and step time, is generated by the rank 0 process.
136+
This process runs on the pod whose name begins with
137+
`JOB_NAME_PREFIX-workload-0-0`.
138+
For example: `$USER-a4-wan2-1-14b-4node-workload-0-0-s9zrv`.
139+
140+
### Uninstall the Helm release
141+
142+
You can delete the job and other resources created by the Helm chart. To
143+
uninstall Helm, run the following command from your client:
144+
145+
```bash
146+
helm uninstall $USER-a4-wan2-1-14b-4node
147+
```
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
usage()
2+
{
3+
cat << EOF
4+
usage: bash ./launcher.sh [config-override [config-override ...]]
5+
config-override (Optional) A NeMo configuration override. E.g. trainer.max_steps=10000.
6+
EOF
7+
}
8+
9+
parse_args() {
10+
while [ "$1" != "" ]; do
11+
case $(grep -o "=" <<< "$1" | wc -l) in
12+
1 )
13+
config_overrides+=("$1")
14+
;;
15+
* )
16+
echo "Invalid config override: $1"
17+
usage
18+
exit 1
19+
esac
20+
shift
21+
done
22+
config_overrides="${config_overrides[*]}"
23+
}
24+
25+
config_overrides=()
26+
parse_args "$@"
27+
28+
if [ -z "${config_overrides}" ]; then
29+
echo "No NeMo config overrides specified"
30+
else
31+
echo "NeMo config overrides:"
32+
echo " ${config_overrides}"
33+
fi
34+
35+
export LD_LIBRARY_PATH="$NCCL_PLUGIN_PATH"
36+
ldconfig $LD_LIBRARY_PATH
37+
echo "Added $LD_LIBRARY_PATH to ldconfig:"
38+
ldconfig -p | grep libcuda | sed 's/^/ /'
39+
echo ""
40+
41+
if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then
42+
explicit_log_dir=${EXPLICIT_LOG_DIR}
43+
else
44+
explicit_log_dir=/workload_logs
45+
fi
46+
echo "Logging to ${explicit_log_dir}"
47+
48+
if [[ -n "${TOKENIZER_PATH}" ]]; then
49+
echo "Getting tokenizer files"
50+
cp ${TOKENIZER_PATH}/* .
51+
echo ""
52+
fi
53+
54+
echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes"
55+
56+
pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger
57+
58+
# Create the nsys directory.
59+
mkdir -p ${explicit_log_dir}/nsys
60+
61+
export NVTE_FUSED_ATTN=1
62+
# Repos & Pinning (Wan Stable Build)
63+
cd /opt
64+
rm -rf DFM Megatron-Bridge Megatron-LM
65+
66+
# Fresh clone of DFM
67+
git clone https://github.com/NVIDIA-NeMo/DFM.git /opt/DFM
68+
69+
# Pin Bridge & LM to known stable commits for this workflow
70+
git clone --no-checkout https://github.com/NVIDIA-NeMo/Megatron-Bridge.git /opt/Megatron-Bridge
71+
git -C /opt/Megatron-Bridge checkout 953aabf75c0500180dc14a6a76cf9e7e7c4baec7
72+
73+
git clone --no-checkout https://github.com/NVIDIA/Megatron-LM.git /opt/Megatron-LM
74+
git -C /opt/Megatron-LM checkout 2d398b42fd4237fffb553109563d73ac099751c3
75+
76+
# Fix automodel info logging flood
77+
sed -i 's/logger.info/logger.warning/g' /opt/DFM/dfm/src/automodel/flow_matching/flow_matching_pipeline.py
78+
79+
# Fix data length issue with DFM
80+
sed -i 's/length=1024/length=10**12/g' /opt/DFM/dfm/src/megatron/data/wan/wan_mock_datamodule.py
81+
82+
sed -i 's/length: int/length: int = 10**12/g' /opt/DFM/dfm/src/megatron/data/wan/wan_mock_datamodule.py
83+
export PYTHONPATH="/opt/DFM:/opt/Megatron-Bridge:/opt/Megatron-LM"
84+
85+
pip install --no-cache-dir \
86+
diffusers==0.35.1 \
87+
easydict \
88+
imageio \
89+
imageio-ffmpeg \
90+
peft \
91+
"transformers<4.57.0" \
92+
nvidia-modelopt[hf] \
93+
> /dev/null 2>&1
94+
95+
worker_command=$(cat <<- EOM
96+
if [ "\$RANK" -eq "0" ]; then
97+
echo "Worker 0 is stalling for a few seconds.." ;
98+
sleep 3 ;
99+
echo "The detected environment within worker rank 0 is:" ;
100+
env | sed 's/^/ /' ;
101+
fi ;
102+
103+
cd /opt/DFM ;
104+
105+
numactl \
106+
--cpunodebind=\$((LOCAL_RANK/4)) \
107+
--membind=\$((LOCAL_RANK/4)) \
108+
nice -10 \
109+
python examples/megatron/recipes/wan/pretrain_wan.py \
110+
--config-file examples/megatron/recipes/wan/conf/gb200_perf_pretrain_mock.yaml \
111+
--training-mode pretrain \
112+
--mock \
113+
checkpoint.save_interval=0 \
114+
model.tensor_model_parallel_size=1 \
115+
model.context_parallel_size=2 \
116+
model.sequence_parallel=false \
117+
model.recompute_granularity=full \
118+
model.recompute_method=block \
119+
model.recompute_num_layers=40 \
120+
\${config_overrides}
121+
EOM
122+
)
123+
124+
echo "$worker_command" > worker_command.sh
125+
chmod 777 worker_command.sh
126+
127+
torchrun \
128+
--nproc-per-node="8" \
129+
--nnodes="4" \
130+
--node_rank="${JOB_COMPLETION_INDEX}" \
131+
--rdzv_id="${JOB_IDENTIFIER}" \
132+
--master_addr="${MASTER_ADDR}" \
133+
--master_port="${MASTER_PORT}" \
134+
--no-python bash worker_command.sh
135+
136+
137+
if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then
138+
mkdir -p ${ARTIFACT_DIR}
139+
cp -r ${explicit_log_dir}/* ${ARTIFACT_DIR}/
140+
env > ${ARTIFACT_DIR}/environ.txt
141+
ls ${ARTIFACT_DIR}
142+
fi
143+
echo "Training completed"
144+
echo "Pod on $(hostname --fqdn) is exiting"
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
helm install ninggu-ubench-59tr . -f values.yaml --set-file workload_launcher=launcher.sh --set workload.image=nvcr.io/nvidia/nemo:25.11.01 --set volumes.gcsMounts[0].bucketName=ubench-logs --set volumes.gcsMounts[0].mountPath=/job-logs --set workload.envs[0].value=/job-logs/ninggu-ubench-59tr --set queue=a4
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# yamllint disable
2+
# Copyright 2025 Google LLC
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
{{- if .Values.workload.configFile }}
17+
apiVersion: v1
18+
kind: ConfigMap
19+
metadata:
20+
name: "{{ .Release.Name }}-config"
21+
data:
22+
workload-configuration: |-
23+
{{- if .Values.workload_config }}
24+
{{ .Values.workload_config | nindent 4 }}
25+
{{- else }}
26+
{{ "config: null" | nindent 4 }}
27+
{{- end }}
28+
{{- end }}

0 commit comments

Comments
 (0)