AI-Hypercomputer
diff --git a/‎training/a4/wan2-1-14b/nemo-pretraining-gke/4node-BF16-GBS64/recipe/Chart.yaml‎
Lines changed: 20 additions & 0 deletions b/‎training/a4/wan2-1-14b/nemo-pretraining-gke/4node-BF16-GBS64/recipe/Chart.yaml‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎training/a4/wan2-1-14b/nemo-pretraining-gke/4node-BF16-GBS64/recipe/README.md‎
Lines changed: 147 additions & 0 deletions b/‎training/a4/wan2-1-14b/nemo-pretraining-gke/4node-BF16-GBS64/recipe/README.md‎
Lines changed: 147 additions & 0 deletions
diff --git a/‎training/a4/wan2-1-14b/nemo-pretraining-gke/4node-BF16-GBS64/recipe/launcher.sh‎
Lines changed: 144 additions & 0 deletions b/‎training/a4/wan2-1-14b/nemo-pretraining-gke/4node-BF16-GBS64/recipe/launcher.sh‎
Lines changed: 144 additions & 0 deletions
diff --git a/‎training/a4/wan2-1-14b/nemo-pretraining-gke/4node-BF16-GBS64/recipe/recipe_launch_command.sh‎
Lines changed: 1 addition & 0 deletions b/‎training/a4/wan2-1-14b/nemo-pretraining-gke/4node-BF16-GBS64/recipe/recipe_launch_command.sh‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎training/a4/wan2-1-14b/nemo-pretraining-gke/4node-BF16-GBS64/recipe/templates/workload-config-configmap.yaml‎
Lines changed: 28 additions & 0 deletions b/‎training/a4/wan2-1-14b/nemo-pretraining-gke/4node-BF16-GBS64/recipe/templates/workload-config-configmap.yaml‎
Lines changed: 28 additions & 0 deletions
@@ -0,0 +1,20 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: v2
+name: a4_jobset_workload
+description: a4_jobset_workload
+type: application
+version: 0.1.0
+appVersion: "1.16.0"
@@ -0,0 +1,147 @@
+<!-- mdformat global-off -->
+# Pretrain wan2-1-14b-bf16-gbs64-gpus32 workloads on a4 GKE Node pools with NVIDIA DFM & Megatron-Bridge
+
+This recipe outlines the steps for running a wan2-1-14b-bf16-gbs64-gpus32 pretraining workload on a4 GKE Node pools by using the NeMo DFM (Diffusion Foundation Models) and Megatron-Bridge within Nemo Framework.
+
+## Orchestration and deployment tools
+
+For this recipe, the following setup is used:
+
+- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
+- Pretraining job configuration and deployment - A Helm chart is used to configure and deploy the Kubernetes Jobset resource which manages the execution of the DFM pretraining workload.
+
+## Test environment
+
+This recipe has been optimized for and tested with the following configuration:
+
+- GKE cluster: Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4) to create your a4 GKE cluster.
+- Node Configuration: 4 nodes (8 GPUs per node, 32 GPUs total).
+- GPU Architecture: NVIDIA Blackwell B200.
+
+## Training dataset
+
+This recipe uses a mock pretraining dataset. To support long-duration stress testing, the recipe includes a patch to the WanMockDataModule via sed within the launcher to extend the mock data length from 1,024 to effectively infinite (10^12 tokens).
+
+## Docker container image
+
+This recipe uses the following docker images:
+
+- `nvcr.io/nvidia/nemo:25.11.00`
+- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-arm64:v1.1.1`
+
+## Run the recipe
+
+From your client workstation, complete the following steps:
+
+### Configure environment settings
+
+Set the environment variables to match your environment:
+
+```bash
+export PROJECT_ID=<PROJECT_ID>
+export CLUSTER_REGION=<CLUSTER_REGION>
+export CLUSTER_NAME=<CLUSTER_NAME>
+export GCS_BUCKET=<GCS_BUCKET> # Note: path should not be prefixed with gs://
+export KUEUE_NAME=<KUEUE_NAME>
+```
+
+Replace the following values:
+
+- `<PROJECT_ID>`: your Google Cloud project ID.
+- `<CLUSTER_REGION>`: the region where your cluster is located.
+- `<CLUSTER_NAME>`: the name of your GKE cluster.
+- `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the gs:// prefix.
+- `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is a4.
+
+Set the default project:
+
+```bash
+gcloud config set project $PROJECT_ID
+```
+
+### Get cluster credentials
+
+```bash
+gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
+```
+
+### Get the recipe
+
+Clone the `gpu-recipes` repository and set a reference to the recipe folder.
+
+```
+git clone https://github.com/ai-hypercomputer/gpu-recipes.git
+cd gpu-recipes
+export REPO_ROOT=`git rev-parse --show-toplevel`
+export RECIPE_ROOT=$REPO_ROOT/training/a4/wan2-1-14b/nemo-pretraining-gke/4node-BF16-GBS64/recipe
+cd $RECIPE_ROOT
+```
+
+### Configure and submit a pretraining job
+
+#### Using 8 nodes (32 gpus) fp8 precision
+
+To execute the job with the default settings, run the following command from your client:
+
+```bash
+cd $RECIPE_ROOT
+export WORKLOAD_NAME=$USER-a4-wan2-1-14b-4node
+helm install $WORKLOAD_NAME . -f values.yaml \
+--set-file workload_launcher=launcher.sh \
+--set workload.image=nvcr.io/nvidia/nemo:25.11.00 \
+--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
+--set volumes.gcsMounts[0].mountPath=/job-logs \
+--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
+--set queue=${KUEUE_NAME}
+```
+
+**Examples**
+
+-   To set the number of training steps to 100, run the following command from
+    your client:
+
+    ```bash
+    cd $RECIPE_ROOT
+    export WORKLOAD_NAME=$USER-a4-wan2-1-14b-4node
+    helm install $WORKLOAD_NAME . -f values.yaml \
+    --set-file workload_launcher=launcher.sh \
+    --set workload.image=nvcr.io/nvidia/nemo:25.11.00 \
+    --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
+    --set volumes.gcsMounts[0].mountPath=/job-logs \
+    --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
+    --set queue=${KUEUE_NAME} \
+    --set workload.arguments[0]="train.train_iters=100"
+    ```
+
+### Monitor the job
+
+To check the status of pods in your job, run the following command:
+
+```
+kubectl get pods | grep $USER-a4-wan2-1-14b-4node
+```
+
+Replace the following:
+
+- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4-wan2-1-14b-4node.
+
+To get the logs for one of the pods, run the following command:
+
+```
+kubectl logs POD_NAME
+```
+
+Information about the training job's progress, including crucial details such as
+loss, step count, and step time, is generated by the rank 0 process.
+This process runs on the pod whose name begins with
+`JOB_NAME_PREFIX-workload-0-0`.
+For example: `$USER-a4-wan2-1-14b-4node-workload-0-0-s9zrv`.
+
+### Uninstall the Helm release
+
+You can delete the job and other resources created by the Helm chart. To
+uninstall Helm, run the following command from your client:
+
+```bash
+helm uninstall $USER-a4-wan2-1-14b-4node
+```
@@ -0,0 +1,144 @@
+usage()
+{
+cat << EOF
+usage: bash ./launcher.sh [config-override  [config-override ...]]
+config-override  (Optional) A  NeMo configuration override. E.g. trainer.max_steps=10000.
+EOF
+}
+
+parse_args() {
+  while [ "$1" != "" ]; do
+    case $(grep -o "=" <<< "$1" | wc -l) in
+        1  )
+        config_overrides+=("$1")
+        ;;
+        * )
+            echo "Invalid config override: $1"
+            usage
+            exit 1
+    esac
+    shift
+  done
+  config_overrides="${config_overrides[*]}"
+}
+
+config_overrides=()
+parse_args "$@"
+
+if [ -z "${config_overrides}" ]; then
+  echo "No NeMo config overrides specified"
+else
+  echo "NeMo config overrides:"
+  echo "  ${config_overrides}"
+fi
+
+export LD_LIBRARY_PATH="$NCCL_PLUGIN_PATH"
+ldconfig $LD_LIBRARY_PATH
+echo "Added $LD_LIBRARY_PATH to ldconfig:"
+ldconfig -p | grep libcuda | sed 's/^/  /'
+echo ""
+
+if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then
+  explicit_log_dir=${EXPLICIT_LOG_DIR}
+else
+  explicit_log_dir=/workload_logs
+fi
+echo "Logging to ${explicit_log_dir}"
+
+if [[ -n "${TOKENIZER_PATH}" ]]; then
+  echo "Getting tokenizer files"
+  cp ${TOKENIZER_PATH}/* .
+  echo ""
+fi
+
+echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes"
+
+pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger
+
+# Create the nsys directory.
+mkdir -p ${explicit_log_dir}/nsys
+
+export NVTE_FUSED_ATTN=1
+# Repos & Pinning (Wan Stable Build)
+cd /opt
+rm -rf DFM Megatron-Bridge Megatron-LM
+
+# Fresh clone of DFM
+git clone https://github.com/NVIDIA-NeMo/DFM.git /opt/DFM
+
+# Pin Bridge & LM to known stable commits for this workflow
+git clone --no-checkout https://github.com/NVIDIA-NeMo/Megatron-Bridge.git /opt/Megatron-Bridge
+git -C /opt/Megatron-Bridge checkout 953aabf75c0500180dc14a6a76cf9e7e7c4baec7
+
+git clone --no-checkout https://github.com/NVIDIA/Megatron-LM.git /opt/Megatron-LM
+git -C /opt/Megatron-LM checkout 2d398b42fd4237fffb553109563d73ac099751c3
+
+# Fix automodel info logging flood
+sed -i 's/logger.info/logger.warning/g' /opt/DFM/dfm/src/automodel/flow_matching/flow_matching_pipeline.py
+
+# Fix data length issue with DFM
+sed -i 's/length=1024/length=10**12/g' /opt/DFM/dfm/src/megatron/data/wan/wan_mock_datamodule.py
+
+sed -i 's/length: int/length: int = 10**12/g' /opt/DFM/dfm/src/megatron/data/wan/wan_mock_datamodule.py
+export PYTHONPATH="/opt/DFM:/opt/Megatron-Bridge:/opt/Megatron-LM"
+
+pip install --no-cache-dir \
+  diffusers==0.35.1 \
+  easydict \
+  imageio \
+  imageio-ffmpeg \
+  peft \
+  "transformers<4.57.0" \
+  nvidia-modelopt[hf] \
+  > /dev/null 2>&1
+
+worker_command=$(cat <<- EOM
+  if [ "\$RANK" -eq "0" ]; then
+    echo "Worker 0 is stalling for a few seconds.." ;
+    sleep 3 ;
+    echo "The detected environment within worker rank 0 is:" ;
+    env | sed 's/^/  /' ;
+  fi ;
+
+  cd /opt/DFM ;
+
+  numactl \
+    --cpunodebind=\$((LOCAL_RANK/4)) \
+    --membind=\$((LOCAL_RANK/4)) \
+  nice -10 \
+  python examples/megatron/recipes/wan/pretrain_wan.py \
+    --config-file examples/megatron/recipes/wan/conf/gb200_perf_pretrain_mock.yaml \
+    --training-mode pretrain \
+    --mock \
+    checkpoint.save_interval=0 \
+    model.tensor_model_parallel_size=1 \
+    model.context_parallel_size=2 \
+    model.sequence_parallel=false \
+    model.recompute_granularity=full \
+    model.recompute_method=block \
+    model.recompute_num_layers=40 \
+    \${config_overrides}
+EOM
+)
+
+echo "$worker_command" > worker_command.sh
+chmod 777 worker_command.sh
+
+torchrun \
+--nproc-per-node="8" \
+--nnodes="4" \
+--node_rank="${JOB_COMPLETION_INDEX}" \
+--rdzv_id="${JOB_IDENTIFIER}" \
+--master_addr="${MASTER_ADDR}" \
+--master_port="${MASTER_PORT}" \
+--no-python bash worker_command.sh
+
+
+if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then
+  mkdir -p ${ARTIFACT_DIR}
+  cp -r ${explicit_log_dir}/* ${ARTIFACT_DIR}/
+  env > ${ARTIFACT_DIR}/environ.txt
+  ls ${ARTIFACT_DIR}
+fi
+echo "Training completed"
+echo "Pod on $(hostname --fqdn) is exiting"
@@ -0,0 +1 @@
+helm install ninggu-ubench-59tr . -f values.yaml --set-file workload_launcher=launcher.sh --set workload.image=nvcr.io/nvidia/nemo:25.11.01 --set volumes.gcsMounts[0].bucketName=ubench-logs --set volumes.gcsMounts[0].mountPath=/job-logs --set workload.envs[0].value=/job-logs/ninggu-ubench-59tr --set queue=a4
@@ -0,0 +1,28 @@
+# yamllint disable
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+{{- if .Values.workload.configFile }}
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: "{{ .Release.Name }}-config"
+data:
+  workload-configuration: |-
+{{- if .Values.workload_config }}
+{{ .Values.workload_config | nindent 4 }}
+{{- else }}
+{{ "config: null" | nindent 4 }}
+{{- end }}
+{{- end }}
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+helm install ninggu-ubench-59tr . -f values.yaml --set-file workload_launcher=launcher.sh --set workload.image=nvcr.io/nvidia/nemo:25.11.01 --set volumes.gcsMounts[0].bucketName=ubench-logs --set volumes.gcsMounts[0].mountPath=/job-logs --set workload.envs[0].value=/job-logs/ninggu-ubench-59tr --set queue=a4`