Skip to content

feat(kubeflow): add setup_commands run once per pod before the job#545

Open
ko3n1g wants to merge 3 commits into
mainfrom
ko3n1g/feat/kubeflow-setup-commands
Open

feat(kubeflow): add setup_commands run once per pod before the job#545
ko3n1g wants to merge 3 commits into
mainfrom
ko3n1g/feat/kubeflow-setup-commands

Conversation

@ko3n1g

@ko3n1g ko3n1g commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Background / motivation

  • When running synced code via the Kubeflow workdir_pvc data-mover (no image rebuild), a container may be missing a dependency (e.g. a broken release-candidate image). There was no hook to run a command once per pod before the job.

What changed

  • New KubeflowExecutor.setup_commands: list[str] — shell commands rendered into the generated launch.sh between the /nemo_run symlink and the training command.

Details

  • launch.sh runs once per pod, before torchrun spawns the per-GPU ranks, so each command executes exactly once per node (not once per rank) and under set -e (a failure aborts the pod). Empty by default → no change to existing launch scripts.
  • Typical use: setup_commands=["uv pip install nvidia-resiliency-ext==0.6.0"] to patch a missing dep into the container venv without rebuilding the image.
# rendered launch.sh (excerpt)
ln -sfn <code_dir> /nemo_run
echo "Running setup commands..."
uv pip install nvidia-resiliency-ext==0.6.0
echo "Starting training command..."
...

Tested

  • Jinja2 render verified for both populated and empty setup_commands (valid bash either way). End-to-end validation pending via a Megatron-Bridge K8s job that sets it.

KubeflowExecutor.setup_commands is a list of shell commands rendered into the
generated launch.sh between the /nemo_run symlink and the training command.
launch.sh runs once per pod before torchrun spawns the per-GPU ranks, so each
setup command executes exactly once per node (not per rank), under errexit.

Use case: install a dependency missing from the container image into the
container venv (e.g. a broken release-candidate image) without rebuilding,
when running synced code via the workdir data-mover.

Signed-off-by: oliver könig <okoenig@nvidia.com>
On Kubernetes the training pods write every recipe output to the shared
workdir PVC, including the PyTorch profiler chrome trace and CUDA memory
snapshot which land under /nemo_run (the PVC code_dir). The launcher that
collects artifacts and parses logs only sees the local job_dir, so without
a copy-back those outputs are stranded on the PVC and never reach CI
artifacts.

Override KubeflowExecutor.cleanup() to reuse the existing pull_results()
data-mover, mirroring code_dir back to job_dir before teardown. Best-effort:
a failed pull never breaks cleanup. Gated on workdir_pvc, so non-PVC
(slurm/local) runs are unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g added the r0.10.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. label Jun 15, 2026
@ko3n1g ko3n1g marked this pull request as ready for review June 15, 2026 15:59
test_cleanup_noop_without_workdir_pvc constructed a KubeflowExecutor
without the mock_k8s_clients fixture, so __post_init__ ran the real
config.load_kube_config()/load_incluster_config() and crashed in CI
(no kubeconfig). Request the fixture like every other executor test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.10.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant