What happened?
When an MPIJob is created with .spec.runPolicy.suspend=true and later updated (e.g., via kubectl patch) to modify .spec.mpiReplicaSpecs["Launcher"].template fields alongside setting suspend=false, the changes are not propagated to the already-created batch/v1 Job.
The MPIJob spec correctly reflects all updates, but the launcher Job retains its original pod template — only Job.Spec.Suspend is toggled to false.
This affects all fields in the Launcher's PodTemplateSpec, including but not limited to:
.template.metadata.annotations
.template.metadata.labels
.template.spec.containers[*].image
.template.spec.containers[*].command / args
.template.spec.containers[*].resources
.template.spec.containers[*].env
.template.spec.volumes
What did you expect to happen?
Updates to .spec.mpiReplicaSpecs["Launcher"].template should be reflected in the owned launcher batch/v1 Job's .spec.template when the MPIJob is resumed.
How to reproduce
The following steps use annotations as a concrete example, but the same behavior applies to any Launcher template field.
Environment
- MPI Operator: v0.7.0
- Kubernetes: v1.35.0 (kind v0.31.0)
Steps
1. Install MPI Operator
kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.7.0/deploy/v2beta1/mpi-operator.yaml
kubectl -n mpi-operator wait --for=condition=available deployment/mpi-operator --timeout=120s
2. Deploy the pi example with suspend=true
curl -sL https://raw.githubusercontent.com/kubeflow/mpi-operator/refs/heads/master/examples/v2beta1/pi/pi.yaml | yq '.spec.runPolicy.suspend = true' | kubectl apply -f -
Verify the launcher Job is created in suspended state:
$ kubectl get job pi-launcher -o jsonpath='{.spec.suspend}'
true
3. Confirm the Launcher template has no annotations
$ kubectl get mpijob pi -o jsonpath='{.spec.mpiReplicaSpecs.Launcher.template.metadata.annotations}'
# (empty)
4. Patch MPIJob: unsuspend + add annotation (single command)
kubectl patch mpijob pi --type=merge -p '{"spec":{"runPolicy":{"suspend":false},"mpiReplicaSpecs":{"Launcher":{"template":{"metadata":{"annotations":{"alpha":"beta"}}}}}}}'
5. Verify the MPIJob spec was updated
$ kubectl get mpijob pi -o jsonpath='{.spec.runPolicy.suspend}'
false
$ kubectl get mpijob pi -o jsonpath='{.spec.mpiReplicaSpecs.Launcher.template.metadata.annotations}'
{"alpha":"beta"}
Both fields are correctly updated on the MPIJob.
6. Check the launcher Job
$ kubectl get job pi-launcher -o jsonpath='{.spec.suspend}'
false
$ kubectl get job pi-launcher -o jsonpath='{.spec.template.metadata.annotations}'
# (empty)
The Job's suspend field was correctly set to false, but the alpha: "beta" annotation is missing from the Job's pod template. The same would occur for any other Launcher template field change.
Root cause
In pkg/controller/mpi_job_controller.go, when the launcher Job already exists, the controller only syncs the suspension state:
if launcher != nil {
if isMPIJobSuspended(mpiJob) != isJobSuspended(launcher) {
launcher.Spec.Suspend = ptr.To(isMPIJobSuspended(mpiJob))
if _, err := c.kubeClient.BatchV1().Jobs(namespace).Update(context.TODO(), launcher, metav1.UpdateOptions{}); err != nil {
return err
}
}
}
There is no reconciliation of the launcher Job's pod template (.spec.template) against the desired state from mpiJob.Spec.MPIReplicaSpecs["Launcher"].Template. Changes made to the MPIJob's Launcher template after initial Job creation are silently ignored.
/kind bug
What happened?
When an MPIJob is created with
.spec.runPolicy.suspend=trueand later updated (e.g., viakubectl patch) to modify.spec.mpiReplicaSpecs["Launcher"].templatefields alongside settingsuspend=false, the changes are not propagated to the already-createdbatch/v1 Job.The MPIJob spec correctly reflects all updates, but the launcher Job retains its original pod template — only
Job.Spec.Suspendis toggled tofalse.This affects all fields in the Launcher's
PodTemplateSpec, including but not limited to:.template.metadata.annotations.template.metadata.labels.template.spec.containers[*].image.template.spec.containers[*].command/args.template.spec.containers[*].resources.template.spec.containers[*].env.template.spec.volumesWhat did you expect to happen?
Updates to
.spec.mpiReplicaSpecs["Launcher"].templateshould be reflected in the owned launcherbatch/v1 Job's.spec.templatewhen the MPIJob is resumed.How to reproduce
The following steps use annotations as a concrete example, but the same behavior applies to any Launcher template field.
Environment
Steps
1. Install MPI Operator
kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.7.0/deploy/v2beta1/mpi-operator.yaml kubectl -n mpi-operator wait --for=condition=available deployment/mpi-operator --timeout=120s2. Deploy the pi example with
suspend=trueVerify the launcher Job is created in suspended state:
3. Confirm the Launcher template has no annotations
4. Patch MPIJob: unsuspend + add annotation (single command)
kubectl patch mpijob pi --type=merge -p '{"spec":{"runPolicy":{"suspend":false},"mpiReplicaSpecs":{"Launcher":{"template":{"metadata":{"annotations":{"alpha":"beta"}}}}}}}'5. Verify the MPIJob spec was updated
Both fields are correctly updated on the MPIJob.
6. Check the launcher Job
The Job's
suspendfield was correctly set tofalse, but thealpha: "beta"annotation is missing from the Job's pod template. The same would occur for any other Launcher template field change.Root cause
In
pkg/controller/mpi_job_controller.go, when the launcher Job already exists, the controller only syncs the suspension state:There is no reconciliation of the launcher Job's pod template (
.spec.template) against the desired state frommpiJob.Spec.MPIReplicaSpecs["Launcher"].Template. Changes made to the MPIJob's Launcher template after initial Job creation are silently ignored./kind bug