Skip to content

Cant get mpijob status when pod template is invalid  #604

@congpeiqing

Description

@congpeiqing

i created a mpijob with invalid pod template , i cant get mpijob status all the time ( i think the status should be Failed).
now i cant distinguish the mpijobs which are too new to get status and the mpijobs with invaild pod template

my mpijob shows below
kubectl get mpijob ai62da0dbe-6406-4252-85d6-51ef87eab10d -n cpod -oyaml
the output is :

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  creationTimestamp: "2023-11-15T02:01:44Z"
  generation: 1
  labels:
    deadline: 2023-11-15_02-06-44
  name: ai62da0dbe-6406-4252-85d6-51ef87eab10d
  namespace: cpod
  resourceVersion: "2787007"
  uid: e5703c73-f27e-45ef-9049-fd40c152d4d6
spec:
  launcherCreationPolicy: WaitForWorkersReady
  mpiImplementation: OpenMPI
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: "111"
            imagePullPolicy: IfNotPresent
            name: launcher
          hostIPC: true
    Worker:
      replicas: 1
      template:
        spec:
          containers:
          - image: "111"
            imagePullPolicy: IfNotPresent
            name: worker
            resources:
              limits:
                nvidia.com/gpu: 1
            volumeMounts:
            - mountPath: "111"
              name: ckpt-pv
            - mountPath: "111"
              name: saved-model-pv
          hostIPC: true
          nodeSelector:
            nvidia.com/gpu.product: NVIDIA-GeForce-RTX-3090
          volumes:
          - name: ckpt-pv
            persistentVolumeClaim:
              claimName: ai62da0dbe-6406-4252-85d6-51ef87eab10d-ckpt
              readOnly: false
          - name: saved-model-pv
            persistentVolumeClaim:
              claimName: ai62da0dbe-6406-4252-85d6-51ef87eab10d-modelsave
              readOnly: false
  runPolicy:
    cleanPodPolicy: Running
    schedulingPolicy:
      minAvailable: 1
    suspend: false
  slotsPerWorker: 1
  sshAuthMountPath: /root/.ssh

when describe the mpijob

kubectl describe mpijob  ai62da0dbe-6406-4252-85d6-51ef87eab10d -n cpod 

output is :

Name:         ai62da0dbe-6406-4252-85d6-51ef87eab10d
Namespace:    cpod
Labels:       deadline=2023-11-15_02-06-44
Annotations:  <none>
API Version:  kubeflow.org/v2beta1
Kind:         MPIJob
Metadata:
  Creation Timestamp:  2023-11-15T02:01:44Z
  Generation:          1
  Managed Fields:
    API Version:  kubeflow.org/v2beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .:
          f:deadline:
      f:spec:
        .:
        f:launcherCreationPolicy:
        f:mpiImplementation:
        f:mpiReplicaSpecs:
          .:
          f:Launcher:
            .:
            f:replicas:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
                f:hostIPC:
          f:Worker:
            .:
            f:replicas:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
                f:hostIPC:
                f:nodeSelector:
                f:volumes:
        f:runPolicy:
          .:
          f:cleanPodPolicy:
          f:schedulingPolicy:
            .:
            f:minAvailable:
          f:suspend:
        f:slotsPerWorker:
        f:sshAuthMountPath:
    Manager:         cpodmanager
    Operation:       Update
    Time:            2023-11-15T02:01:44Z
  Resource Version:  2787007
  UID:               e5703c73-f27e-45ef-9049-fd40c152d4d6
Spec:
  Launcher Creation Policy:  WaitForWorkersReady
  Mpi Implementation:        OpenMPI
  Mpi Replica Specs:
    Launcher:
      Replicas:  1
      Template:
        Spec:
          Containers:
            Image:              111
            Image Pull Policy:  IfNotPresent
            Name:               launcher
          Host IPC:             true
    Worker:
      Replicas:  1
      Template:
        Spec:
          Containers:
            Image:              111
            Image Pull Policy:  IfNotPresent
            Name:               worker
            Resources:
              Limits:
                nvidia.com/gpu:  1
            Volume Mounts:
              Mount Path:  111
              Name:        ckpt-pv
              Mount Path:  111
              Name:        saved-model-pv
          Host IPC:        true
          Node Selector:
            nvidia.com/gpu.product:  NVIDIA-GeForce-RTX-3090
          Volumes:
            Name:  ckpt-pv
            Persistent Volume Claim:
              Claim Name:  ai62da0dbe-6406-4252-85d6-51ef87eab10d-ckpt
              Read Only:   false
            Name:          saved-model-pv
            Persistent Volume Claim:
              Claim Name:  ai62da0dbe-6406-4252-85d6-51ef87eab10d-modelsave
              Read Only:   false
  Run Policy:
    Clean Pod Policy:  Running
    Scheduling Policy:
      Min Available:    1
    Suspend:            false
  Slots Per Worker:     1
  Ssh Auth Mount Path:  /root/.ssh
Events:
  Type     Reason         Age                   From                Message
  ----     ------         ----                  ----                -------
  Normal   MPIJobCreated  5m48s (x12 over 27m)  mpi-job-controller  MPIJob cpod/ai62da0dbe-6406-4252-85d6-51ef87eab10d is created.
  Warning  MPIJobFailed   5m48s (x12 over 27m)  mpi-job-controller  worker pod created failed: Pod "ai62da0dbe-6406-4252-85d6-51ef87eab10d-worker-0" is invalid: spec.containers[0].volumeMounts[1].mountPath: Invalid value: "111": must be unique

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions