Currently (plugin v0.6.0), when the underlying Slurm job reaches time limit, the k8s pod goes into Error: 15 status.
It would be great to have some k8s-native way to treat it, for example:
- treat it as a crash of pod's containers - run a pod termination routine and let user handle resubmission via Deployment policy, etc.
- OR treat it as a crash of the node and evict/restart pod or whatever Kubernetes does in this case.
Currently (plugin v0.6.0), when the underlying Slurm job reaches time limit, the k8s pod goes into
Error: 15status.It would be great to have some k8s-native way to treat it, for example: