Skip to content

Graceful termination/restart when Slurm job hits time limit #111

@kondratyevd

Description

@kondratyevd

Currently (plugin v0.6.0), when the underlying Slurm job reaches time limit, the k8s pod goes into Error: 15 status.

It would be great to have some k8s-native way to treat it, for example:

  • treat it as a crash of pod's containers - run a pod termination routine and let user handle resubmission via Deployment policy, etc.
  • OR treat it as a crash of the node and evict/restart pod or whatever Kubernetes does in this case.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions