Skip to content

Implement SIGTERM-only, ordered, configurable pod preStop lifecycle in SLURM plugin #110

@dciangot

Description

@dciangot

Summary

Implement a robust Kubernetes preStop lifecycle feature in the interlink-slurm-plugin with the following requirements:


Key Features:

  • PreStop handlers run only on SIGTERM (never on normal or EXIT cleanup)
  • PreStop execution is synchronously ordered: run preStops in the order containers are declared (matches Kubernetes container shutdown order)
  • PreStop always runs to completion (with per-container timeout) before container kill and probe cleanup
  • Both HTTP and Exec preStop lifecycle handlers supported (same as probe subsystem, no TCP for now)
  • PreStop actions run in their dedicated bash function and called via a global runner for all containers
  • Job shell script (job.sh) traps SIGTERM and invokes all preStop handlers before probe cleanup
  • New config flags in SlurmConfig.yaml:
    • EnablePreStop (bool): enables/disables preStop lifecycle processing
    • PreStopTimeoutSeconds (int): per-preStop max time (default 5 seconds)
  • References and basic how-to/example added to docs/README-probes.md
  • Backwards compatible: default is disabled, so jobs behave exactly as before until enabled.

Implementation Tasks

  • Extend SlurmConfig to support EnablePreStop and PreStopTimeoutSeconds (default: 5)
  • Translate container.Lifecycle.PreStop → PreStopCommand at container parsing, only if config enabled
  • Generate per-container runPreStop_<container>() shell functions for each container with preStop
  • Generate a runAllPreStops() that runs all preStops in order, each with configured timeout; log errors but do not abort on failure
  • Install trap for SIGTERM to call runAllPreStops → cleanup_probes → exit
  • Add documentation and usage examples.

Acceptance Criteria

  • When preStop is set in a pod (exec or httpGet), it is run only if the SLURM job is signaled with SIGTERM, before background probe processes are killed
  • PreStops execute in the same order as containers are defined in the pod spec
  • Each preStop is forcibly killed after the configured timeout (default 5s, configurable)
  • If no preStop or config disables it, there is zero change to job shell or behavior for backward compatibility
  • Example(s) in README-probes.md demonstrate usage and config

References


PRs/branches: Please associate any PR implementing this feature to this issue.

CC: @dciangot

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions