Skip to content

Cancel queued Slurm allocations when no jobs remain to run #257

@nkeilbart

Description

@nkeilbart

Summary

When a workflow has multiple Slurm allocations open, Torc does not cancel queued allocations after
all remaining work has already finished elsewhere.

Current behavior

One concrete case:

  1. Torc submits 10 Slurm allocations for a workflow.
  2. Jobs get claimed and all workflow work finishes inside the first 9 allocations.
  3. The 10th allocation is still waiting in the Slurm queue.
  4. Torc continues to treat that queued allocation as valid instead of canceling it, even though
    there are no jobs left for it to run.

From the current watch/orphan cleanup logic, queued Slurm allocations are considered valid as long
as Slurm still reports them as Queued/Running, and pending allocations only get cleaned up once
they are already gone from Slurm. That means a queued-but-now-unnecessary allocation can sit in the
queue until it starts or is canceled manually.

Expected behavior

If a workflow no longer has runnable work, Torc should detect any still-waiting Slurm allocations
and cancel them automatically.

Possible trigger conditions:

  • No ready, pending, or running jobs remain for the workflow.
  • There are still scheduled_compute_nodes in pending state backed by Slurm jobs that are
    still queued.

At that point, Torc could scancel those queued allocations and update the corresponding scheduled
compute node records so the workflow can finish cleanly without waiting on unnecessary queue slots.

Suggested direction

See whether this should live in the watch loop and/or orphan cleanup path:

  • detect that the workflow has no runnable jobs left
  • enumerate pending Slurm allocations
  • confirm they are still queued in Slurm
  • cancel them and mark the Torc-side scheduled compute nodes accordingly

This seems especially important for "many small allocations" workflows where some allocations may
still be queued after the workflow's useful work has already been exhausted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions