Summary
When a workflow has multiple Slurm allocations open, Torc does not cancel queued allocations after
all remaining work has already finished elsewhere.
Current behavior
One concrete case:
- Torc submits 10 Slurm allocations for a workflow.
- Jobs get claimed and all workflow work finishes inside the first 9 allocations.
- The 10th allocation is still waiting in the Slurm queue.
- Torc continues to treat that queued allocation as valid instead of canceling it, even though
there are no jobs left for it to run.
From the current watch/orphan cleanup logic, queued Slurm allocations are considered valid as long
as Slurm still reports them as Queued/Running, and pending allocations only get cleaned up once
they are already gone from Slurm. That means a queued-but-now-unnecessary allocation can sit in the
queue until it starts or is canceled manually.
Expected behavior
If a workflow no longer has runnable work, Torc should detect any still-waiting Slurm allocations
and cancel them automatically.
Possible trigger conditions:
- No
ready, pending, or running jobs remain for the workflow.
- There are still
scheduled_compute_nodes in pending state backed by Slurm jobs that are
still queued.
At that point, Torc could scancel those queued allocations and update the corresponding scheduled
compute node records so the workflow can finish cleanly without waiting on unnecessary queue slots.
Suggested direction
See whether this should live in the watch loop and/or orphan cleanup path:
- detect that the workflow has no runnable jobs left
- enumerate pending Slurm allocations
- confirm they are still queued in Slurm
- cancel them and mark the Torc-side scheduled compute nodes accordingly
This seems especially important for "many small allocations" workflows where some allocations may
still be queued after the workflow's useful work has already been exhausted.
Summary
When a workflow has multiple Slurm allocations open, Torc does not cancel queued allocations after
all remaining work has already finished elsewhere.
Current behavior
One concrete case:
there are no jobs left for it to run.
From the current watch/orphan cleanup logic, queued Slurm allocations are considered valid as long
as Slurm still reports them as
Queued/Running, and pending allocations only get cleaned up oncethey are already gone from Slurm. That means a queued-but-now-unnecessary allocation can sit in the
queue until it starts or is canceled manually.
Expected behavior
If a workflow no longer has runnable work, Torc should detect any still-waiting Slurm allocations
and cancel them automatically.
Possible trigger conditions:
ready,pending, orrunningjobs remain for the workflow.scheduled_compute_nodesinpendingstate backed by Slurm jobs that arestill queued.
At that point, Torc could
scancelthose queued allocations and update the corresponding scheduledcompute node records so the workflow can finish cleanly without waiting on unnecessary queue slots.
Suggested direction
See whether this should live in the watch loop and/or orphan cleanup path:
This seems especially important for "many small allocations" workflows where some allocations may
still be queued after the workflow's useful work has already been exhausted.