Skip to content

Add Slurm worker MPI mode and claim paging#260

Open
nkeilbart wants to merge 3 commits intomainfrom
fix/slurm-mpi-none-and-resource-backfill
Open

Add Slurm worker MPI mode and claim paging#260
nkeilbart wants to merge 3 commits intomainfrom
fix/slurm-mpi-none-and-resource-backfill

Conversation

@nkeilbart
Copy link
Copy Markdown
Collaborator

@nkeilbart nkeilbart commented Apr 8, 2026

Summary

  • add execution_config.slurm_worker_mpi_mode so the outer direct-mode worker-launch srun can opt into --mpi=none
  • page resource-based job claiming across ready-job candidates so smaller jobs can backfill past an initial priority window
  • document direct-mode multi-GPU / nested srun usage with --overlap, --nodes, and --ntasks

Validation

  • cargo fmt -- --check
  • cargo clippy --all --all-targets --all-features -- -D warnings
  • dprint check

Known Follow-up

  • resource backfill can let producer-style setup stages run farther ahead of GPU/postprocess stages than before, which can increase staged file count on HPC systems with strict file-count limits
  • we still need an explicit backpressure / work-in-progress control for that workflow shape instead of relying on the old first-page claim behavior as an accidental throttle

Record Slurm job completion before any sacct lookup so short
jobs do not inherit accounting latency. Queue completed steps by
allocation and persist slurm_stats from a background worker after
resources are freed.
Add an opt-in direct-mode setting for launching one worker per node with outer srun --mpi=none, and thread that through the workflow execution config and Slurm submission path.

Refactor resource-based job claiming to scan ready jobs in pages so lower-priority jobs can backfill leftover capacity when large higher-priority jobs do not fit inside the first SQL window.
@nkeilbart nkeilbart requested a review from daniel-thom April 8, 2026 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant