Effortlessly launch experiments on Slurm clusters from your local machine.
Slurmpilot is a Python library designed to simplify launching experiments on Slurm clusters directly from your local machine. It automates code synchronization, job submission, and status tracking, letting you focus on your research.
Slurmpilot is designed for academic research environments: multi-cluster support, no Docker requirement, source files sent directly, and a CLI-first workflow for managing experiments.
- 💻 Local & Remote Submission: Submit jobs from your laptop via SSH, or run directly on a login node — no code changes needed.
- 🔁 Simplified Workflow: Automatically handles code synchronization, log management, and job status tracking.
- 🌐 Multi-Cluster Support: Manage jobs across multiple Slurm clusters from a single interface.
- 📝 Reproducibility: Keep track of your experiments with automatically generated metadata.
- ⌨️ Command-Line Interface (CLI): Launch, manage, and monitor jobs — view logs and check status with simple commands.
- 🔢 Job Arrays: Run parameter sweeps by passing a list of arguments — one Slurm array task per entry, no boilerplate.
Install from PyPI:
pip install slurmpilotOr clone and install in editable mode:
git clone https://github.com/geoalgo/slurmpilot.git
cd slurmpilot
pip install -e .The cluster parameter controls where and how the job is submitted. The entrypoint and resource settings are the same regardless of mode.
Use cluster="local" when running Slurmpilot directly from a Slurm login node. No SSH connection is opened — sbatch is called directly.
from slurmpilot import SlurmPilot, JobCreationInfo, unify
slurm = SlurmPilot(clusters=["local"])
job_info = JobCreationInfo(
cluster="local",
partition="YOURPARTITION",
jobname=unify("hellocluster", method="date"),
entrypoint="hellocluster_script.sh",
src_dir="example/hellocluster",
n_cpus=4,
n_gpus=1,
max_runtime_minutes=60,
)
job_id = slurm.schedule_job(job_info)
print(f"Job {job_id} submitted")Job files are written to ~/slurmpilot/jobs/ locally, see example/hellocluster/ for a full working example.
Use a named cluster to submit from your laptop. Slurmpilot syncs the source files and calls sbatch over SSH.
slurm = SlurmPilot(clusters=["YOURCLUSTER"])
job_info = JobCreationInfo(
cluster="YOURCLUSTER",
partition="YOURPARTITION",
jobname=unify("hellocluster", method="coolname"),
entrypoint="hellocluster_script.sh",
src_dir="example/hellocluster",
n_cpus=1,
max_runtime_minutes=60,
)
job_id = slurm.schedule_job(job_info)YOURCLUSTER must be reachable via SSH — verify with sp test-ssh YOURCLUSTER.
If you want to configure a cluster, you can create the following config file ~/slurmpilot/config/clusters/YOUR_CLUSTER.yaml:
host: your-cluster-hostname
user: your-username # optional, defaults to current user
remote_path: ~/slurmpilot # optional, where files are stored on the cluster
default_partition: gpu # optional, used when partition is not specified
account: your-account # optional, Slurm account to chargeOptionally create ~/slurmpilot/config/general.yaml for global settings:
local_path: ~/slurmpilot # where job files are stored locally
default_cluster: YOUR_CLUSTER # used when cluster is not specifiedUse cluster="mock" to run jobs as plain local processes — no Slurm installation required.
Slurm is then simulated, the generated bash script is executed as a subprocess and its PID is used as the job ID.
Only recommended for testing.
slurm = SlurmPilot(clusters=["mock"])
job_info = JobCreationInfo(
cluster="mock",
jobname="test/hellocluster",
entrypoint="hellocluster_script.sh",
src_dir="example/hellocluster",
)
job_id = slurm.schedule_job(job_info)
slurm.wait_completion(job_info.jobname, max_seconds=30)
stdout, stderr = slurm.log(job_info.jobname)Status is emulated by Slurmpilot: RUNNING while the process is alive, then COMPLETED / FAILED / CANCELLED based on its exit code. No cluster config file is needed.
Set python_binary to run a Python script instead of a bare shell script. All options below work with any running mode.
job_info = JobCreationInfo(
cluster="YOURCLUSTER",
partition="YOURPARTITION",
jobname=unify("python-job", method="date"),
entrypoint="main.py",
python_binary="~/miniconda3/bin/python", # use a full path to your venv's python binary
python_args="--data /path/to/data --epochs 10",
n_cpus=2,
n_gpus=1,
mem=16000,
env={
"API_TOKEN": "your-token",
"PYTHONUNBUFFERED": "1", # flush stdout immediately so sp log shows output while the job runs
},
)bash_setup_command runs a shell command before the entrypoint — useful for activating a conda environment or loading modules:
job_info = JobCreationInfo(
...
bash_setup_command="source ~/miniconda3/etc/profile.d/conda.sh && conda activate myenv",
)Pass a list to python_args to submit a job array — one Slurm task per element:
job_info = JobCreationInfo(
...
python_args=[
{"lr": 0.001, "batch": 32},
{"lr": 0.01, "batch": 16},
],
n_concurrent_jobs=4, # optional: limit to 4 running at once
)Each dict is converted to CLI arguments (e.g. --lr 0.001 --batch 32) for the corresponding array task.
Ship additional local packages alongside your code with python_libraries. Each directory is copied into the job folder and added to PYTHONPATH:
job_info = JobCreationInfo(
...
python_libraries=["./custom_library"],
)A working example is available in example/python_dependencies/. You can run it with:
sp launch --config example/python_dependencies/job.yaml --cluster local --partition YOURPARTITION
# or
python example/python_dependencies/launch_python_dependencies.pyAll job commands accept an optional job name. When omitted, the most recently submitted job is used — so after sp launch, you can just run sp log, sp status, etc. without typing the job name:
sp log my-experiment # print logs for a specific job
sp log # print logs for the last submitted job| Command | Description |
|---|---|
sp log [JOBNAME] |
Print stdout/stderr of a job |
sp status [JOBNAME] |
Print current Slurm state of a job |
sp metadata [JOBNAME] |
Print job metadata (cluster, date, …) |
sp path [JOBNAME] |
Show local and remote paths for a job |
sp slurm-script [JOBNAME] |
Print the generated Slurm script |
sp download [JOBNAME] |
Download the job folder from the cluster |
sp stop [JOBNAME] |
Cancel a running job |
sp queue-status [JOBNAME] |
Show queue position and priority of a pending job |
| Command | Description |
|---|---|
sp list-jobs [N] [--clusters C …] |
Print a table of the N most recent jobs (default 10) |
sp test-ssh CLUSTER … |
Test SSH connection to one or more clusters |
sp stop-all [--clusters C …] |
Cancel all tracked jobs on cluster(s) |
--collapse-job-array on list-jobs shows one row per job array instead of one per task.
sp queue-status runs squeue and reports the job's priority score, its rank among all PENDING jobs in the same partition, and the top priority score in that partition. Note: this requires your account to have permission to query the full partition queue, which is not always the case on shared clusters.
job : my-experiment (id: 17026264)
partition : small-g
priority : 5000 (top is 9999)
position : 3 / 42 pending jobs
Returns a "not pending" message when the job has already started running or completed.
sp launch builds and submits a job from a YAML config file and/or inline CLI flags. CLI flags always override YAML values.
# Inline flags only
sp launch --entrypoint main.py --cluster mycluster --partition YOURPARTITION --n-gpus 1
# From a YAML config (src_dir defaults to the YAML file's directory)
sp launch --config job.yaml
# YAML with a one-off override
sp launch --config job.yaml --cluster local --partition debug
# Preview the generated sbatch script without submitting
sp launch --config job.yaml --dry-run
# Submit and block until the job finishes, then print logs
sp launch --config job.yaml --wait
sp launch --config job.yaml --wait --max-wait-seconds 3600A minimal job.yaml:
cluster: mycluster
partition: YOURPARTITION
entrypoint: train.py # relative to the YAML file's directory
python_binary: python3 # use a full path (e.g. ~/venv/bin/python) to target a specific venv
python_args: "--epochs 10"
n_cpus: 4
n_gpus: 1
max_runtime_minutes: 120
env:
PYTHONUNBUFFERED: "1" # flush stdout immediately so sp log shows output while the job runsFor a job array, set python_args to a list:
python_args:
- lr: 0.001
batch: 32
- lr: 0.01
batch: 16jobname is auto-generated from the entrypoint stem via coolname if not provided (e.g. train-charming-swift-otter-of-justice). You can control this with jobname_method: date (appends a timestamp) or jobname_method: coolname (appends a random adjective-noun suffix).
Example output from sp list-jobs 5:
job jobid cluster creation min status nodelist
------------------------------------------- ------ ------- ------------------- ----- ---------- --------
job-2026-01-01 42 mock 2026-01-01 00:00:00 5.0 ✅ COMPLETED node01
python-job-2026-01-01-12-00-00 43 gpu-c 2026-01-01 12:00:00 12.3 🏃 RUNNING gpu01
~/slurmpilot/config/
general.yaml # optional global settings
clusters/
YOUR_CLUSTER.yaml # one file per cluster
local_path: ~/slurmpilot # where job files are stored locally
default_cluster: YOUR_CLUSTERhost: hostname-or-ip
user: your-username # optional
remote_path: ~/slurmpilot # optional
default_partition: gpu # optional
account: slurm-account # optionalContributions are welcome! If you have ideas for improvements or find a bug, please open an issue or submit a pull request.
To set up a development environment:
git clone https://github.com/geoalgo/slurmpilot.git
cd slurmpilot
pip install -e ".[dev]" # the [dev] extra installs test and lint dependenciesRun tests:
pytestRun linting:
ruff check slurmpilot tstHow does it work?
When scheduling a job, the files required to run it are copied to ~/slurmpilot/jobs/YOUR_JOB_NAME locally, then synced to the remote cluster. The following files are generated:
slurm_script.sh— sbatch script generated from yourJobCreationInfometadata.json— job metadata (cluster, date, config)jobid.json— Slurm job ID after successful submissionsrc/— copy of your source fileslogs/stdout,logs/stderr— job output (populated after the job runs)
The working directory on the remote node is ~/slurmpilot/jobs/YOUR_JOB_NAME.
Why SSH and not a cluster login node?
A typical workflow involves SSHing to a login node and calling sbatch there. Slurmpilot automates this so you can manage multiple clusters without ever leaving your local machine.
You can also run directly on a login node by using cluster="local".
Why not Docker?
Docker is great for cloud tools (SkyPilot, SageMaker…) but is often unavailable on Slurm clusters due to root-privilege requirements. Slurmpilot sends source files directly, which is simpler and more portable.
What are the dependencies?
Only pyyaml and coolname are required at runtime. No pandas, no numpy.
What about other tools?
While tools like SkyPilot and Submitit are excellent, Slurmpilot offers a more flexible, multi-cluster experience tailored for academic research environments where Docker might not be available. We focus on sending source files directly, avoiding serialization issues and providing a seamless CLI for managing your experiments. SkyPilot excels for cloud providers; Submitit is great for single-cluster Python-native workflows. Slurmpilot targets the multi-cluster academic use case: sending raw source files, no serialization, CLI-first management across clusters.