Problem
Compute nodes may not be able to run sbatch/srun to submit jobs back to the slurmctld (head node). This is a known class of SLURM configuration problems that can affect workflows requiring job-from-job submission (e.g., Snakemake, Nextflow, or manual sbatch calls from within a running job).
Key SLURM Config Areas to Investigate
Authentication
- Is the
munge daemon running and healthy on all compute nodes?
- Consider adding
auth/jwt as AuthAltTypes fallback in slurm.conf — this can help when munge has transient issues
Network / Firewall
- Port 6817 (slurmctld default) must be reachable from compute nodes
SrunPortRange may need explicit configuration so compute nodes can communicate back to the controller
- Check for any firewall rules or security groups blocking traffic between compute nodes and the head node
Timeouts
MessageTimeout defaults to 10 seconds, which may be too low under load
- Alpine's working config uses
MessageTimeout=90 — worth matching or at least increasing
Config Consistency
slurm.conf must be identical across all nodes (head node + compute nodes)
- A mismatch can cause silent failures or authentication errors
Diagnostic Commands (run from a compute node)
# Check munge status
systemctl status munge
# Test munge communication
munge -n | unmunge
# Test slurmctld connectivity
scontrol ping
# Try submitting a trivial job
sbatch --wrap="hostname"
# Check SLURM logs for errors
journalctl -u slurmd -n 50
Reference
Alpine cluster's SLURM config is a working example of multi-node job submission — their slurm.conf settings for MessageTimeout, SrunPortRange, and AuthAltTypes can serve as a baseline.
Acceptance Criteria
Problem
Compute nodes may not be able to run
sbatch/srunto submit jobs back to the slurmctld (head node). This is a known class of SLURM configuration problems that can affect workflows requiring job-from-job submission (e.g., Snakemake, Nextflow, or manualsbatchcalls from within a running job).Key SLURM Config Areas to Investigate
Authentication
mungedaemon running and healthy on all compute nodes?auth/jwtasAuthAltTypesfallback inslurm.conf— this can help when munge has transient issuesNetwork / Firewall
SrunPortRangemay need explicit configuration so compute nodes can communicate back to the controllerTimeouts
MessageTimeoutdefaults to 10 seconds, which may be too low under loadMessageTimeout=90— worth matching or at least increasingConfig Consistency
slurm.confmust be identical across all nodes (head node + compute nodes)Diagnostic Commands (run from a compute node)
Reference
Alpine cluster's SLURM config is a working example of multi-node job submission — their
slurm.confsettings forMessageTimeout,SrunPortRange, andAuthAltTypescan serve as a baseline.Acceptance Criteria
slurm.confconsistencysbatchfrom within a running job on a compute node