Investigate compute node → slurmctld job submission issues

## Problem

Compute nodes may not be able to run `sbatch`/`srun` to submit jobs back to the slurmctld (head node). This is a known class of SLURM configuration problems that can affect workflows requiring job-from-job submission (e.g., Snakemake, Nextflow, or manual `sbatch` calls from within a running job).

## Key SLURM Config Areas to Investigate

### Authentication
- Is the `munge` daemon running and healthy on all compute nodes?
- Consider adding `auth/jwt` as `AuthAltTypes` fallback in `slurm.conf` — this can help when munge has transient issues

### Network / Firewall
- Port **6817** (slurmctld default) must be reachable from compute nodes
- `SrunPortRange` may need explicit configuration so compute nodes can communicate back to the controller
- Check for any firewall rules or security groups blocking traffic between compute nodes and the head node

### Timeouts
- `MessageTimeout` defaults to 10 seconds, which may be too low under load
- Alpine's working config uses `MessageTimeout=90` — worth matching or at least increasing

### Config Consistency
- `slurm.conf` must be identical across all nodes (head node + compute nodes)
- A mismatch can cause silent failures or authentication errors

## Diagnostic Commands (run from a compute node)

```bash
# Check munge status
systemctl status munge

# Test munge communication
munge -n | unmunge

# Test slurmctld connectivity
scontrol ping

# Try submitting a trivial job
sbatch --wrap="hostname"

# Check SLURM logs for errors
journalctl -u slurmd -n 50
```

## Reference

Alpine cluster's SLURM config is a working example of multi-node job submission — their `slurm.conf` settings for `MessageTimeout`, `SrunPortRange`, and `AuthAltTypes` can serve as a baseline.

## Acceptance Criteria

- [ ] Confirm compute nodes can reach slurmctld on port 6817
- [ ] Confirm munge is running and keys match across nodes
- [ ] Validate `slurm.conf` consistency
- [ ] Test `sbatch` from within a running job on a compute node
- [ ] Document findings and any config changes needed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate compute node → slurmctld job submission issues #1

Problem

Key SLURM Config Areas to Investigate

Authentication

Network / Firewall

Timeouts

Config Consistency

Diagnostic Commands (run from a compute node)

Reference

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate compute node → slurmctld job submission issues #1

Description

Problem

Key SLURM Config Areas to Investigate

Authentication

Network / Firewall

Timeouts

Config Consistency

Diagnostic Commands (run from a compute node)

Reference

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions