Summary
There appear to be multiple issues in Torc's direct-mode behavior for GPU workflows inside multi-node Slurm allocations.
I hit a panic in JobRunner when running a 2-node allocation with 4 GPUs per node:
thread 'main' panicked at src/client/job_runner.rs:1825:9:
assertion failed: self.resources.num_gpus >= 0
While debugging this, it also became clear that direct mode has additional limitations / correctness issues for GPU jobs in multi-node allocations.
Environment / Topology
- Slurm allocation: 2 nodes
- GPUs per node: 4
- Total GPUs in allocation: 8
- Execution mode: direct
- Workload type: single-node GPU jobs (num_gpus: 1)
- Also investigated behavior for true multi-node GPU jobs
What I observed
1. Multi-node GPU accounting panic
In a 2-node allocation, the runner process on the first node saw:
echo $SLURM_JOB_GPUS
0,1,2,3
echo $CUDA_VISIBLE_DEVICES
0,1,2,3
nvidia-smi showed the correct GPU hardware on the node.
From reading the code, Torc appears to:
- derive allocation resources from Slurm startup env
- then later override GPU count from visible-device env vars like CUDA_VISIBLE_DEVICES / SLURM_JOB_GPUS
In a multi-node allocation, those env vars appear to be node-local, not allocation-wide. So a single runner in a 2-node allocation can incorrectly collapse the total GPU pool from
8 to 4, which then makes multi-node GPU accounting inconsistent and can trigger the num_gpus >= 0 assertion.
2. Direct mode with one runner cannot use GPUs on other nodes
My understanding after debugging this is:
- In mode: direct, if start_one_worker_per_node is not enabled, there is one runner for the whole allocation.
- That runner executes jobs directly on its own host.
- It does not place jobs on the other nodes in the allocation.
If that understanding is correct, then a single direct-mode runner in a multi-node allocation cannot actually use remote-node GPUs, even if the resource accounting says they
exist.
This means direct mode without start_one_worker_per_node: true underutilizes multi-node GPU allocations.
3. Direct mode may oversubscribe local GPUs when total allocation GPUs > visible local GPUs
Again, if my reading is correct:
- the scheduler/accounting can think there are 8 GPUs total in the 2-node allocation
- but the local runner only has 4 visible GPUs on its host
- after the first 4 local GPU assignments, Torc falls back to GPU reuse / round-robin
If so, direct mode with one runner can assign multiple jobs to the same visible GPU while still believing it is using the full allocation.
4. start_one_worker_per_node helps for single-node GPU jobs, but direct mode still leaks PMI/Slurm task env
When I enabled start_one_worker_per_node: true, I then hit errors like:
[PE_1]:inet_recv:inet_recv: recv error on nid001320 from nid001317 (fd=3) Connection reset by peer
[PE_1]:_pmi_network_barrier:_pmi_inet_recv from target 0 failed pmi errno -1
[PE_1]:control_nets_join:network_barrier failed
agent: ../src/mpid/common/cray/cray_pmi_utils.c:364: mpid_cray_pmi_init: Assertion `PMI2_Initialized()' failed.
and later:
[PE_0]:inet_listen_socket_setup:bind() failed [fd=3, port=63002 err='Address already in use']
[PE_0]:_pmi_inet_listen_socket_setup:socket setup failed
[PE_0]:control_nets_listen:_pmi_inet_listen_socket_setup (full) returned -1
agent: ../src/mpid/common/cray/cray_pmi_utils.c:364: mpid_cray_pmi_init: Assertion `PMI2_Initialized()' failed.
The job command being run by Torc was just:
./agent ./inputs > output
I did not add srun inside the job script.
This suggests that in direct mode, child jobs inherit PMI / PMIx / Slurm task environment from the outer worker launch (for example when per-node workers are launched under srun
--ntasks-per-node=1), and MPI-linked binaries may then try to initialize against the wrong launcher context.
Questions / suspected design limitations
Based on the current behavior, I think the following may be true:
- Direct mode should preserve allocation-level GPU counts in multi-node allocations instead of replacing them with node-local visible-device env counts.
- Direct mode child jobs should scrub inherited PMI_, PMIX_, and Slurm step/task env before spawn.
- Direct mode without start_one_worker_per_node is not a good fit for multi-node GPU allocations, because one runner cannot actually execute on remote nodes.
- Direct mode likely cannot correctly support:
- true multi-node GPU jobs
- multi-GPU jobs that need GPUs across more than one node
If that understanding is correct, it would be helpful to document explicitly that:
- mode: direct + start_one_worker_per_node: true is for many single-node jobs spread across nodes
- mode: slurm is required for true multi-node GPU jobs / MPI-style jobs / jobs that need coordinated launch across nodes
Repro context
- 2-node Slurm allocation
- 4 GPUs per node
- mode: direct
- observed SLURM_JOB_GPUS=0,1,2,3 and CUDA_VISIBLE_DEVICES=0,1,2,3 on the node where the runner started
- nvidia-smi showed the expected hardware on the node
- panic triggered from GPU accounting in src/client/job_runner.rs
Requested outcome
At minimum, I think Torc should:
- avoid panicking in this scenario
- handle multi-node direct-mode GPU accounting consistently
- document direct-mode limitations for multi-node GPU workloads
- scrub PMI / PMIx / Slurm step env for direct-mode child job launch
Summary
There appear to be multiple issues in Torc's direct-mode behavior for GPU workflows inside multi-node Slurm allocations.
I hit a panic in
JobRunnerwhen running a 2-node allocation with 4 GPUs per node:While debugging this, it also became clear that direct mode has additional limitations / correctness issues for GPU jobs in multi-node allocations.
Environment / Topology
What I observed
1. Multi-node GPU accounting panic
In a 2-node allocation, the runner process on the first node saw:
echo $SLURM_JOB_GPUS
0,1,2,3
echo $CUDA_VISIBLE_DEVICES
0,1,2,3
nvidia-smi showed the correct GPU hardware on the node.
From reading the code, Torc appears to:
In a multi-node allocation, those env vars appear to be node-local, not allocation-wide. So a single runner in a 2-node allocation can incorrectly collapse the total GPU pool from
8 to 4, which then makes multi-node GPU accounting inconsistent and can trigger the num_gpus >= 0 assertion.
2. Direct mode with one runner cannot use GPUs on other nodes
My understanding after debugging this is:
If that understanding is correct, then a single direct-mode runner in a multi-node allocation cannot actually use remote-node GPUs, even if the resource accounting says they
exist.
This means direct mode without start_one_worker_per_node: true underutilizes multi-node GPU allocations.
3. Direct mode may oversubscribe local GPUs when total allocation GPUs > visible local GPUs
Again, if my reading is correct:
If so, direct mode with one runner can assign multiple jobs to the same visible GPU while still believing it is using the full allocation.
4. start_one_worker_per_node helps for single-node GPU jobs, but direct mode still leaks PMI/Slurm task env
When I enabled start_one_worker_per_node: true, I then hit errors like:
[PE_1]:inet_recv:inet_recv: recv error on nid001320 from nid001317 (fd=3) Connection reset by peer
[PE_1]:_pmi_network_barrier:_pmi_inet_recv from target 0 failed pmi errno -1
[PE_1]:control_nets_join:network_barrier failed
agent: ../src/mpid/common/cray/cray_pmi_utils.c:364: mpid_cray_pmi_init: Assertion `PMI2_Initialized()' failed.
and later:
[PE_0]:inet_listen_socket_setup:bind() failed [fd=3, port=63002 err='Address already in use']
[PE_0]:_pmi_inet_listen_socket_setup:socket setup failed
[PE_0]:control_nets_listen:_pmi_inet_listen_socket_setup (full) returned -1
agent: ../src/mpid/common/cray/cray_pmi_utils.c:364: mpid_cray_pmi_init: Assertion `PMI2_Initialized()' failed.
The job command being run by Torc was just:
./agent ./inputs > output
I did not add srun inside the job script.
This suggests that in direct mode, child jobs inherit PMI / PMIx / Slurm task environment from the outer worker launch (for example when per-node workers are launched under srun
--ntasks-per-node=1), and MPI-linked binaries may then try to initialize against the wrong launcher context.
Questions / suspected design limitations
Based on the current behavior, I think the following may be true:
If that understanding is correct, it would be helpful to document explicitly that:
Repro context
Requested outcome
At minimum, I think Torc should: