Skip to content

Direct-mode GPU accounting and inherited Slurm/PMI environment break multi-node GPU workflows #249

@nkeilbart

Description

@nkeilbart

Summary

There appear to be multiple issues in Torc's direct-mode behavior for GPU workflows inside multi-node Slurm allocations.

I hit a panic in JobRunner when running a 2-node allocation with 4 GPUs per node:

thread 'main' panicked at src/client/job_runner.rs:1825:9:
assertion failed: self.resources.num_gpus >= 0

While debugging this, it also became clear that direct mode has additional limitations / correctness issues for GPU jobs in multi-node allocations.

Environment / Topology

  • Slurm allocation: 2 nodes
  • GPUs per node: 4
  • Total GPUs in allocation: 8
  • Execution mode: direct
  • Workload type: single-node GPU jobs (num_gpus: 1)
  • Also investigated behavior for true multi-node GPU jobs

What I observed

1. Multi-node GPU accounting panic

In a 2-node allocation, the runner process on the first node saw:

echo $SLURM_JOB_GPUS
0,1,2,3

echo $CUDA_VISIBLE_DEVICES
0,1,2,3

nvidia-smi showed the correct GPU hardware on the node.

From reading the code, Torc appears to:

  1. derive allocation resources from Slurm startup env
  2. then later override GPU count from visible-device env vars like CUDA_VISIBLE_DEVICES / SLURM_JOB_GPUS

In a multi-node allocation, those env vars appear to be node-local, not allocation-wide. So a single runner in a 2-node allocation can incorrectly collapse the total GPU pool from
8 to 4, which then makes multi-node GPU accounting inconsistent and can trigger the num_gpus >= 0 assertion.

2. Direct mode with one runner cannot use GPUs on other nodes

My understanding after debugging this is:

  • In mode: direct, if start_one_worker_per_node is not enabled, there is one runner for the whole allocation.
  • That runner executes jobs directly on its own host.
  • It does not place jobs on the other nodes in the allocation.

If that understanding is correct, then a single direct-mode runner in a multi-node allocation cannot actually use remote-node GPUs, even if the resource accounting says they
exist.

This means direct mode without start_one_worker_per_node: true underutilizes multi-node GPU allocations.

3. Direct mode may oversubscribe local GPUs when total allocation GPUs > visible local GPUs

Again, if my reading is correct:

  • the scheduler/accounting can think there are 8 GPUs total in the 2-node allocation
  • but the local runner only has 4 visible GPUs on its host
  • after the first 4 local GPU assignments, Torc falls back to GPU reuse / round-robin

If so, direct mode with one runner can assign multiple jobs to the same visible GPU while still believing it is using the full allocation.

4. start_one_worker_per_node helps for single-node GPU jobs, but direct mode still leaks PMI/Slurm task env

When I enabled start_one_worker_per_node: true, I then hit errors like:

[PE_1]:inet_recv:inet_recv: recv error on nid001320 from nid001317 (fd=3) Connection reset by peer
[PE_1]:_pmi_network_barrier:_pmi_inet_recv from target 0 failed pmi errno -1
[PE_1]:control_nets_join:network_barrier failed
agent: ../src/mpid/common/cray/cray_pmi_utils.c:364: mpid_cray_pmi_init: Assertion `PMI2_Initialized()' failed.

and later:

[PE_0]:inet_listen_socket_setup:bind() failed [fd=3, port=63002 err='Address already in use']
[PE_0]:_pmi_inet_listen_socket_setup:socket setup failed
[PE_0]:control_nets_listen:_pmi_inet_listen_socket_setup (full) returned -1
agent: ../src/mpid/common/cray/cray_pmi_utils.c:364: mpid_cray_pmi_init: Assertion `PMI2_Initialized()' failed.

The job command being run by Torc was just:

./agent ./inputs > output

I did not add srun inside the job script.

This suggests that in direct mode, child jobs inherit PMI / PMIx / Slurm task environment from the outer worker launch (for example when per-node workers are launched under srun
--ntasks-per-node=1), and MPI-linked binaries may then try to initialize against the wrong launcher context.

Questions / suspected design limitations

Based on the current behavior, I think the following may be true:

  1. Direct mode should preserve allocation-level GPU counts in multi-node allocations instead of replacing them with node-local visible-device env counts.
  2. Direct mode child jobs should scrub inherited PMI_, PMIX_, and Slurm step/task env before spawn.
  3. Direct mode without start_one_worker_per_node is not a good fit for multi-node GPU allocations, because one runner cannot actually execute on remote nodes.
  4. Direct mode likely cannot correctly support:
    • true multi-node GPU jobs
    • multi-GPU jobs that need GPUs across more than one node

If that understanding is correct, it would be helpful to document explicitly that:

  • mode: direct + start_one_worker_per_node: true is for many single-node jobs spread across nodes
  • mode: slurm is required for true multi-node GPU jobs / MPI-style jobs / jobs that need coordinated launch across nodes

Repro context

  • 2-node Slurm allocation
  • 4 GPUs per node
  • mode: direct
  • observed SLURM_JOB_GPUS=0,1,2,3 and CUDA_VISIBLE_DEVICES=0,1,2,3 on the node where the runner started
  • nvidia-smi showed the expected hardware on the node
  • panic triggered from GPU accounting in src/client/job_runner.rs

Requested outcome

At minimum, I think Torc should:

  • avoid panicking in this scenario
  • handle multi-node direct-mode GPU accounting consistently
  • document direct-mode limitations for multi-node GPU workloads
  • scrub PMI / PMIx / Slurm step env for direct-mode child job launch

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions