Direct-mode GPU accounting and inherited Slurm/PMI environment break multi-node GPU workflows

## Summary

  There appear to be multiple issues in Torc's direct-mode behavior for GPU workflows inside multi-node Slurm allocations.

  I hit a panic in `JobRunner` when running a 2-node allocation with 4 GPUs per node:

  ```text
  thread 'main' panicked at src/client/job_runner.rs:1825:9:
  assertion failed: self.resources.num_gpus >= 0
```

  While debugging this, it also became clear that direct mode has additional limitations / correctness issues for GPU jobs in multi-node allocations.

  ## Environment / Topology

  - Slurm allocation: 2 nodes
  - GPUs per node: 4
  - Total GPUs in allocation: 8
  - Execution mode: direct
  - Workload type: single-node GPU jobs (num_gpus: 1)
  - Also investigated behavior for true multi-node GPU jobs

  ## What I observed

  ### 1. Multi-node GPU accounting panic

  In a 2-node allocation, the runner process on the first node saw:

  echo $SLURM_JOB_GPUS
  0,1,2,3

  echo $CUDA_VISIBLE_DEVICES
  0,1,2,3

  nvidia-smi showed the correct GPU hardware on the node.

  From reading the code, Torc appears to:

  1. derive allocation resources from Slurm startup env
  2. then later override GPU count from visible-device env vars like CUDA_VISIBLE_DEVICES / SLURM_JOB_GPUS

  In a multi-node allocation, those env vars appear to be node-local, not allocation-wide. So a single runner in a 2-node allocation can incorrectly collapse the total GPU pool from
  8 to 4, which then makes multi-node GPU accounting inconsistent and can trigger the num_gpus >= 0 assertion.

  ## 2. Direct mode with one runner cannot use GPUs on other nodes

  My understanding after debugging this is:

  - In mode: direct, if start_one_worker_per_node is not enabled, there is one runner for the whole allocation.
  - That runner executes jobs directly on its own host.
  - It does not place jobs on the other nodes in the allocation.

  If that understanding is correct, then a single direct-mode runner in a multi-node allocation cannot actually use remote-node GPUs, even if the resource accounting says they
  exist.

  This means direct mode without start_one_worker_per_node: true underutilizes multi-node GPU allocations.

  ## 3. Direct mode may oversubscribe local GPUs when total allocation GPUs > visible local GPUs

  Again, if my reading is correct:

  - the scheduler/accounting can think there are 8 GPUs total in the 2-node allocation
  - but the local runner only has 4 visible GPUs on its host
  - after the first 4 local GPU assignments, Torc falls back to GPU reuse / round-robin

  If so, direct mode with one runner can assign multiple jobs to the same visible GPU while still believing it is using the full allocation.

  ## 4. start_one_worker_per_node helps for single-node GPU jobs, but direct mode still leaks PMI/Slurm task env

  When I enabled start_one_worker_per_node: true, I then hit errors like:

  [PE_1]:inet_recv:inet_recv: recv error on nid001320 from nid001317 (fd=3) Connection reset by peer
  [PE_1]:_pmi_network_barrier:_pmi_inet_recv from target 0 failed pmi errno -1
  [PE_1]:control_nets_join:network_barrier failed
  agent: ../src/mpid/common/cray/cray_pmi_utils.c:364: mpid_cray_pmi_init: Assertion `PMI2_Initialized()' failed.

  and later:

  [PE_0]:inet_listen_socket_setup:bind() failed [fd=3, port=63002 err='Address already in use']
  [PE_0]:_pmi_inet_listen_socket_setup:socket setup failed
  [PE_0]:control_nets_listen:_pmi_inet_listen_socket_setup (full) returned -1
  agent: ../src/mpid/common/cray/cray_pmi_utils.c:364: mpid_cray_pmi_init: Assertion `PMI2_Initialized()' failed.

  The job command being run by Torc was just:

  ./agent ./inputs > output

  I did not add srun inside the job script.

  This suggests that in direct mode, child jobs inherit PMI / PMIx / Slurm task environment from the outer worker launch (for example when per-node workers are launched under srun
  --ntasks-per-node=1), and MPI-linked binaries may then try to initialize against the wrong launcher context.

  ## Questions / suspected design limitations

  Based on the current behavior, I think the following may be true:

  1. Direct mode should preserve allocation-level GPU counts in multi-node allocations instead of replacing them with node-local visible-device env counts.
  2. Direct mode child jobs should scrub inherited PMI_*, PMIX_*, and Slurm step/task env before spawn.
  3. Direct mode without start_one_worker_per_node is not a good fit for multi-node GPU allocations, because one runner cannot actually execute on remote nodes.
  4. Direct mode likely cannot correctly support:
      - true multi-node GPU jobs
      - multi-GPU jobs that need GPUs across more than one node

  If that understanding is correct, it would be helpful to document explicitly that:

  - mode: direct + start_one_worker_per_node: true is for many single-node jobs spread across nodes
  - mode: slurm is required for true multi-node GPU jobs / MPI-style jobs / jobs that need coordinated launch across nodes

  ## Repro context

  - 2-node Slurm allocation
  - 4 GPUs per node
  - mode: direct
  - observed SLURM_JOB_GPUS=0,1,2,3 and CUDA_VISIBLE_DEVICES=0,1,2,3 on the node where the runner started
  - nvidia-smi showed the expected hardware on the node
  - panic triggered from GPU accounting in src/client/job_runner.rs

  ## Requested outcome

  At minimum, I think Torc should:

  - avoid panicking in this scenario
  - handle multi-node direct-mode GPU accounting consistently
  - document direct-mode limitations for multi-node GPU workloads
  - scrub PMI / PMIx / Slurm step env for direct-mode child job launch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direct-mode GPU accounting and inherited Slurm/PMI environment break multi-node GPU workflows #249

Summary

Environment / Topology

What I observed

1. Multi-node GPU accounting panic

2. Direct mode with one runner cannot use GPUs on other nodes

3. Direct mode may oversubscribe local GPUs when total allocation GPUs > visible local GPUs

4. start_one_worker_per_node helps for single-node GPU jobs, but direct mode still leaks PMI/Slurm task env

Questions / suspected design limitations

Repro context

Requested outcome

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Direct-mode GPU accounting and inherited Slurm/PMI environment break multi-node GPU workflows #249

Description

Summary

Environment / Topology

What I observed

1. Multi-node GPU accounting panic

2. Direct mode with one runner cannot use GPUs on other nodes

3. Direct mode may oversubscribe local GPUs when total allocation GPUs > visible local GPUs

4. start_one_worker_per_node helps for single-node GPU jobs, but direct mode still leaks PMI/Slurm task env

Questions / suspected design limitations

Repro context

Requested outcome

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions