Skip to content

NODE_RANK causes DDP jobs to hang at initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8 #5651

@ajtao

Description

@ajtao

Hello,

In my compute cluster, all pytorch lightning code will hang when using more than 1 GPU.
It hangs right at "initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8"

Some relevant stats:

  • 8-gpu ddp training
  • NODE_RANK=0
  • WORLD_SIZE=8

I have found that training does work if i unset NODE_RANK.

By instrumenting the pytorch-lightning code, i have observed that:

  1. in the case when things hang, i.e. NODE_RANK is set, i've observed that select_accelerator() calls DDPHPCAccelerator every time, for all ranks. I guess this is because setting NODE_RANK seems to force use_torchelastic_ddp for all ranks.
  2. in the case when NODE_RANK isn't set, then training works. In this case, it appears that the first call to select_accelerator() will call DDPAccelerator() for the first rank but then will call DDPHPCAccelerator for subsequent ranks.

My questions:

  • Why is NODE_RANK being set causing my code to hang?
  • Could the dependence on environment variables be explicitly stated / defined somewhere?
  • I also would love to know: where is the spawn for DDP happening? I was trying to read the code, but just can't find where it happens.

What's your environment?

  • Ubuntu 18.04
  • pip
  • 1.0.8 pytorch-lightning
  • torch 1.6.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions