NODE_RANK causes DDP jobs to hang at `initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8`

Hello,

In my compute cluster, all pytorch lightning code will hang when using more than 1 GPU.
It hangs right at "initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8"

Some relevant stats:
* 8-gpu ddp training
* NODE_RANK=0
* WORLD_SIZE=8

I have found that training **does work** if i unset NODE_RANK.

By instrumenting the pytorch-lightning code, i have observed that:
1. in the case when things hang, i.e. NODE_RANK is set, i've observed that select_accelerator() calls DDPHPCAccelerator every time, for all ranks. I guess this is because setting NODE_RANK seems to force `use_torchelastic_ddp` for all ranks.
2. in the case when NODE_RANK isn't set, then training works. In this case, it appears that the first call to select_accelerator() will call DDPAccelerator() for the first rank but then will call DDPHPCAccelerator for subsequent ranks.

My questions:
* Why is NODE_RANK being set causing my code to hang? 
* Could the dependence on environment variables be explicitly stated / defined somewhere?
* I also would love to know: where is the spawn for DDP happening? I was trying to read the code, but just can't find where it happens.

#### What's your environment?

 - Ubuntu 18.04
 - pip
 - 1.0.8 pytorch-lightning
 - torch 1.6.0


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NODE_RANK causes DDP jobs to hang at `initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8` #5651

What's your environment?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NODE_RANK causes DDP jobs to hang at initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8 #5651

Description

What's your environment?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

NODE_RANK causes DDP jobs to hang at `initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8` #5651