-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
questionFurther information is requestedFurther information is requested
Description
Hello,
In my compute cluster, all pytorch lightning code will hang when using more than 1 GPU.
It hangs right at "initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8"
Some relevant stats:
- 8-gpu ddp training
- NODE_RANK=0
- WORLD_SIZE=8
I have found that training does work if i unset NODE_RANK.
By instrumenting the pytorch-lightning code, i have observed that:
- in the case when things hang, i.e. NODE_RANK is set, i've observed that select_accelerator() calls DDPHPCAccelerator every time, for all ranks. I guess this is because setting NODE_RANK seems to force
use_torchelastic_ddpfor all ranks. - in the case when NODE_RANK isn't set, then training works. In this case, it appears that the first call to select_accelerator() will call DDPAccelerator() for the first rank but then will call DDPHPCAccelerator for subsequent ranks.
My questions:
- Why is NODE_RANK being set causing my code to hang?
- Could the dependence on environment variables be explicitly stated / defined somewhere?
- I also would love to know: where is the spawn for DDP happening? I was trying to read the code, but just can't find where it happens.
What's your environment?
- Ubuntu 18.04
- pip
- 1.0.8 pytorch-lightning
- torch 1.6.0
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested