This repository was archived by the owner on Jan 6, 2023. It is now read-only.
Description 🐛 Bug
Component (check all that applies):
To Reproduce
Steps to reproduce the behavior:
Launch a 2 node job on Kubernetes+Volcano
LOGLEVEL=INFO python -m torch.distributed.run --rdzv_backend c10d --rdzv_id 1 --rdzv_endpoint "$VC_SH_0_HOSTS" --nnodes 2 echo hello
rendezvous times out since the rank 0 host doesn't realize it's the master due to insufficient hostname resolution
root@sh-db2kkt73p534vd-sh-0-0:/app# echo $VC_SH_0_HOSTS
sh-db2kkt73p534vd-sh-0-0.sh-db2kkt73p534vd
root@sh-db2kkt73p534vd-sh-0-0:/app# hostname
sh-db2kkt73p534vd-sh-0-0
root@sh-db2kkt73p534vd-sh-0-0:/app# cat /etc/resolv.conf
nameserver 10.100.0.10
search default.svc.cluster.local svc.cluster.local cluster.local us-west-2.compute.internal
options ndots:5
root@sh-db2kkt73p534vd-sh-0-0:/app# cat /etc/hosts
# Kubernetes-managed hosts file.
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
192.168.15.246 sh-db2kkt73p534vd-sh-0-0.sh-db2kkt73p534vd.default.svc.cluster.local sh-db2kkt73p534vd-sh-0-0
The hostname is sh-db2kkt73p534vd-sh-0-0 but Volcano gives the addresss sh-db2kkt73p534vd-sh-0-0.sh-db2kkt73p534vd. Between hosts, resolve.conf and hostname there's all the information required to realize that these addresses are equivalent but the current logic isn't sufficient.
https://github.com/pytorch/pytorch/blob/1b745efbe8ee0ac3bae594ea88ff27e71a734c88/torch/distributed/elastic/rendezvous/utils.py#L110
We may want to do a full dns resolution on the address and check if it matches any of the local IP addresses.
Expected behavior
It realizes the host name is the current node and starts the c10d server.
Environment
torchelastic version (e.g. 0.1.0rc1):
OS (e.g., Linux): Linux sh-db2kkt73p534vd-sh-0-0 4.14.241-184.433.amzn2.x86_64 [torchelastic][circleci] Fix etcd download path #1 SMP Wed Aug 4 14:35:15 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
How you installed torchelastic (conda, pip, source, docker): docker
Docker image and tag (if using docker): https://github.com/pytorch/torchx/pkgs/container/torchx/15644476?tag=0.1.2dev0
Build command you used (if compiling from source):
Git commit (if installed from source):
Python version: 3.7.11
CUDA/cuDNN version:
GPU models and configuration:
Execution environment (on-prem, aws, etc): EKS + Volcano
Any other relevant information:
Additional context
Reactions are currently unavailable
🐛 Bug
Component (check all that applies):
state apitrain_step apitrain_looprendezvouscheckpointrollbackmetricspetctlexamplesdockerTo Reproduce
Steps to reproduce the behavior:
LOGLEVEL=INFO python -m torch.distributed.run --rdzv_backend c10d --rdzv_id 1 --rdzv_endpoint "$VC_SH_0_HOSTS" --nnodes 2 echo helloThe hostname is
sh-db2kkt73p534vd-sh-0-0but Volcano gives the addressssh-db2kkt73p534vd-sh-0-0.sh-db2kkt73p534vd. Between hosts, resolve.conf and hostname there's all the information required to realize that these addresses are equivalent but the current logic isn't sufficient.https://github.com/pytorch/pytorch/blob/1b745efbe8ee0ac3bae594ea88ff27e71a734c88/torch/distributed/elastic/rendezvous/utils.py#L110
We may want to do a full dns resolution on the address and check if it matches any of the local IP addresses.
Expected behavior
It realizes the host name is the current node and starts the
c10dserver.Environment
conda,pip, source,docker): dockerAdditional context