This repo is for study purpose, to understand the torch internals in deeper scope.
runs folder contains all the python script to do experimentation run.
run the below command in the terminal.
Run below commond for the given files alone
NCCL_DEBUG=ERROR torchrun --standalone --nproc-per-node=4 -m runs.torch_ring_all_reduceruns/torch_ring_all_reduce.py
runs/torch_p2p_comm.py
runs/torch_distributed_hook_check.pySome of the env variable to make use of:
- NCCL_DEBUG=ERROR
- TORCH_LOGS=+dynamo,gaurd,recompile,+inductor,graph_breaks
- TORCH_DISTRIBUTED_DEBUG=DETAIL