I've tried reproducing your results, which works well when running on one GPU. It just takes a long time to train. So, your recent DDP addition was an exciting new feature I had to try out.
Unfortunately, scaling from 1x H100 to 8x H100 (in the same DGX-H100 node) decreases it/sec from 8.71 to 0.61. Assuming it does 8x as much work per step, that still is slower than the single-GPU baseline.
Did I miss some config options?
I've tried reproducing your results, which works well when running on one GPU. It just takes a long time to train. So, your recent DDP addition was an exciting new feature I had to try out.
Unfortunately, scaling from 1x H100 to 8x H100 (in the same DGX-H100 node) decreases it/sec from 8.71 to 0.61. Assuming it does 8x as much work per step, that still is slower than the single-GPU baseline.
Did I miss some config options?