add tp example by inkcherry · Pull Request #964 · deepspeedai/DeepSpeedExamples

inkcherry · 2025-04-07T08:12:59Z

FYI , @hwchen2017

Signed-off-by: inkcherry <mingzhi.liu@intel.com>

The release versions are now available. update from the master branch to use the minimum required versions instead. also link the example.deepspeedai/DeepSpeedExamples#964 --------- Signed-off-by: inkcherry <mingzhi.liu@intel.com>

ekg · 2025-04-17T21:24:52Z

I'm unable to get this to work.

First I run: bash run.sh zero2 (all of the options fail with the same error)

Time to load fused_adam op: 0.1688675880432129 seconds
[rank4]:[E417 20:43:19.177767953 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=131076096, NumelOut=131076096, Timeout(ms)=6
00000) ran for 600016 milliseconds before timing out.
[rank4]:[E417 20:43:19.178407423 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 291, last completed
 NCCL work: -1.
[rank4]:[E417 20:43:19.178477173 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 4] Timeout at NCCL work: 1, last enqueued NCCL work: 291, last completed NCCL work: -1.
[rank4]:[E417 20:43:19.178491283 ProcessGroupNCCL.cpp:630] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on c
orrupted/incomplete data.
[rank4]:[E417 20:43:19.178503043 ProcessGroupNCCL.cpp:636] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E417 20:43:19.180988563 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: Work
NCCL(SeqNum=1, OpType=BROADCAST, NumelIn=131076096, NumelOut=131076096, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1729647429097/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a237f9c8446 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7a232e5f14d2 in /home/erikg/micromamba/envs/deepseed-examples/lib/
python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a232e5f8913 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a232e5fa37d in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7a2386b785c0 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9caa4 (0x7a238769caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7a2387729c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 1 PG GUID 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=13107609
6, NumelOut=131076096, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.

What am I doing wrong?

inkcherry · 2025-04-18T00:39:22Z

I'm unable to get this to work.

First I run: bash run.sh zero2 (all of the options fail with the same error)

Time to load fused_adam op: 0.1688675880432129 seconds
[rank4]:[E417 20:43:19.177767953 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=131076096, NumelOut=131076096, Timeout(ms)=6
00000) ran for 600016 milliseconds before timing out.
[rank4]:[E417 20:43:19.178407423 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 291, last completed
 NCCL work: -1.
[rank4]:[E417 20:43:19.178477173 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 4] Timeout at NCCL work: 1, last enqueued NCCL work: 291, last completed NCCL work: -1.
[rank4]:[E417 20:43:19.178491283 ProcessGroupNCCL.cpp:630] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on c
orrupted/incomplete data.
[rank4]:[E417 20:43:19.178503043 ProcessGroupNCCL.cpp:636] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E417 20:43:19.180988563 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: Work
NCCL(SeqNum=1, OpType=BROADCAST, NumelIn=131076096, NumelOut=131076096, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1729647429097/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a237f9c8446 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7a232e5f14d2 in /home/erikg/micromamba/envs/deepseed-examples/lib/
python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a232e5f8913 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a232e5fa37d in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7a2386b785c0 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9caa4 (0x7a238769caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7a2387729c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 1 PG GUID 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=13107609
6, NumelOut=131076096, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.

What am I doing wrong?

Hi， ekg, If standard ZeRO-1/2 still fails to run properly, it may be due to incorrect configuration of the your CUDA and NCCL versions.

inkcherry · 2025-04-18T00:40:38Z

@hwchen2017 just a reminder in case you miss this~ thanks.

The release versions are now available. update from the master branch to use the minimum required versions instead. also link the example.deepspeedai/DeepSpeedExamples#964 --------- Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: yisheng <yi.sheng@intel.com>

* update tp example Signed-off-by: inkcherry <mingzhi.liu@intel.com> * update Signed-off-by: inkcherry <mingzhi.liu@intel.com> * add length bench file Signed-off-by: inkcherry <mingzhi.liu@intel.com> --------- Signed-off-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>

The release versions are now available. update from the master branch to use the minimum required versions instead. also link the example.deepspeedai/DeepSpeedExamples#964 --------- Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>

inkcherry added 2 commits April 7, 2025 08:07

update tp example

5d87971

Signed-off-by: inkcherry <mingzhi.liu@intel.com>

update

06a7fbe

Signed-off-by: inkcherry <mingzhi.liu@intel.com>

inkcherry requested a review from tjruwase as a code owner April 7, 2025 08:13

inkcherry mentioned this pull request Apr 7, 2025

update dependencies version info deepspeedai/DeepSpeed#7206

Merged

add length bench file

592d28f

Signed-off-by: inkcherry <mingzhi.liu@intel.com>

inkcherry mentioned this pull request Apr 7, 2025

[BUG]AutoTP train get AssertionError: Data inconsistency within the TP group. deepspeedai/DeepSpeed#7199

Closed

hwchen2017 added 2 commits April 17, 2025 19:18

Merge branch 'master' into master

0aa0913

Merge branch 'master' into master

24ae2cb

hwchen2017 approved these changes May 23, 2025

View reviewed changes

hwchen2017 merged commit bd47e5b into deepspeedai:master May 23, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add tp example#964

add tp example#964
hwchen2017 merged 5 commits intodeepspeedai:masterfrom
inkcherry:master

inkcherry commented Apr 7, 2025

Uh oh!

ekg commented Apr 17, 2025

Uh oh!

inkcherry commented Apr 18, 2025 •

edited

Loading

Uh oh!

inkcherry commented Apr 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

inkcherry commented Apr 7, 2025

Uh oh!

ekg commented Apr 17, 2025

Uh oh!

inkcherry commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

inkcherry commented Apr 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

inkcherry commented Apr 18, 2025 •

edited

Loading