Skip to content

add tp example#964

Merged
hwchen2017 merged 5 commits intodeepspeedai:masterfrom
inkcherry:master
May 23, 2025
Merged

add tp example#964
hwchen2017 merged 5 commits intodeepspeedai:masterfrom
inkcherry:master

Conversation

@inkcherry
Copy link
Contributor

FYI , @hwchen2017

Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
github-merge-queue bot pushed a commit to deepspeedai/DeepSpeed that referenced this pull request Apr 7, 2025
The release versions are now available. update from the master branch to
use the minimum required versions instead.
also link the
example.deepspeedai/DeepSpeedExamples#964

---------

Signed-off-by: inkcherry <mingzhi.liu@intel.com>
github-merge-queue bot pushed a commit to deepspeedai/DeepSpeed that referenced this pull request Apr 8, 2025
The release versions are now available. update from the master branch to
use the minimum required versions instead.
also link the
example.deepspeedai/DeepSpeedExamples#964

---------

Signed-off-by: inkcherry <mingzhi.liu@intel.com>
github-merge-queue bot pushed a commit to deepspeedai/DeepSpeed that referenced this pull request Apr 8, 2025
The release versions are now available. update from the master branch to
use the minimum required versions instead.
also link the
example.deepspeedai/DeepSpeedExamples#964

---------

Signed-off-by: inkcherry <mingzhi.liu@intel.com>
@ekg
Copy link

ekg commented Apr 17, 2025

I'm unable to get this to work.

First I run: bash run.sh zero2 (all of the options fail with the same error)

Time to load fused_adam op: 0.1688675880432129 seconds
[rank4]:[E417 20:43:19.177767953 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=131076096, NumelOut=131076096, Timeout(ms)=6
00000) ran for 600016 milliseconds before timing out.
[rank4]:[E417 20:43:19.178407423 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 291, last completed
 NCCL work: -1.
[rank4]:[E417 20:43:19.178477173 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 4] Timeout at NCCL work: 1, last enqueued NCCL work: 291, last completed NCCL work: -1.
[rank4]:[E417 20:43:19.178491283 ProcessGroupNCCL.cpp:630] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on c
orrupted/incomplete data.
[rank4]:[E417 20:43:19.178503043 ProcessGroupNCCL.cpp:636] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E417 20:43:19.180988563 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: Work
NCCL(SeqNum=1, OpType=BROADCAST, NumelIn=131076096, NumelOut=131076096, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1729647429097/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a237f9c8446 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7a232e5f14d2 in /home/erikg/micromamba/envs/deepseed-examples/lib/
python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a232e5f8913 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a232e5fa37d in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7a2386b785c0 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9caa4 (0x7a238769caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7a2387729c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 1 PG GUID 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=13107609
6, NumelOut=131076096, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.

What am I doing wrong?

@inkcherry
Copy link
Contributor Author

inkcherry commented Apr 18, 2025

I'm unable to get this to work.

First I run: bash run.sh zero2 (all of the options fail with the same error)

Time to load fused_adam op: 0.1688675880432129 seconds
[rank4]:[E417 20:43:19.177767953 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=131076096, NumelOut=131076096, Timeout(ms)=6
00000) ran for 600016 milliseconds before timing out.
[rank4]:[E417 20:43:19.178407423 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 291, last completed
 NCCL work: -1.
[rank4]:[E417 20:43:19.178477173 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 4] Timeout at NCCL work: 1, last enqueued NCCL work: 291, last completed NCCL work: -1.
[rank4]:[E417 20:43:19.178491283 ProcessGroupNCCL.cpp:630] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on c
orrupted/incomplete data.
[rank4]:[E417 20:43:19.178503043 ProcessGroupNCCL.cpp:636] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E417 20:43:19.180988563 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: Work
NCCL(SeqNum=1, OpType=BROADCAST, NumelIn=131076096, NumelOut=131076096, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1729647429097/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a237f9c8446 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7a232e5f14d2 in /home/erikg/micromamba/envs/deepseed-examples/lib/
python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a232e5f8913 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a232e5fa37d in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7a2386b785c0 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9caa4 (0x7a238769caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7a2387729c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 1 PG GUID 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=13107609
6, NumelOut=131076096, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.

What am I doing wrong?

Hi, ekg, If standard ZeRO-1/2 still fails to run properly, it may be due to incorrect configuration of the your CUDA and NCCL versions.

@inkcherry
Copy link
Contributor Author

@hwchen2017 just a reminder in case you miss this~ thanks.

ys950902 pushed a commit to ys950902/DeepSpeed that referenced this pull request May 21, 2025
The release versions are now available. update from the master branch to
use the minimum required versions instead.
also link the
example.deepspeedai/DeepSpeedExamples#964

---------

Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Signed-off-by: yisheng <yi.sheng@intel.com>
@hwchen2017 hwchen2017 merged commit bd47e5b into deepspeedai:master May 23, 2025
2 checks passed
hwchen2017 added a commit that referenced this pull request Jun 8, 2025
* update tp example

Signed-off-by: inkcherry <mingzhi.liu@intel.com>

* update

Signed-off-by: inkcherry <mingzhi.liu@intel.com>

* add length bench file

Signed-off-by: inkcherry <mingzhi.liu@intel.com>

---------

Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
deepcharm pushed a commit to deepcharm/DeepSpeed that referenced this pull request Jun 16, 2025
The release versions are now available. update from the master branch to
use the minimum required versions instead.
also link the
example.deepspeedai/DeepSpeedExamples#964

---------

Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants