Merged
Conversation
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
github-merge-queue bot
pushed a commit
to deepspeedai/DeepSpeed
that referenced
this pull request
Apr 7, 2025
The release versions are now available. update from the master branch to use the minimum required versions instead. also link the example.deepspeedai/DeepSpeedExamples#964 --------- Signed-off-by: inkcherry <mingzhi.liu@intel.com>
github-merge-queue bot
pushed a commit
to deepspeedai/DeepSpeed
that referenced
this pull request
Apr 8, 2025
The release versions are now available. update from the master branch to use the minimum required versions instead. also link the example.deepspeedai/DeepSpeedExamples#964 --------- Signed-off-by: inkcherry <mingzhi.liu@intel.com>
github-merge-queue bot
pushed a commit
to deepspeedai/DeepSpeed
that referenced
this pull request
Apr 8, 2025
The release versions are now available. update from the master branch to use the minimum required versions instead. also link the example.deepspeedai/DeepSpeedExamples#964 --------- Signed-off-by: inkcherry <mingzhi.liu@intel.com>
|
I'm unable to get this to work. First I run: Time to load fused_adam op: 0.1688675880432129 seconds
[rank4]:[E417 20:43:19.177767953 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=131076096, NumelOut=131076096, Timeout(ms)=6
00000) ran for 600016 milliseconds before timing out.
[rank4]:[E417 20:43:19.178407423 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 291, last completed
NCCL work: -1.
[rank4]:[E417 20:43:19.178477173 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 4] Timeout at NCCL work: 1, last enqueued NCCL work: 291, last completed NCCL work: -1.
[rank4]:[E417 20:43:19.178491283 ProcessGroupNCCL.cpp:630] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on c
orrupted/incomplete data.
[rank4]:[E417 20:43:19.178503043 ProcessGroupNCCL.cpp:636] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E417 20:43:19.180988563 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: Work
NCCL(SeqNum=1, OpType=BROADCAST, NumelIn=131076096, NumelOut=131076096, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1729647429097/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a237f9c8446 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7a232e5f14d2 in /home/erikg/micromamba/envs/deepseed-examples/lib/
python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a232e5f8913 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a232e5fa37d in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7a2386b785c0 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9caa4 (0x7a238769caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7a2387729c3c in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 1 PG GUID 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=13107609
6, NumelOut=131076096, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.What am I doing wrong? |
Contributor
Author
Hi, ekg, If standard ZeRO-1/2 still fails to run properly, it may be due to incorrect configuration of the your CUDA and NCCL versions. |
Contributor
Author
|
@hwchen2017 just a reminder in case you miss this~ thanks. |
ys950902
pushed a commit
to ys950902/DeepSpeed
that referenced
this pull request
May 21, 2025
The release versions are now available. update from the master branch to use the minimum required versions instead. also link the example.deepspeedai/DeepSpeedExamples#964 --------- Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: yisheng <yi.sheng@intel.com>
hwchen2017
approved these changes
May 23, 2025
hwchen2017
added a commit
that referenced
this pull request
Jun 8, 2025
* update tp example Signed-off-by: inkcherry <mingzhi.liu@intel.com> * update Signed-off-by: inkcherry <mingzhi.liu@intel.com> * add length bench file Signed-off-by: inkcherry <mingzhi.liu@intel.com> --------- Signed-off-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
deepcharm
pushed a commit
to deepcharm/DeepSpeed
that referenced
this pull request
Jun 16, 2025
The release versions are now available. update from the master branch to use the minimum required versions instead. also link the example.deepspeedai/DeepSpeedExamples#964 --------- Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
FYI , @hwchen2017