-
Notifications
You must be signed in to change notification settings - Fork 220
perf: DeepEP interface in megatron backend #1794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Parth Mannan <pmannan@nvidia.com>
…1640) Signed-off-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
…1605) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: alexandery <alexandery@nvidia.com> Signed-off-by: Brian Yu <bxyu@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Sahil Modi <samodi@nvidia.com> Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: Jonas Yang <joyang@nvidia.com> Signed-off-by: ZeYi Lin <944270057@qq.com> Signed-off-by: Alexander Zhipa <azzhipa@amazon.com> Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: alexandery-nvidia <alexandery@nvidia.com> Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com> Co-authored-by: Peter Jin <pjin@nvidia.com> Co-authored-by: samodi-nv <141948907+samodi-nv@users.noreply.github.com> Co-authored-by: ruit <ruit@nvidia.com> Co-authored-by: Jonas Yang <joyang@nvidia.com> Co-authored-by: Ze-Yi LIN <58305964+Zeyi-Lin@users.noreply.github.com> Co-authored-by: Alexander Zhipa <alex.zhipa@proton.me> Co-authored-by: Alexander Zhipa <azzhipa@amazon.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: Manasa Manohara <mmanohara@nvidia.com> Co-authored-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Peter Jin <pjin@nvidia.com> Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> Co-authored-by: Guyue Huang <guyueh@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Peter Jin <pjin@nvidia.com> Co-authored-by: Dong Hyuk Chang <donghyukc@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Sahger Lad <lad.sahger@gmail.com> Signed-off-by: sahgerlad <36946563+sahgerlad@users.noreply.github.com> Co-authored-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Sahger Lad <lad.sahger@gmail.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: root <root@pool0-00514.cm.cluster> Co-authored-by: root <root@pool0-00514.cm.cluster> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Guyue Huang <guyueh@login-lyris02.lyris.clusters.nvidia.com> Co-authored-by: Guyue Huang <guyueh@login-lyris02.lyris.clusters.nvidia.com> Co-authored-by: Seonjin <sna@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com> Co-authored-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
…A) (#1648) Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
#1715) Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
📝 WalkthroughWalkthroughAdds three new MOE-related configuration fields ( Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@nemo_rl/models/policy/__init__.py`:
- Around line 186-195: Update the TypedDict in nemo_rl/models/policy/__init__.py
to mark moe_enable_deepep, moe_token_dispatcher_type, and
moe_shared_expert_overlap as NotRequired and document recommended defaults
(e.g., False, 'allgather', False); then change the access in
megatron_policy_worker.py (around the logic at lines ~661–667) to use
config.get('moe_enable_deepep', False), config.get('moe_token_dispatcher_type',
'allgather'), and config.get('moe_shared_expert_overlap', False) so missing keys
won’t KeyError; finally add those three keys with the recommended default values
to the exemplar YAMLs (grpo_math_70B_megatron.yaml, grpo_math_8B_megatron.yaml,
grpo_math_qwen30ba3b_megatron.yaml).
|
@terrykong could you review? This is needed for a recent deepseek performance study urgently |
terrykong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm. Can you please resolve the two comments?
Fyi: @yuki-97
What does this PR do ?
Add interface to configure deep_ep usage in megatron backend
closes #1396
Dup of #1645
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
Release Notes
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.