[training] fix: use startswith for Blackwell GPU name validation in DeepEP#2731
[training] fix: use startswith for Blackwell GPU name validation in DeepEP#2731
Conversation
The exact-match check for GPU names like "NVIDIA B200" and "NVIDIA B300" doesn't account for naming variants like "NVIDIA B300 SXM6 AC". Use startswith() instead, consistent with HybridEP's compute capability check. Also include the GPU name in warning/error messages for debuggability. Fixes #2725 Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
|
/ok to test 52cd0b8 |
📝 WalkthroughWalkthroughUpdated the Megatron-LM submodule reference and modified GPU validation logic in the flex dispatcher backend. The change replaces exact GPU name matching with prefix-based matching to handle device name variants for B200/B300 GPUs. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/megatron/bridge/training/flex_dispatcher_backend.py (1)
88-92: Inconsistency: HybridEP validation error message should include GPU name.For consistency with the updated DeepEP validation (lines 84-85) and the HybridEP warning message (line 62), the HybridEP validation error should also include the current GPU name for debuggability.
♻️ Proposed fix for consistency
if model_config.moe_flex_dispatcher_backend == "hybridep": if not device_properties.major in [8, 9, 10]: raise ValueError( - "HybridEP is supported for GB200, GB300 with NVL72 and for Ampere, Hopper, B200 and B300 GPUs" + f"HybridEP is supported for GB200, GB300 with NVL72 and for Ampere, Hopper, B200 and B300 GPUs. " + f"Current GPU: {device_properties.name}" )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/training/flex_dispatcher_backend.py` around lines 88 - 92, The HybridEP validation raises a ValueError without including the current GPU name; update the ValueError in the branch checking model_config.moe_flex_dispatcher_backend == "hybridep" to include device_properties.name (like the DeepEP validation and HybridEP warning do) so the message contains the GPU name for debuggability and consistency with the other checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/megatron/bridge/training/flex_dispatcher_backend.py`:
- Around line 88-92: The HybridEP validation raises a ValueError without
including the current GPU name; update the ValueError in the branch checking
model_config.moe_flex_dispatcher_backend == "hybridep" to include
device_properties.name (like the DeepEP validation and HybridEP warning do) so
the message contains the GPU name for debuggability and consistency with the
other checks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fc9e5398-ed07-4b86-966f-8b1b8a12290e
📒 Files selected for processing (2)
3rdparty/Megatron-LMsrc/megatron/bridge/training/flex_dispatcher_backend.py
Restore `3rdparty/Megatron-LM` to the `main` gitlink so this DeepEP GPU-name validation fix stays isolated and easier to review or backport. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
|
/ok to test 51c5bc3 |
|
/ok to test 51c5bc3 |
|
/ok to test 75af377 |
|
@yaoyu-33 Failing UTs |
|
/ok to test 77a00ec1c993ea021a22d06650933d4bad9bd087 |
@yaoyu-33, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
/ok to test 3f5b248 |
|
looks like actual unit test errors: |
…test_deepep Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
|
/ok to test ca52bd5 |
…eepEP (#2731) Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
…eepEP (#2731) Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Li Ding <liding@nvidia.com>
Summary
"NVIDIA B200"and"NVIDIA B300"doesn't account for naming variants like"NVIDIA B300 SXM6 AC".startswith()instead, consistent with how HybridEP already accepts all Blackwell variants viadevice_properties.major.Fixes #2725
Made with Cursor
Summary by CodeRabbit
Chores
Bug Fixes