Skip to content

[training] fix: use startswith for Blackwell GPU name validation in DeepEP#2731

Merged
yaoyu-33 merged 7 commits intomainfrom
yuya/fix-deepep-blackwell-validation
Mar 20, 2026
Merged

[training] fix: use startswith for Blackwell GPU name validation in DeepEP#2731
yaoyu-33 merged 7 commits intomainfrom
yuya/fix-deepep-blackwell-validation

Conversation

@yaoyu-33
Copy link
Copy Markdown
Contributor

@yaoyu-33 yaoyu-33 commented Mar 10, 2026

Summary

  • The exact-match check for GPU names like "NVIDIA B200" and "NVIDIA B300" doesn't account for naming variants like "NVIDIA B300 SXM6 AC".
  • Uses startswith() instead, consistent with how HybridEP already accepts all Blackwell variants via device_properties.major.
  • Includes the GPU name in warning/error messages for debuggability.

Fixes #2725

Made with Cursor

Summary by CodeRabbit

  • Chores

    • Updated Megatron-LM to the latest version.
  • Bug Fixes

    • Improved GPU compatibility detection for NVIDIA B200 and B300 series, enabling support for additional device variants.

The exact-match check for GPU names like "NVIDIA B200" and "NVIDIA B300"
doesn't account for naming variants like "NVIDIA B300 SXM6 AC". Use
startswith() instead, consistent with HybridEP's compute capability check.
Also include the GPU name in warning/error messages for debuggability.

Fixes #2725

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyu-33
Copy link
Copy Markdown
Contributor Author

/ok to test 52cd0b8

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 10, 2026

📝 Walkthrough

Walkthrough

Updated the Megatron-LM submodule reference and modified GPU validation logic in the flex dispatcher backend. The change replaces exact GPU name matching with prefix-based matching to handle device name variants for B200/B300 GPUs.

Changes

Cohort / File(s) Summary
Submodule Update
3rdparty/Megatron-LM
Advanced submodule pointer from commit 8318b80 to commit 23dd63c with no functional code changes.
GPU Validation Logic
src/megatron/bridge/training/flex_dispatcher_backend.py
Replaced exact GPU name matching with prefix-based checks using .startswith(("NVIDIA B200", "NVIDIA B300")) in both apply_flex_dispatcher_backend and validate_flex_dispatcher_backend functions. Updated error and warning messages to reflect the new condition and include the current GPU name.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: replacing exact GPU name checks with prefix-based validation for Blackwell GPUs in DeepEP.
Linked Issues check ✅ Passed The changes fully address issue #2725 by replacing exact-match GPU name checks with startswith() for B200/B300 variants and adding GPU names to error messages.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing DeepEP GPU validation for Blackwell variants; the submodule update is unrelated but routine.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed Changes are minor bug fixes to GPU name validation logic affecting only 12 lines in flex_dispatcher_backend.py, replacing exact matching with prefix checks for GPU variants.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yuya/fix-deepep-blackwell-validation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/megatron/bridge/training/flex_dispatcher_backend.py (1)

88-92: Inconsistency: HybridEP validation error message should include GPU name.

For consistency with the updated DeepEP validation (lines 84-85) and the HybridEP warning message (line 62), the HybridEP validation error should also include the current GPU name for debuggability.

♻️ Proposed fix for consistency
         if model_config.moe_flex_dispatcher_backend == "hybridep":
             if not device_properties.major in [8, 9, 10]:
                 raise ValueError(
-                    "HybridEP is supported for GB200, GB300 with NVL72 and for Ampere, Hopper, B200 and B300 GPUs"
+                    f"HybridEP is supported for GB200, GB300 with NVL72 and for Ampere, Hopper, B200 and B300 GPUs. "
+                    f"Current GPU: {device_properties.name}"
                 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/flex_dispatcher_backend.py` around lines 88 -
92, The HybridEP validation raises a ValueError without including the current
GPU name; update the ValueError in the branch checking
model_config.moe_flex_dispatcher_backend == "hybridep" to include
device_properties.name (like the DeepEP validation and HybridEP warning do) so
the message contains the GPU name for debuggability and consistency with the
other checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/megatron/bridge/training/flex_dispatcher_backend.py`:
- Around line 88-92: The HybridEP validation raises a ValueError without
including the current GPU name; update the ValueError in the branch checking
model_config.moe_flex_dispatcher_backend == "hybridep" to include
device_properties.name (like the DeepEP validation and HybridEP warning do) so
the message contains the GPU name for debuggability and consistency with the
other checks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fc9e5398-ed07-4b86-966f-8b1b8a12290e

📥 Commits

Reviewing files that changed from the base of the PR and between de93536 and 52cd0b8.

📒 Files selected for processing (2)
  • 3rdparty/Megatron-LM
  • src/megatron/bridge/training/flex_dispatcher_backend.py

@yaoyu-33 yaoyu-33 added bug Something isn't working area:training Training loop, callbacks, and runtime integration needs-review PR is ready for code review and waiting on a reviewer and removed bug Something isn't working labels Mar 11, 2026
Restore `3rdparty/Megatron-LM` to the `main` gitlink so this DeepEP GPU-name validation fix stays isolated and easier to review or backport.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
@yaoyu-33
Copy link
Copy Markdown
Contributor Author

/ok to test 51c5bc3

cuichenx
cuichenx previously approved these changes Mar 12, 2026
@cuichenx cuichenx added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed needs-review PR is ready for code review and waiting on a reviewer labels Mar 12, 2026
@cuichenx
Copy link
Copy Markdown
Contributor

/ok to test 51c5bc3

@cuichenx
Copy link
Copy Markdown
Contributor

/ok to test 75af377

@gautham-kollu
Copy link
Copy Markdown
Contributor

@yaoyu-33 Failing UTs

        container, og_ws, cfg_mod = create_test_config_container(world_size_override=1, model_config=gpt_model_cfg)
    
        try:
            if expect_error:
>               with pytest.raises(ValueError, match="DeepEP is supported for Ampere"):
E               Failed: DID NOT RAISE <class 'ValueError'>

tests/unit_tests/training/test_config.py:1039: Failed

@yaoyu-33
Copy link
Copy Markdown
Contributor Author

/ok to test 77a00ec1c993ea021a22d06650933d4bad9bd087

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 19, 2026

/ok to test 77a00ec1c993ea021a22d06650933d4bad9bd087

@yaoyu-33, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@cuichenx
Copy link
Copy Markdown
Contributor

/ok to test 3f5b248

@cuichenx
Copy link
Copy Markdown
Contributor

looks like actual unit test errors:

FAILED tests/unit_tests/training/test_deepep.py::TestApplyDeepEP::test_apply_flex_dispatcher_backend_warns_for_unsupported_gpu_pascal - AssertionError: Expected 'warning' to have been called once. Called 0 times.
FAILED tests/unit_tests/training/test_deepep.py::TestValidateDeepEP::test_validate_flex_dispatcher_backend_volta_gpu_raises_error - Failed: DID NOT RAISE <class 'ValueError'>
FAILED tests/unit_tests/training/test_deepep.py::TestValidateDeepEP::test_validate_flex_dispatcher_backend_future_gpu_raises_error - Failed: DID NOT RAISE <class 'ValueError'>```

…test_deepep

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33
Copy link
Copy Markdown
Contributor Author

/ok to test ca52bd5

@yaoyu-33 yaoyu-33 merged commit f565adb into main Mar 20, 2026
37 checks passed
@yaoyu-33 yaoyu-33 deleted the yuya/fix-deepep-blackwell-validation branch March 20, 2026 03:38
ko3n1g pushed a commit that referenced this pull request Mar 20, 2026
…eepEP (#2731)

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
liding-nv pushed a commit that referenced this pull request Mar 22, 2026
…eepEP (#2731)

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Li Ding <liding@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:training Training loop, callbacks, and runtime integration ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DeepEP validation rejects B200/B300 when device_properties.name is a variant (e.g. "NVIDIA B300 SXM6 AC")

3 participants