[training] fix: use startswith for Blackwell GPU name validation in DeepEP by yaoyu-33 · Pull Request #2731 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2026-03-10T22:00:11Z

Summary

The exact-match check for GPU names like "NVIDIA B200" and "NVIDIA B300" doesn't account for naming variants like "NVIDIA B300 SXM6 AC".
Uses startswith() instead, consistent with how HybridEP already accepts all Blackwell variants via device_properties.major.
Includes the GPU name in warning/error messages for debuggability.

Made with Cursor

Summary by CodeRabbit

Chores
- Updated Megatron-LM to the latest version.
Bug Fixes
- Improved GPU compatibility detection for NVIDIA B200 and B300 series, enabling support for additional device variants.

The exact-match check for GPU names like "NVIDIA B200" and "NVIDIA B300" doesn't account for naming variants like "NVIDIA B300 SXM6 AC". Use startswith() instead, consistent with HybridEP's compute capability check. Also include the GPU name in warning/error messages for debuggability. Fixes #2725 Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

copy-pr-bot · 2026-03-10T22:00:15Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaoyu-33 · 2026-03-10T22:00:16Z

/ok to test 52cd0b8

coderabbitai · 2026-03-10T22:05:07Z

📝 Walkthrough

Walkthrough

Updated the Megatron-LM submodule reference and modified GPU validation logic in the flex dispatcher backend. The change replaces exact GPU name matching with prefix-based matching to handle device name variants for B200/B300 GPUs.

Changes

Cohort / File(s)	Summary
Submodule Update `3rdparty/Megatron-LM`	Advanced submodule pointer from commit 8318b80 to commit 23dd63c with no functional code changes.
GPU Validation Logic `src/megatron/bridge/training/flex_dispatcher_backend.py`	Replaced exact GPU name matching with prefix-based checks using `.startswith(("NVIDIA B200", "NVIDIA B300"))` in both `apply_flex_dispatcher_backend` and `validate_flex_dispatcher_backend` functions. Updated error and warning messages to reflect the new condition and include the current GPU name.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately describes the main change: replacing exact GPU name checks with prefix-based validation for Blackwell GPUs in DeepEP.
Linked Issues check	✅ Passed	The changes fully address issue `#2725` by replacing exact-match GPU name checks with startswith() for B200/B300 variants and adding GPU names to error messages.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing DeepEP GPU validation for Blackwell variants; the submodule update is unrelated but routine.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes	✅ Passed	Changes are minor bug fixes to GPU name validation logic affecting only 12 lines in flex_dispatcher_backend.py, replacing exact matching with prefix checks for GPU variants.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch yuya/fix-deepep-blackwell-validation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

src/megatron/bridge/training/flex_dispatcher_backend.py (1)

88-92: Inconsistency: HybridEP validation error message should include GPU name.

For consistency with the updated DeepEP validation (lines 84-85) and the HybridEP warning message (line 62), the HybridEP validation error should also include the current GPU name for debuggability.

♻️ Proposed fix for consistency

         if model_config.moe_flex_dispatcher_backend == "hybridep":
             if not device_properties.major in [8, 9, 10]:
                 raise ValueError(
-                    "HybridEP is supported for GB200, GB300 with NVL72 and for Ampere, Hopper, B200 and B300 GPUs"
+                    f"HybridEP is supported for GB200, GB300 with NVL72 and for Ampere, Hopper, B200 and B300 GPUs. "
+                    f"Current GPU: {device_properties.name}"
                 )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/flex_dispatcher_backend.py` around lines 88 -
92, The HybridEP validation raises a ValueError without including the current
GPU name; update the ValueError in the branch checking
model_config.moe_flex_dispatcher_backend == "hybridep" to include
device_properties.name (like the DeepEP validation and HybridEP warning do) so
the message contains the GPU name for debuggability and consistency with the
other checks.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/megatron/bridge/training/flex_dispatcher_backend.py`:
- Around line 88-92: The HybridEP validation raises a ValueError without
including the current GPU name; update the ValueError in the branch checking
model_config.moe_flex_dispatcher_backend == "hybridep" to include
device_properties.name (like the DeepEP validation and HybridEP warning do) so
the message contains the GPU name for debuggability and consistency with the
other checks.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fc9e5398-ed07-4b86-966f-8b1b8a12290e

📥 Commits

Reviewing files that changed from the base of the PR and between de93536 and 52cd0b8.

📒 Files selected for processing (2)

3rdparty/Megatron-LM
src/megatron/bridge/training/flex_dispatcher_backend.py

Restore `3rdparty/Megatron-LM` to the `main` gitlink so this DeepEP GPU-name validation fix stays isolated and easier to review or backport. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor Signed-off-by: Yu Yao <yaoyu.094@gmail.com>

yaoyu-33 · 2026-03-12T18:32:59Z

/ok to test 51c5bc3

cuichenx · 2026-03-12T19:53:00Z

/ok to test 51c5bc3

cuichenx · 2026-03-12T22:55:05Z

/ok to test 75af377

gautham-kollu · 2026-03-16T22:53:08Z

@yaoyu-33 Failing UTs

        container, og_ws, cfg_mod = create_test_config_container(world_size_override=1, model_config=gpt_model_cfg)
    
        try:
            if expect_error:
>               with pytest.raises(ValueError, match="DeepEP is supported for Ampere"):
E               Failed: DID NOT RAISE <class 'ValueError'>

tests/unit_tests/training/test_config.py:1039: Failed

yaoyu-33 · 2026-03-19T00:33:55Z

/ok to test 77a00ec1c993ea021a22d06650933d4bad9bd087

copy-pr-bot · 2026-03-19T00:33:59Z

/ok to test 77a00ec1c993ea021a22d06650933d4bad9bd087

@yaoyu-33, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

cuichenx · 2026-03-19T22:15:06Z

/ok to test 3f5b248

cuichenx · 2026-03-19T23:23:00Z

looks like actual unit test errors:

FAILED tests/unit_tests/training/test_deepep.py::TestApplyDeepEP::test_apply_flex_dispatcher_backend_warns_for_unsupported_gpu_pascal - AssertionError: Expected 'warning' to have been called once. Called 0 times.
FAILED tests/unit_tests/training/test_deepep.py::TestValidateDeepEP::test_validate_flex_dispatcher_backend_volta_gpu_raises_error - Failed: DID NOT RAISE <class 'ValueError'>
FAILED tests/unit_tests/training/test_deepep.py::TestValidateDeepEP::test_validate_flex_dispatcher_backend_future_gpu_raises_error - Failed: DID NOT RAISE <class 'ValueError'>```

…test_deepep Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-03-20T00:48:19Z

/ok to test ca52bd5

…eepEP (#2731) Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

…eepEP (#2731) Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Li Ding <liding@nvidia.com>

copy-pr-bot bot temporarily deployed to test March 10, 2026 22:01 Inactive

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

copy-pr-bot bot had a problem deploying to nemo-ci March 10, 2026 22:12 Failure

yaoyu-33 mentioned this pull request Mar 10, 2026

DeepEP validation rejects B200/B300 when device_properties.name is a variant (e.g. "NVIDIA B300 SXM6 AC") #2725

Closed

copy-pr-bot bot temporarily deployed to public March 10, 2026 22:46 Inactive

yaoyu-33 added bug Something isn't working area:training Training loop, callbacks, and runtime integration needs-review PR is ready for code review and waiting on a reviewer and removed bug Something isn't working labels Mar 11, 2026

copy-pr-bot bot temporarily deployed to test March 12, 2026 18:33 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 12, 2026 18:49 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 12, 2026 18:57 Failure

cuichenx previously approved these changes Mar 12, 2026

View reviewed changes

cuichenx added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed needs-review PR is ready for code review and waiting on a reviewer labels Mar 12, 2026

copy-pr-bot bot had a problem deploying to nemo-ci March 12, 2026 21:30 Failure

Merge branch 'main' into yuya/fix-deepep-blackwell-validation

75af377

copy-pr-bot bot temporarily deployed to test March 12, 2026 22:55 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 12, 2026 22:58 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 12, 2026 23:08 Failure

Merge branch 'main' into yuya/fix-deepep-blackwell-validation

235a3da

copy-pr-bot bot temporarily deployed to test March 18, 2026 22:19 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 18, 2026 22:29 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 18, 2026 22:38 Failure

copy-pr-bot bot had a problem deploying to nemo-ci March 19, 2026 22:00 Failure

Merge branch 'main' into yuya/fix-deepep-blackwell-validation

3f5b248

copy-pr-bot bot temporarily deployed to test March 19, 2026 22:15 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 19, 2026 22:25 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 19, 2026 22:31 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 19, 2026 22:31 Failure

[test] fix: set mock GPU name to avoid startswith truthy-mock bug in …

ca52bd5

…test_deepep Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

copy-pr-bot bot temporarily deployed to test March 20, 2026 00:49 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 20, 2026 00:50 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 20, 2026 01:02 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 20, 2026 01:11 Inactive

yaoyu-33 merged commit f565adb into main Mar 20, 2026
37 checks passed

yaoyu-33 deleted the yuya/fix-deepep-blackwell-validation branch March 20, 2026 03:38

Conversation

yaoyu-33 commented Mar 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 10, 2026

Uh oh!

yaoyu-33 commented Mar 10, 2026

Uh oh!

coderabbitai bot commented Mar 10, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 commented Mar 12, 2026

Uh oh!

cuichenx commented Mar 12, 2026

Uh oh!

cuichenx commented Mar 12, 2026

Uh oh!

gautham-kollu commented Mar 16, 2026

Uh oh!

yaoyu-33 commented Mar 19, 2026

Uh oh!

copy-pr-bot bot commented Mar 19, 2026

Uh oh!

cuichenx commented Mar 19, 2026

Uh oh!

cuichenx commented Mar 19, 2026

Uh oh!

yaoyu-33 commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yaoyu-33 commented Mar 10, 2026 •

edited by coderabbitai bot

Loading