Add QK layernorm support for dot-product attention in MambaModel by Phlip79 · Pull Request #4067 · NVIDIA/Megatron-LM

Phlip79 · 2026-03-31T00:12:19Z

What does this PR do ?

Converts static mamba_stack_spec and mamba_inference_stack_spec into config-driven functions (get_mamba_stack_spec, get_mamba_inference_stack_spec) that read qk_layernorm and qk_l2_norm from TransformerConfig, matching GPTModel's approach. Backward-compatible constants are preserved.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Convert static mamba_stack_spec and mamba_inference_stack_spec into config-driven functions (get_mamba_stack_spec, get_mamba_inference_stack_spec) that read qk_layernorm and qk_l2_norm from TransformerConfig, matching GPTModel's approach. Backward-compatible constants are preserved. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

copy-pr-bot · 2026-03-31T00:12:24Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Phlip79 · 2026-03-31T00:12:30Z

/claude review

megatron/core/models/mamba/mamba_layer_specs.py

claude

Light review — the refactor from static specs to config-driven functions looks correct and consistent with the GPT layer specs pattern. One gap: the new QK-norm code paths have no test coverage (see inline comment).

Tests cover: default (no config), qk_layernorm=True, qk_l2_norm=True, inference spec, backward-compatible constant, and a full forward pass with qk_layernorm enabled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Phlip79 · 2026-03-31T00:21:08Z

/claude review

claude

LGTM

Phlip79 · 2026-03-31T00:22:55Z

/ok to test 367b8a8

Phlip79 · 2026-03-31T00:25:57Z

/ok to test 1c46827

Phlip79 · 2026-03-31T00:27:50Z

/ok to test b28a11f

janEbert · 2026-03-31T07:32:45Z

I think this is exactly what we want to avoid; as far as I understand, we do not want to start to make the spec dynamic in code. :)
Isn't this solvable by dynamically passing the arguments in the MambaStack or MambaModel constructors?

Phlip79 · 2026-03-31T19:31:22Z

/claude review

claude

LGTM

Phlip79 · 2026-03-31T19:32:50Z

/ok to test bdf2d12

Phlip79 · 2026-03-31T19:37:16Z

/ok to test 488e448

Phlip79 · 2026-03-31T19:41:09Z

/ok to test e9c20a4

TENorm uses __new__ to return a TE LayerNorm/RMSNorm instance, not a TENorm instance, so isinstance(x, TENorm) always fails. Check for not None instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Phlip79 · 2026-03-31T20:19:47Z

/ok to test d16170f

yuzhongw-nvidia

LGTM. Thanks.

claude bot reviewed Mar 31, 2026

View reviewed changes

megatron/core/models/mamba/mamba_layer_specs.py Outdated Show resolved Hide resolved

claude bot reviewed Mar 31, 2026

View reviewed changes

Phlip79 marked this pull request as ready for review March 31, 2026 00:20

Phlip79 requested review from a team as code owners March 31, 2026 00:20

svcnvidia-nemo-ci requested a review from a team March 31, 2026 00:20

svcnvidia-nemo-ci added the complexity: medium label Mar 31, 2026

claude bot approved these changes Mar 31, 2026

View reviewed changes

svcnvidia-nemo-ci added this to the Core 0.16 milestone Mar 31, 2026

Fix linting

1c46827

Fix linting again

b28a11f

Phlip79 mentioned this pull request Mar 31, 2026

Switch Qwen3-Next to use MambaModel NVIDIA-NeMo/Megatron-Bridge#2520

Draft

5 tasks

Phlip79 requested a review from yuzhongw-nvidia March 31, 2026 05:07

Phlip79 added 2 commits March 31, 2026 18:12

Ran isort

97607d6

Update qk layernorm settings and tests

bdf2d12

claude bot approved these changes Mar 31, 2026

View reviewed changes

Fix linting

488e448

Fix linting again

e9c20a4

copy-pr-bot bot temporarily deployed to test March 31, 2026 19:42 Inactive

Fix TENorm isinstance check in QK layernorm test

d16170f

TENorm uses __new__ to return a TE LayerNorm/RMSNorm instance, not a TENorm instance, so isinstance(x, TENorm) always fails. Check for not None instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

copy-pr-bot bot temporarily deployed to test March 31, 2026 20:20 Inactive

Phlip79 requested review from janEbert March 31, 2026 22:56

yuzhongw-nvidia approved these changes Apr 1, 2026

View reviewed changes

Conversation

Phlip79 commented Mar 31, 2026

What does this PR do ?

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot bot commented Mar 31, 2026

Uh oh!

Phlip79 commented Mar 31, 2026

Uh oh!

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

Phlip79 commented Mar 31, 2026

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

Phlip79 commented Mar 31, 2026

Uh oh!

Phlip79 commented Mar 31, 2026

Uh oh!

Phlip79 commented Mar 31, 2026

Uh oh!

janEbert commented Mar 31, 2026

Uh oh!

Phlip79 commented Mar 31, 2026

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

Phlip79 commented Mar 31, 2026

Uh oh!

Phlip79 commented Mar 31, 2026

Uh oh!

Phlip79 commented Mar 31, 2026

Uh oh!

Phlip79 commented Mar 31, 2026

Uh oh!

yuzhongw-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants