[model] feat: support Qwen 3.5 MTP conversion and training by cuichenx · Pull Request #2711 · NVIDIA-NeMo/Megatron-Bridge

cuichenx · 2026-03-09T23:05:58Z

What does this PR do?

Add Multi-Token Prediction (MTP) support for Qwen3.5 VL models, enabling MTP-aware checkpoint conversion and training for both dense and MoE variants.

Changelog

Add MTP parameter mappings in Qwen35VLMoEBridge and Qwen35VLBridge for bidirectional HF ↔ Megatron conversion (QKV, MoE experts, shared experts, projection, layernorms)
Add mtp_num_hidden_layers → mtp_num_layers config alias in MegatronModelBridge.CONFIG_MAPPING
Pass mtp_block_spec and vp_stage through Qwen3VLModel to the language model in both dense and MoE providers
Forward input_ids to the language model (instead of None) so MTP can roll input_ids for future-token embeddings; handle THD-format for packed sequences
Wrap embedding with SP scatter when MTP is active with sequence parallelism in Qwen3VLGPTModel
Set mtp_loss_scaling_factor = 0.1 in both bridges when MTP is enabled
Refactor qwen35_vl.py recipes into shared helpers (_qwen35_vl_apply_common, _qwen35_vl_apply_moe, _qwen35_vl_enable_recompute, _qwen35_vl_apply_peft_scheme) and add MTP config to all recipe variants
Improve _disable_mtp in hf_to_megatron_generate_vlm.py to properly clear both config.mtp_num_layers and language_model.mtp_process during inference
Remove debug tiny-model override and outdated TODO comments from bridge/provider

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI.
A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Builds on top of the Qwen3.5 VL recipes added in [model, recipe] Add Qwen 3.5 recipes #2654
MTP architecture: Qwen3.5 uses mtp_num_hidden_layers=1 in its HF config, with a single MTP layer containing its own attention + MLP block plus embedding/hidden projection norms

Signed-off-by: Chen Cui <chcui@nvidia.com>

- Add `set -euo pipefail` to slurm_sft.sh and slurm_peft.sh for fail-fast behavior - Remove `logs/` prefix from SBATCH output/error paths (directory doesn't exist at job start) - Remove now-unnecessary `mkdir -p logs` calls - Fix 122B-A10B parallelism comment in slurm_sft.sh to match recipe (TP=2, PP=6, EP=8) - Add `pytestmark = pytest.mark.integration` to functional test module - Reset microbatch calculator both before and after each test in fixture Signed-off-by: Chen Cui <chcui@nvidia.com> Made-with: Cursor

Signed-off-by: Chen Cui <chcui@nvidia.com>

…n-Bridge into chcui/qwen35_recipes

Signed-off-by: Chen Cui <chcui@nvidia.com>

…n-Bridge into chcui/qwen35_recipes

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot · 2026-03-09T23:06:10Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Chen Cui <chcui@nvidia.com>

coderabbitai · 2026-03-10T22:39:15Z

📝 Walkthrough

Walkthrough

The changes introduce Multi-Token Prediction (MTP) support across the Qwen3 VL model architecture. This includes adding MTP configuration mapping entries, extending model signatures with MTP parameters, implementing MTP block specifications in providers, adding comprehensive MTP parameter translation mappings, incorporating embedding sequence-parallel handling, and configuring MTP settings in recipe variants. An inference helper function disables MTP during conversion.

Changes

Cohort / File(s)	Summary
MTP Configuration Mapping `src/megatron/bridge/models/conversion/model_bridge.py`	Added config mapping entry linking HF parameter mtp_num_hidden_layers to Megatron parameter mtp_num_layers.
Qwen3VL Model Extensions `src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py`, `src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py`	Extended Qwen3VLModel signature with optional mtp_block_spec and vp_stage parameters; added embedding sequence-parallel scatter wrapper when MTP is active; introduced lm_input_ids routing for MTP-compatible input handling across preprocessing and packed sequence paths.
MTP Provider Implementation `src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py`	Added mtp_block_spec instantiation and passing to Qwen3VLModel constructor in both dense and MoE provider paths.
MTP Parameter Mappings & Loss Scaling `src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py`	Removed debug override block; added comprehensive MTP parameter mappings (QKVMapping, GatedMLPMapping, ReplicatedMapping, AutoMapping) translating Megatron MTP layers to HF equivalents; configured mtp_loss_scaling_factor = 0.1 when MTP is enabled for both dense and MoE variants.
Recipe Configuration `src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py`	Added MTP configuration settings (mtp_num_layers = 1, mtp_loss_scaling_factor = 0.1) to all Qwen3.5-VL SFT and PEFT recipe variants.
Conversion & Inference `examples/conversion/hf_to_megatron_generate_vlm.py`	Introduced _disable_mtp helper function to disable MTP during inference by clearing mtp_num_layers and mtp_process; applied after model loading and CUDA setup.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	No test files, test results, or testing information were found in the PR for the major changes to Qwen VL model conversion.	Add test files, test cases, or testing documentation to validate the Qwen VL model conversion functionality.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title clearly summarizes the main change: adding support for Qwen 3.5 MTP (Multi-Token Prediction) conversion and training, which is directly reflected across all modified files.
Docstring Coverage	✅ Passed	Docstring coverage is 94.74% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chcui/qwen35_mtp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py (1)
178-215: ⚠️ Potential issue | 🟠 Major

Restore the shadowed embedding in a finally block.

Lines 191-215 shadow self.embedding before calling _postprocess(), but the cleanup only runs on the success path. If _postprocess() raises, the module stays shadowed by _sp_scatter_embedding and subsequent calls see corrupted state.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py` around
lines 178 - 215, Wrap the call to self._postprocess(...) in a try/finally when
_shadow_embedding is set: move the existing _postprocess invocation into a try
block and in the finally block restore the original embedding by writing back
self.__dict__["embedding"] = _original_embedding (or deleting the shadow if you
prefer) so that _sp_scatter_embedding cannot remain in place if _postprocess
raises; reference symbols: _shadow_embedding, _original_embedding,
_sp_scatter_embedding, self.embedding, and _postprocess.
src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py (1)
202-221: ⚠️ Potential issue | 🟠 Major

Patch the MTP block spec with Qwen3VLSelfAttention too.

block_spec is patched locally before constructing the main decoder, but mtp_block_spec(self, vp_stage=vp_stage) rebuilds a fresh spec from self.transformer_layer_spec. That leaves the MTP path on the unpatched attention class, so MTP won't get the same Qwen VL mRoPE/self-attention behavior as the main decoder.

Also applies to: 399-424
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py` around lines 202 -
221, The MTP path is being built from mtp_block_spec(self, vp_stage=vp_stage)
which recreates a spec from self.transformer_layer_spec and therefore isn't
patched with Qwen3VLSelfAttention; after creating mtp_spec =
mtp_block_spec(self, vp_stage=vp_stage) (or inline) call
_patch_standard_attention_specs(mtp_spec, Qwen3VLSelfAttention) so the MTP block
spec uses the same patched attention class as block_spec; update both places
noted (around the Qwen3VLModel construction and the other occurrence at lines
~399-424) to patch the mtp spec before passing it into Qwen3VLModel.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/models/conversion/model_bridge.py`:
- Around line 291-292: The shared mapping MegatronModelBridge.CONFIG_MAPPING
must not include the Qwen3.5-specific alias ("mtp_num_hidden_layers",
"mtp_num_layers"); remove that tuple from MegatronModelBridge.CONFIG_MAPPING and
instead implement the alias in the Qwen3.5-specific bridge (e.g.,
Qwen35ModelBridge) by adding a bridge-local input alias or an override of
hf_config_to_provider_kwargs()/megatron_to_hf_config() that maps
mtp_num_hidden_layers ↔ mtp_num_layers only for Qwen3.5 models, so other bridges
keep the original behavior and no global alias wins silently on import.

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py`:
- Around line 302-306: The code only sets lm_input_ids = input_ids_thd inside
the self.pre_process branch, leaving lm_input_ids as BxS for later stages;
change the logic so that whenever packed sequences are in use (input_ids_thd
exists or the packed-sequence flag is set) and pipeline_model_parallel_size > 1,
lm_input_ids is switched to input_ids_thd regardless of self.pre_process.
Concretely, adjust the assignment around lm_input_ids (the variable used by
MTP/_get_embeddings) so it prefers input_ids_thd when present or when
packing/THD layout is required for multi-stage pipeline inference, ensuring the
last decoder stage uses the same THD layout as position_ids and hidden states.

In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py`:
- Around line 378-431: The MTP mappings only cover the "mtp_model_layer" prefix
so weights exposed as "transformer_layer" will be missed; update the mapping
registration (the mtp_param_mappings dict and the subsequent mapping_list.extend
entries) to register both prefixes by duplicating or looping over prefixes
["mtp_model_layer","transformer_layer"] when creating
AutoMapping/QKVMapping/GatedMLPMapping/ReplicatedMapping entries (refer to
mtp_param_mappings, QKVMapping(...linear_qkv...),
GatedMLPMapping(...mlp.experts...), AutoMapping(...linear_fc2...), and
ReplicatedMapping(...shared_experts.gate_weight)); apply the same dual-prefix
change to the other MTP block around the 656-691 region so both
"transformer_layer" and "mtp_model_layer" variants are produced.

In `@src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py`:
- Around line 68-71: The recipe is overwriting the HF-detected MTP depth; remove
any hardcoded assignment to cfg.model.mtp_num_layers (do not set it to 1) so the
value returned by AutoBridge.from_hf_pretrained(hf_path) is preserved, and only
assign cfg.model.mtp_loss_scaling_factor when cfg.model.mtp_num_layers is truthy
(i.e., MTP is present). Locate uses of cfg.model.mtp_num_layers and
cfg.model.mtp_loss_scaling_factor in the recipe functions that call
AutoBridge.from_hf_pretrained and delete the mtp_num_layers assignment and guard
the mtp_loss_scaling_factor assignment behind a check of
cfg.model.mtp_num_layers.

---

Outside diff comments:
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py`:
- Around line 178-215: Wrap the call to self._postprocess(...) in a try/finally
when _shadow_embedding is set: move the existing _postprocess invocation into a
try block and in the finally block restore the original embedding by writing
back self.__dict__["embedding"] = _original_embedding (or deleting the shadow if
you prefer) so that _sp_scatter_embedding cannot remain in place if _postprocess
raises; reference symbols: _shadow_embedding, _original_embedding,
_sp_scatter_embedding, self.embedding, and _postprocess.

In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py`:
- Around line 202-221: The MTP path is being built from mtp_block_spec(self,
vp_stage=vp_stage) which recreates a spec from self.transformer_layer_spec and
therefore isn't patched with Qwen3VLSelfAttention; after creating mtp_spec =
mtp_block_spec(self, vp_stage=vp_stage) (or inline) call
_patch_standard_attention_specs(mtp_spec, Qwen3VLSelfAttention) so the MTP block
spec uses the same patched attention class as block_spec; update both places
noted (around the Qwen3VLModel construction and the other occurrence at lines
~399-424) to patch the mtp spec before passing it into Qwen3VLModel.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4b764c39-1c37-4042-a4d1-acc9606d2476

📥 Commits

Reviewing files that changed from the base of the PR and between de93536 and ce2c681.

📒 Files selected for processing (7)

examples/conversion/hf_to_megatron_generate_vlm.py
src/megatron/bridge/models/conversion/model_bridge.py
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py
src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py
src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py
src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py

Signed-off-by: Chen Cui <chcui@nvidia.com>

HollowMan6 · 2026-03-14T14:02:32Z

@cuichenx I just found that enable MTP with Qwen3.5 when PP>1 breaks Megatron->HF bridge for word_embeddings.weight, so I opened #2799 to fix this, please take a look when you have time, thanks!

cuichenx added 17 commits March 2, 2026 10:06

fix test

238b9af

Signed-off-by: Chen Cui <chcui@nvidia.com>

add 3 new dense models and training recipes

cfd9d90

Signed-off-by: Chen Cui <chcui@nvidia.com>

Merge branch 'main' into chcui/qwen35_recipes

047134b

Signed-off-by: Chen Cui <chcui@nvidia.com>

recipe refactor

935c40c

Signed-off-by: Chen Cui <chcui@nvidia.com>

Merge branch 'main' into chcui/qwen35_recipes

7bea48e

fix test

c4448fc

Signed-off-by: Chen Cui <chcui@nvidia.com>

Merge branch 'chcui/qwen35_recipes' of github.com:NVIDIA-NeMo/Megatro…

9213637

…n-Bridge into chcui/qwen35_recipes

update docs and readmes

3996cd5

Signed-off-by: Chen Cui <chcui@nvidia.com>

Merge branch 'main' into chcui/qwen35_recipes

6d54d97

doc

0c14274

Signed-off-by: Chen Cui <chcui@nvidia.com>

Merge branch 'chcui/qwen35_recipes' of github.com:NVIDIA-NeMo/Megatro…

eeb9de3

…n-Bridge into chcui/qwen35_recipes

fix doc link

8407255

Signed-off-by: Chen Cui <chcui@nvidia.com>

doc

c617fee

Signed-off-by: Chen Cui <chcui@nvidia.com>

doc

76ac963

Signed-off-by: Chen Cui <chcui@nvidia.com>

Merge branch 'main' into chcui/qwen35_recipes

63395f9

support mtp

3013172

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx marked this pull request as draft March 9, 2026 23:06

copy-pr-bot bot temporarily deployed to test March 9, 2026 23:06 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 9, 2026 23:12 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 9, 2026 23:20 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 9, 2026 23:28 Inactive

copy-pr-bot bot had a problem deploying to test March 10, 2026 22:27 Error

refactor recipe

a534ecc

Signed-off-by: Chen Cui <chcui@nvidia.com>

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

Comment thread src/megatron/bridge/models/conversion/model_bridge.py

Comment thread src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py

Comment thread src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py

Comment thread src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py Outdated

copy-pr-bot bot temporarily deployed to test March 10, 2026 22:40 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 10, 2026 22:41 Inactive

copy-pr-bot bot temporarily deployed to public March 10, 2026 22:43 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 10, 2026 22:51 Failure

fix test

61f1537

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot bot temporarily deployed to test March 10, 2026 23:34 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 10, 2026 23:41 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 10, 2026 23:50 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 10, 2026 23:55 Inactive

fix pp>1 + thd + mtp

baae8b1

Signed-off-by: Chen Cui <chcui@nvidia.com>

yaoyu-33 approved these changes Mar 11, 2026

View reviewed changes

Merge branch 'main' into chcui/qwen35_mtp

03e819a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[model] feat: support Qwen 3.5 MTP conversion and training#2711

[model] feat: support Qwen 3.5 MTP conversion and training#2711
cuichenx merged 22 commits intomainfrom
chcui/qwen35_mtp

cuichenx commented Mar 9, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 9, 2026

Uh oh!

coderabbitai bot commented Mar 10, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HollowMan6 commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cuichenx commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Mar 9, 2026

Uh oh!

coderabbitai bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HollowMan6 commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cuichenx commented Mar 9, 2026 •

edited

Loading

coderabbitai bot commented Mar 10, 2026 •

edited

Loading