Skip to content

[model] feat: support Qwen 3.5 MTP conversion and training#2711

Merged
cuichenx merged 22 commits intomainfrom
chcui/qwen35_mtp
Mar 13, 2026
Merged

[model] feat: support Qwen 3.5 MTP conversion and training#2711
cuichenx merged 22 commits intomainfrom
chcui/qwen35_mtp

Conversation

@cuichenx
Copy link
Copy Markdown
Contributor

@cuichenx cuichenx commented Mar 9, 2026

What does this PR do?

Add Multi-Token Prediction (MTP) support for Qwen3.5 VL models, enabling MTP-aware checkpoint conversion and training for both dense and MoE variants.

Changelog

  • Add MTP parameter mappings in Qwen35VLMoEBridge and Qwen35VLBridge for bidirectional HF ↔ Megatron conversion (QKV, MoE experts, shared experts, projection, layernorms)
  • Add mtp_num_hidden_layersmtp_num_layers config alias in MegatronModelBridge.CONFIG_MAPPING
  • Pass mtp_block_spec and vp_stage through Qwen3VLModel to the language model in both dense and MoE providers
  • Forward input_ids to the language model (instead of None) so MTP can roll input_ids for future-token embeddings; handle THD-format for packed sequences
  • Wrap embedding with SP scatter when MTP is active with sequence parallelism in Qwen3VLGPTModel
  • Set mtp_loss_scaling_factor = 0.1 in both bridges when MTP is enabled
  • Refactor qwen35_vl.py recipes into shared helpers (_qwen35_vl_apply_common, _qwen35_vl_apply_moe, _qwen35_vl_enable_recompute, _qwen35_vl_apply_peft_scheme) and add MTP config to all recipe variants
  • Improve _disable_mtp in hf_to_megatron_generate_vlm.py to properly clear both config.mtp_num_layers and language_model.mtp_process during inference
  • Remove debug tiny-model override and outdated TODO comments from bridge/provider

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI.
A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Builds on top of the Qwen3.5 VL recipes added in [model, recipe] Add Qwen 3.5 recipes #2654
  • MTP architecture: Qwen3.5 uses mtp_num_hidden_layers=1 in its HF config, with a single MTP layer containing its own attention + MLP block plus embedding/hidden projection norms

cuichenx added 17 commits March 2, 2026 10:06
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
- Add `set -euo pipefail` to slurm_sft.sh and slurm_peft.sh for fail-fast behavior
- Remove `logs/` prefix from SBATCH output/error paths (directory doesn't exist at job start)
- Remove now-unnecessary `mkdir -p logs` calls
- Fix 122B-A10B parallelism comment in slurm_sft.sh to match recipe (TP=2, PP=6, EP=8)
- Add `pytestmark = pytest.mark.integration` to functional test module
- Reset microbatch calculator both before and after each test in fixture

Signed-off-by: Chen Cui <chcui@nvidia.com>
Made-with: Cursor
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx cuichenx marked this pull request as draft March 9, 2026 23:06
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 9, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Chen Cui <chcui@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 10, 2026

📝 Walkthrough

Walkthrough

The changes introduce Multi-Token Prediction (MTP) support across the Qwen3 VL model architecture. This includes adding MTP configuration mapping entries, extending model signatures with MTP parameters, implementing MTP block specifications in providers, adding comprehensive MTP parameter translation mappings, incorporating embedding sequence-parallel handling, and configuring MTP settings in recipe variants. An inference helper function disables MTP during conversion.

Changes

Cohort / File(s) Summary
MTP Configuration Mapping
src/megatron/bridge/models/conversion/model_bridge.py
Added config mapping entry linking HF parameter mtp_num_hidden_layers to Megatron parameter mtp_num_layers.
Qwen3VL Model Extensions
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py, src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py
Extended Qwen3VLModel signature with optional mtp_block_spec and vp_stage parameters; added embedding sequence-parallel scatter wrapper when MTP is active; introduced lm_input_ids routing for MTP-compatible input handling across preprocessing and packed sequence paths.
MTP Provider Implementation
src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py
Added mtp_block_spec instantiation and passing to Qwen3VLModel constructor in both dense and MoE provider paths.
MTP Parameter Mappings & Loss Scaling
src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py
Removed debug override block; added comprehensive MTP parameter mappings (QKVMapping, GatedMLPMapping, ReplicatedMapping, AutoMapping) translating Megatron MTP layers to HF equivalents; configured mtp_loss_scaling_factor = 0.1 when MTP is enabled for both dense and MoE variants.
Recipe Configuration
src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py
Added MTP configuration settings (mtp_num_layers = 1, mtp_loss_scaling_factor = 0.1) to all Qwen3.5-VL SFT and PEFT recipe variants.
Conversion & Inference
examples/conversion/hf_to_megatron_generate_vlm.py
Introduced _disable_mtp helper function to disable MTP during inference by clearing mtp_num_layers and mtp_process; applied after model loading and CUDA setup.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning No test files, test results, or testing information were found in the PR for the major changes to Qwen VL model conversion. Add test files, test cases, or testing documentation to validate the Qwen VL model conversion functionality.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly summarizes the main change: adding support for Qwen 3.5 MTP (Multi-Token Prediction) conversion and training, which is directly reflected across all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 94.74% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chcui/qwen35_mtp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py (1)

178-215: ⚠️ Potential issue | 🟠 Major

Restore the shadowed embedding in a finally block.

Lines 191-215 shadow self.embedding before calling _postprocess(), but the cleanup only runs on the success path. If _postprocess() raises, the module stays shadowed by _sp_scatter_embedding and subsequent calls see corrupted state.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py` around
lines 178 - 215, Wrap the call to self._postprocess(...) in a try/finally when
_shadow_embedding is set: move the existing _postprocess invocation into a try
block and in the finally block restore the original embedding by writing back
self.__dict__["embedding"] = _original_embedding (or deleting the shadow if you
prefer) so that _sp_scatter_embedding cannot remain in place if _postprocess
raises; reference symbols: _shadow_embedding, _original_embedding,
_sp_scatter_embedding, self.embedding, and _postprocess.
src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py (1)

202-221: ⚠️ Potential issue | 🟠 Major

Patch the MTP block spec with Qwen3VLSelfAttention too.

block_spec is patched locally before constructing the main decoder, but mtp_block_spec(self, vp_stage=vp_stage) rebuilds a fresh spec from self.transformer_layer_spec. That leaves the MTP path on the unpatched attention class, so MTP won't get the same Qwen VL mRoPE/self-attention behavior as the main decoder.

Also applies to: 399-424

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py` around lines 202 -
221, The MTP path is being built from mtp_block_spec(self, vp_stage=vp_stage)
which recreates a spec from self.transformer_layer_spec and therefore isn't
patched with Qwen3VLSelfAttention; after creating mtp_spec =
mtp_block_spec(self, vp_stage=vp_stage) (or inline) call
_patch_standard_attention_specs(mtp_spec, Qwen3VLSelfAttention) so the MTP block
spec uses the same patched attention class as block_spec; update both places
noted (around the Qwen3VLModel construction and the other occurrence at lines
~399-424) to patch the mtp spec before passing it into Qwen3VLModel.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/models/conversion/model_bridge.py`:
- Around line 291-292: The shared mapping MegatronModelBridge.CONFIG_MAPPING
must not include the Qwen3.5-specific alias ("mtp_num_hidden_layers",
"mtp_num_layers"); remove that tuple from MegatronModelBridge.CONFIG_MAPPING and
instead implement the alias in the Qwen3.5-specific bridge (e.g.,
Qwen35ModelBridge) by adding a bridge-local input alias or an override of
hf_config_to_provider_kwargs()/megatron_to_hf_config() that maps
mtp_num_hidden_layers ↔ mtp_num_layers only for Qwen3.5 models, so other bridges
keep the original behavior and no global alias wins silently on import.

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py`:
- Around line 302-306: The code only sets lm_input_ids = input_ids_thd inside
the self.pre_process branch, leaving lm_input_ids as BxS for later stages;
change the logic so that whenever packed sequences are in use (input_ids_thd
exists or the packed-sequence flag is set) and pipeline_model_parallel_size > 1,
lm_input_ids is switched to input_ids_thd regardless of self.pre_process.
Concretely, adjust the assignment around lm_input_ids (the variable used by
MTP/_get_embeddings) so it prefers input_ids_thd when present or when
packing/THD layout is required for multi-stage pipeline inference, ensuring the
last decoder stage uses the same THD layout as position_ids and hidden states.

In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py`:
- Around line 378-431: The MTP mappings only cover the "mtp_model_layer" prefix
so weights exposed as "transformer_layer" will be missed; update the mapping
registration (the mtp_param_mappings dict and the subsequent mapping_list.extend
entries) to register both prefixes by duplicating or looping over prefixes
["mtp_model_layer","transformer_layer"] when creating
AutoMapping/QKVMapping/GatedMLPMapping/ReplicatedMapping entries (refer to
mtp_param_mappings, QKVMapping(...linear_qkv...),
GatedMLPMapping(...mlp.experts...), AutoMapping(...linear_fc2...), and
ReplicatedMapping(...shared_experts.gate_weight)); apply the same dual-prefix
change to the other MTP block around the 656-691 region so both
"transformer_layer" and "mtp_model_layer" variants are produced.

In `@src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py`:
- Around line 68-71: The recipe is overwriting the HF-detected MTP depth; remove
any hardcoded assignment to cfg.model.mtp_num_layers (do not set it to 1) so the
value returned by AutoBridge.from_hf_pretrained(hf_path) is preserved, and only
assign cfg.model.mtp_loss_scaling_factor when cfg.model.mtp_num_layers is truthy
(i.e., MTP is present). Locate uses of cfg.model.mtp_num_layers and
cfg.model.mtp_loss_scaling_factor in the recipe functions that call
AutoBridge.from_hf_pretrained and delete the mtp_num_layers assignment and guard
the mtp_loss_scaling_factor assignment behind a check of
cfg.model.mtp_num_layers.

---

Outside diff comments:
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py`:
- Around line 178-215: Wrap the call to self._postprocess(...) in a try/finally
when _shadow_embedding is set: move the existing _postprocess invocation into a
try block and in the finally block restore the original embedding by writing
back self.__dict__["embedding"] = _original_embedding (or deleting the shadow if
you prefer) so that _sp_scatter_embedding cannot remain in place if _postprocess
raises; reference symbols: _shadow_embedding, _original_embedding,
_sp_scatter_embedding, self.embedding, and _postprocess.

In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py`:
- Around line 202-221: The MTP path is being built from mtp_block_spec(self,
vp_stage=vp_stage) which recreates a spec from self.transformer_layer_spec and
therefore isn't patched with Qwen3VLSelfAttention; after creating mtp_spec =
mtp_block_spec(self, vp_stage=vp_stage) (or inline) call
_patch_standard_attention_specs(mtp_spec, Qwen3VLSelfAttention) so the MTP block
spec uses the same patched attention class as block_spec; update both places
noted (around the Qwen3VLModel construction and the other occurrence at lines
~399-424) to patch the mtp spec before passing it into Qwen3VLModel.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4b764c39-1c37-4042-a4d1-acc9606d2476

📥 Commits

Reviewing files that changed from the base of the PR and between de93536 and ce2c681.

📒 Files selected for processing (7)
  • examples/conversion/hf_to_megatron_generate_vlm.py
  • src/megatron/bridge/models/conversion/model_bridge.py
  • src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py
  • src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py
  • src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py
  • src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py
  • src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py

Comment thread src/megatron/bridge/models/conversion/model_bridge.py
Comment thread src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py
Comment thread src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py
Comment thread src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py Outdated
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
@HollowMan6
Copy link
Copy Markdown
Contributor

@cuichenx I just found that enable MTP with Qwen3.5 when PP>1 breaks Megatron->HF bridge for word_embeddings.weight, so I opened #2799 to fix this, please take a look when you have time, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:model Model implementations and HF bridge logic ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants