[model] feat: support Qwen 3.5 MTP conversion and training#2711
[model] feat: support Qwen 3.5 MTP conversion and training#2711
Conversation
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
- Add `set -euo pipefail` to slurm_sft.sh and slurm_peft.sh for fail-fast behavior - Remove `logs/` prefix from SBATCH output/error paths (directory doesn't exist at job start) - Remove now-unnecessary `mkdir -p logs` calls - Fix 122B-A10B parallelism comment in slurm_sft.sh to match recipe (TP=2, PP=6, EP=8) - Add `pytestmark = pytest.mark.integration` to functional test module - Reset microbatch calculator both before and after each test in fixture Signed-off-by: Chen Cui <chcui@nvidia.com> Made-with: Cursor
…n-Bridge into chcui/qwen35_recipes
Signed-off-by: Chen Cui <chcui@nvidia.com>
…n-Bridge into chcui/qwen35_recipes
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Chen Cui <chcui@nvidia.com>
📝 WalkthroughWalkthroughThe changes introduce Multi-Token Prediction (MTP) support across the Qwen3 VL model architecture. This includes adding MTP configuration mapping entries, extending model signatures with MTP parameters, implementing MTP block specifications in providers, adding comprehensive MTP parameter translation mappings, incorporating embedding sequence-parallel handling, and configuring MTP settings in recipe variants. An inference helper function disables MTP during conversion. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py (1)
178-215:⚠️ Potential issue | 🟠 MajorRestore the shadowed embedding in a
finallyblock.Lines 191-215 shadow
self.embeddingbefore calling_postprocess(), but the cleanup only runs on the success path. If_postprocess()raises, the module stays shadowed by_sp_scatter_embeddingand subsequent calls see corrupted state.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py` around lines 178 - 215, Wrap the call to self._postprocess(...) in a try/finally when _shadow_embedding is set: move the existing _postprocess invocation into a try block and in the finally block restore the original embedding by writing back self.__dict__["embedding"] = _original_embedding (or deleting the shadow if you prefer) so that _sp_scatter_embedding cannot remain in place if _postprocess raises; reference symbols: _shadow_embedding, _original_embedding, _sp_scatter_embedding, self.embedding, and _postprocess.src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py (1)
202-221:⚠️ Potential issue | 🟠 MajorPatch the MTP block spec with
Qwen3VLSelfAttentiontoo.
block_specis patched locally before constructing the main decoder, butmtp_block_spec(self, vp_stage=vp_stage)rebuilds a fresh spec fromself.transformer_layer_spec. That leaves the MTP path on the unpatched attention class, so MTP won't get the same Qwen VL mRoPE/self-attention behavior as the main decoder.Also applies to: 399-424
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py` around lines 202 - 221, The MTP path is being built from mtp_block_spec(self, vp_stage=vp_stage) which recreates a spec from self.transformer_layer_spec and therefore isn't patched with Qwen3VLSelfAttention; after creating mtp_spec = mtp_block_spec(self, vp_stage=vp_stage) (or inline) call _patch_standard_attention_specs(mtp_spec, Qwen3VLSelfAttention) so the MTP block spec uses the same patched attention class as block_spec; update both places noted (around the Qwen3VLModel construction and the other occurrence at lines ~399-424) to patch the mtp spec before passing it into Qwen3VLModel.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/megatron/bridge/models/conversion/model_bridge.py`:
- Around line 291-292: The shared mapping MegatronModelBridge.CONFIG_MAPPING
must not include the Qwen3.5-specific alias ("mtp_num_hidden_layers",
"mtp_num_layers"); remove that tuple from MegatronModelBridge.CONFIG_MAPPING and
instead implement the alias in the Qwen3.5-specific bridge (e.g.,
Qwen35ModelBridge) by adding a bridge-local input alias or an override of
hf_config_to_provider_kwargs()/megatron_to_hf_config() that maps
mtp_num_hidden_layers ↔ mtp_num_layers only for Qwen3.5 models, so other bridges
keep the original behavior and no global alias wins silently on import.
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py`:
- Around line 302-306: The code only sets lm_input_ids = input_ids_thd inside
the self.pre_process branch, leaving lm_input_ids as BxS for later stages;
change the logic so that whenever packed sequences are in use (input_ids_thd
exists or the packed-sequence flag is set) and pipeline_model_parallel_size > 1,
lm_input_ids is switched to input_ids_thd regardless of self.pre_process.
Concretely, adjust the assignment around lm_input_ids (the variable used by
MTP/_get_embeddings) so it prefers input_ids_thd when present or when
packing/THD layout is required for multi-stage pipeline inference, ensuring the
last decoder stage uses the same THD layout as position_ids and hidden states.
In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py`:
- Around line 378-431: The MTP mappings only cover the "mtp_model_layer" prefix
so weights exposed as "transformer_layer" will be missed; update the mapping
registration (the mtp_param_mappings dict and the subsequent mapping_list.extend
entries) to register both prefixes by duplicating or looping over prefixes
["mtp_model_layer","transformer_layer"] when creating
AutoMapping/QKVMapping/GatedMLPMapping/ReplicatedMapping entries (refer to
mtp_param_mappings, QKVMapping(...linear_qkv...),
GatedMLPMapping(...mlp.experts...), AutoMapping(...linear_fc2...), and
ReplicatedMapping(...shared_experts.gate_weight)); apply the same dual-prefix
change to the other MTP block around the 656-691 region so both
"transformer_layer" and "mtp_model_layer" variants are produced.
In `@src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py`:
- Around line 68-71: The recipe is overwriting the HF-detected MTP depth; remove
any hardcoded assignment to cfg.model.mtp_num_layers (do not set it to 1) so the
value returned by AutoBridge.from_hf_pretrained(hf_path) is preserved, and only
assign cfg.model.mtp_loss_scaling_factor when cfg.model.mtp_num_layers is truthy
(i.e., MTP is present). Locate uses of cfg.model.mtp_num_layers and
cfg.model.mtp_loss_scaling_factor in the recipe functions that call
AutoBridge.from_hf_pretrained and delete the mtp_num_layers assignment and guard
the mtp_loss_scaling_factor assignment behind a check of
cfg.model.mtp_num_layers.
---
Outside diff comments:
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py`:
- Around line 178-215: Wrap the call to self._postprocess(...) in a try/finally
when _shadow_embedding is set: move the existing _postprocess invocation into a
try block and in the finally block restore the original embedding by writing
back self.__dict__["embedding"] = _original_embedding (or deleting the shadow if
you prefer) so that _sp_scatter_embedding cannot remain in place if _postprocess
raises; reference symbols: _shadow_embedding, _original_embedding,
_sp_scatter_embedding, self.embedding, and _postprocess.
In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py`:
- Around line 202-221: The MTP path is being built from mtp_block_spec(self,
vp_stage=vp_stage) which recreates a spec from self.transformer_layer_spec and
therefore isn't patched with Qwen3VLSelfAttention; after creating mtp_spec =
mtp_block_spec(self, vp_stage=vp_stage) (or inline) call
_patch_standard_attention_specs(mtp_spec, Qwen3VLSelfAttention) so the MTP block
spec uses the same patched attention class as block_spec; update both places
noted (around the Qwen3VLModel construction and the other occurrence at lines
~399-424) to patch the mtp spec before passing it into Qwen3VLModel.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4b764c39-1c37-4042-a4d1-acc9606d2476
📒 Files selected for processing (7)
examples/conversion/hf_to_megatron_generate_vlm.pysrc/megatron/bridge/models/conversion/model_bridge.pysrc/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.pysrc/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.pysrc/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.pysrc/megatron/bridge/models/qwen_vl/qwen35_vl_provider.pysrc/megatron/bridge/recipes/qwen_vl/qwen35_vl.py
Signed-off-by: Chen Cui <chcui@nvidia.com>
What does this PR do?
Add Multi-Token Prediction (MTP) support for Qwen3.5 VL models, enabling MTP-aware checkpoint conversion and training for both dense and MoE variants.
Changelog
Qwen35VLMoEBridgeandQwen35VLBridgefor bidirectional HF ↔ Megatron conversion (QKV, MoE experts, shared experts, projection, layernorms)mtp_num_hidden_layers→mtp_num_layersconfig alias inMegatronModelBridge.CONFIG_MAPPINGmtp_block_specandvp_stagethroughQwen3VLModelto the language model in both dense and MoE providersinput_idsto the language model (instead ofNone) so MTP can roll input_ids for future-token embeddings; handle THD-format for packed sequencesQwen3VLGPTModelmtp_loss_scaling_factor = 0.1in both bridges when MTP is enabledqwen35_vl.pyrecipes into shared helpers (_qwen35_vl_apply_common,_qwen35_vl_apply_moe,_qwen35_vl_enable_recompute,_qwen35_vl_apply_peft_scheme) and add MTP config to all recipe variants_disable_mtpinhf_to_megatron_generate_vlm.pyto properly clear bothconfig.mtp_num_layersandlanguage_model.mtp_processduring inferenceGitHub Actions CI
See the CI section in the Contributing doc for how to trigger the CI.
A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
mtp_num_hidden_layers=1in its HF config, with a single MTP layer containing its own attention + MLP block plus embedding/hidden projection norms