Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for Megatron-based parallel training (TP, PP, EP) on NPU by integrating with MindSpeed. Key changes include new example scripts for LoRA and MoE LoRA training, comprehensive documentation updates for NPU setup, and several compatibility fixes such as using torch.where for masking and Gloo groups for object collectives on NPU. The review feedback identifies redundant environment variable assignments in shell scripts, incorrect script paths and PYTHONPATH inconsistencies in the documentation, and suggests clarifying the specific constraints for MoE LoRA support.
cookbook/megatron/npu/tp_lora_npu.sh
Outdated
| ASCEND_RT_VISIBLE_DEVICES="${ASCEND_RT_VISIBLE_DEVICES}" \ | ||
| torchrun --nproc_per_node=8 cookbook/megatron/npu/tp_lora_npu.py |
There was a problem hiding this comment.
The ASCEND_RT_VISIBLE_DEVICES variable is already set on line 4. Setting it again here for the torchrun command is redundant and can be removed to improve script clarity.
| ASCEND_RT_VISIBLE_DEVICES="${ASCEND_RT_VISIBLE_DEVICES}" \ | |
| torchrun --nproc_per_node=8 cookbook/megatron/npu/tp_lora_npu.py | |
| torchrun --nproc_per_node=8 cookbook/megatron/npu/tp_lora_npu.py |
| ASCEND_RT_VISIBLE_DEVICES="${ASCEND_RT_VISIBLE_DEVICES}" \ | ||
| torchrun --nproc_per_node=8 cookbook/megatron/npu/tp_moe_lora_npu.py |
There was a problem hiding this comment.
The ASCEND_RT_VISIBLE_DEVICES variable is already set on line 4. Setting it again here for the torchrun command is redundant and can be removed to improve script clarity.
| ASCEND_RT_VISIBLE_DEVICES="${ASCEND_RT_VISIBLE_DEVICES}" \ | |
| torchrun --nproc_per_node=8 cookbook/megatron/npu/tp_moe_lora_npu.py | |
| torchrun --nproc_per_node=8 cookbook/megatron/npu/tp_moe_lora_npu.py |
| export TWINKLE_SRC_PATH=/path/to/twinkle/src | ||
| export MEGATRON_LM_PATH=/path/to/Megatron-LM | ||
| export PYTHONPATH=${TWINKLE_SRC_PATH}:${MEGATRON_LM_PATH}:${PYTHONPATH} |
There was a problem hiding this comment.
The documentation regarding PYTHONPATH setup is inconsistent with the provided example scripts. Here, and in other places (lines 223-225, 332-337), TWINKLE_SRC_PATH is included. However, the example shell scripts (tp_lora_npu.sh, tp_moe_lora_npu.sh) do not use TWINKLE_SRC_PATH, assuming twinkle is installed in an editable mode. To avoid confusion, the documentation should be consistent with the example scripts.
| export TWINKLE_SRC_PATH=/path/to/twinkle/src | |
| export MEGATRON_LM_PATH=/path/to/Megatron-LM | |
| export PYTHONPATH=${TWINKLE_SRC_PATH}:${MEGATRON_LM_PATH}:${PYTHONPATH} | |
| export MEGATRON_LM_PATH=/path/to/Megatron-LM | |
| export PYTHONPATH=${MEGATRON_LM_PATH}:${PYTHONPATH} |
| export PYTHONPATH=${TWINKLE_SRC_PATH}:${MEGATRON_LM_PATH}:${PYTHONPATH} | ||
| ``` | ||
|
|
||
| > **Tip**: `cookbook/megatron/tp.sh` and `cookbook/megatron/tp_moe.sh` already include automatic PYTHONPATH configuration. You can use these scripts directly to launch training without manual setup. Default paths can be overridden via the `TWINKLE_SRC_PATH` and `MEGATRON_LM_PATH` environment variables. |
There was a problem hiding this comment.
The script paths mentioned here (cookbook/megatron/tp.sh and cookbook/megatron/tp_moe.sh) are incorrect. The scripts added in this pull request are located in cookbook/megatron/npu/ and are named tp_lora_npu.sh and tp_moe_lora_npu.sh. Please update the paths to refer to the correct example scripts.
| > **Tip**: `cookbook/megatron/tp.sh` and `cookbook/megatron/tp_moe.sh` already include automatic PYTHONPATH configuration. You can use these scripts directly to launch training without manual setup. Default paths can be overridden via the `TWINKLE_SRC_PATH` and `MEGATRON_LM_PATH` environment variables. | |
| > **Tip**: The example scripts in `cookbook/megatron/npu/` (e.g., `tp_lora_npu.sh`) already include automatic PYTHONPATH configuration for Megatron-LM. You can use these scripts directly to launch training without manual setup. The default path can be overridden via the `MEGATRON_LM_PATH` environment variable. |
| - ✅ Megatron backend (DP=2, TP=2, PP=2) | ||
| - ✅ 10-step continuous metric printing + checkpoint saving | ||
|
|
||
| **Note**: MoE models do not currently support LoRA fine-tuning (Expert LoRA is not available when ETP>1). |
There was a problem hiding this comment.
This note is confusing as it seems to contradict the preceding section "Megatron MoE LoRA Fine-tuning". It should be rephrased to clarify that the limitation applies specifically when ETP > 1.
| **Note**: MoE models do not currently support LoRA fine-tuning (Expert LoRA is not available when ETP>1). | |
| **Note**: When using LoRA with MoE models, Expert LoRA is only supported when ETP=1. The provided MoE LoRA example uses a verified topology that respects this constraint. |
| # Avoid bool advanced indexing here. On NPU this lowers to | ||
| # aclnnNonzeroV2 inside AdvancedIndex and can crash during | ||
| # end-to-end training; torch.where preserves the same masking | ||
| # semantics without going through that path. | ||
| masked_labels = torch.where(loss_mask, labels, torch.zeros_like(labels)) |
| # Some Megatron-LM versions (e.g. 0.12.1) only accept | ||
| # overlap_param_gather_with_optimizer_step here. | ||
| # overlap_param_gather still exists on ddp_config / distributed | ||
| # optimizer paths, but passing it directly into OptimizerConfig | ||
| # raises TypeError on this branch. | ||
| config_sig = inspect.signature(OptimizerConfig).parameters | ||
| config_kwargs = { | ||
| 'optimizer': 'adam', | ||
| 'lr': lr, | ||
| 'min_lr': kwargs.get('min_lr', 0.0), | ||
| 'weight_decay': kwargs.get('weight_decay', 0.01), | ||
| 'adam_beta1': kwargs.get('adam_beta1', 0.9), | ||
| 'adam_beta2': kwargs.get('adam_beta2', 0.999), | ||
| 'adam_eps': kwargs.get('adam_eps', 1e-8), | ||
| 'clip_grad': kwargs.get('clip_grad', 1.0), | ||
| 'bf16': kwargs.get('bf16', True), | ||
| 'use_distributed_optimizer': use_distributed_optimizer, | ||
| 'log_num_zeros_in_grad': kwargs.get('log_num_zeros_in_grad', False), | ||
| } | ||
| # Keep the old knob only if this Megatron version still exposes it. | ||
| # Some branches wire it through ddp_config instead of OptimizerConfig. | ||
| if 'overlap_param_gather' in config_sig: | ||
| config_kwargs['overlap_param_gather'] = kwargs.get('overlap_param_gather', False) | ||
| if 'overlap_param_gather_with_optimizer_step' in config_sig: | ||
| config_kwargs['overlap_param_gather_with_optimizer_step'] = kwargs.get( | ||
| 'overlap_param_gather_with_optimizer_step', kwargs.get('overlap_param_gather', False)) | ||
| for key, value in kwargs.items(): | ||
| if key in config_sig and key not in config_kwargs: | ||
| config_kwargs[key] = value | ||
|
|
||
| opt_config = OptimizerConfig(**config_kwargs) |
There was a problem hiding this comment.
There was a problem hiding this comment.
Pull request overview
Adapts Twinkle’s Megatron backend to run on Ascend NPU via MindSpeed, fixes a MoE sharding correctness issue for older megatron-core, and adds NPU Megatron cookbook + documentation.
Changes:
- Bootstrap MindSpeed early on NPU and propagate a MindSpeed-compatible runtime
Namespace. - Fix MoE expert tensor-parallel shape mismatch for
megatron-core<0.13by tying ETP to TP. - Add NPU-specific distributed/ops workarounds (Gloo object gather, unfused softmax) plus verified cookbook scripts and docs.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/twinkle/utils/framework.py | Switch metric/object gather to Megatron’s DP Gloo group on NPU to avoid HCCL hangs. |
| src/twinkle/model/megatron/model/register.py | Lazily import/register Megatron model metadata to avoid early megatron.core imports. |
| src/twinkle/model/megatron/model/init.py | Stop eager registry imports; re-export lazy registry helper. |
| src/twinkle/model/megatron/mindspeed_bootstrap.py | Build/install MindSpeed runtime args and manage (re)patching on config signature changes. |
| src/twinkle/model/megatron/megatron.py | Invoke MindSpeed bootstrap before Megatron init; add NPU-safe label masking; enable Gloo groups; adjust optimizer wiring. |
| src/twinkle/model/megatron/args.py | Add megatron-core<0.13 ETP fix; NPU softmax fusion fallback; LoRA finalize monkey-patch for None grads. |
| src/twinkle/model/megatron/_mindspeed_args.py | Construct merged MindSpeed Namespace from defaults + HF config + Twinkle topology. |
| src/twinkle/model/megatron/init.py | Import MindSpeed adaptor early on NPU so patches apply before megatron.core binds symbols. |
| docs/source_en/Usage Guide/NPU-Support.md | Document MindSpeed/Megatron setup and add verified Megatron NPU examples. |
| cookbook/megatron/npu/tp_moe_lora_npu.sh | Add torchrun launcher for NPU Megatron MoE LoRA smoke. |
| cookbook/megatron/npu/tp_moe_lora_npu.py | Add verified 8-card NPU Megatron MoE LoRA smoke script. |
| cookbook/megatron/npu/tp_lora_npu.sh | Add torchrun launcher for NPU Megatron LoRA smoke. |
| cookbook/megatron/npu/tp_lora_npu.py | Add verified 8-card NPU Megatron LoRA smoke script. |
| if _LAST_REPATCH_SIGNATURE is not None: | ||
| adaptor.repatch(vars(runtime_args).copy()) |
There was a problem hiding this comment.
The current repatch logic skips calling adaptor.repatch(...) on the first bootstrap (_LAST_REPATCH_SIGNATURE is None). This becomes a correctness issue because src/twinkle/model/megatron/__init__.py imports mindspeed.megatron_adaptor before Twinkle installs runtime args, so MindSpeed can patch using defaults and never get updated on the first real runtime config. Fix by ensuring repatch() is invoked at least once after _set_runtime_args(...) when the adaptor may already be imported (e.g., call repatch() whenever signature != _LAST_REPATCH_SIGNATURE, including the None case, or explicitly detect pre-import via sys.modules and repatch once).
| if _LAST_REPATCH_SIGNATURE is not None: | |
| adaptor.repatch(vars(runtime_args).copy()) | |
| adaptor.repatch(vars(runtime_args).copy()) |
| optimizer = get_megatron_optimizer( | ||
| config=opt_config, | ||
| model_chunks=model_chunks, | ||
| use_gloo_process_groups=use_gloo_process_groups, | ||
| ) |
There was a problem hiding this comment.
use_gloo_process_groups is passed unconditionally into get_megatron_optimizer(...). On older/newer Megatron-Core versions where this kwarg is absent or renamed, this will raise TypeError at runtime. Consider guarding this with a signature check (similar to the OptimizerConfig handling above) and only passing the kwarg when supported, otherwise fall back to the default call.
| optimizer = get_megatron_optimizer( | |
| config=opt_config, | |
| model_chunks=model_chunks, | |
| use_gloo_process_groups=use_gloo_process_groups, | |
| ) | |
| optimizer_sig = inspect.signature(get_megatron_optimizer).parameters | |
| if 'use_gloo_process_groups' in optimizer_sig: | |
| optimizer = get_megatron_optimizer( | |
| config=opt_config, | |
| model_chunks=model_chunks, | |
| use_gloo_process_groups=use_gloo_process_groups, | |
| ) | |
| else: | |
| optimizer = get_megatron_optimizer( | |
| config=opt_config, | |
| model_chunks=model_chunks, | |
| ) |
| # The registry import chain can pull in megatron.core, which must stay | ||
| # behind the MindSpeed bootstrap on NPU. | ||
| from .model.constant import MLLMModelType | ||
| is_multimodal = model_type in {value for key, value in vars(MLLMModelType).items() if not key.startswith('_')} |
There was a problem hiding this comment.
This membership test is likely incorrect if MLLMModelType is an Enum (common for *ModelType constants): vars(MLLMModelType) yields enum members as values (e.g., MLLMModelType.X), so comparing a str (model_type) to enum instances will always be false. Prefer iterating members and comparing against their .value (or, if it’s a plain constants container, explicitly compare against string constants only). This impacts is_multimodal and can misconfigure multimodal vs. text-only model construction.
| is_multimodal = model_type in {value for key, value in vars(MLLMModelType).items() if not key.startswith('_')} | |
| # Support both Enum-based and constant-based definitions of MLLMModelType. | |
| if hasattr(MLLMModelType, "__members__"): | |
| # Enum: compare against the .value of each enum member. | |
| _multimodal_model_types = {member.value for member in MLLMModelType.__members__.values()} | |
| else: | |
| # Fallback: compare against string constants defined on MLLMModelType. | |
| _multimodal_model_types = { | |
| value | |
| for key, value in vars(MLLMModelType).items() | |
| if not key.startswith('_') and isinstance(value, str) | |
| } | |
| is_multimodal = model_type in _multimodal_model_types |
| # The registry import chain can pull in megatron.core, which must stay | ||
| # behind the MindSpeed bootstrap on NPU. | ||
| from .model.constant import MLLMModelType | ||
| is_multimodal = model_type in {value for key, value in vars(MLLMModelType).items() if not key.startswith('_')} |
There was a problem hiding this comment.
This membership test is likely incorrect if MLLMModelType is an Enum (common for *ModelType constants): vars(MLLMModelType) yields enum members as values (e.g., MLLMModelType.X), so comparing a str (model_type) to enum instances will always be false. Prefer iterating members and comparing against their .value (or, if it’s a plain constants container, explicitly compare against string constants only). This impacts is_multimodal and can misconfigure multimodal vs. text-only model construction.
| from megatron.core import parallel_state | ||
| from megatron.core.distributed import DistributedDataParallel as MegatronDDP | ||
| from megatron.core.distributed.finalize_model_grads import (_get_main_grad_attr, _reshard_if_dtensor, | ||
| _unshard_if_dtensor, get_attr_wrapped_model) |
There was a problem hiding this comment.
parallel_state and the imports from megatron.core.distributed.finalize_model_grads are not used in finalize_model_grads_for_lora (the function later imports the module via importlib and only swaps _allreduce_word_embedding_grads). Removing unused imports reduces import-time side effects and makes the intent of the monkey-patch clearer.
| from megatron.core import parallel_state | |
| from megatron.core.distributed import DistributedDataParallel as MegatronDDP | |
| from megatron.core.distributed.finalize_model_grads import (_get_main_grad_attr, _reshard_if_dtensor, | |
| _unshard_if_dtensor, get_attr_wrapped_model) | |
| from megatron.core.distributed import DistributedDataParallel as MegatronDDP |
| if Platform.device_prefix() == 'npu': | ||
| try: | ||
| from megatron.core import parallel_state as mpu | ||
|
|
||
| process_group = mpu.get_data_parallel_group_gloo( | ||
| with_context_parallel=getattr(device_mesh, 'cp_world_size', 1) > 1) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
The broad except Exception: pass will silently fall back to the default process group on NPU, which (per the comment) can hang in dist.all_gather_object(...). To avoid hard-to-debug metric hangs, narrow the exception types (e.g., ImportError, AttributeError) and emit a warning when the fallback happens so operators can see why the safer Gloo path was not used.
cookbook/megatron/npu/tp_lora_npu.sh
Outdated
|
|
||
| export PYTHONPATH="${MEGATRON_LM_PATH}:${PYTHONPATH:-}" |
There was a problem hiding this comment.
The launcher only prepends MEGATRON_LM_PATH to PYTHONPATH, but the docs also instruct adding Twinkle’s src path when running from source. Either update the script to optionally include TWINKLE_SRC_PATH (when provided) or adjust the docs around these scripts to clarify that Twinkle must already be installed in the environment.
| export PYTHONPATH="${MEGATRON_LM_PATH}:${PYTHONPATH:-}" | |
| TWINKLE_SRC_PATH=${TWINKLE_SRC_PATH:-} | |
| if [ -n "${TWINKLE_SRC_PATH}" ]; then | |
| export PYTHONPATH="${TWINKLE_SRC_PATH}:${MEGATRON_LM_PATH}:${PYTHONPATH:-}" | |
| else | |
| export PYTHONPATH="${MEGATRON_LM_PATH}:${PYTHONPATH:-}" | |
| fi |
| model = MegatronModel(model_id=MODEL_ID) | ||
| lora_config = LoraConfig(r=8, lora_alpha=32, target_modules='all-linear') | ||
| model.add_adapter_to_model('default', lora_config) | ||
| model.set_optimizer(optimizer_cls='default', lr=1e-4) |
There was a problem hiding this comment.
This hardcodes lr=1e-4 while the MoE smoke script makes LR configurable via env var. For consistency across cookbook examples (and easier reproduction/tuning), consider reading LR from an environment variable (similar to the other script) or defining a module-level LR constant used here.
PR type
PR information
Summary
megatron-core<0.13(including 0.12.1): whenetp_sizedefaulted to 1 whiletp_size > 1,GPTBridgesharded expert weights byetp=1while the model had already built expert parameters sharded bytp_size, causing areshapecrash on every MoE layer weight loadChanges
Bug fix: ETP shape mismatch on
megatron-core<0.13In
megatron-core<0.13, there is no independent ETP group — expertColumnParallelLinearinternally reuses the dense TP group, so expert weights are always sharded bytp_size.TwinkleMegatronArgs.expert_tensor_parallel_sizepreviously returnedetp_sizefromDeviceMeshdirectly. The fix detects the megatron-core version and forcesexpert_tensor_parallel_sizeto returntp_sizeon affected versions, with a warning logged when the values diverge.NPU Megatron adaptation
megatron/__init__.py): On NPU,mindspeed.megatron_adaptoris imported at package init time, before anymegatron.coresymbol gets bound by value. MindSpeed is an import-time patch system — it replacestorch.compile, TE ops, and Megatron internals and must run before those modules are loaded_mindspeed_args.py,mindspeed_bootstrap.py): Builds theargparse.NamespaceMindSpeed expects at runtime by merging MindSpeed defaults, HF config–derived values, and Twinkle's TP/PP/CP/EP settings, then installs it into MindSpeed's internal state. Re-runsrepatch()when the config signature changesframework.py): HCCLall_gather_objecthangs on 8-card NPU. Metric gather on NPU now uses the Megatron-created Gloo DP group instead of the default HCCL groupmasked_softmax_fusionon NPU (args.py): MindSpeed's TBE-backed fused softmax fails when the TBE kernel is not compiled. NPU runs fall back to unfused softmax while keeping Flash Attentionargs.py): LoRA freezes the base embedding weight, soshared_embedding_or_output_weight().gradisNonein PP+LoRA runs. Native_allreduce_word_embedding_gradswould callall_reduce(None)and crash. A None-safe wrapper is temporarily monkey-patched around the nativefinalize_model_gradscall and restored in afinallyblock. The wrapper also handles_get_main_grad_attrsignature drift between megatron-core 0.12.x and 0.16.xbound_device_idon NPU (megatron.py): Gloo subgroups inheritbound_device_idfrom the default process group, which breaks on NPU. The default PG binding is cleared beforeinitialize_model_parallel(). Single-card runs also get arank=0/world_size=1PG so the same init path works withouttorchrunmodel/register.py): Model type detection no longer importsmegatron.corebefore MindSpeed bootstrapVerification
All changes verified with real 8-card NPU smoke runs (10 steps, continuous loss output, checkpoint saved):
cookbook/megatron/npu/tp_full_npu.pycookbook/megatron/npu/tp_lora_npu.pyExperiment results
Paste your experiment result here(if needed).