Skip to content

Npu megatron adapt#129

Open
addsubmuldiv wants to merge 6 commits intomodelscope:mainfrom
addsubmuldiv:npu_adapt_doc
Open

Npu megatron adapt#129
addsubmuldiv wants to merge 6 commits intomodelscope:mainfrom
addsubmuldiv:npu_adapt_doc

Conversation

@addsubmuldiv
Copy link
Collaborator

@addsubmuldiv addsubmuldiv commented Mar 26, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Summary

  • Adapt Twinkle's Megatron backend to run on Ascend NPU via MindSpeed
  • Fix a correctness bug in MoE training that affects megatron-core<0.13 (including 0.12.1): when etp_size defaulted to 1 while tp_size > 1, GPTBridge sharded expert weights by etp=1 while the model had already built expert parameters sharded by tp_size, causing a reshape crash on every MoE layer weight load
  • Add verified NPU cookbook examples for Dense full-parameter, LoRA, and MoE LoRA training

Changes

Bug fix: ETP shape mismatch on megatron-core<0.13

In megatron-core<0.13, there is no independent ETP group — expert ColumnParallelLinear internally reuses the dense TP group, so expert weights are always sharded by tp_size. TwinkleMegatronArgs.expert_tensor_parallel_size previously returned etp_size from DeviceMesh directly. The fix detects the megatron-core version and forces expert_tensor_parallel_size to return tp_size on affected versions, with a warning logged when the values diverge.

NPU Megatron adaptation

  • MindSpeed early import (megatron/__init__.py): On NPU, mindspeed.megatron_adaptor is imported at package init time, before any megatron.core symbol gets bound by value. MindSpeed is an import-time patch system — it replaces torch.compile, TE ops, and Megatron internals and must run before those modules are loaded
  • MindSpeed runtime args (_mindspeed_args.py, mindspeed_bootstrap.py): Builds the argparse.Namespace MindSpeed expects at runtime by merging MindSpeed defaults, HF config–derived values, and Twinkle's TP/PP/CP/EP settings, then installs it into MindSpeed's internal state. Re-runs repatch() when the config signature changes
  • Gloo group for NPU object gather (framework.py): HCCL all_gather_object hangs on 8-card NPU. Metric gather on NPU now uses the Megatron-created Gloo DP group instead of the default HCCL group
  • Disabled masked_softmax_fusion on NPU (args.py): MindSpeed's TBE-backed fused softmax fails when the TBE kernel is not compiled. NPU runs fall back to unfused softmax while keeping Flash Attention
  • None-safe embedding all-reduce for LoRA (args.py): LoRA freezes the base embedding weight, so shared_embedding_or_output_weight().grad is None in PP+LoRA runs. Native _allreduce_word_embedding_grads would call all_reduce(None) and crash. A None-safe wrapper is temporarily monkey-patched around the native finalize_model_grads call and restored in a finally block. The wrapper also handles _get_main_grad_attr signature drift between megatron-core 0.12.x and 0.16.x
  • Default PG bound_device_id on NPU (megatron.py): Gloo subgroups inherit bound_device_id from the default process group, which breaks on NPU. The default PG binding is cleared before initialize_model_parallel(). Single-card runs also get a rank=0/world_size=1 PG so the same init path works without torchrun
  • Lazy Megatron model registry (model/register.py): Model type detection no longer imports megatron.core before MindSpeed bootstrap

Verification

All changes verified with real 8-card NPU smoke runs (10 steps, continuous loss output, checkpoint saved):

Script Topology
cookbook/megatron/npu/tp_full_npu.py Dense full-parameter, dp=2 tp=2 pp=2
cookbook/megatron/npu/tp_lora_npu.py Dense LoRA, dp=2 tp=2 pp=2

Experiment results

Paste your experiment result here(if needed).

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Megatron-based parallel training (TP, PP, EP) on NPU by integrating with MindSpeed. Key changes include new example scripts for LoRA and MoE LoRA training, comprehensive documentation updates for NPU setup, and several compatibility fixes such as using torch.where for masking and Gloo groups for object collectives on NPU. The review feedback identifies redundant environment variable assignments in shell scripts, incorrect script paths and PYTHONPATH inconsistencies in the documentation, and suggests clarifying the specific constraints for MoE LoRA support.

Comment on lines +8 to +9
ASCEND_RT_VISIBLE_DEVICES="${ASCEND_RT_VISIBLE_DEVICES}" \
torchrun --nproc_per_node=8 cookbook/megatron/npu/tp_lora_npu.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ASCEND_RT_VISIBLE_DEVICES variable is already set on line 4. Setting it again here for the torchrun command is redundant and can be removed to improve script clarity.

Suggested change
ASCEND_RT_VISIBLE_DEVICES="${ASCEND_RT_VISIBLE_DEVICES}" \
torchrun --nproc_per_node=8 cookbook/megatron/npu/tp_lora_npu.py
torchrun --nproc_per_node=8 cookbook/megatron/npu/tp_lora_npu.py

Comment on lines +8 to +9
ASCEND_RT_VISIBLE_DEVICES="${ASCEND_RT_VISIBLE_DEVICES}" \
torchrun --nproc_per_node=8 cookbook/megatron/npu/tp_moe_lora_npu.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ASCEND_RT_VISIBLE_DEVICES variable is already set on line 4. Setting it again here for the torchrun command is redundant and can be removed to improve script clarity.

Suggested change
ASCEND_RT_VISIBLE_DEVICES="${ASCEND_RT_VISIBLE_DEVICES}" \
torchrun --nproc_per_node=8 cookbook/megatron/npu/tp_moe_lora_npu.py
torchrun --nproc_per_node=8 cookbook/megatron/npu/tp_moe_lora_npu.py

Comment on lines +102 to +104
export TWINKLE_SRC_PATH=/path/to/twinkle/src
export MEGATRON_LM_PATH=/path/to/Megatron-LM
export PYTHONPATH=${TWINKLE_SRC_PATH}:${MEGATRON_LM_PATH}:${PYTHONPATH}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation regarding PYTHONPATH setup is inconsistent with the provided example scripts. Here, and in other places (lines 223-225, 332-337), TWINKLE_SRC_PATH is included. However, the example shell scripts (tp_lora_npu.sh, tp_moe_lora_npu.sh) do not use TWINKLE_SRC_PATH, assuming twinkle is installed in an editable mode. To avoid confusion, the documentation should be consistent with the example scripts.

Suggested change
export TWINKLE_SRC_PATH=/path/to/twinkle/src
export MEGATRON_LM_PATH=/path/to/Megatron-LM
export PYTHONPATH=${TWINKLE_SRC_PATH}:${MEGATRON_LM_PATH}:${PYTHONPATH}
export MEGATRON_LM_PATH=/path/to/Megatron-LM
export PYTHONPATH=${MEGATRON_LM_PATH}:${PYTHONPATH}

export PYTHONPATH=${TWINKLE_SRC_PATH}:${MEGATRON_LM_PATH}:${PYTHONPATH}
```

> **Tip**: `cookbook/megatron/tp.sh` and `cookbook/megatron/tp_moe.sh` already include automatic PYTHONPATH configuration. You can use these scripts directly to launch training without manual setup. Default paths can be overridden via the `TWINKLE_SRC_PATH` and `MEGATRON_LM_PATH` environment variables.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script paths mentioned here (cookbook/megatron/tp.sh and cookbook/megatron/tp_moe.sh) are incorrect. The scripts added in this pull request are located in cookbook/megatron/npu/ and are named tp_lora_npu.sh and tp_moe_lora_npu.sh. Please update the paths to refer to the correct example scripts.

Suggested change
> **Tip**: `cookbook/megatron/tp.sh` and `cookbook/megatron/tp_moe.sh` already include automatic PYTHONPATH configuration. You can use these scripts directly to launch training without manual setup. Default paths can be overridden via the `TWINKLE_SRC_PATH` and `MEGATRON_LM_PATH` environment variables.
> **Tip**: The example scripts in `cookbook/megatron/npu/` (e.g., `tp_lora_npu.sh`) already include automatic PYTHONPATH configuration for Megatron-LM. You can use these scripts directly to launch training without manual setup. The default path can be overridden via the `MEGATRON_LM_PATH` environment variable.

- ✅ Megatron backend (DP=2, TP=2, PP=2)
- ✅ 10-step continuous metric printing + checkpoint saving

**Note**: MoE models do not currently support LoRA fine-tuning (Expert LoRA is not available when ETP>1).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This note is confusing as it seems to contradict the preceding section "Megatron MoE LoRA Fine-tuning". It should be rephrased to clarify that the limitation applies specifically when ETP > 1.

Suggested change
**Note**: MoE models do not currently support LoRA fine-tuning (Expert LoRA is not available when ETP>1).
**Note**: When using LoRA with MoE models, Expert LoRA is only supported when ETP=1. The provided MoE LoRA example uses a verified topology that respects this constraint.

Comment on lines +489 to +493
# Avoid bool advanced indexing here. On NPU this lowers to
# aclnnNonzeroV2 inside AdvancedIndex and can crash during
# end-to-end training; torch.where preserves the same masking
# semantics without going through that path.
masked_labels = torch.where(loss_mask, labels, torch.zeros_like(labels))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using boolean indexing with ~loss_mask can lead to issues on NPU backends as it may lower to unsupported operations. Replacing it with torch.where is a safer and more compatible way to achieve the same masking behavior, as noted in the comment.

Comment on lines +785 to +815
# Some Megatron-LM versions (e.g. 0.12.1) only accept
# overlap_param_gather_with_optimizer_step here.
# overlap_param_gather still exists on ddp_config / distributed
# optimizer paths, but passing it directly into OptimizerConfig
# raises TypeError on this branch.
config_sig = inspect.signature(OptimizerConfig).parameters
config_kwargs = {
'optimizer': 'adam',
'lr': lr,
'min_lr': kwargs.get('min_lr', 0.0),
'weight_decay': kwargs.get('weight_decay', 0.01),
'adam_beta1': kwargs.get('adam_beta1', 0.9),
'adam_beta2': kwargs.get('adam_beta2', 0.999),
'adam_eps': kwargs.get('adam_eps', 1e-8),
'clip_grad': kwargs.get('clip_grad', 1.0),
'bf16': kwargs.get('bf16', True),
'use_distributed_optimizer': use_distributed_optimizer,
'log_num_zeros_in_grad': kwargs.get('log_num_zeros_in_grad', False),
}
# Keep the old knob only if this Megatron version still exposes it.
# Some branches wire it through ddp_config instead of OptimizerConfig.
if 'overlap_param_gather' in config_sig:
config_kwargs['overlap_param_gather'] = kwargs.get('overlap_param_gather', False)
if 'overlap_param_gather_with_optimizer_step' in config_sig:
config_kwargs['overlap_param_gather_with_optimizer_step'] = kwargs.get(
'overlap_param_gather_with_optimizer_step', kwargs.get('overlap_param_gather', False))
for key, value in kwargs.items():
if key in config_sig and key not in config_kwargs:
config_kwargs[key] = value

opt_config = OptimizerConfig(**config_kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for constructing OptimizerConfig has been refactored to dynamically build the keyword arguments based on the signature of OptimizerConfig. This is a great improvement for ensuring compatibility across different versions of megatron-core which may have different constructor arguments.

@addsubmuldiv addsubmuldiv changed the title Npu adapt doc Npu megatron adapt Mar 26, 2026
@addsubmuldiv addsubmuldiv marked this pull request as ready for review March 27, 2026 02:22
Copilot AI review requested due to automatic review settings March 27, 2026 02:22
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adapts Twinkle’s Megatron backend to run on Ascend NPU via MindSpeed, fixes a MoE sharding correctness issue for older megatron-core, and adds NPU Megatron cookbook + documentation.

Changes:

  • Bootstrap MindSpeed early on NPU and propagate a MindSpeed-compatible runtime Namespace.
  • Fix MoE expert tensor-parallel shape mismatch for megatron-core<0.13 by tying ETP to TP.
  • Add NPU-specific distributed/ops workarounds (Gloo object gather, unfused softmax) plus verified cookbook scripts and docs.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/twinkle/utils/framework.py Switch metric/object gather to Megatron’s DP Gloo group on NPU to avoid HCCL hangs.
src/twinkle/model/megatron/model/register.py Lazily import/register Megatron model metadata to avoid early megatron.core imports.
src/twinkle/model/megatron/model/init.py Stop eager registry imports; re-export lazy registry helper.
src/twinkle/model/megatron/mindspeed_bootstrap.py Build/install MindSpeed runtime args and manage (re)patching on config signature changes.
src/twinkle/model/megatron/megatron.py Invoke MindSpeed bootstrap before Megatron init; add NPU-safe label masking; enable Gloo groups; adjust optimizer wiring.
src/twinkle/model/megatron/args.py Add megatron-core<0.13 ETP fix; NPU softmax fusion fallback; LoRA finalize monkey-patch for None grads.
src/twinkle/model/megatron/_mindspeed_args.py Construct merged MindSpeed Namespace from defaults + HF config + Twinkle topology.
src/twinkle/model/megatron/init.py Import MindSpeed adaptor early on NPU so patches apply before megatron.core binds symbols.
docs/source_en/Usage Guide/NPU-Support.md Document MindSpeed/Megatron setup and add verified Megatron NPU examples.
cookbook/megatron/npu/tp_moe_lora_npu.sh Add torchrun launcher for NPU Megatron MoE LoRA smoke.
cookbook/megatron/npu/tp_moe_lora_npu.py Add verified 8-card NPU Megatron MoE LoRA smoke script.
cookbook/megatron/npu/tp_lora_npu.sh Add torchrun launcher for NPU Megatron LoRA smoke.
cookbook/megatron/npu/tp_lora_npu.py Add verified 8-card NPU Megatron LoRA smoke script.

Comment on lines +77 to +78
if _LAST_REPATCH_SIGNATURE is not None:
adaptor.repatch(vars(runtime_args).copy())
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current repatch logic skips calling adaptor.repatch(...) on the first bootstrap (_LAST_REPATCH_SIGNATURE is None). This becomes a correctness issue because src/twinkle/model/megatron/__init__.py imports mindspeed.megatron_adaptor before Twinkle installs runtime args, so MindSpeed can patch using defaults and never get updated on the first real runtime config. Fix by ensuring repatch() is invoked at least once after _set_runtime_args(...) when the adaptor may already be imported (e.g., call repatch() whenever signature != _LAST_REPATCH_SIGNATURE, including the None case, or explicitly detect pre-import via sys.modules and repatch once).

Suggested change
if _LAST_REPATCH_SIGNATURE is not None:
adaptor.repatch(vars(runtime_args).copy())
adaptor.repatch(vars(runtime_args).copy())

Copilot uses AI. Check for mistakes.
Comment on lines 838 to 842
optimizer = get_megatron_optimizer(
config=opt_config,
model_chunks=model_chunks,
use_gloo_process_groups=use_gloo_process_groups,
)
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_gloo_process_groups is passed unconditionally into get_megatron_optimizer(...). On older/newer Megatron-Core versions where this kwarg is absent or renamed, this will raise TypeError at runtime. Consider guarding this with a signature check (similar to the OptimizerConfig handling above) and only passing the kwarg when supported, otherwise fall back to the default call.

Suggested change
optimizer = get_megatron_optimizer(
config=opt_config,
model_chunks=model_chunks,
use_gloo_process_groups=use_gloo_process_groups,
)
optimizer_sig = inspect.signature(get_megatron_optimizer).parameters
if 'use_gloo_process_groups' in optimizer_sig:
optimizer = get_megatron_optimizer(
config=opt_config,
model_chunks=model_chunks,
use_gloo_process_groups=use_gloo_process_groups,
)
else:
optimizer = get_megatron_optimizer(
config=opt_config,
model_chunks=model_chunks,
)

Copilot uses AI. Check for mistakes.
# The registry import chain can pull in megatron.core, which must stay
# behind the MindSpeed bootstrap on NPU.
from .model.constant import MLLMModelType
is_multimodal = model_type in {value for key, value in vars(MLLMModelType).items() if not key.startswith('_')}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This membership test is likely incorrect if MLLMModelType is an Enum (common for *ModelType constants): vars(MLLMModelType) yields enum members as values (e.g., MLLMModelType.X), so comparing a str (model_type) to enum instances will always be false. Prefer iterating members and comparing against their .value (or, if it’s a plain constants container, explicitly compare against string constants only). This impacts is_multimodal and can misconfigure multimodal vs. text-only model construction.

Suggested change
is_multimodal = model_type in {value for key, value in vars(MLLMModelType).items() if not key.startswith('_')}
# Support both Enum-based and constant-based definitions of MLLMModelType.
if hasattr(MLLMModelType, "__members__"):
# Enum: compare against the .value of each enum member.
_multimodal_model_types = {member.value for member in MLLMModelType.__members__.values()}
else:
# Fallback: compare against string constants defined on MLLMModelType.
_multimodal_model_types = {
value
for key, value in vars(MLLMModelType).items()
if not key.startswith('_') and isinstance(value, str)
}
is_multimodal = model_type in _multimodal_model_types

Copilot uses AI. Check for mistakes.
# The registry import chain can pull in megatron.core, which must stay
# behind the MindSpeed bootstrap on NPU.
from .model.constant import MLLMModelType
is_multimodal = model_type in {value for key, value in vars(MLLMModelType).items() if not key.startswith('_')}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This membership test is likely incorrect if MLLMModelType is an Enum (common for *ModelType constants): vars(MLLMModelType) yields enum members as values (e.g., MLLMModelType.X), so comparing a str (model_type) to enum instances will always be false. Prefer iterating members and comparing against their .value (or, if it’s a plain constants container, explicitly compare against string constants only). This impacts is_multimodal and can misconfigure multimodal vs. text-only model construction.

Copilot uses AI. Check for mistakes.
Comment on lines +585 to +588
from megatron.core import parallel_state
from megatron.core.distributed import DistributedDataParallel as MegatronDDP
from megatron.core.distributed.finalize_model_grads import (_get_main_grad_attr, _reshard_if_dtensor,
_unshard_if_dtensor, get_attr_wrapped_model)
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parallel_state and the imports from megatron.core.distributed.finalize_model_grads are not used in finalize_model_grads_for_lora (the function later imports the module via importlib and only swaps _allreduce_word_embedding_grads). Removing unused imports reduces import-time side effects and makes the intent of the monkey-patch clearer.

Suggested change
from megatron.core import parallel_state
from megatron.core.distributed import DistributedDataParallel as MegatronDDP
from megatron.core.distributed.finalize_model_grads import (_get_main_grad_attr, _reshard_if_dtensor,
_unshard_if_dtensor, get_attr_wrapped_model)
from megatron.core.distributed import DistributedDataParallel as MegatronDDP

Copilot uses AI. Check for mistakes.
Comment on lines +56 to +63
if Platform.device_prefix() == 'npu':
try:
from megatron.core import parallel_state as mpu

process_group = mpu.get_data_parallel_group_gloo(
with_context_parallel=getattr(device_mesh, 'cp_world_size', 1) > 1)
except Exception:
pass
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broad except Exception: pass will silently fall back to the default process group on NPU, which (per the comment) can hang in dist.all_gather_object(...). To avoid hard-to-debug metric hangs, narrow the exception types (e.g., ImportError, AttributeError) and emit a warning when the fallback happens so operators can see why the safer Gloo path was not used.

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +6

export PYTHONPATH="${MEGATRON_LM_PATH}:${PYTHONPATH:-}"
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The launcher only prepends MEGATRON_LM_PATH to PYTHONPATH, but the docs also instruct adding Twinkle’s src path when running from source. Either update the script to optionally include TWINKLE_SRC_PATH (when provided) or adjust the docs around these scripts to clarify that Twinkle must already be installed in the environment.

Suggested change
export PYTHONPATH="${MEGATRON_LM_PATH}:${PYTHONPATH:-}"
TWINKLE_SRC_PATH=${TWINKLE_SRC_PATH:-}
if [ -n "${TWINKLE_SRC_PATH}" ]; then
export PYTHONPATH="${TWINKLE_SRC_PATH}:${MEGATRON_LM_PATH}:${PYTHONPATH:-}"
else
export PYTHONPATH="${MEGATRON_LM_PATH}:${PYTHONPATH:-}"
fi

Copilot uses AI. Check for mistakes.
model = MegatronModel(model_id=MODEL_ID)
lora_config = LoraConfig(r=8, lora_alpha=32, target_modules='all-linear')
model.add_adapter_to_model('default', lora_config)
model.set_optimizer(optimizer_cls='default', lr=1e-4)
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hardcodes lr=1e-4 while the MoE smoke script makes LR configurable via env var. For consistency across cookbook examples (and easier reproduction/tuning), consider reading LR from an environment variable (similar to the other script) or defining a module-level LR constant used here.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants