Authors: Da Chang, Ganzhao Yuan
Our article is accepted as Spotlight by NeurIPS 2025. You can find this in NeurIPS2025 Here it is.
Our central claim is the MGUP strategy. A safeguard mechanism controls a threshold: by ranking the products
In practice you can use Cautious-MGUP to avoid the expensive Top-K sort on very large models:
Note 1: Our theory is confined to Adam; whether Lion, Muon, etc. can safely adopt the Cautious trick without losing convergence remains open.
Note 2: The stepsize increase factor
Set the alignment threshold via mask_ratio and scale the steps of coordinates not in the top-K via gamma;
class AdamW(Optimizer):
def __init__(
self,
params: Iterable[nn.parameter.Parameter],
lr: float = 1e-3,
betas: Tuple[float, float] = (0.9, 0.999),
eps: float = 1e-6,
weight_decay: float = 0.0,
correct_bias: bool = True,
### MGUP parameters
mask_ratio=0.5,
alpha=2.0
gamma=0.1,
###############
no_deprecation_warning: bool = False,
):from MGUP.MGUP_AdamW import AdamW as mg_adamw
from MGUP.MGUP_AdamW import CMGUP_AdamW as cmg_adamwThe following experiments detail the configurations and results of the training processes:
Experiment 1: Single RTX-4090 GPU
- Model Architecture: Qwen2.5-150M
- Training Dataset: Wikitext-103
- Number of Training Epochs: 5
- Batch Size: 160
The learning rate schedule, as well as the training and validation loss curves, are presented below. Figure 1 illustrates the learning rate schedule, Figure 2 depicts the training loss curve, and Figure 3 shows the validation loss curve.
![]() |
![]() |
![]() |
Experiment 2: Single ASCEND-910C NPU
- Model Architecture: LLaMA2-130M
- Training Dataset: C4
- Number of Training Steps: 10,000
- Batch Size: 512
The learning rate schedule, as well as the training and validation loss curves, are presented below. Figure 1 illustrates the learning rate schedule, Figure 2 depicts the training loss curve, and Figure 3 shows the validation loss curve.
![]() |
![]() |
![]() |





