Skip to content

Hyperparameter Transfer beyond MuP #4088

@plugyawn

Description

@plugyawn

Is your feature request related to a problem? Please describe.
#3058, #3715 introduces MuP into Megatron-LM with support for Muon.
MuP allows efficient and reliable hyperparameter (esp. LR) transfer from narrow to wide networks, for the same depth.

There are a series of papers in this field, that allow transfer from shallow to deeper networks, etc, which are essential for a good pretraining scaling recipe.

A new paper from Microsoft, HyperP, claims to be "the first framework for transferring optimal learning rates across model width, depth, training tokens, and Mixture-of-Experts (MoE) granularity under the Frobenius-sphere constraint with the Muon optimizer."

I suggest we integrate HyperP into Megatron-LM.
Link: Rethinking Language Model Scaling under Transferable Hypersphere Optimization, Ren et al, 2026.

Tag the @mcore-oncall
to get oncall's attention to this issue.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions