Hyperparameter Transfer beyond MuP

**Is your feature request related to a problem? Please describe.**
#3058, #3715 introduces MuP into Megatron-LM with support for Muon.
MuP allows efficient and reliable hyperparameter (esp. LR) transfer from narrow to wide networks, for the same depth. 

There are a series of papers in this field, that allow transfer from shallow to deeper networks, etc, which are essential for a good pretraining scaling recipe. 

A new paper from Microsoft, HyperP, claims to be "the first framework for transferring optimal learning rates across model width, depth, training tokens, and Mixture-of-Experts (MoE) granularity under the Frobenius-sphere constraint with the Muon optimizer." 

I suggest we integrate HyperP into Megatron-LM.
Link: [Rethinking Language Model Scaling under Transferable Hypersphere Optimization](https://arxiv.org/pdf/2603.28743), Ren et al, 2026.

Tag the [@mcore-oncall](https://github.com/orgs/NVIDIA/teams/mcore-oncall) 
to get oncall's attention to this issue.

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Additional context**
Add any other context or screenshots about the feature request here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyperparameter Transfer beyond MuP #4088

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hyperparameter Transfer beyond MuP #4088

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions