Hi Moonshot AI team,
First, thank you for your excellent paper and for open-sourcing your work on the Muon optimizer. It's a fascinating contribution to the field.
I've been studying the paper and the Megatron-LM implementation in detail, and I had a small suggestion to improve the clarity of Algorithm 1 ("Distributed Muon") for future readers.
I was initially very confused by the use of the variable G and the term "gradient matrix" in the "DP Gather" step (lines 5-6). The algorithm begins by requiring "Full Gradients G," but the object gathered in line 6 is actually the gradient-updated momentum buffer (g'), not the raw gradient.
This was confusing for two reasons:
- It seems to reuse the variable
G for two different things (initial raw gradient vs. final momentum buffer).
- In a ZeRO-1 context, the raw gradients are replicated, so the idea of "gathering" them seemed paradoxical.
My confusion was resolved when I realized the gathered object is an optimizer state (the momentum buffer), which is sharded under ZeRO-1 and therefore does need to be gathered.
Suggestion:
To improve clarity, perhaps the pseudocode could use a different variable (e.g., M_full or G_momentum) in line 6 to distinguish the gathered momentum buffer from the initial raw gradient G.
This is a minor terminological point, but I believe it would make the excellent algorithm even easier to understand for people trying to learn from your work.
Thanks again for the great research!
Hi Moonshot AI team,
First, thank you for your excellent paper and for open-sourcing your work on the Muon optimizer. It's a fascinating contribution to the field.
I've been studying the paper and the Megatron-LM implementation in detail, and I had a small suggestion to improve the clarity of Algorithm 1 ("Distributed Muon") for future readers.
I was initially very confused by the use of the variable
Gand the term "gradient matrix" in the "DP Gather" step (lines 5-6). The algorithm begins by requiring "Full Gradients G," but the object gathered in line 6 is actually the gradient-updated momentum buffer (g'), not the raw gradient.This was confusing for two reasons:
Gfor two different things (initial raw gradient vs. final momentum buffer).My confusion was resolved when I realized the gathered object is an optimizer state (the momentum buffer), which is sharded under ZeRO-1 and therefore does need to be gathered.
Suggestion:
To improve clarity, perhaps the pseudocode could use a different variable (e.g.,
M_fullorG_momentum) in line 6 to distinguish the gathered momentum buffer from the initial raw gradientG.This is a minor terminological point, but I believe it would make the excellent algorithm even easier to understand for people trying to learn from your work.
Thanks again for the great research!