Preconditioned Optimizers for MoE Training at scale, with out-of-the-box support for MuP and FSDP support for Muon, built on top of Megatron-LM and TransformerEngine.
amd optimization cuda transformers pytorch parallelism transformer muon gpt adam preconditioning kfac megatron adamw optimizers foof large-language-models psgd newton-muon locoprop
-
Updated
Jun 17, 2026