SkipStepMuon implementation by tyler-romero · Pull Request #637 · allenai/OLMo-core

tyler-romero · 2026-03-11T21:13:50Z

Quick and dirty skipstepmuon implementation. This simple implementation incurs a host-device sync. If that has noticeable effect on throughput I will revisit and try to implement a version similar to SkipStepAdamW that pipes the skip-step value through in order to avoid this control flow.

Add test_skip_step_muon_test.py with tests for config building, basic stepping, skip-on-outlier behavior, and state preservation during skipped steps. Fix SkipStepMuonConfig.create_optimizer() passing rolling_interval_length and sigma_factor twice (once from as_dict kwargs, once explicitly). Add comment explaining the host-device sync in SkipStepMuon.step(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SkipStepMuon inherits from dion.Muon at class definition time, so importing the module fails when dion is not installed. Remove the top-level re-export so that importing olmo_core.optim does not require dion. The class is still accessible via the lazy import in muon.py at optimizer creation time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dirkgr

Did you ever look at spikiness with Muon?

Changes are just about docs.

dirkgr · 2026-03-11T23:30:41Z

+    sigma_factor: int = 6
+    """Number of standard deviations above the mean to trigger a skip."""


I'm beginning to think this number should be 7, but this is not the time or the place to do that experiment.

dirkgr · 2026-03-11T23:31:18Z

+class SkipStepMuon(_import_muon()):
+    """
+    A "skip step" version of :class:`Muon` that skips the entire optimizer step
+    when a loss spike is detected.


Suggested change

when a loss spike is detected.

when a loss or grad norm spike is detected.

dirkgr · 2026-03-11T23:34:15Z

+
+    - All ranks compute the same ``step_factor`` (loss is pre-synchronized).
+    - Muon's ``step()`` is not ``torch.compile``'d, so branching is safe.
+    - On skip: momentum, weights, and step counters are all untouched.


I confirmed that this matches SkipStepAdamW.

dirkgr · 2026-03-11T23:36:03Z

+
+    Unlike :class:`SkipStepAdamW` and :class:`SkipStepLion` which thread a
+    ``step_factor`` through the update computation, this class skips the entire
+    ``step()`` call. This avoids all distributed communication and Newton-Schulz


Just a thought: Since Newton-Schulz take a little while, is it possible to overlap the host-device sync with that, and bail early if it turns out we want to skip?

dirkgr · 2026-03-11T23:40:01Z

+    - On skip: momentum, weights, and step counters are all untouched.
+
+    .. important::
+        ``latest_loss`` must be set to the **all-reduced** loss before calling


tyler-romero and others added 4 commits March 11, 2026 14:12

SkipStepMuon implementation

5f84f80

Merge branch 'main' into tyler/skipstepmuon

07158ed

tyler-romero marked this pull request as ready for review March 11, 2026 21:31

tyler-romero requested review from dirkgr and epwalsh March 11, 2026 21:32

tyler-romero marked this pull request as draft March 11, 2026 21:34

dirkgr suggested changes Mar 11, 2026

View reviewed changes

tyler-romero closed this Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SkipStepMuon implementation#637

SkipStepMuon implementation#637
tyler-romero wants to merge 4 commits intomainfrom
tyler/skipstepmuon

tyler-romero commented Mar 11, 2026 •

edited

Loading

Uh oh!

dirkgr left a comment

Uh oh!

dirkgr Mar 11, 2026

Uh oh!

dirkgr Mar 11, 2026

Uh oh!

dirkgr Mar 11, 2026

Uh oh!

dirkgr Mar 11, 2026

Uh oh!

dirkgr Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		sigma_factor: int = 6
		"""Number of standard deviations above the mean to trigger a skip."""

	when a loss spike is detected.
	when a loss or grad norm spike is detected.

Conversation

tyler-romero commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dirkgr left a comment

Choose a reason for hiding this comment

Uh oh!

dirkgr Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

dirkgr Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

dirkgr Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

dirkgr Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

dirkgr Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tyler-romero commented Mar 11, 2026 •

edited

Loading