Skip to content

SkipStepMuon implementation#637

Closed
tyler-romero wants to merge 4 commits intomainfrom
tyler/skipstepmuon
Closed

SkipStepMuon implementation#637
tyler-romero wants to merge 4 commits intomainfrom
tyler/skipstepmuon

Conversation

@tyler-romero
Copy link
Copy Markdown
Contributor

@tyler-romero tyler-romero commented Mar 11, 2026

Quick and dirty skipstepmuon implementation. This simple implementation incurs a host-device sync. If that has noticeable effect on throughput I will revisit and try to implement a version similar to SkipStepAdamW that pipes the skip-step value through in order to avoid this control flow.

tyler-romero and others added 4 commits March 11, 2026 14:12
Add test_skip_step_muon_test.py with tests for config building,
basic stepping, skip-on-outlier behavior, and state preservation
during skipped steps.

Fix SkipStepMuonConfig.create_optimizer() passing rolling_interval_length
and sigma_factor twice (once from as_dict kwargs, once explicitly).

Add comment explaining the host-device sync in SkipStepMuon.step().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SkipStepMuon inherits from dion.Muon at class definition time, so
importing the module fails when dion is not installed. Remove the
top-level re-export so that importing olmo_core.optim does not
require dion. The class is still accessible via the lazy import in
muon.py at optimizer creation time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tyler-romero tyler-romero marked this pull request as ready for review March 11, 2026 21:31
@tyler-romero tyler-romero requested review from dirkgr and epwalsh March 11, 2026 21:32
@tyler-romero tyler-romero marked this pull request as draft March 11, 2026 21:34
Copy link
Copy Markdown
Contributor

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you ever look at spikiness with Muon?

Changes are just about docs.

Comment on lines +259 to +260
sigma_factor: int = 6
"""Number of standard deviations above the mean to trigger a skip."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm beginning to think this number should be 7, but this is not the time or the place to do that experiment.

class SkipStepMuon(_import_muon()):
"""
A "skip step" version of :class:`Muon` that skips the entire optimizer step
when a loss spike is detected.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
when a loss spike is detected.
when a loss or grad norm spike is detected.


- All ranks compute the same ``step_factor`` (loss is pre-synchronized).
- Muon's ``step()`` is not ``torch.compile``'d, so branching is safe.
- On skip: momentum, weights, and step counters are all untouched.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that this matches SkipStepAdamW.


Unlike :class:`SkipStepAdamW` and :class:`SkipStepLion` which thread a
``step_factor`` through the update computation, this class skips the entire
``step()`` call. This avoids all distributed communication and Newton-Schulz
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought: Since Newton-Schulz take a little while, is it possible to overlap the host-device sync with that, and bail early if it turns out we want to skip?

- On skip: momentum, weights, and step counters are all untouched.

.. important::
``latest_loss`` must be set to the **all-reduced** loss before calling
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And gnorm?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants