Closed
Conversation
Add test_skip_step_muon_test.py with tests for config building, basic stepping, skip-on-outlier behavior, and state preservation during skipped steps. Fix SkipStepMuonConfig.create_optimizer() passing rolling_interval_length and sigma_factor twice (once from as_dict kwargs, once explicitly). Add comment explaining the host-device sync in SkipStepMuon.step(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SkipStepMuon inherits from dion.Muon at class definition time, so importing the module fails when dion is not installed. Remove the top-level re-export so that importing olmo_core.optim does not require dion. The class is still accessible via the lazy import in muon.py at optimizer creation time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dirkgr
suggested changes
Mar 11, 2026
Contributor
dirkgr
left a comment
There was a problem hiding this comment.
Did you ever look at spikiness with Muon?
Changes are just about docs.
Comment on lines
+259
to
+260
| sigma_factor: int = 6 | ||
| """Number of standard deviations above the mean to trigger a skip.""" |
Contributor
There was a problem hiding this comment.
I'm beginning to think this number should be 7, but this is not the time or the place to do that experiment.
| class SkipStepMuon(_import_muon()): | ||
| """ | ||
| A "skip step" version of :class:`Muon` that skips the entire optimizer step | ||
| when a loss spike is detected. |
Contributor
There was a problem hiding this comment.
Suggested change
| when a loss spike is detected. | |
| when a loss or grad norm spike is detected. |
|
|
||
| - All ranks compute the same ``step_factor`` (loss is pre-synchronized). | ||
| - Muon's ``step()`` is not ``torch.compile``'d, so branching is safe. | ||
| - On skip: momentum, weights, and step counters are all untouched. |
Contributor
There was a problem hiding this comment.
I confirmed that this matches SkipStepAdamW.
|
|
||
| Unlike :class:`SkipStepAdamW` and :class:`SkipStepLion` which thread a | ||
| ``step_factor`` through the update computation, this class skips the entire | ||
| ``step()`` call. This avoids all distributed communication and Newton-Schulz |
Contributor
There was a problem hiding this comment.
Just a thought: Since Newton-Schulz take a little while, is it possible to overlap the host-device sync with that, and bail early if it turns out we want to skip?
| - On skip: momentum, weights, and step counters are all untouched. | ||
|
|
||
| .. important:: | ||
| ``latest_loss`` must be set to the **all-reduced** loss before calling |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Quick and dirty skipstepmuon implementation. This simple implementation incurs a host-device sync. If that has noticeable effect on throughput I will revisit and try to implement a version similar to SkipStepAdamW that pipes the skip-step value through in order to avoid this control flow.