Sliding Window Eval + Muon6 (val_bpb 1.1973)#169
Sliding Window Eval + Muon6 (val_bpb 1.1973)#169beee003 wants to merge 2 commits intoopenai:mainfrom
Conversation
Mean val_bpb: 1.1973 across 3 seeds (1337, 42, 7). Key techniques: sliding window eval (stride=256), Muon 6-step Newton-Schulz, extended momentum warmup (1000 steps), longer warmdown (1500 iters). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ccf183ec4e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Only score tokens in the scoring window [score_start, score_end) | ||
| score_start = pos - win_start | ||
| score_end = min(score_start + stride, actual_len) |
There was a problem hiding this comment.
Prevent double-counting boundary tokens in sliding eval
On multi-GPU runs, eval_val_sliding partitions the validation set with my_start/my_end, but this scoring window still extends each rank by a full stride because score_end is only clamped to actual_len. In the 8xH100 path used for leaderboard submissions, that means the last window on rank r can score past my_end while rank r+1 starts again at its own my_start, so the boundary tokens are counted twice in the all-reduced val_loss/val_bpb. Please clamp the scored range to the local shard (or distribute work by windows instead of raw tokens) so the final metric matches the true dataset average.
Useful? React with 👍 / 👎.
Clamp score_end to the local shard boundary (my_end) so tokens at rank boundaries aren't counted by both adjacent ranks during all_reduce. Score may change slightly on multi-GPU runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Key Techniques
forward_logits()method for efficient inference.MUON_BACKEND_STEPS=6).MUON_MOMENTUM_WARMUP_STEPS=1000(up from 500) stabilizes early training.WARMDOWN_ITERS=1500(up from 1200) for smoother LR decay.Reproduction
Results