Skip to content

Sliding Window Eval + Muon6 (val_bpb 1.1973)#169

Open
beee003 wants to merge 2 commits intoopenai:mainfrom
beee003:submission-sliding-window-muon6
Open

Sliding Window Eval + Muon6 (val_bpb 1.1973)#169
beee003 wants to merge 2 commits intoopenai:mainfrom
beee003:submission-sliding-window-muon6

Conversation

@beee003
Copy link

@beee003 beee003 commented Mar 20, 2026

Summary

  • Mean val_bpb: 1.1973 (3 seeds: 1337→1.1968, 42→1.1974, 7→1.1978, p<0.001)
  • Artifact: ~15.9 MB (under 16 MB)
  • Training: ~13,688 steps in 600s on 8xH100 SXM
  • Eval: ~126s sliding window (within 10 min budget)

Key Techniques

  1. Sliding Window Evaluation (stride=256): Each token scored with 768+ context instead of 0-1023 average. Added forward_logits() method for efficient inference.
  2. Muon 6-step Newton-Schulz: More accurate gradient orthogonalization (MUON_BACKEND_STEPS=6).
  3. Extended Momentum Warmup: MUON_MOMENTUM_WARMUP_STEPS=1000 (up from 500) stabilizes early training.
  4. Longer Warmdown: WARMDOWN_ITERS=1500 (up from 1200) for smoother LR decay.

Reproduction

MUON_BACKEND_STEPS=6 MUON_MOMENTUM_WARMUP_STEPS=1000 WARMDOWN_ITERS=1500 EVAL_STRIDE=256 \
SEED=1337 RUN_ID=submission VAL_LOSS_EVERY=0 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Results

Seed val_loss val_bpb Steps
1337 2.0208 1.1968 13,688
42 2.0217 1.1974 13,688
7 2.0225 1.1978 13,688
Mean 2.0217 1.1973

Mean val_bpb: 1.1973 across 3 seeds (1337, 42, 7).
Key techniques: sliding window eval (stride=256), Muon 6-step Newton-Schulz,
extended momentum warmup (1000 steps), longer warmdown (1500 iters).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ccf183ec4e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +903 to +905
# Only score tokens in the scoring window [score_start, score_end)
score_start = pos - win_start
score_end = min(score_start + stride, actual_len)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent double-counting boundary tokens in sliding eval

On multi-GPU runs, eval_val_sliding partitions the validation set with my_start/my_end, but this scoring window still extends each rank by a full stride because score_end is only clamped to actual_len. In the 8xH100 path used for leaderboard submissions, that means the last window on rank r can score past my_end while rank r+1 starts again at its own my_start, so the boundary tokens are counted twice in the all-reduced val_loss/val_bpb. Please clamp the scored range to the local shard (or distribute work by windows instead of raw tokens) so the final metric matches the true dataset average.

Useful? React with 👍 / 👎.

Clamp score_end to the local shard boundary (my_end) so tokens at
rank boundaries aren't counted by both adjacent ranks during
all_reduce. Score may change slightly on multi-GPU runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants