Skip to content

Record: Long Context + All Optimizations submission#166

Open
chinesepowered wants to merge 1 commit intoopenai:mainfrom
chinesepowered:claude/complete-coding-challenge-fGEp1
Open

Record: Long Context + All Optimizations submission#166
chinesepowered wants to merge 1 commit intoopenai:mainfrom
chinesepowered:claude/complete-coding-challenge-fGEp1

Conversation

@chinesepowered
Copy link

Submission: records/track_10min_16mb/2026-03-20_LongCtx_SlidingWindow_FP16Emb_10L_AllOpts/

Strategy: Merge the two strongest approaches that haven't been combined yet:

Technique Source | What it contributes -- | -- SOTA (1.1748 BPB) | Sliding window eval, FP16 embed export, 10 layers, Overtone init, Muon WD, phase-transition resid_mix Seq4096 (1.2014 BPB) | Long context training, high Muon momentum (0.99), conservative LRs (0.02), smaller batch, longer warmdown

Key changes from current SOTA:

  • train_seq_len: 1024 → 2048 (2x more context per training token)
  • train_batch_tokens: 524K → 393K (more optimizer steps per wallclock)
  • matrix_lr/scalar_lr: 0.04 → 0.02 (reduces quantization gap)
  • muon_momentum: 0.95 → 0.99 (stronger gradient smoothing)
  • warmdown_iters: 2500 → 3600 (smoother weights for quantization)
  • tied_embed_lr: 0.10 → 0.06 (more stable embedding training)

The Seq4096 submission showed ~0.02 BPB improvement from training alone (without any eval tricks). The SOTA's sliding window eval adds ~0.03 BPB improvement. Combined, this could target ~1.155-1.165 BPB, pending validation on 8xH100 hardware.

Combines best training techniques (seq_len=2048, Muon momentum 0.99,
conservative LRs, longer warmdown) with all SOTA eval/quantization
tricks (sliding window, FP16 embed, Overtone init, Muon WD, 10 layers).

https://claude.ai/code/session_01D1CQCz3TCExUmWTVDivu3R
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants