Skip to content

SwiGLU dim=576 + Sliding Window + Muon WD (1.2091 BPB)#163

Open
Focus2321 wants to merge 1 commit intoopenai:mainfrom
Focus2321:submission/swiglu-576-sliding-window
Open

SwiGLU dim=576 + Sliding Window + Muon WD (1.2091 BPB)#163
Focus2321 wants to merge 1 commit intoopenai:mainfrom
Focus2321:submission/swiglu-576-sliding-window

Conversation

@Focus2321
Copy link
Copy Markdown

@Focus2321 Focus2321 commented Mar 20, 2026

Summary

  • val_bpb: 1.2091 (mean across 3 seeds, sliding window eval stride=64)
  • Artifact: 13.2MB (under 16MB limit)
  • Hardware: 4xH100 SXM (8xH100 was unavailable at time of submission — see note below)

Seeds (4xH100 SXM, RunPod secure cloud)

Seed Steps ms/step val_bpb (standard) val_bpb (sliding window)
1337 10,169 59.0 1.2441 1.2093
42 10,293 58.4 1.2432 1.2086
7 10,310 58.2 1.2439 1.2092

Note on hardware

These runs were done on 4xH100 SXM because 8xH100 nodes were sold out on RunPod at time of submission. The script is fully 8xH100-compatible (grad_accum_steps = 8 // world_size) and we have a prior 8xH100 run of the same architecture (without weight decay/sliding window) that completed 22,196 steps at 27ms/step with standard eval of 1.2270 — confirming it runs well within the 10-minute budget on 8x.

On 4xH100 we get ~10.2K steps (grad_accum=2). On 8xH100 we'd get ~20K+ steps (grad_accum=1), which means the 8x results would be better than what's reported here. We'll update with 8xH100 logs when availability returns.

Key Changes from Baseline

  1. Wider model (dim=576, 7 layers, SwiGLU mult=2) — width scaling > depth scaling
  2. Muon weight decay 0.02 — decoupled WD improves training + shrinks artifact by 0.5MB
  3. FP16 embedding passthrough — eliminates tied-embedding quantization degradation
  4. Sliding window eval (stride=64) — ~0.035 BPB improvement at eval time
  5. Wallclock warmdown 60% — optimal with weight decay (tested 0.35–0.7)

Architecture discovered through 111 automated experiments on a single RTX 3090 before scaling to H100.

Mean val_bpb: 1.2091 (sliding window, stride=64)
Seeds: 1337 (1.2093), 42 (1.2086), 7 (1.2092)
Artifact: 13.2MB (well under 16MB limit)
Hardware: 4xH100 SXM, ~10K steps in 10 min

Key innovations:
- Wider model (dim=576, 7L) over deeper (dim=512, 9L)
- Muon decoupled weight decay 0.02
- FP16 embedding passthrough
- Sliding window eval (stride=64)
- Architecture found via 111 automated experiments on RTX 3090

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

PR #163 Review

Title: SwiGLU dim=576 + Sliding Window + Muon WD (1.2091 BPB)
State: open
Date Reviewed: 2026-04-11

Code Analysis

train_gpt.py Checks

  • target-in-key pattern: not found
  • TTT (Temporal Token Tagging): not found
  • SLOT (Slot MoE): not found
  • Custom Tokenizer: not found

Verdict

Classification: PURE_NEURAL_CLEAN

Recommendation:

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: ✓ MERGE


Draft by @MatoTeziTanka for parameter-golf review sweep (2026-04-11)


Reviewed by @MatoTeziTankaThe Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants