SwiGLU dim=576 + Sliding Window + Muon WD (1.2091 BPB) by Focus2321 · Pull Request #163 · openai/parameter-golf

Focus2321 · 2026-03-20T03:46:32Z

Summary

val_bpb: 1.2091 (mean across 3 seeds, sliding window eval stride=64)
Artifact: 13.2MB (under 16MB limit)
Hardware: 4xH100 SXM (8xH100 was unavailable at time of submission — see note below)

Seeds (4xH100 SXM, RunPod secure cloud)

Seed	Steps	ms/step	val_bpb (standard)	val_bpb (sliding window)
1337	10,169	59.0	1.2441	1.2093
42	10,293	58.4	1.2432	1.2086
7	10,310	58.2	1.2439	1.2092

Note on hardware

These runs were done on 4xH100 SXM because 8xH100 nodes were sold out on RunPod at time of submission. The script is fully 8xH100-compatible (grad_accum_steps = 8 // world_size) and we have a prior 8xH100 run of the same architecture (without weight decay/sliding window) that completed 22,196 steps at 27ms/step with standard eval of 1.2270 — confirming it runs well within the 10-minute budget on 8x.

On 4xH100 we get ~10.2K steps (grad_accum=2). On 8xH100 we'd get ~20K+ steps (grad_accum=1), which means the 8x results would be better than what's reported here. We'll update with 8xH100 logs when availability returns.

Key Changes from Baseline

Wider model (dim=576, 7 layers, SwiGLU mult=2) — width scaling > depth scaling
Muon weight decay 0.02 — decoupled WD improves training + shrinks artifact by 0.5MB
FP16 embedding passthrough — eliminates tied-embedding quantization degradation
Sliding window eval (stride=64) — ~0.035 BPB improvement at eval time
Wallclock warmdown 60% — optimal with weight decay (tested 0.35–0.7)

Architecture discovered through 111 automated experiments on a single RTX 3090 before scaling to H100.

Mean val_bpb: 1.2091 (sliding window, stride=64) Seeds: 1337 (1.2093), 42 (1.2086), 7 (1.2092) Artifact: 13.2MB (well under 16MB limit) Hardware: 4xH100 SXM, ~10K steps in 10 min Key innovations: - Wider model (dim=576, 7L) over deeper (dim=512, 9L) - Muon decoupled weight decay 0.02 - FP16 embedding passthrough - Sliding window eval (stride=64) - Architecture found via 111 automated experiments on RTX 3090 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-12T14:17:39Z

PR #163 Review

Title: SwiGLU dim=576 + Sliding Window + Muon WD (1.2091 BPB)
State: open
Date Reviewed: 2026-04-11

Code Analysis

train_gpt.py Checks

target-in-key pattern: not found
TTT (Temporal Token Tagging): not found
SLOT (Slot MoE): not found
Custom Tokenizer: not found

Verdict

Classification: PURE_NEURAL_CLEAN

Recommendation:

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: ✓ MERGE

Draft by @MatoTeziTanka for parameter-golf review sweep (2026-04-11)

Reviewed by @MatoTeziTanka — The Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

notapplica mentioned this pull request Mar 20, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SwiGLU dim=576 + Sliding Window + Muon WD (1.2091 BPB)#163

SwiGLU dim=576 + Sliding Window + Muon WD (1.2091 BPB)#163
Focus2321 wants to merge 1 commit intoopenai:mainfrom
Focus2321:submission/swiglu-576-sliding-window

Focus2321 commented Mar 20, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Focus2321 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Seeds (4xH100 SXM, RunPod secure cloud)

Note on hardware

Key Changes from Baseline

Uh oh!

MatoTeziTanka commented Apr 12, 2026

PR #163 Review

Code Analysis

train_gpt.py Checks

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Focus2321 commented Mar 20, 2026 •

edited

Loading