SwiGLU dim=576 + Sliding Window + Muon WD (1.2091 BPB)#163
Open
Focus2321 wants to merge 1 commit intoopenai:mainfrom
Open
SwiGLU dim=576 + Sliding Window + Muon WD (1.2091 BPB)#163Focus2321 wants to merge 1 commit intoopenai:mainfrom
Focus2321 wants to merge 1 commit intoopenai:mainfrom
Conversation
Mean val_bpb: 1.2091 (sliding window, stride=64) Seeds: 1337 (1.2093), 42 (1.2086), 7 (1.2092) Artifact: 13.2MB (well under 16MB limit) Hardware: 4xH100 SXM, ~10K steps in 10 min Key innovations: - Wider model (dim=576, 7L) over deeper (dim=512, 9L) - Muon decoupled weight decay 0.02 - FP16 embedding passthrough - Sliding window eval (stride=64) - Architecture found via 111 automated experiments on RTX 3090 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR #163 ReviewTitle: SwiGLU dim=576 + Sliding Window + Muon WD (1.2091 BPB) Code Analysistrain_gpt.py Checks
VerdictClassification: PURE_NEURAL_CLEAN Recommendation: Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: ✓ MERGE Draft by @MatoTeziTanka for parameter-golf review sweep (2026-04-11) Reviewed by @MatoTeziTanka — The Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Seeds (4xH100 SXM, RunPod secure cloud)
Note on hardware
These runs were done on 4xH100 SXM because 8xH100 nodes were sold out on RunPod at time of submission. The script is fully 8xH100-compatible (
grad_accum_steps = 8 // world_size) and we have a prior 8xH100 run of the same architecture (without weight decay/sliding window) that completed 22,196 steps at 27ms/step with standard eval of 1.2270 — confirming it runs well within the 10-minute budget on 8x.On 4xH100 we get ~10.2K steps (grad_accum=2). On 8xH100 we'd get ~20K+ steps (grad_accum=1), which means the 8x results would be better than what's reported here. We'll update with 8xH100 logs when availability returns.
Key Changes from Baseline
Architecture discovered through 111 automated experiments on a single RTX 3090 before scaling to H100.