Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483) by raahilshah · Pull Request #162 · openai/parameter-golf

raahilshah · 2026-03-20T03:33:38Z

Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Muon WD + SWA

Mean val_bpb: 1.1483 (3 seeds: 1.1488, 1.1485, 1.1476)

Trained on 8×H100 SXM in 600 seconds. 15.92MB artifact (int6+zstd-22).

Key Techniques

Per-row int6 quantization ([-32,31]) on MLP + attention weights, fp16 passthrough for tied embeddings and last-layer key projection. zstd level 22 compression.
3× MLP expansion (hidden=1536) — enabled by int6 byte savings. Single largest improvement source.
SmearGate — learned gate blending each token embedding with the previous token's (~512 params).
BigramHash embedding — 4096-bucket hash table (dim=128→512) for token-pair context (~524K params).
Orthogonal init + muP scaling — orthogonal weight init, output projections scaled by 1/√(2·num_layers).
Muon WD=0.02 with momentum warmup 0.92→0.99 over 1500 steps. AdamW WD=0.01 for embeddings/scalars.
SWA over last 50% of training (every 200 steps) — smoother weights, better quantization.

Hyperparameters

9 layers, 512 dim, MLP 3×, seq2048, batch=786K, warmdown=3000, matrix_lr=0.02, grad_clip=0.3, muon_momentum=0.99.

Metrics

Seed	val_loss	val_bpb
1337	1.93978	1.14885
42	1.93923	1.14852
7	1.93762	1.14757
Mean	1.93888	1.14831

Pre-quant val_bpb: 1.1640
Steps: 7,373 in 600s (81.4 ms/step)
Artifact: 15.92MB (int6+zstd-22)
Improvement over current SOTA (1.1748): -0.0265 bpb / -0.046 nats

… (mean val_bpb=1.1483, 3 seeds)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 14cdf6f7a4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-20T03:39:16Z

records/track_10min_16mb/2026-03-20_Int6_MLP3x_SmearGate_BigramHash_MuonWD_SWA/train_gpt.py

+            ).reshape(bsz, seq_len)
+            for i, ws in enumerate(batch_ws):
+                wlen = wlens[i]
+                s = 0 if ws == 0 else max(wlen - stride, 0)


Avoid double-counting tail tokens in sliding eval

The sliding-window scorer can count the same validation tokens more than once near the end of the corpus. With s = 0 if ws == 0 else max(wlen - stride, 0), any non-first window where wlen < stride scores the entire short window, including tokens that were already scored by the previous window (e.g., seq_len=8, stride=4 double-scores the last two tokens). This biases the reported val_loss/val_bpb, which can skew experiment comparisons.

Useful? React with 👍 / 👎.

…SWA — improved config (Muon WD=0.04, SWA every 50), mean val_bpb=1.1458

raahilshah · 2026-03-20T05:46:31Z

Updated submission — improved configuration after systematic hyperparameter sweeps:

Muon weight decay: 0.02 → 0.04 (swept 0.01–0.05, optimal at 0.04)
SWA frequency: every 200 steps → every 50 steps (swept 25–200, optimal at 50; ~30 checkpoint average)

New results (3 seeds):

Seed	val_loss	val_bpb
1337	1.93492	1.14597
42	1.93591	1.14656
7	1.93314	1.14492
Mean	1.93466	1.14582

Previous submission mean was 1.1483 → now 1.1458 (improvement of 0.0025 bpb from tuning WD and SWA frequency alone).

- SmearGate: learned per-dim gate blending x[t] with x[t-1] (~512 params) USE_SMEAR_GATE=1 to enable - BigramHash: hash(tok[t-1],tok[t]) -> 4096-bucket embed(128) -> proj(512) USE_BIGRAM_HASH=1 to enable (~524K params) - Both disabled by default for backward compatibility - forward_with_adapter refactored to reuse _forward_body

- STE QAT: fake quantize->dequantize in CastedLinear forward pass Gradients pass through via STE (w + (w_hat - w).detach()) Activates after STE_QAT_START_FRAC of training (default 25%) USE_STE_QAT=1 to enable - forward_with_adapter refactored to reuse _forward_body - All Tier 2 features are env-var controlled, disabled by default

Add submission: 2026-03-20_Int6_MLP3x_SmearGate_BigramHash_MuonWD_SWA…

14cdf6f

… (mean val_bpb=1.1483, 3 seeds)

chatgpt-codex-connector bot reviewed Mar 20, 2026

View reviewed changes

notapplica mentioned this pull request Mar 20, 2026

Parameter Golf Live AI Commentary #140

Open

Julz19 mentioned this pull request Mar 20, 2026

Add ContextFuse-2048-BigramSmear submission #174

Open

Update submission: 2026-03-20_Int6_MLP3x_SmearGate_BigramHash_MuonWD_…

3c458dc

…SWA — improved config (Muon WD=0.04, SWA every 50), mean val_bpb=1.1458

thwu1 mentioned this pull request Mar 20, 2026

Record: 10L Int5-MLP + MuonWD=0.04 + SWA/50 (val_bpb=1.1453) #180

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483)#162

Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483)#162
raahilshah wants to merge 2 commits intoopenai:mainfrom
raahilshah:submission/2026-03-20_Int6_MLP3x_SmearGate_BigramHash_MuonWD_SWA

raahilshah commented Mar 20, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 20, 2026

Uh oh!

raahilshah commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raahilshah commented Mar 20, 2026

Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Muon WD + SWA

Key Techniques

Hyperparameters

Metrics

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

raahilshah commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant