Skip to content

Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483)#162

Open
raahilshah wants to merge 2 commits intoopenai:mainfrom
raahilshah:submission/2026-03-20_Int6_MLP3x_SmearGate_BigramHash_MuonWD_SWA
Open

Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483)#162
raahilshah wants to merge 2 commits intoopenai:mainfrom
raahilshah:submission/2026-03-20_Int6_MLP3x_SmearGate_BigramHash_MuonWD_SWA

Conversation

@raahilshah
Copy link

Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Muon WD + SWA

Mean val_bpb: 1.1483 (3 seeds: 1.1488, 1.1485, 1.1476)

Trained on 8×H100 SXM in 600 seconds. 15.92MB artifact (int6+zstd-22).

Key Techniques

  1. Per-row int6 quantization ([-32,31]) on MLP + attention weights, fp16 passthrough for tied embeddings and last-layer key projection. zstd level 22 compression.
  2. 3× MLP expansion (hidden=1536) — enabled by int6 byte savings. Single largest improvement source.
  3. SmearGate — learned gate blending each token embedding with the previous token's (~512 params).
  4. BigramHash embedding — 4096-bucket hash table (dim=128→512) for token-pair context (~524K params).
  5. Orthogonal init + muP scaling — orthogonal weight init, output projections scaled by 1/√(2·num_layers).
  6. Muon WD=0.02 with momentum warmup 0.92→0.99 over 1500 steps. AdamW WD=0.01 for embeddings/scalars.
  7. SWA over last 50% of training (every 200 steps) — smoother weights, better quantization.

Hyperparameters

9 layers, 512 dim, MLP 3×, seq2048, batch=786K, warmdown=3000, matrix_lr=0.02, grad_clip=0.3, muon_momentum=0.99.

Metrics

Seed val_loss val_bpb
1337 1.93978 1.14885
42 1.93923 1.14852
7 1.93762 1.14757
Mean 1.93888 1.14831
  • Pre-quant val_bpb: 1.1640
  • Steps: 7,373 in 600s (81.4 ms/step)
  • Artifact: 15.92MB (int6+zstd-22)
  • Improvement over current SOTA (1.1748): -0.0265 bpb / -0.046 nats

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 14cdf6f7a4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

).reshape(bsz, seq_len)
for i, ws in enumerate(batch_ws):
wlen = wlens[i]
s = 0 if ws == 0 else max(wlen - stride, 0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid double-counting tail tokens in sliding eval

The sliding-window scorer can count the same validation tokens more than once near the end of the corpus. With s = 0 if ws == 0 else max(wlen - stride, 0), any non-first window where wlen < stride scores the entire short window, including tokens that were already scored by the previous window (e.g., seq_len=8, stride=4 double-scores the last two tokens). This biases the reported val_loss/val_bpb, which can skew experiment comparisons.

Useful? React with 👍 / 👎.

…SWA — improved config (Muon WD=0.04, SWA every 50), mean val_bpb=1.1458
@raahilshah
Copy link
Author

Updated submission — improved configuration after systematic hyperparameter sweeps:

  • Muon weight decay: 0.02 → 0.04 (swept 0.01–0.05, optimal at 0.04)
  • SWA frequency: every 200 steps → every 50 steps (swept 25–200, optimal at 50; ~30 checkpoint average)

New results (3 seeds):

Seed val_loss val_bpb
1337 1.93492 1.14597
42 1.93591 1.14656
7 1.93314 1.14492
Mean 1.93466 1.14582

Previous submission mean was 1.1483 → now 1.1458 (improvement of 0.0025 bpb from tuning WD and SWA frequency alone).

kellyvv added a commit to kellyvv/parameter-golf that referenced this pull request Mar 20, 2026
- SmearGate: learned per-dim gate blending x[t] with x[t-1] (~512 params)
  USE_SMEAR_GATE=1 to enable
- BigramHash: hash(tok[t-1],tok[t]) -> 4096-bucket embed(128) -> proj(512)
  USE_BIGRAM_HASH=1 to enable (~524K params)
- Both disabled by default for backward compatibility
- forward_with_adapter refactored to reuse _forward_body
kellyvv added a commit to kellyvv/parameter-golf that referenced this pull request Mar 20, 2026
- STE QAT: fake quantize->dequantize in CastedLinear forward pass
  Gradients pass through via STE (w + (w_hat - w).detach())
  Activates after STE_QAT_START_FRAC of training (default 25%)
  USE_STE_QAT=1 to enable
- forward_with_adapter refactored to reuse _forward_body
- All Tier 2 features are env-var controlled, disabled by default
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant