Skip to content

feat(record): Int6 STE + NorMuon + SWA + Sliding Window (val_bpb=1.16019)#156

Open
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:weco-int6-ste-normuon
Open

feat(record): Int6 STE + NorMuon + SWA + Sliding Window (val_bpb=1.16019)#156
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:weco-int6-ste-normuon

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

Summary

New SOTA submission: mean val_bpb = 1.16019 across 3 seeds, beating current merged SOTA (1.17475) by 0.01456.

Seed val_bpb Steps ms/step Artifact
1337 1.16146 12357 48.55 15,045,740
42 1.15935 12351 48.58 15,053,489
7 1.15976 12336 48.69 15,157,415
Mean 1.16019

Key Techniques

  1. Int6 STE — Fake int6 per-row quantization ([-31,31]) on every forward pass with straight-through gradient bypass. Model learns to cope with quantization noise, yielding only ~+0.002 bpb gap.
  2. NorMuon optimizer — Row-normalized Newton-Schulz updates on top of Muon with adaptive step sizes.
  3. 3x MLP width (1536) — Enabled by int6 compression savings within 16MB budget.
  4. FP16 tied embedding — Embedding tensor stored in fp16, never quantized.
  5. Sliding window eval (stride=64) — Every token gets 960 tokens context (~0.033 bpb improvement).
  6. SWA — 7 checkpoint average during warmdown phase.
  7. Zstd-22 compression — Better ratio than zlib for quantized weights.
  8. U-Net skip connections — Encoder-decoder structure with learnable skip weights.

Architecture

  • 9 layers, 512 dim, 8 heads, 4 KV heads (GQA)
  • Vocab 1024 (SentencePiece BPE), seq len 1024
  • relu² activation, RoPE, logit softcapping (30.0)

Submission checklist

  • 3-seed verification with mean val_bpb
  • All artifacts < 16MB
  • Wallclock < 600s on 8xH100
  • Train logs included
  • Reproducible train_gpt.py included
  • submission.json with metadata

…019)

3-seed verified results:
- Seed 1337: val_bpb=1.16146
- Seed 42: val_bpb=1.15935
- Seed 7: val_bpb=1.15976
- Mean: 1.16019

Key techniques: int6 STE quantization-aware training, NorMuon optimizer,
3x MLP width (1536), FP16 tied embedding, sliding window eval (stride=64),
SWA with 7 checkpoints, zstd-22 compression, U-Net skip connections.
Added val_bpb, bytes_total, bytes_code, github_id fields expected
by the parameter-golf-leaderboard collector script.
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — feat(record): Int6 STE + NorMuon + SWA + Sliding Window (val_bpb=1.16019)

BPB: 1.16019 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 2c12ff54420b, file records/track_10min_16mb/2026-03-20_Int6STE_NorMuon_SWA_SlidingWindow/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=9, vocab=1024, code=54630 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=9, vocab=1024, code=54630 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants