Record: MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window (val_bpb=1.1623) by ChaseWNorton · Pull Request #160 · openai/parameter-golf

ChaseWNorton · 2026-03-20T03:11:57Z

MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window

This submission starts from the public 10-minute baseline family and makes three main changes:

Increase feedforward capacity from 2x to 3x.
Train and evaluate at seq_len=2048.
Use a custom grouped low-bit export plus sliding-window evaluation to improve the final under-cap score.

The model was trained for the official 600s wallclock limit on 8x H100 SXM, then repacked into a submission-valid artifact under the 16,000,000 byte limit.

Model

VOCAB_SIZE=1024
NUM_LAYERS=9
MODEL_DIM=512
NUM_HEADS=8
NUM_KV_HEADS=4
MLP_MULT=3
TIE_EMBEDDINGS=1
RoPE + RMSNorm + tied input/output embeddings
U-Net-style skip structure inherited from the baseline

This keeps the backbone close to the baseline while spending more of the parameter budget on the MLP.

Training Setup

The timed run used:

TRAIN_BATCH_TOKENS=786432
TRAIN_SEQ_LEN=2048
ITERATIONS=20000
MAX_WALLCLOCK_SECONDS=600
WARMUP_STEPS=20
WARMDOWN_ITERS=3000
TIED_EMBED_LR=0.03
MATRIX_LR=0.02
SCALAR_LR=0.02
MUON_MOMENTUM=0.99

Logged optimizer summary:

tie_embeddings:True embed_lr:0.03 head_lr:0.0 matrix_lr:0.02 scalar_lr:0.02

Logged attention summary:

attention_mode:gqa num_heads:8 num_kv_heads:4

The script includes QAT support, but this specific timed run stopped before QAT activation:

qat_enabled:True
qat_start_frac:0.500
qat_start_step:10000
actual stop step: 7534

So the final reported result is from post-training repacking of the timed checkpoint rather than from a checkpoint that had entered the QAT phase.

Timed Training Result

The official training run stopped at the wallclock cap:

step:7534/20000
train_time:600120ms
step_avg:79.65ms

Validation at stop:

val_loss=1.9844
val_bpb=1.1753

Other logged details:

peak memory allocated: 16738 MiB
peak memory reserved: 16944 MiB
raw checkpoint size: 86099351 bytes

Export / Compression

The first export evaluated from the timed run used:

int6 for most tensors
fp16 token embedding passthrough
fp16 passthrough for the last two c_k weights
zlib

That version evaluated very well but was not submission-valid because it was over the size cap:

total size: 16639274 bytes
standard post-export score: 1.18101095
sliding-window score: 1.16018011

The final submission-valid repack uses:

grouped QGv3 serialization
lzma compression
int6 for most tensors
int8 for tok_emb.weight
no fp16 passthrough tensors

This change was enough to get back under the limit while preserving almost all of the sliding-window gain.

Final Submission Artifact

artifact path: final_model.mixed_tok8_lzma.ptz
compressor: lzma
model bytes: 15845980
code bytes: 64924
total bytes: 15910904

This is submission-valid under the 16,000,000 byte cap.

Final Scores

Exact post-pack scores for the final under-cap artifact:

standard eval:
- val_loss=1.99867543
- val_bpb=1.18372817
sliding-window eval with seq_len=2048, stride=256:
- val_loss=1.96250243
- val_bpb=1.16230441

The submission score is therefore:

`1.16230441 val_bpb`

Notes

The largest quality-preserving export change was keeping the token embedding at 8-bit while quantizing the rest of the model to 6-bit.
Grouped serialization reduced overhead compared with a naive torch.save artifact.
Fresh-load evaluation required keeping low-dimensional floating buffers, including RoPE frequency buffers, in fp32 to match the trained checkpoint behavior after reload.

Included Files

README.md - this writeup
submission.json - submission metadata
train.log - exact log from the timed 8x H100 SXM run
train_gpt.py - code snapshot for the submission artifact and evaluation path

…_bpb=1.1623

MatoTeziTanka · 2026-04-11T20:10:21Z

Community Review — Record: MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window (val_bpb=1.1623)

BPB: 1.1623 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 9c4146af4b67, file records/track_10min_16mb/2026-03-19_ChaseNorton/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=9, vocab=1024, code=64924 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=9, vocab=1024, code=64924 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Submission: MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window, val…

9c4146a

…_bpb=1.1623

notapplica mentioned this pull request Mar 20, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

This was referenced Mar 25, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #728

Closed

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window (val_bpb=1.1623)#160

Record: MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window (val_bpb=1.1623)#160
ChaseWNorton wants to merge 1 commit intoopenai:mainfrom
ChaseWNorton:submission/chase-norton-mlp3x-int6-sliding

ChaseWNorton commented Mar 20, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ChaseWNorton commented Mar 20, 2026

MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window

Model

Training Setup

Timed Training Result

Export / Compression

Final Submission Artifact

Final Scores

1.16230441 val_bpb

Notes

Included Files

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window (val_bpb=1.1623)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`1.16230441 val_bpb`