Record: MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window (val_bpb=1.1623)#160
Conversation
Community Review — Record: MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window (val_bpb=1.1623)BPB: 1.1623 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache What I found in the code (head SHA Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=9, vocab=1024, code=64924 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline. Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=9, vocab=1024, code=64924 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window
This submission starts from the public 10-minute baseline family and makes three main changes:
2xto3x.seq_len=2048.The model was trained for the official
600swallclock limit on8x H100 SXM, then repacked into a submission-valid artifact under the16,000,000byte limit.Model
VOCAB_SIZE=1024NUM_LAYERS=9MODEL_DIM=512NUM_HEADS=8NUM_KV_HEADS=4MLP_MULT=3TIE_EMBEDDINGS=1This keeps the backbone close to the baseline while spending more of the parameter budget on the MLP.
Training Setup
The timed run used:
TRAIN_BATCH_TOKENS=786432TRAIN_SEQ_LEN=2048ITERATIONS=20000MAX_WALLCLOCK_SECONDS=600WARMUP_STEPS=20WARMDOWN_ITERS=3000TIED_EMBED_LR=0.03MATRIX_LR=0.02SCALAR_LR=0.02MUON_MOMENTUM=0.99Logged optimizer summary:
tie_embeddings:True embed_lr:0.03 head_lr:0.0 matrix_lr:0.02 scalar_lr:0.02Logged attention summary:
attention_mode:gqa num_heads:8 num_kv_heads:4The script includes QAT support, but this specific timed run stopped before QAT activation:
qat_enabled:Trueqat_start_frac:0.500qat_start_step:100007534So the final reported result is from post-training repacking of the timed checkpoint rather than from a checkpoint that had entered the QAT phase.
Timed Training Result
The official training run stopped at the wallclock cap:
step:7534/20000train_time:600120msstep_avg:79.65msValidation at stop:
val_loss=1.9844val_bpb=1.1753Other logged details:
16738 MiB16944 MiB86099351bytesExport / Compression
The first export evaluated from the timed run used:
fp16token embedding passthroughfp16passthrough for the last twoc_kweightszlibThat version evaluated very well but was not submission-valid because it was over the size cap:
16639274bytes1.181010951.16018011The final submission-valid repack uses:
QGv3serializationlzmacompressiontok_emb.weightfp16passthrough tensorsThis change was enough to get back under the limit while preserving almost all of the sliding-window gain.
Final Submission Artifact
final_model.mixed_tok8_lzma.ptzlzma158459806492415910904This is submission-valid under the
16,000,000byte cap.Final Scores
Exact post-pack scores for the final under-cap artifact:
standard eval:
val_loss=1.99867543val_bpb=1.18372817sliding-window eval with
seq_len=2048,stride=256:val_loss=1.96250243val_bpb=1.16230441The submission score is therefore:
1.16230441 val_bpbNotes
8-bitwhile quantizing the rest of the model to6-bit.torch.saveartifact.fp32to match the trained checkpoint behavior after reload.Included Files
README.md- this writeupsubmission.json- submission metadatatrain.log- exact log from the timed8x H100 SXMruntrain_gpt.py- code snapshot for the submission artifact and evaluation path