openai · pleasedontddosme · Mar 20, 2026 · Mar 21, 2026 · Mar 21, 2026 · Mar 21, 2026
diff --git a/records/track_10min_16mb/2026-03-20_Combined_Int6_QAT_SlidingWindow/README.md b/records/track_10min_16mb/2026-03-20_Combined_Int6_QAT_SlidingWindow/README.md
@@ -0,0 +1,59 @@
+# V5: TTT LoRA + Dual Bigram + Label Smoothing + Z-Loss
+
+## Strategy
+thwu1's winning 1.1428 bpb architecture + 4 novel innovations that no competitor uses.
+
+## Architecture (from thwu1)
+- **10 transformer layers**, GQA (8 heads, 4 KV heads), model_dim=512
+- **3x MLP** (hidden=1536), ReLU^2 activation
+- **Learned BigramHashEmbedding** (10240 entries, 128-dim, XOR hash)
+- **SmearGate**: learned gate blending each token with previous
+- **U-Net skip connections** with learned blending weights
+- **Tied embeddings** (FP16 passthrough for tok_emb + last layer c_k)
+
+## Training (from thwu1)
+- **Muon optimizer**: momentum=0.99, weight_decay=0.04, Newton-Schulz
+- **Linear warmdown** (wallclock-aware, 3000 iters)
+- **SWA** (last 40%, every 50 steps, accumulated on CPU)
+- **Low LRs**: matrix=0.02, scalar=0.02, tied_embed=0.03
+- Train at seq_len=2048, batch=786,432 tokens
+
+## Quantization & Compression (from thwu1)
+- **Mixed Int5/Int6**: Int5 (clip=15) for MLP, Int6 (clip=31) for attention
+- **3% magnitude pruning** (zeros compress well)
+- **zstd level 22** compression
+
+## Novel Innovations (V5)
+
+### 1. TTT LoRA at Eval (biggest gain: -0.010 to -0.025 bpb)
+Per-document LoRA adapters (rank=8) on Q, V projections and LM head.
+Each document gets its own adapter trained chunk-by-chunk (256 tokens).
+Adapters reset between documents. Uses Adam (lr=0.01).
+**No competitor in the top 5 uses TTT.**
+
+### 2. Dual Bigram inside TTT (-0.001 to -0.003 bpb)
+Post-hoc statistical [1024,1024] bigram residual table added to logits
+**inside** the TTT eval loop. TTT adapts with bigram context.
+
+### 3. Label Smoothing (0.05) (-0.001 to -0.003 bpb)
+Prevents overconfident predictions, improves calibration and
+quantization robustness.
+
+### 4. Z-Loss Regularization (1e-4) (-0.001 to -0.002 bpb)
+Penalizes large logit magnitudes: `z_loss = 1e-4 * mean(logsumexp(logits)^2)`.
+Directly improves post-quantization performance.
+
+## Reproduction
+```bash
+pip install zstandard
+cd /workspace/parameter-golf
+python data/cached_challenge_fineweb.py
+torchrun --standalone --nproc_per_node=8 \
+  records/track_10min_16mb/2026-03-20_Combined_Int6_QAT_SlidingWindow/train_gpt.py
+```
+
+## Fallback
+If TTT eval exceeds time budget: `TTT_ENABLED=0` falls back to sliding window eval.
+
+## Expected Results
+Target: ~1.115-1.135 bpb (beat thwu1's 1.1428 by 0.008-0.028)
diff --git a/records/track_10min_16mb/2026-03-20_Combined_Int6_QAT_SlidingWindow/submission.json b/records/track_10min_16mb/2026-03-20_Combined_Int6_QAT_SlidingWindow/submission.json
@@ -0,0 +1,41 @@
+{
+    "name": "V5_TTT_LoRA_DualBigram_LabelSmoothing",
+    "author": "DizSi",
+    "date": "2026-03-21",
+    "track": "10min_16mb",
+    "description": "V5: thwu1's 1.1428 architecture + 4 novel innovations. TTT LoRA eval (per-document rank-8 Q/V/LM-head adapters), dual bigram (learned BigramHash + post-hoc statistical table inside TTT), label smoothing (0.05), z-loss (1e-4). Core: 10L GQA (8h/4kv), 3xMLP, SmearGate, Mixed Int5/Int6, SWA, magnitude pruning, zstd-22, linear warmdown, Muon 0.99.",
+    "base_submissions": [
+        "thwu1_10L_Int5MLP_BigramHash_SWA"
+    ],
+    "env": {
+        "WARMDOWN_ITERS": 3000,
+        "MATRIX_LR": 0.02,
+        "TIED_EMBED_LR": 0.03,
+        "SCALAR_LR": 0.02,
+        "GRAD_CLIP_NORM": 0.3,
+        "WEIGHT_DECAY": 0.04,
+        "MUON_MOMENTUM": 0.99,
+        "MUON_MOMENTUM_WARMUP_STEPS": 1500,
+        "NUM_LAYERS": 10,
+        "MODEL_DIM": 512,
+        "MLP_MULT": 3.0,
+        "NUM_HEADS": 8,
+        "NUM_KV_HEADS": 4,
+        "BIGRAM_VOCAB_SIZE": 10240,
+        "BIGRAM_DIM": 128,
+        "TRAIN_SEQ_LEN": 2048,
+        "EVAL_STRIDE": 64,
+        "SWA_START_FRAC": 0.4,
+        "PRUNE_FRAC": 0.03,
+        "TTT_ENABLED": true,
+        "TTT_LORA_RANK": 8,
+        "TTT_LORA_LR": 0.01,
+        "TTT_CHUNK_SIZE": 256,
+        "TTT_EVAL_SEQ_LEN": 1024,
+        "TTT_BATCH_SIZE": 64,
+        "LABEL_SMOOTHING": 0.05,
+        "Z_LOSS_COEFF": 0.0001,
+        "EVAL_BIGRAM_TABLE": true,
+        "EVAL_BIGRAM_SCALE": 0.3
+    }
+}