openai · evnkm · Mar 20, 2026 · Mar 20, 2026 · Mar 21, 2026
diff --git a/LOGS_RESEARCH_THINKING/2026-03-18_Parameter-Golf-Plan-v0.md b/LOGS_RESEARCH_THINKING/2026-03-18_Parameter-Golf-Plan-v0.md
diff --git a/records/track_non_record_16mb/2026-03-19_CrossLayerSharing_4bitQAT/README.md b/records/track_non_record_16mb/2026-03-19_CrossLayerSharing_4bitQAT/README.md
@@ -0,0 +1,57 @@
+# Cross-Layer Parameter Sharing + 4-bit QAT (RecurrentGPT)
+
+## Approach
+
+This submission introduces two information-theoretically motivated techniques to maximize effective model capacity within the 16MB artifact budget:
+
+### 1. Cross-Layer Parameter Sharing (ALBERT-style Depth Recurrence)
+
+Instead of N unique transformer blocks, we use a **prelude → recurrent → coda** architecture:
+- **2 unique prelude blocks** for input specialization
+- **1 shared recurrent block** iterated 10 times with per-iteration learned gates
+- **2 unique coda blocks** for output specialization
+- **Effective depth: 14 layers** from only 5 unique block parameter sets
+
+Per-iteration `iter_gate` parameters (10 × d_model) allow the shared block to differentiate its behavior across depths. U-Net skip connections are preserved from the baseline.
+
+### 2. 4-bit Quantization-Aware Training (QAT)
+
+Straight-through estimator (STE) based fake quantization simulates 4-bit precision (16 levels) during training. This lets the model learn robustness to aggressive post-training quantization:
+- 4-bit weights stored as int8 with values in [-8, 7]
+- Only 16 distinct values per row → excellent zlib compressibility
+- QAT disabled for first 500 steps to let weights settle
+
+### Combined Effect
+
+The parameter sharing reduces unique parameters from ~28M (SOTA) to ~13.8M, while 4-bit quantization compresses these to a **7.25MB artifact** (vs 15.4MB SOTA). This leaves ~8.75MB of headroom for wider models (d=640 vs d=512) or more recurrent iterations.
+
+## Architecture Details
+
+| Parameter | Value |
+|-----------|-------|
+| model_dim | 640 |
+| num_heads | 10 (head_dim=64) |
+| num_kv_heads | 2 (aggressive GQA) |
+| num_prelude | 2 |
+| num_recurrent_iters | 10 |
+| num_coda | 2 |
+| effective_depth | 14 |
+| unique_params | 13,786,290 |
+| artifact_bytes | 7,193,920 |
+
+## Results
+
+| Seed | val_loss | val_bpb | Steps | ms/step |
+|------|----------|---------|-------|---------|
+| 1337 | 3.6447 | 2.1586 | 300 | 2001 |
+
+**Note:** This run was on 2× GPUs in eager mode (no `torch.compile`) due to Triton SMEM constraints with the recurrent architecture. Only 300 steps were completed in the 10-minute wallclock — far too few for convergence. The approach needs 8×H100 with `torch.compile` (compiling individual blocks rather than the full model) to achieve competitive BPB.
+
+## Key Innovations Over Baseline
+
+1. **RecurrentGPT architecture** with configurable prelude/recurrent/coda structure
+2. **Per-iteration gating** for depth-dependent behavior without per-layer parameters
+3. **4-bit QAT with STE** integrated into `CastedLinear` for seamless training
+4. **Unified quantization pipeline** supporting both int8 and 4-bit modes
+5. **Phase-transition residual mixing** applied across effective depth (including recurrent iterations)
+6. All SOTA innovations preserved: FP16 tied embedding, Muon WD, overtone init, sliding window eval
diff --git a/records/track_non_record_16mb/2026-03-19_CrossLayerSharing_4bitQAT/submission.json b/records/track_non_record_16mb/2026-03-19_CrossLayerSharing_4bitQAT/submission.json
@@ -0,0 +1,14 @@
+{
+  "track": "non_record_16mb",
+  "date": "2026-03-19",
+  "name": "Cross-Layer Parameter Sharing + 4-bit QAT (RecurrentGPT)",
+  "author": "evan_kim",
+  "seed_results": {
+    "1337": {"val_loss": 3.6447, "val_bpb": 2.1586, "steps": 300, "ms_per_step": 2001.47}
+  },
+  "mean_val_loss": 3.6447,
+  "mean_val_bpb": 2.1586,
+  "artifact_bytes": 7193920,
+  "code_bytes": 57943,
+  "notes": "Non-record: tested on 2xGPU in eager mode (no torch.compile). Only 300 steps in 10min wallclock. Needs 8xH100 + torch.compile for competitive results. Architecture uses depth recurrence (ALBERT-style) + 4-bit QAT to fit 14 effective layers in 7.25MB artifact."
+}