Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# V5: TTT LoRA + Dual Bigram + Label Smoothing + Z-Loss

## Strategy
thwu1's winning 1.1428 bpb architecture + 4 novel innovations that no competitor uses.

## Architecture (from thwu1)
- **10 transformer layers**, GQA (8 heads, 4 KV heads), model_dim=512
- **3x MLP** (hidden=1536), ReLU^2 activation
- **Learned BigramHashEmbedding** (10240 entries, 128-dim, XOR hash)
- **SmearGate**: learned gate blending each token with previous
- **U-Net skip connections** with learned blending weights
- **Tied embeddings** (FP16 passthrough for tok_emb + last layer c_k)

## Training (from thwu1)
- **Muon optimizer**: momentum=0.99, weight_decay=0.04, Newton-Schulz
- **Linear warmdown** (wallclock-aware, 3000 iters)
- **SWA** (last 40%, every 50 steps, accumulated on CPU)
- **Low LRs**: matrix=0.02, scalar=0.02, tied_embed=0.03
- Train at seq_len=2048, batch=786,432 tokens

## Quantization & Compression (from thwu1)
- **Mixed Int5/Int6**: Int5 (clip=15) for MLP, Int6 (clip=31) for attention
- **3% magnitude pruning** (zeros compress well)
- **zstd level 22** compression

## Novel Innovations (V5)

### 1. TTT LoRA at Eval (biggest gain: -0.010 to -0.025 bpb)
Per-document LoRA adapters (rank=8) on Q, V projections and LM head.
Each document gets its own adapter trained chunk-by-chunk (256 tokens).
Adapters reset between documents. Uses Adam (lr=0.01).
**No competitor in the top 5 uses TTT.**

### 2. Dual Bigram inside TTT (-0.001 to -0.003 bpb)
Post-hoc statistical [1024,1024] bigram residual table added to logits
**inside** the TTT eval loop. TTT adapts with bigram context.

### 3. Label Smoothing (0.05) (-0.001 to -0.003 bpb)
Prevents overconfident predictions, improves calibration and
quantization robustness.

### 4. Z-Loss Regularization (1e-4) (-0.001 to -0.002 bpb)
Penalizes large logit magnitudes: `z_loss = 1e-4 * mean(logsumexp(logits)^2)`.
Directly improves post-quantization performance.

## Reproduction
```bash
pip install zstandard
cd /workspace/parameter-golf
python data/cached_challenge_fineweb.py
torchrun --standalone --nproc_per_node=8 \
records/track_10min_16mb/2026-03-20_Combined_Int6_QAT_SlidingWindow/train_gpt.py
```

## Fallback
If TTT eval exceeds time budget: `TTT_ENABLED=0` falls back to sliding window eval.

## Expected Results
Target: ~1.115-1.135 bpb (beat thwu1's 1.1428 by 0.008-0.028)
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
{
"name": "V5_TTT_LoRA_DualBigram_LabelSmoothing",
"author": "DizSi",
"date": "2026-03-21",
"track": "10min_16mb",
"description": "V5: thwu1's 1.1428 architecture + 4 novel innovations. TTT LoRA eval (per-document rank-8 Q/V/LM-head adapters), dual bigram (learned BigramHash + post-hoc statistical table inside TTT), label smoothing (0.05), z-loss (1e-4). Core: 10L GQA (8h/4kv), 3xMLP, SmearGate, Mixed Int5/Int6, SWA, magnitude pruning, zstd-22, linear warmdown, Muon 0.99.",
"base_submissions": [
"thwu1_10L_Int5MLP_BigramHash_SWA"
],
"env": {
"WARMDOWN_ITERS": 3000,
"MATRIX_LR": 0.02,
"TIED_EMBED_LR": 0.03,
"SCALAR_LR": 0.02,
"GRAD_CLIP_NORM": 0.3,
"WEIGHT_DECAY": 0.04,
"MUON_MOMENTUM": 0.99,
"MUON_MOMENTUM_WARMUP_STEPS": 1500,
"NUM_LAYERS": 10,
"MODEL_DIM": 512,
"MLP_MULT": 3.0,
"NUM_HEADS": 8,
"NUM_KV_HEADS": 4,
"BIGRAM_VOCAB_SIZE": 10240,
"BIGRAM_DIM": 128,
"TRAIN_SEQ_LEN": 2048,
"EVAL_STRIDE": 64,
"SWA_START_FRAC": 0.4,
"PRUNE_FRAC": 0.03,
"TTT_ENABLED": true,
"TTT_LORA_RANK": 8,
"TTT_LORA_LR": 0.01,
"TTT_CHUNK_SIZE": 256,
"TTT_EVAL_SEQ_LEN": 1024,
"TTT_BATCH_SIZE": 64,
"LABEL_SMOOTHING": 0.05,
"Z_LOSS_COEFF": 0.0001,
"EVAL_BIGRAM_TABLE": true,
"EVAL_BIGRAM_SCALE": 0.3
}
}
Loading