Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483)#162
Conversation
… (mean val_bpb=1.1483, 3 seeds)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 14cdf6f7a4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ).reshape(bsz, seq_len) | ||
| for i, ws in enumerate(batch_ws): | ||
| wlen = wlens[i] | ||
| s = 0 if ws == 0 else max(wlen - stride, 0) |
There was a problem hiding this comment.
Avoid double-counting tail tokens in sliding eval
The sliding-window scorer can count the same validation tokens more than once near the end of the corpus. With s = 0 if ws == 0 else max(wlen - stride, 0), any non-first window where wlen < stride scores the entire short window, including tokens that were already scored by the previous window (e.g., seq_len=8, stride=4 double-scores the last two tokens). This biases the reported val_loss/val_bpb, which can skew experiment comparisons.
Useful? React with 👍 / 👎.
…SWA — improved config (Muon WD=0.04, SWA every 50), mean val_bpb=1.1458
|
Updated submission — improved configuration after systematic hyperparameter sweeps:
New results (3 seeds):
Previous submission mean was 1.1483 → now 1.1458 (improvement of 0.0025 bpb from tuning WD and SWA frequency alone). |
- SmearGate: learned per-dim gate blending x[t] with x[t-1] (~512 params) USE_SMEAR_GATE=1 to enable - BigramHash: hash(tok[t-1],tok[t]) -> 4096-bucket embed(128) -> proj(512) USE_BIGRAM_HASH=1 to enable (~524K params) - Both disabled by default for backward compatibility - forward_with_adapter refactored to reuse _forward_body
- STE QAT: fake quantize->dequantize in CastedLinear forward pass Gradients pass through via STE (w + (w_hat - w).detach()) Activates after STE_QAT_START_FRAC of training (default 25%) USE_STE_QAT=1 to enable - forward_with_adapter refactored to reuse _forward_body - All Tier 2 features are env-var controlled, disabled by default
Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Muon WD + SWA
Mean val_bpb: 1.1483 (3 seeds: 1.1488, 1.1485, 1.1476)
Trained on 8×H100 SXM in 600 seconds. 15.92MB artifact (int6+zstd-22).
Key Techniques
Hyperparameters
9 layers, 512 dim, MLP 3×, seq2048, batch=786K, warmdown=3000, matrix_lr=0.02, grad_clip=0.3, muon_momentum=0.99.
Metrics