Skip to content

Record: Int6 QAT + SmearGate + Muon WD (val_bpb=1.1669)#170

Open
baudrillardsgh0st wants to merge 2 commits intoopenai:mainfrom
baudrillardsgh0st:submit/int6-qat-smeargate-muonwd
Open

Record: Int6 QAT + SmearGate + Muon WD (val_bpb=1.1669)#170
baudrillardsgh0st wants to merge 2 commits intoopenai:mainfrom
baudrillardsgh0st:submit/int6-qat-smeargate-muonwd

Conversation

@baudrillardsgh0st
Copy link

Summary

  • val_bpb: 1.1669 — beats current SOTA (1.1748) by 0.0079
  • Artifact: 14.77 MB — well under 16MB cap
  • 9L/512dim/8H/4KV, 21.8M params, trained 9706 steps @ 61.8ms/step on 8×H100 SXM

Key Techniques

  1. Int6 QAT (Quantization-Aware Training): STE fake int6 quantization during forward pass with per-row symmetric scaling. Eliminates post-quant degradation without needing fp16 late-K layer passthrough.

  2. Int6-in-Int8 zstd22 compression: Store int6 values (-32 to 31) in int8 containers — zstd-22 compresses the restricted value range ~35%. Achieves 14.77MB from 21.8M params. (Bit-packing int6 values destroys byte alignment and defeats compressors.)

  3. SmearGate: ~513-param learned gate blending current + previous token embedding. Zero-initialized, very low LR. Provides cheap bigram context at the embedding layer.

  4. Decoupled Muon weight decay (0.01): Applied in the Muon optimizer step for improved generalization and quantization robustness.

  5. Sliding window evaluation (stride=64, batch=32 seqs): Full-context scoring at every token position.

  6. FP16 tied embedding passthrough: Avoids compounding int6 errors through both input/output paths.

Results

Seed val_loss val_bpb Steps ms/step
1337 1.9703 1.1669 9706 61.80

Artifact Size

Component Bytes
Model (int6+zstd22) 14,696,046
Code 71,909
Total 14,767,955

Test plan

  • Trained on 8×H100 SXM in under 10 minutes (600s wallclock)
  • Artifact under 16MB (14.77MB)
  • val_bpb improvement over SOTA exceeds 0.005 threshold (0.0079)
  • train.log included with full training output
  • train_gpt.py runs standalone within the records folder

9L 512dim int6 QAT with STE, SmearGate, Muon weight decay 0.01,
int6-in-int8 zstd22 compression. 14.77MB artifact, 9706 steps @ 61.8ms/step.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
11-layer GPT with int6 QAT, SmearGate, and decoupled Muon weight decay 0.038.
Artifact: 15.50MB (int6+zstd-22). Single seed, 7723 steps at 77ms/step.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant