Depth Recurrence via Layer Sharing (3 shared blocks → 1/3 params, matched BPB)#167
Open
SkywardSyntax wants to merge 6 commits intoopenai:mainfrom
Open
Conversation
3 unique transformer blocks cycled over 9 virtual layers achieves the same BPB as the baseline with 1/3 the parameters and 30% faster training. Frees ~10MB of artifact budget for wider models, MLP 3x, or larger vocabularies. Validated locally, awaiting H100 compute.
c1bdde9 to
4e760ac
Compare
- train_gpt_mlx_exp.py: MLX script with depth recurrence, per-layer scales, repeat embeddings, sliding window eval, DEQ convergence eval, FTLE sensitivity tracking, QAT, nuclear norm, SwiGLU, bounded recurrence - train_gpt_submission.py: CUDA/H100 script with layer sharing + Muon WD + label smoothing + eval knobs - make_mini_shards.py: data subset creator for local testing - EXPERIMENTS.md: full strategy doc with competition analysis and all results - NOTES.md: dev notes for resuming work Best local result: 1.783 BPB (2 shared 256d MLP3x depth 6, 1.45M params)
…bandoned New files: - train_exp.py: Full experimental script combining WarmdownQuantization base with BigramHash, SmearGate, SWA, layer sharing, zstd, Muon WD, OrthoInit - EXPERIMENT_LOG.md: Detailed results from 7 A100 experiments Key findings: - Layer sharing (3 shared x 9 virtual) loses 0.09 BPB vs 9 unique at 512d - BigramHash + SmearGate: +0.003 BPB over baseline (1.4384 vs 1.4417) - Aggressive warmdown (WARMDOWN_ITERS=20000) needs tuning per hardware - 9L faster than 10L on 1xA100 short runs; 10L expected better on 8xH100 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5625dd9 to
4e760ac
Compare
…1.3260 BPB Key improvements to train_exp.py: - BigramHash: XOR hash with coprime multipliers, 128-dim, zero-init, learned scale (matching PR openai#162) - SmearGate: single gate after embed+RMSNorm (not per-block), fixed gate direction - SWA early-start bug fix (minimum 100 steps before activation) - FTLE-lite sensitivity-aware mixed-precision quantization (experimental) - Eval-time extra recurrence support (not useful for non-shared models) - Sliding window eval safety: skips if estimated time > 600s Best A100 results: 1.3260 BPB (9L, sliding window stride=1024, zstd-22) Previous best was 1.4384 BPB — a 0.112 BPB improvement from bug fixes + eval strategy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 3: Sliding window eval gives 1.3260 BPB (best legal result). FTLE-lite helps quality but busts 16MB limit. Eval-time recurrence useless on non-shared models. Competition intel: SOTA is 1.1428 (merged). Key gaps identified: WD=0.04, 11L, BigramHash(10240), RoPE base 50000, batched stride-64 sliding window, periodic SWA not continuous. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR openai#215 found Q projection matrices have condition numbers >100M, meaning Q naturally operates in a low-rank subspace. Factoring Q as W_down(dim→r) @ W_up(r→dim) with r=192 saves 2.6% params and ~22% step time on H100, yielding ~28% more training steps. Enable with Q_RANK=192 (default 0 = full rank, no change). K, V, O projections remain full rank (they ARE full rank). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
train_gpt.pyviaNUM_UNIQUE_LAYERSenv varWhy It Matters
The freed budget can be spent on:
Unique eval-time trick: since shared blocks are trained to be applied iteratively, running extra recurrence cycles at eval time gives free BPB improvement — not possible with unique-layer architectures.
Local Validation (Apple Silicon, 500K-token FineWeb subset, 100 steps)
Relative differences are consistent across configs. Absolute numbers not comparable to H100.
Composability
Stacks with the dominant meta (int6, MLP 3x, sliding window eval, FP16 embed, zstd-22). No top-10 submission currently combines depth recurrence with the full int6 stack — this is an open opportunity.
How to Run
Test Plan