Skip to content

Depth Recurrence via Layer Sharing (3 shared blocks → 1/3 params, matched BPB)#167

Open
SkywardSyntax wants to merge 6 commits intoopenai:mainfrom
SkywardSyntax:submission/depth-recurrence-layer-sharing
Open

Depth Recurrence via Layer Sharing (3 shared blocks → 1/3 params, matched BPB)#167
SkywardSyntax wants to merge 6 commits intoopenai:mainfrom
SkywardSyntax:submission/depth-recurrence-layer-sharing

Conversation

@SkywardSyntax
Copy link

Summary

  • 3 unique transformer blocks cycled over 9 virtual layers (ALBERT-style weight sharing)
  • Matches baseline BPB with 1/3 the unique parameters (6M vs 17M) and 30% faster training
  • Compressed model size drops from ~5MB to ~1.6MB (int8+zlib), freeing ~10MB of artifact budget
  • Minimal code change: ~20 lines added to train_gpt.py via NUM_UNIQUE_LAYERS env var

Why It Matters

The freed budget can be spent on:

  • Wider model (768d+)
  • MLP 3x expansion
  • Larger vocabulary (2048+)
  • All of the above, combined with int6 + zstd

Unique eval-time trick: since shared blocks are trained to be applied iteratively, running extra recurrence cycles at eval time gives free BPB improvement — not possible with unique-layer architectures.

Local Validation (Apple Silicon, 500K-token FineWeb subset, 100 steps)

Config Unique Params Depth Post-quant BPB int8+zlib
Baseline (9 unique, 512d) 17.1M 9 3.157 ~5.0MB
3 shared, 512d 6.0M 9 3.151 ~1.6MB
3 shared, 640d 8.5M 12 3.174 2.4MB

Relative differences are consistent across configs. Absolute numbers not comparable to H100.

Composability

Stacks with the dominant meta (int6, MLP 3x, sliding window eval, FP16 embed, zstd-22). No top-10 submission currently combines depth recurrence with the full int6 stack — this is an open opportunity.

How to Run

# Drop-in: same speed as baseline, 1/3 params
NUM_UNIQUE_LAYERS=3 torchrun --nproc_per_node=8 train_gpt.py

# Wider + deeper recurrence
NUM_UNIQUE_LAYERS=3 MODEL_DIM=640 NUM_HEADS=10 NUM_KV_HEADS=2 NUM_LAYERS=12 \
  torchrun --nproc_per_node=8 train_gpt.py

Test Plan

  • Local validation on Apple Silicon (mini FineWeb shards)
  • Verified int8+zlib quantization roundtrip with shared layers
  • Confirmed optimizer (Muon + Adam) handles shared params correctly
  • Full 10-minute run on 8xH100 (awaiting compute)
  • Ablation: layer sharing + int6 + MLP 3x combined

3 unique transformer blocks cycled over 9 virtual layers achieves
the same BPB as the baseline with 1/3 the parameters and 30% faster
training. Frees ~10MB of artifact budget for wider models, MLP 3x,
or larger vocabularies. Validated locally, awaiting H100 compute.
@SkywardSyntax SkywardSyntax reopened this Mar 20, 2026
@SkywardSyntax SkywardSyntax force-pushed the submission/depth-recurrence-layer-sharing branch from c1bdde9 to 4e760ac Compare March 20, 2026 04:56
SkywardSyntax and others added 2 commits March 20, 2026 11:45
- train_gpt_mlx_exp.py: MLX script with depth recurrence, per-layer scales,
  repeat embeddings, sliding window eval, DEQ convergence eval, FTLE
  sensitivity tracking, QAT, nuclear norm, SwiGLU, bounded recurrence
- train_gpt_submission.py: CUDA/H100 script with layer sharing + Muon WD +
  label smoothing + eval knobs
- make_mini_shards.py: data subset creator for local testing
- EXPERIMENTS.md: full strategy doc with competition analysis and all results
- NOTES.md: dev notes for resuming work

Best local result: 1.783 BPB (2 shared 256d MLP3x depth 6, 1.45M params)
…bandoned

New files:
- train_exp.py: Full experimental script combining WarmdownQuantization base
  with BigramHash, SmearGate, SWA, layer sharing, zstd, Muon WD, OrthoInit
- EXPERIMENT_LOG.md: Detailed results from 7 A100 experiments

Key findings:
- Layer sharing (3 shared x 9 virtual) loses 0.09 BPB vs 9 unique at 512d
- BigramHash + SmearGate: +0.003 BPB over baseline (1.4384 vs 1.4417)
- Aggressive warmdown (WARMDOWN_ITERS=20000) needs tuning per hardware
- 9L faster than 10L on 1xA100 short runs; 10L expected better on 8xH100

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@SkywardSyntax SkywardSyntax force-pushed the submission/depth-recurrence-layer-sharing branch from 5625dd9 to 4e760ac Compare March 20, 2026 20:04
Nikhil Reddy Battapati and others added 3 commits March 20, 2026 16:37
…1.3260 BPB

Key improvements to train_exp.py:
- BigramHash: XOR hash with coprime multipliers, 128-dim, zero-init, learned scale (matching PR openai#162)
- SmearGate: single gate after embed+RMSNorm (not per-block), fixed gate direction
- SWA early-start bug fix (minimum 100 steps before activation)
- FTLE-lite sensitivity-aware mixed-precision quantization (experimental)
- Eval-time extra recurrence support (not useful for non-shared models)
- Sliding window eval safety: skips if estimated time > 600s

Best A100 results: 1.3260 BPB (9L, sliding window stride=1024, zstd-22)
Previous best was 1.4384 BPB — a 0.112 BPB improvement from bug fixes + eval strategy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 3: Sliding window eval gives 1.3260 BPB (best legal result).
FTLE-lite helps quality but busts 16MB limit. Eval-time recurrence
useless on non-shared models.

Competition intel: SOTA is 1.1428 (merged). Key gaps identified:
WD=0.04, 11L, BigramHash(10240), RoPE base 50000, batched stride-64
sliding window, periodic SWA not continuous.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR openai#215 found Q projection matrices have condition numbers >100M,
meaning Q naturally operates in a low-rank subspace. Factoring
Q as W_down(dim→r) @ W_up(r→dim) with r=192 saves 2.6% params
and ~22% step time on H100, yielding ~28% more training steps.

Enable with Q_RANK=192 (default 0 = full rank, no change).
K, V, O projections remain full rank (they ARE full rank).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant