Depth Recurrence via Layer Sharing (3 shared blocks → 1/3 params, matched BPB) by SkywardSyntax · Pull Request #167 · openai/parameter-golf

SkywardSyntax · 2026-03-20T04:15:43Z

Summary

3 unique transformer blocks cycled over 9 virtual layers (ALBERT-style weight sharing)
Matches baseline BPB with 1/3 the unique parameters (6M vs 17M) and 30% faster training
Compressed model size drops from ~5MB to ~1.6MB (int8+zlib), freeing ~10MB of artifact budget
Minimal code change: ~20 lines added to train_gpt.py via NUM_UNIQUE_LAYERS env var

Why It Matters

The freed budget can be spent on:

Wider model (768d+)
MLP 3x expansion
Larger vocabulary (2048+)
All of the above, combined with int6 + zstd

Unique eval-time trick: since shared blocks are trained to be applied iteratively, running extra recurrence cycles at eval time gives free BPB improvement — not possible with unique-layer architectures.

Local Validation (Apple Silicon, 500K-token FineWeb subset, 100 steps)

Config	Unique Params	Depth	Post-quant BPB	int8+zlib
Baseline (9 unique, 512d)	17.1M	9	3.157	~5.0MB
3 shared, 512d	6.0M	9	3.151	~1.6MB
3 shared, 640d	8.5M	12	3.174	2.4MB

Relative differences are consistent across configs. Absolute numbers not comparable to H100.

Composability

Stacks with the dominant meta (int6, MLP 3x, sliding window eval, FP16 embed, zstd-22). No top-10 submission currently combines depth recurrence with the full int6 stack — this is an open opportunity.

How to Run

# Drop-in: same speed as baseline, 1/3 params
NUM_UNIQUE_LAYERS=3 torchrun --nproc_per_node=8 train_gpt.py

# Wider + deeper recurrence
NUM_UNIQUE_LAYERS=3 MODEL_DIM=640 NUM_HEADS=10 NUM_KV_HEADS=2 NUM_LAYERS=12 \
  torchrun --nproc_per_node=8 train_gpt.py

Test Plan

Local validation on Apple Silicon (mini FineWeb shards)
Verified int8+zlib quantization roundtrip with shared layers
Confirmed optimizer (Muon + Adam) handles shared params correctly
Full 10-minute run on 8xH100 (awaiting compute)
Ablation: layer sharing + int6 + MLP 3x combined

3 unique transformer blocks cycled over 9 virtual layers achieves the same BPB as the baseline with 1/3 the parameters and 30% faster training. Frees ~10MB of artifact budget for wider models, MLP 3x, or larger vocabularies. Validated locally, awaiting H100 compute.

- train_gpt_mlx_exp.py: MLX script with depth recurrence, per-layer scales, repeat embeddings, sliding window eval, DEQ convergence eval, FTLE sensitivity tracking, QAT, nuclear norm, SwiGLU, bounded recurrence - train_gpt_submission.py: CUDA/H100 script with layer sharing + Muon WD + label smoothing + eval knobs - make_mini_shards.py: data subset creator for local testing - EXPERIMENTS.md: full strategy doc with competition analysis and all results - NOTES.md: dev notes for resuming work Best local result: 1.783 BPB (2 shared 256d MLP3x depth 6, 1.45M params)

…bandoned New files: - train_exp.py: Full experimental script combining WarmdownQuantization base with BigramHash, SmearGate, SWA, layer sharing, zstd, Muon WD, OrthoInit - EXPERIMENT_LOG.md: Detailed results from 7 A100 experiments Key findings: - Layer sharing (3 shared x 9 virtual) loses 0.09 BPB vs 9 unique at 512d - BigramHash + SmearGate: +0.003 BPB over baseline (1.4384 vs 1.4417) - Aggressive warmdown (WARMDOWN_ITERS=20000) needs tuning per hardware - 9L faster than 10L on 1xA100 short runs; 10L expected better on 8xH100 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…1.3260 BPB Key improvements to train_exp.py: - BigramHash: XOR hash with coprime multipliers, 128-dim, zero-init, learned scale (matching PR openai#162) - SmearGate: single gate after embed+RMSNorm (not per-block), fixed gate direction - SWA early-start bug fix (minimum 100 steps before activation) - FTLE-lite sensitivity-aware mixed-precision quantization (experimental) - Eval-time extra recurrence support (not useful for non-shared models) - Sliding window eval safety: skips if estimated time > 600s Best A100 results: 1.3260 BPB (9L, sliding window stride=1024, zstd-22) Previous best was 1.4384 BPB — a 0.112 BPB improvement from bug fixes + eval strategy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Round 3: Sliding window eval gives 1.3260 BPB (best legal result). FTLE-lite helps quality but busts 16MB limit. Eval-time recurrence useless on non-shared models. Competition intel: SOTA is 1.1428 (merged). Key gaps identified: WD=0.04, 11L, BigramHash(10240), RoPE base 50000, batched stride-64 sliding window, periodic SWA not continuous. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PR openai#215 found Q projection matrices have condition numbers >100M, meaning Q naturally operates in a low-rank subspace. Factoring Q as W_down(dim→r) @ W_up(r→dim) with r=192 saves 2.6% params and ~22% step time on H100, yielding ~28% more training steps. Enable with Q_RANK=192 (default 0 = full rank, no change). K, V, O projections remain full rank (they ARE full rank). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SkywardSyntax closed this Mar 20, 2026

SkywardSyntax reopened this Mar 20, 2026

SkywardSyntax force-pushed the submission/depth-recurrence-layer-sharing branch from c1bdde9 to 4e760ac Compare March 20, 2026 04:56

SkywardSyntax and others added 2 commits March 20, 2026 11:45

SkywardSyntax force-pushed the submission/depth-recurrence-layer-sharing branch from 5625dd9 to 4e760ac Compare March 20, 2026 20:04

Nikhil Reddy Battapati and others added 3 commits March 20, 2026 16:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Depth Recurrence via Layer Sharing (3 shared blocks → 1/3 params, matched BPB)#167

Depth Recurrence via Layer Sharing (3 shared blocks → 1/3 params, matched BPB)#167
SkywardSyntax wants to merge 6 commits intoopenai:mainfrom
SkywardSyntax:submission/depth-recurrence-layer-sharing

SkywardSyntax commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SkywardSyntax commented Mar 20, 2026

Summary

Why It Matters

Local Validation (Apple Silicon, 500K-token FineWeb subset, 100 steps)

Composability

How to Run

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant