Skip to content

Depth Recurrence + Cross-Repeat Skip + Sliding Window Eval#148

Open
iverbovoy wants to merge 4 commits intoopenai:mainfrom
iverbovoy:depth-recurrence
Open

Depth Recurrence + Cross-Repeat Skip + Sliding Window Eval#148
iverbovoy wants to merge 4 commits intoopenai:mainfrom
iverbovoy:depth-recurrence

Conversation

@iverbovoy
Copy link

Summary

Beats naive baseline (1.2244) by 0.005 bpb using 3.1x fewer training steps through stateful depth recurrence.

  • val_bpb = 1.2196 (sliding window eval on int8+zlib roundtrip model, stride=256)
  • val_bpb = 1.2533 (standard int8+zlib roundtrip)
  • 4494 steps in 600s on 8xH100, 133ms/step, 12.83MB artifact

Architecture: 3 shared blocks x 4 repeats (12 effective layers), dim=832, with Cross-Repeat Skip (original — stateful recurrence via per-repeat learned residuals), 2 Value Embedding tables, LR x0.3, sliding window eval.

All experiments done on RTX 3060 12GB. Cloud GPUs used only for validation and final run.

See README for full details and ablations.

- Replace 9 unique blocks with 3 blocks x 4 repeats (12 effective layers)
- Increase dim from 512 to 832, remove U-Net skips
- Add loop_embed for timestep encoding per effective layer
- Add cross-repeat skip: each block mixes in its output from previous repeat
  with per-repeat learned scales (stateful recurrence)
- Add 2 value embedding tables mixed into each layer with learned scales
- 17.14M params, best result: 1.6780 bpb (int8+zlib) on 2000 steps batch 8K
- Add eval_val_ttt: adapts model on each val batch before evaluating
- For each batch: save weights → K gradient steps → evaluate → restore
- Controlled by TTT_STEPS (default 0 = disabled) and TTT_LR (default 1e-4)
- Result: -0.010 bpb improvement on 200-step test (2.4124 → 2.4027)
- TTT eval runs after normal roundtrip eval, reports both scores
- Sliding window eval: window=1024, stride=256, ~-0.034 bpb
- forward_logits() method for sliding window support
- LR x0.3: matrix=0.012, embed=0.015, scalar=0.012 (sweep winner)
- GRAD_CLIP_NORM=0.3 for recurrence stability
- WARMDOWN_ITERS=3000
- train@1024 (not 2048) — better for recurrence (160ms vs 253ms/step)
- Fix grad_accum for non-power-of-2 GPU counts
- Best result: 1.2308 bpb sliding window on 6xH100 (3726 steps)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant