Regime E: EMA from ep25 + decay=0.997 + T_max=72 (longer EMA window)#1247
Regime E: EMA from ep25 + decay=0.997 + T_max=72 (longer EMA window)#1247
Conversation
|
I have read the CLA Document and I hereby sign the CLA 0 out of 2 committers have signed the CLA. |
Review: Closed — Schedule Incompatible with BudgetT_max=72 shifts productive learning too late — the cosine schedule at epoch 55 was only 63% complete, operating at a higher LR than baseline at the same epoch. The EMA model averaged over a noisier trajectory. mean3=24.9 vs 23.2 (+7.3%). Like Regime A, this approach needs more epochs than the 30-min cap allows. |
Hypothesis
Earlier EMA start + slightly faster decay + longer LR schedule = more epochs of productive averaging.
Instructions
Change: ema_start_epoch=25, ema_decay=0.997, T_max=72. Run with
--wandb_group regime-e.Baseline (verified frontier, 4 consecutive plateau rounds)
Results
W&B run: inhta48l
Status: Timed out at epoch 55/100 (30-min cap)
Metrics at epoch 55 (EMA model, mid-run)
mean3 (surf p, in+ood+tan / 3): 24.9 vs baseline 23.2 — worse (+7%)
in=19.6 (+2.1), ood=14.6 (+0.3), re=28.3 (+0.6), tan=40.5 (+2.8)
Peak memory: ~87.8 GB (of 96 GB)
What happened
The run hit the 30-minute timeout at epoch 55/100 — results are mid-training, not final. This makes fair comparison to the baseline impossible. The core problem: with T_max=72 (vs 62 baseline), the cosine LR schedule at epoch 55 is further from its minimum (45/72 steps complete vs ~45/52 for baseline at epoch 55). The model is operating in a higher-LR regime with more active weight updates, so it hasn't converged to its best checkpoint. The EMA model (active since epoch 25, 30 epochs of averaging) is averaging over a noisier trajectory than the baseline EMA model was.
All 4 splits show worse performance at epoch 55 vs baseline. But since T_max=72 is designed to shift productive learning later in training, the final-epoch comparison would be more meaningful — which we can't see due to the timeout.
Verdict: Inconclusive. The T_max=72 change is structurally incompatible with the 30-minute training cap — the schedule only completes at epoch ~82, well past the timeout.
Suggested follow-ups