Regime I: warmup=5, cosine restart at ep40 (fast start + second phase)#1251
Regime I: warmup=5, cosine restart at ep40 (fast start + second phase)#1251
Conversation
|
I have read the CLA Document and I hereby sign the CLA 0 out of 2 committers have signed the CLA. |
Review: Closed — Cosine Restart FailedShorter warmup (5 epochs) + cosine restart at ep40 was the worst performer: mean3=25.8 vs 23.2 (+11.2%). The restart disrupted training momentum mid-run. The model effectively lost progress from the first phase when LR spiked back up at ep40. Single-phase cosine annealing remains more effective. |
Hypothesis
Shorter warmup (5 instead of 10) + cosine warm restart at ep40 gives a second optimization phase that can escape the basin found in phase 1.
Instructions
Change: warmup total_iters=5, milestones=[5]. Replace cosine with CosineAnnealingWarmRestarts(T_0=35, T_mult=1, eta_min=5e-5). Run with `--wandb_group regime-i`.
Baseline (verified frontier, 4 consecutive plateau rounds)
Results
W&B run: `qv1no8ii`
Epochs: 60 (hit 30-min wall clock)
Peak memory: 14.7 GB
Surface MAE (mae_surf_p) — primary metric
Full surface MAE (Ux / Uy / p)
Volume MAE (p)
val/loss (combined)
What happened
The hypothesis did not work. All 4 splits are uniformly worse than baseline by 1.5–3.4 points on surface pressure MAE. The warm restart disrupted a converging trajectory rather than escaping a basin.
Key observations from the epoch-by-epoch loss curve:
Epochs 1–10 showed zero training loss (`vol=0.0000 surf=0.0000`). The model produced near-zero predictions during the warmup phase, suggesting 5 epochs of warmup at 10% LR is too brief to establish stable gradients before the cosine phase kicks in.
Phase 1 converged steadily: epochs 11–40 showed smooth improvement, reaching `val_in_dist=0.650, val_ood_re=0.618` by epoch 40 (end of first T_0=35 cycle).
The restart at epoch 41 caused clear regression: `val_in_dist` jumped from 0.650 → 0.732, `val_ood_re` from 0.618 → 0.704. The second phase had to re-learn rather than explore new basins.
Phase 2 ran out of time: with T_0=35, the second cycle ends at epoch 75, but the wall-clock limit cut us off at epoch 60 (only 20/35 epochs into phase 2). Even accounting for this, by epoch 60 the model had not recovered to phase-1 best on in_dist (0.686 vs 0.650 at epoch 40).
The core issue: the restart resets the LR to its maximum value, which effectively undoes converged weights. The "escape from basin" mechanism requires the optimizer to find a better basin and re-converge — with only 20 epochs remaining after the restart, there wasn't enough time.
Suggested follow-ups