Skip to content

Regime I: warmup=5, cosine restart at ep40 (fast start + second phase)#1251

Closed
tcapelle wants to merge 2 commits intonoamfrom
exp-noam/regime-i
Closed

Regime I: warmup=5, cosine restart at ep40 (fast start + second phase)#1251
tcapelle wants to merge 2 commits intonoamfrom
exp-noam/regime-i

Conversation

@tcapelle
Copy link
Copy Markdown
Contributor

@tcapelle tcapelle commented Mar 19, 2026

Hypothesis

Shorter warmup (5 instead of 10) + cosine warm restart at ep40 gives a second optimization phase that can escape the basin found in phase 1.

Instructions

Change: warmup total_iters=5, milestones=[5]. Replace cosine with CosineAnnealingWarmRestarts(T_0=35, T_mult=1, eta_min=5e-5). Run with `--wandb_group regime-i`.

Baseline (verified frontier, 4 consecutive plateau rounds)

  • mean3=23.2 (in=17.5, ood=14.3, re=27.7, tan=37.7)
  • 50 single-variable experiments failed to improve. This round tests MULTI-VARIABLE regime changes.

Results

W&B run: `qv1no8ii`
Epochs: 60 (hit 30-min wall clock)
Peak memory: 14.7 GB

Surface MAE (mae_surf_p) — primary metric

Split Baseline This run Delta
val_in_dist 17.5 20.85 +3.35 worse
val_ood_cond 14.3 16.46 +2.16 worse
val_ood_re 27.7 29.25 +1.55 worse
val_tandem_transfer 37.7 40.06 +2.36 worse
mean3 (in+ood+tan)/3 23.2 25.8 +2.6 worse

Full surface MAE (Ux / Uy / p)

Split Ux Uy p
val_in_dist 8.20 2.14 20.85
val_ood_cond 4.94 1.27 16.46
val_ood_re 4.42 1.09 29.25
val_tandem_transfer 7.32 2.52 40.06

Volume MAE (p)

Split vol_p
val_in_dist 21.65
val_ood_cond 13.83
val_ood_re 47.96
val_tandem_transfer 39.80

val/loss (combined)

  • Best epoch 60: 0.9576

What happened

The hypothesis did not work. All 4 splits are uniformly worse than baseline by 1.5–3.4 points on surface pressure MAE. The warm restart disrupted a converging trajectory rather than escaping a basin.

Key observations from the epoch-by-epoch loss curve:

  1. Epochs 1–10 showed zero training loss (`vol=0.0000 surf=0.0000`). The model produced near-zero predictions during the warmup phase, suggesting 5 epochs of warmup at 10% LR is too brief to establish stable gradients before the cosine phase kicks in.

  2. Phase 1 converged steadily: epochs 11–40 showed smooth improvement, reaching `val_in_dist=0.650, val_ood_re=0.618` by epoch 40 (end of first T_0=35 cycle).

  3. The restart at epoch 41 caused clear regression: `val_in_dist` jumped from 0.650 → 0.732, `val_ood_re` from 0.618 → 0.704. The second phase had to re-learn rather than explore new basins.

  4. Phase 2 ran out of time: with T_0=35, the second cycle ends at epoch 75, but the wall-clock limit cut us off at epoch 60 (only 20/35 epochs into phase 2). Even accounting for this, by epoch 60 the model had not recovered to phase-1 best on in_dist (0.686 vs 0.650 at epoch 40).

The core issue: the restart resets the LR to its maximum value, which effectively undoes converged weights. The "escape from basin" mechanism requires the optimizer to find a better basin and re-converge — with only 20 epochs remaining after the restart, there wasn't enough time.


Suggested follow-ups

  • Shorter T_0 (e.g., T_0=20): first restart at epoch 25, second at 45, allows multiple restarts within 60 epochs and more chances to find a better basin.
  • Smaller restart amplitude: use `eta_min=1e-4` (higher floor) to reduce disruption from restarts while still providing a brief LR boost.
  • No restart, just shorter warmup: test warmup=5 with standard `CosineAnnealingLR(T_max=55)` to isolate whether the shorter warmup itself changes results.
  • Snapshot ensemble: collect checkpoints at each local minimum (end of each cosine cycle) and ensemble at inference.

@tcapelle tcapelle added status:wip Student is working on it student:violet Assigned to violet noam Noam advisor branch experiments labels Mar 19, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 19, 2026


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


0 out of 2 committers have signed the CLA.
❌ @senpai-advisor
❌ @senpai-violet
senpai-advisor, senpai-violet seem not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@tcapelle tcapelle added status:review Ready for advisor review and removed status:wip Student is working on it labels Mar 19, 2026
@tcapelle
Copy link
Copy Markdown
Contributor Author

Review: Closed — Cosine Restart Failed

Shorter warmup (5 epochs) + cosine restart at ep40 was the worst performer: mean3=25.8 vs 23.2 (+11.2%). The restart disrupted training momentum mid-run. The model effectively lost progress from the first phase when LR spiked back up at ep40. Single-phase cosine annealing remains more effective.

@tcapelle tcapelle closed this Mar 19, 2026
@tcapelle tcapelle deleted the exp-noam/regime-i branch March 19, 2026 10:10
@github-actions github-actions Bot locked and limited conversation to collaborators Mar 19, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

noam Noam advisor branch experiments status:review Ready for advisor review student:violet Assigned to violet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant