Regime I: warmup=5, cosine restart at ep40 (fast start + second phase) by tcapelle · Pull Request #1251 · wandb/senpai

tcapelle · 2026-03-19T08:46:32Z

Hypothesis

Shorter warmup (5 instead of 10) + cosine warm restart at ep40 gives a second optimization phase that can escape the basin found in phase 1.

Instructions

Change: warmup total_iters=5, milestones=[5]. Replace cosine with CosineAnnealingWarmRestarts(T_0=35, T_mult=1, eta_min=5e-5). Run with `--wandb_group regime-i`.

Baseline (verified frontier, 4 consecutive plateau rounds)

mean3=23.2 (in=17.5, ood=14.3, re=27.7, tan=37.7)
50 single-variable experiments failed to improve. This round tests MULTI-VARIABLE regime changes.

Results

W&B run: `qv1no8ii`
Epochs: 60 (hit 30-min wall clock)
Peak memory: 14.7 GB

Surface MAE (mae_surf_p) — primary metric

Split	Baseline	This run	Delta
val_in_dist	17.5	20.85	+3.35 worse
val_ood_cond	14.3	16.46	+2.16 worse
val_ood_re	27.7	29.25	+1.55 worse
val_tandem_transfer	37.7	40.06	+2.36 worse
mean3 (in+ood+tan)/3	23.2	25.8	+2.6 worse

Full surface MAE (Ux / Uy / p)

Split	Ux	Uy	p
val_in_dist	8.20	2.14	20.85
val_ood_cond	4.94	1.27	16.46
val_ood_re	4.42	1.09	29.25
val_tandem_transfer	7.32	2.52	40.06

Volume MAE (p)

Split	vol_p
val_in_dist	21.65
val_ood_cond	13.83
val_ood_re	47.96
val_tandem_transfer	39.80

val/loss (combined)

Best epoch 60: 0.9576

What happened

The hypothesis did not work. All 4 splits are uniformly worse than baseline by 1.5–3.4 points on surface pressure MAE. The warm restart disrupted a converging trajectory rather than escaping a basin.

Key observations from the epoch-by-epoch loss curve:

Epochs 1–10 showed zero training loss (`vol=0.0000 surf=0.0000`). The model produced near-zero predictions during the warmup phase, suggesting 5 epochs of warmup at 10% LR is too brief to establish stable gradients before the cosine phase kicks in.
Phase 1 converged steadily: epochs 11–40 showed smooth improvement, reaching `val_in_dist=0.650, val_ood_re=0.618` by epoch 40 (end of first T_0=35 cycle).
The restart at epoch 41 caused clear regression: `val_in_dist` jumped from 0.650 → 0.732, `val_ood_re` from 0.618 → 0.704. The second phase had to re-learn rather than explore new basins.
Phase 2 ran out of time: with T_0=35, the second cycle ends at epoch 75, but the wall-clock limit cut us off at epoch 60 (only 20/35 epochs into phase 2). Even accounting for this, by epoch 60 the model had not recovered to phase-1 best on in_dist (0.686 vs 0.650 at epoch 40).

The core issue: the restart resets the LR to its maximum value, which effectively undoes converged weights. The "escape from basin" mechanism requires the optimizer to find a better basin and re-converge — with only 20 epochs remaining after the restart, there wasn't enough time.

Suggested follow-ups

Shorter T_0 (e.g., T_0=20): first restart at epoch 25, second at 45, allows multiple restarts within 60 epochs and more chances to find a better basin.
Smaller restart amplitude: use `eta_min=1e-4` (higher floor) to reduce disruption from restarts while still providing a brief LR boost.
No restart, just shorter warmup: test warmup=5 with standard `CosineAnnealingLR(T_max=55)` to isolate whether the shorter warmup itself changes results.
Snapshot ensemble: collect checkpoints at each local minimum (end of each cosine cycle) and ensemble at inference.

github-actions · 2026-03-19T08:46:42Z

Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.

I have read the CLA Document and I hereby sign the CLA

0 out of 2 committers have signed the CLA.
❌ @senpai-advisor
❌ @senpai-violet
senpai-advisor, senpai-violet seem not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

tcapelle · 2026-03-19T10:10:37Z

Review: Closed — Cosine Restart Failed

Shorter warmup (5 epochs) + cosine restart at ep40 was the worst performer: mean3=25.8 vs 23.2 (+11.2%). The restart disrupted training momentum mid-run. The model effectively lost progress from the first phase when LR spiked back up at ep40. Single-phase cosine annealing remains more effective.

Experiment placeholder

6a53b19

tcapelle added status:wip Student is working on it student:violet Assigned to violet noam Noam advisor branch experiments labels Mar 19, 2026

Regime I: warmup=5, CosineAnnealingWarmRestarts(T_0=35, T_mult=1)

bafce9f

tcapelle added status:review Ready for advisor review and removed status:wip Student is working on it labels Mar 19, 2026

tcapelle closed this Mar 19, 2026

tcapelle deleted the exp-noam/regime-i branch March 19, 2026 10:10

github-actions Bot locked and limited conversation to collaborators Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regime I: warmup=5, cosine restart at ep40 (fast start + second phase)#1251

Regime I: warmup=5, cosine restart at ep40 (fast start + second phase)#1251
tcapelle wants to merge 2 commits intonoamfrom
exp-noam/regime-i

tcapelle commented Mar 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

tcapelle commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tcapelle commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hypothesis

Instructions

Baseline (verified frontier, 4 consecutive plateau rounds)

Results

Surface MAE (mae_surf_p) — primary metric

Full surface MAE (Ux / Uy / p)

Volume MAE (p)

val/loss (combined)

What happened

Suggested follow-ups

Uh oh!

github-actions Bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tcapelle commented Mar 19, 2026

Review: Closed — Cosine Restart Failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tcapelle commented Mar 19, 2026 •

edited

Loading

github-actions Bot commented Mar 19, 2026 •

edited

Loading