Regime E: EMA from ep25 + decay=0.997 + T_max=72 (longer EMA window) by tcapelle · Pull Request #1247 · wandb/senpai

tcapelle · 2026-03-19T08:46:20Z

Hypothesis

Earlier EMA start + slightly faster decay + longer LR schedule = more epochs of productive averaging.

Instructions

Change: ema_start_epoch=25, ema_decay=0.997, T_max=72. Run with --wandb_group regime-e.

Baseline (verified frontier, 4 consecutive plateau rounds)

mean3=23.2 (in=17.5, ood=14.3, re=27.7, tan=37.7)
50 single-variable experiments failed to improve. This round tests MULTI-VARIABLE regime changes.

Results

W&B run: inhta48l
Status: Timed out at epoch 55/100 (30-min cap)

Metrics at epoch 55 (EMA model, mid-run)

Split	val/loss	surf Ux	surf Uy	surf p	vol MAE (Ux/Uy/p)
in_dist	0.6440	6.09	1.58	19.6	1.13 / 0.37 / 20.3
ood_cond	0.7364	3.45	1.02	14.6	0.75 / 0.28 / 12.5
ood_re	0.5606	3.12	0.85	28.3	0.83 / 0.36 / 47.2
tandem	1.6969	6.59	2.15	40.5	1.97 / 0.90 / 39.2

mean3 (surf p, in+ood+tan / 3): 24.9 vs baseline 23.2 — worse (+7%)
in=19.6 (+2.1), ood=14.6 (+0.3), re=28.3 (+0.6), tan=40.5 (+2.8)

Peak memory: ~87.8 GB (of 96 GB)

What happened

The run hit the 30-minute timeout at epoch 55/100 — results are mid-training, not final. This makes fair comparison to the baseline impossible. The core problem: with T_max=72 (vs 62 baseline), the cosine LR schedule at epoch 55 is further from its minimum (45/72 steps complete vs ~45/52 for baseline at epoch 55). The model is operating in a higher-LR regime with more active weight updates, so it hasn't converged to its best checkpoint. The EMA model (active since epoch 25, 30 epochs of averaging) is averaging over a noisier trajectory than the baseline EMA model was.

All 4 splits show worse performance at epoch 55 vs baseline. But since T_max=72 is designed to shift productive learning later in training, the final-epoch comparison would be more meaningful — which we can't see due to the timeout.

Verdict: Inconclusive. The T_max=72 change is structurally incompatible with the 30-minute training cap — the schedule only completes at epoch ~82, well past the timeout.

Suggested follow-ups

Isolate ema_start_epoch=25: test earlier EMA alone (keep T_max=62) — compatible with timeout and gives more EMA averaging time at the same convergence point
Isolate ema_decay=0.997: slightly faster decay alone may help or hurt independently
Shorten T_max to fit the cap: if ~55 epochs fit in 30 min, try T_max=45 or T_max=50 to ensure the schedule completes within the budget

github-actions · 2026-03-19T08:46:31Z

Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.

I have read the CLA Document and I hereby sign the CLA

0 out of 2 committers have signed the CLA.
❌ @senpai-advisor
❌ @senpai-alphonse
senpai-advisor, senpai-alphonse seem not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

tcapelle · 2026-03-19T10:10:30Z

Review: Closed — Schedule Incompatible with Budget

T_max=72 shifts productive learning too late — the cosine schedule at epoch 55 was only 63% complete, operating at a higher LR than baseline at the same epoch. The EMA model averaged over a noisier trajectory. mean3=24.9 vs 23.2 (+7.3%). Like Regime A, this approach needs more epochs than the 30-min cap allows.

Experiment placeholder

fdc2022

tcapelle added status:wip Student is working on it student:alphonse Assigned to alphonse noam Noam advisor branch experiments labels Mar 19, 2026

Regime E: ema_start_epoch=25, ema_decay=0.997, T_max=72

02425f7

tcapelle added status:review Ready for advisor review and removed status:wip Student is working on it labels Mar 19, 2026

tcapelle marked this pull request as ready for review March 19, 2026 10:05

tcapelle closed this Mar 19, 2026

tcapelle deleted the exp-noam/regime-e branch March 19, 2026 10:10

github-actions Bot locked and limited conversation to collaborators Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regime E: EMA from ep25 + decay=0.997 + T_max=72 (longer EMA window)#1247

Regime E: EMA from ep25 + decay=0.997 + T_max=72 (longer EMA window)#1247
tcapelle wants to merge 2 commits intonoamfrom
exp-noam/regime-e

tcapelle commented Mar 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

tcapelle commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tcapelle commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hypothesis

Instructions

Baseline (verified frontier, 4 consecutive plateau rounds)

Results

Metrics at epoch 55 (EMA model, mid-run)

What happened

Suggested follow-ups

Uh oh!

github-actions Bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tcapelle commented Mar 19, 2026

Review: Closed — Schedule Incompatible with Budget

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tcapelle commented Mar 19, 2026 •

edited

Loading

github-actions Bot commented Mar 19, 2026 •

edited

Loading