You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The run was cut off at epoch 57/100 due to the 30-minute timeout, making this an incomplete evaluation. At epoch 57, surface pressure MAE is worse than baseline across all splits — tandem (40.43 vs 37.7), in_dist (19.31 vs 17.5), ood_cond (15.33 vs 14.3), ood_re (28.48 vs 27.7). All val losses were still declining at epoch 57, so the model had not converged.
Two interpretations are possible: (1) pure AdamW at lr=4e-3 converges more slowly in early-to-mid training and would close the gap given all 100 epochs; (2) Lookahead's slow-weight stabilization genuinely helps and removing it leads to a worse optimum. The incomplete run can't distinguish these. What is clear: at the 30-min mark, removing Lookahead is not beneficial.
Suggested follow-ups
Compare at a fixed epoch budget (e.g., 57 epochs): Run baseline (Lookahead + lr=3e-3) and pure AdamW side-by-side capped at the same number of epochs for a fair comparison.
Try lr=3e-3 pure AdamW: If lr=4e-3 is converging more slowly, testing the same LR without Lookahead isolates the optimizer effect.
Keep Lookahead, increase lr to 4e-3: Decouples LR increase from Lookahead removal — tests whether the higher LR alone can improve over the baseline.
Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.
I have read the CLA Document and I hereby sign the CLA
0 out of 2 committers have signed the CLA. ❌ @senpai-advisor ❌ @senpai-tanjiro senpai-advisor, senpai-tanjiro seem not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.
Pure AdamW at lr=4e-3 didn't improve over Lookahead+AdamW at lr=3e-3. mean3=25.0 vs 23.2 (+7.8%). The hypothesis that Lookahead constrains late-training convergence is not supported — Lookahead's slow-weight stabilization appears genuinely beneficial.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hypothesis
Lookahead may be constraining late-training convergence. Pure AdamW with slightly higher LR.
Instructions
Replace Lookahead with raw AdamW. lr=4e-3. Keep everything else. Run with
--wandb_group regime-c.Baseline (verified frontier, 4 consecutive plateau rounds)
Results
W&B run: 2t5k2nku
Epochs completed: 57/100 (hit 30-min wall-clock limit; killed mid-epoch 58)
Peak memory: 14.7 GB
Val loss @ epoch 57
Surface MAE @ epoch 57
Volume MAE @ epoch 57
mean3 @ epoch 57: (19.31 + 15.33 + 40.43) / 3 = 25.0 vs baseline 23.2 (worse)
What happened
The run was cut off at epoch 57/100 due to the 30-minute timeout, making this an incomplete evaluation. At epoch 57, surface pressure MAE is worse than baseline across all splits — tandem (40.43 vs 37.7), in_dist (19.31 vs 17.5), ood_cond (15.33 vs 14.3), ood_re (28.48 vs 27.7). All val losses were still declining at epoch 57, so the model had not converged.
Two interpretations are possible: (1) pure AdamW at lr=4e-3 converges more slowly in early-to-mid training and would close the gap given all 100 epochs; (2) Lookahead's slow-weight stabilization genuinely helps and removing it leads to a worse optimum. The incomplete run can't distinguish these. What is clear: at the 30-min mark, removing Lookahead is not beneficial.
Suggested follow-ups