Skip to content

Regime J: mlp_ratio=4 (wider FFN at same hidden dim)#1252

Closed
tcapelle wants to merge 3 commits intonoamfrom
exp-noam/regime-j
Closed

Regime J: mlp_ratio=4 (wider FFN at same hidden dim)#1252
tcapelle wants to merge 3 commits intonoamfrom
exp-noam/regime-j

Conversation

@tcapelle
Copy link
Copy Markdown
Contributor

@tcapelle tcapelle commented Mar 19, 2026

Hypothesis

The FFN hidden layer is 2x n_hidden=384. At 4x it becomes 768, giving significantly more nonlinear capacity. The epoch time increases but the model has more expressive power per layer.

Instructions

Change: mlp_ratio=4 in model_config. Run with --wandb_group regime-j.

Baseline (verified frontier, 4 consecutive plateau rounds)

  • mean3=23.2 (in=17.5, ood=14.3, re=27.7, tan=37.7)
  • 50 single-variable experiments failed to improve. This round tests MULTI-VARIABLE regime changes.

Results

Iteration 1: mlp_ratio=4

W&B run: ta02xdo7 | Epochs: 54 | Memory: 16.2 GB

Split mae_surf_p Baseline Δ
val_in_dist 18.8 17.5 +1.3 ↑
val_ood_cond 15.3 14.3 +1.0 ↑
val_ood_re 28.0 27.7 +0.3 ↑
val_tandem_transfer 38.5 37.7 +0.8 ↑
mean3 24.2 23.2 +1.0 ↑

Iteration 2: mlp_ratio=3 (revised per review)

W&B run: e6d3cbls | Epochs: 57 | Memory: 15.5 GB

Split mae_surf_p Baseline Δ
val_in_dist 18.9 17.5 +1.4 ↑
val_ood_cond 14.3 14.3 ~0 (tied)
val_ood_re 27.9 27.7 +0.2 ↑
val_tandem_transfer 40.7 37.7 +3.0 ↑
mean3 24.6 23.2 +1.4 ↑

Full Surface MAE (mlp_ratio=3)

Split Ux Uy p
val_in_dist 7.24 1.93 18.92
val_ood_cond 4.15 1.28 14.26
val_ood_re 3.66 1.14 27.87
val_tandem_transfer 6.77 2.38 40.71

Volume MAE (mlp_ratio=3)

Split Ux Uy p
val_in_dist 1.12 0.37 20.17
val_ood_cond 0.73 0.28 12.22
val_ood_re 0.82 0.36 46.86
val_tandem_transfer 1.95 0.89 39.13

Val loss (mlp_ratio=3, epoch 57)

val/loss: 0.893 | val_in_dist: 0.628, val_ood_cond: 0.718, val_ood_re: 0.543, val_tandem_transfer: 1.685


What happened

Negative result for both ratios. Neither mlp_ratio=3 nor mlp_ratio=4 beats the baseline (ratio=2) within the 30-minute budget.

mlp_ratio=3 ran 57 epochs vs 54 for ratio=4 (as expected — ~30s/epoch vs ~32s). However, quality was actually slightly worse: mean3=24.6 vs 24.2 for ratio=4, and both are above the baseline 23.2. The tandem transfer split is particularly hurt (40.7 vs 37.7 baseline), suggesting the wider FFN doesn't generalize better to the hardest OOD case.

One positive signal: val_ood_cond mae_surf_p with ratio=3 ties the baseline exactly (14.3). But this single split matching doesn't offset the tandem regression.

The fundamental problem: a wider FFN adds more capacity per step but doesn't provide enough improvement per gradient step to compensate for the same (or fewer) training epochs within 30 minutes. The baseline ratio=2 remains the most epoch-efficient configuration.

Suggested follow-ups

  • Give up on wider FFN: Both ratio=3 and ratio=4 fail within the time budget. The capacity bottleneck appears to be elsewhere.
  • Wider model (n_hidden increase): If more capacity is the goal, increasing n_hidden might be more uniformly beneficial than wider FFN at fixed n_hidden.
  • n_layers=2 + reduced slice_num: More depth (currently n_layers=1) could add capacity with different per-step compute profile than width.
  • Investigate tandem transfer regression: The tandem split consistently gets worse with wider FFN. This might hint that wider FFN overfits to single-foil patterns in the training set.

@tcapelle tcapelle added status:wip Student is working on it student:gilbert Assigned to gilbert noam Noam advisor branch experiments labels Mar 19, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 19, 2026


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


0 out of 2 committers have signed the CLA.
❌ @senpai-advisor
❌ @senpai-gilbert
senpai-advisor, senpai-gilbert seem not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@tcapelle tcapelle marked this pull request as ready for review March 19, 2026 10:00
@tcapelle tcapelle added status:review Ready for advisor review and removed status:wip Student is working on it labels Mar 19, 2026
@tcapelle
Copy link
Copy Markdown
Contributor Author

Review: Request Changes — Try mlp_ratio=3

Good execution, gilbert. The wider FFN (mlp_ratio=4) shows an interesting signal: re_p (28.0 vs 27.7) and tan_p (38.5 vs 37.7) are the closest to baseline among all regime experiments. val/loss=0.885 is also the second-closest to baseline (0.865).

However, the model only completed 54 epochs due to the slower epoch time (~33s vs ~28s baseline). With fewer optimization steps, it's hard to know if the wider FFN would have converged to a better minimum given the same number of epochs.

Next step: try mlp_ratio=3

  • This should give ~30s/epoch, fitting ~60 epochs in 30 min (vs 54 at ratio=4)
  • Still provides 50% more FFN capacity than baseline (ratio=2)
  • Keep everything else identical
  • Run with --wandb_group regime-j

The goal is to find the sweet spot between FFN capacity and epoch throughput within the 30-min budget.

@tcapelle tcapelle marked this pull request as draft March 19, 2026 10:10
@tcapelle tcapelle added status:wip Student is working on it and removed status:review Ready for advisor review labels Mar 19, 2026
@tcapelle tcapelle marked this pull request as ready for review March 19, 2026 10:44
@tcapelle tcapelle added status:review Ready for advisor review and removed status:wip Student is working on it labels Mar 19, 2026
@tcapelle
Copy link
Copy Markdown
Contributor Author

Review: Closed — Both mlp_ratio iterations failed

Neither mlp_ratio=4 nor mlp_ratio=3 beat baseline. The wider FFN adds compute but not enough quality within the 30-min budget. Both iterations were on old code (pre-Regime H); the new baseline (val_loss=0.8648) makes these results even further behind. Two iterations is enough — closing. I'll assign you a fresh experiment on the updated codebase.

@tcapelle tcapelle closed this Mar 19, 2026
@tcapelle tcapelle deleted the exp-noam/regime-j branch March 19, 2026 10:52
@github-actions github-actions Bot locked and limited conversation to collaborators Mar 19, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

noam Noam advisor branch experiments status:review Ready for advisor review student:gilbert Assigned to gilbert

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant