Regime J: mlp_ratio=4 (wider FFN at same hidden dim)#1252
Regime J: mlp_ratio=4 (wider FFN at same hidden dim)#1252
Conversation
|
I have read the CLA Document and I hereby sign the CLA 0 out of 2 committers have signed the CLA. |
Review: Request Changes — Try mlp_ratio=3Good execution, gilbert. The wider FFN (mlp_ratio=4) shows an interesting signal: re_p (28.0 vs 27.7) and tan_p (38.5 vs 37.7) are the closest to baseline among all regime experiments. val/loss=0.885 is also the second-closest to baseline (0.865). However, the model only completed 54 epochs due to the slower epoch time (~33s vs ~28s baseline). With fewer optimization steps, it's hard to know if the wider FFN would have converged to a better minimum given the same number of epochs. Next step: try
The goal is to find the sweet spot between FFN capacity and epoch throughput within the 30-min budget. |
Review: Closed — Both mlp_ratio iterations failedNeither mlp_ratio=4 nor mlp_ratio=3 beat baseline. The wider FFN adds compute but not enough quality within the 30-min budget. Both iterations were on old code (pre-Regime H); the new baseline (val_loss=0.8648) makes these results even further behind. Two iterations is enough — closing. I'll assign you a fresh experiment on the updated codebase. |
Hypothesis
The FFN hidden layer is 2x n_hidden=384. At 4x it becomes 768, giving significantly more nonlinear capacity. The epoch time increases but the model has more expressive power per layer.
Instructions
Change: mlp_ratio=4 in model_config. Run with
--wandb_group regime-j.Baseline (verified frontier, 4 consecutive plateau rounds)
Results
Iteration 1: mlp_ratio=4
W&B run:
ta02xdo7| Epochs: 54 | Memory: 16.2 GBIteration 2: mlp_ratio=3 (revised per review)
W&B run:
e6d3cbls| Epochs: 57 | Memory: 15.5 GBFull Surface MAE (mlp_ratio=3)
Volume MAE (mlp_ratio=3)
Val loss (mlp_ratio=3, epoch 57)
val/loss: 0.893 | val_in_dist: 0.628, val_ood_cond: 0.718, val_ood_re: 0.543, val_tandem_transfer: 1.685
What happened
Negative result for both ratios. Neither mlp_ratio=3 nor mlp_ratio=4 beats the baseline (ratio=2) within the 30-minute budget.
mlp_ratio=3 ran 57 epochs vs 54 for ratio=4 (as expected — ~30s/epoch vs ~32s). However, quality was actually slightly worse: mean3=24.6 vs 24.2 for ratio=4, and both are above the baseline 23.2. The tandem transfer split is particularly hurt (40.7 vs 37.7 baseline), suggesting the wider FFN doesn't generalize better to the hardest OOD case.
One positive signal: val_ood_cond mae_surf_p with ratio=3 ties the baseline exactly (14.3). But this single split matching doesn't offset the tandem regression.
The fundamental problem: a wider FFN adds more capacity per step but doesn't provide enough improvement per gradient step to compensate for the same (or fewer) training epochs within 30 minutes. The baseline ratio=2 remains the most epoch-efficient configuration.
Suggested follow-ups