Regime K: n_head=8, slice_num=64 (more attention heads + slices)#1253
Regime K: n_head=8, slice_num=64 (more attention heads + slices)#1253
Conversation
|
I have read the CLA Document and I hereby sign the CLA 0 out of 2 committers have signed the CLA. |
Review: Closed — n_head=8 + slice_num=64 too slowval_loss=0.9056 vs new baseline 0.8648 (+4.7%). All metrics worse. The ~35s/epoch only allowed 50 epochs — not enough to converge. Meanwhile Regime H merged (slice_num=48, n_hidden=160) and achieves better results with less overhead. Your suggestion to try n_head=8 alone (keep slice_num=48) is interesting — the extra attention diversity without 64-slice overhead could work on the new codebase. I'll assign you a fresh experiment on the updated code. |
Hypothesis
More heads (8 vs 4) with more slices (64 vs 32) gives the attention much finer-grained spatial decomposition. This was the original Transolver configuration before we traded it for speed. With torch.compile, the throughput penalty may be recoverable.
Instructions
Change: n_head=8, slice_num=64. Run with
--wandb_group regime-k.Baseline (verified frontier, 4 consecutive plateau rounds)
Results
W&B run: 1pv5ks15 | Epochs: 50 (killed at timeout mid-epoch 51) | Group: regime-k
Validation losses (last/best checkpoint, epoch 50)
mean3 surf_p (in+ood+re)/3 = 21.0 (baseline ~19.8) — worse than baseline
val/loss = 0.9056 (baseline 0.87) — worse than baseline
Volume MAE
Peak memory: ~16.3 GB (vs ~13 GB baseline — +25% overhead)
What happened
Negative result. n_head=8 + slice_num=64 was worse than baseline on every metric and every split.
The primary issue is throughput: each epoch took ~35s (vs ~32s at baseline settings), yielding only 50 epochs in 30 minutes instead of the ~57 the baseline gets. The larger attention heads add meaningful overhead that torch.compile does not fully recover.
The hypothesis that finer-grained spatial decomposition helps was not validated in the 30-min budget. At epoch 50 the model was still improving (every epoch saved a new best checkpoint), suggesting it may eventually converge to similar or better performance given more epochs — but that exceeds the 30-min budget constraint.
Memory increase (+25%, 16.3 GB vs ~13 GB) is a secondary cost: more slices = larger intermediate tensors.
Suggested follow-ups