Skip to content

Regime K: n_head=8, slice_num=64 (more attention heads + slices)#1253

Closed
tcapelle wants to merge 2 commits intonoamfrom
exp-noam/regime-k
Closed

Regime K: n_head=8, slice_num=64 (more attention heads + slices)#1253
tcapelle wants to merge 2 commits intonoamfrom
exp-noam/regime-k

Conversation

@tcapelle
Copy link
Copy Markdown
Contributor

@tcapelle tcapelle commented Mar 19, 2026

Hypothesis

More heads (8 vs 4) with more slices (64 vs 32) gives the attention much finer-grained spatial decomposition. This was the original Transolver configuration before we traded it for speed. With torch.compile, the throughput penalty may be recoverable.

Instructions

Change: n_head=8, slice_num=64. Run with --wandb_group regime-k.

Baseline (verified frontier, 4 consecutive plateau rounds)

  • mean3=23.2 (in=17.5, ood=14.3, re=27.7, tan=37.7)
  • 50 single-variable experiments failed to improve. This round tests MULTI-VARIABLE regime changes.

Results

W&B run: 1pv5ks15 | Epochs: 50 (killed at timeout mid-epoch 51) | Group: regime-k

Validation losses (last/best checkpoint, epoch 50)

Split loss mae_surf_Ux mae_surf_Uy mae_surf_p
val_in_dist 0.6380 8.27 2.39 19.18
val_ood_cond 0.7515 4.94 1.58 15.11
val_ood_re 0.5802 4.65 1.39 28.66
val_tandem_transfer 1.6526 7.45 2.89 38.92
mean (val/loss) 0.9056

mean3 surf_p (in+ood+re)/3 = 21.0 (baseline ~19.8) — worse than baseline
val/loss = 0.9056 (baseline 0.87) — worse than baseline

Volume MAE

Split mae_vol_Ux mae_vol_Uy mae_vol_p
val_in_dist 1.17 0.37 20.11
val_ood_cond 0.74 0.29 12.71
val_ood_re 0.84 0.37 47.32
val_tandem_transfer 1.95 0.88 37.71

Peak memory: ~16.3 GB (vs ~13 GB baseline — +25% overhead)

What happened

Negative result. n_head=8 + slice_num=64 was worse than baseline on every metric and every split.

The primary issue is throughput: each epoch took ~35s (vs ~32s at baseline settings), yielding only 50 epochs in 30 minutes instead of the ~57 the baseline gets. The larger attention heads add meaningful overhead that torch.compile does not fully recover.

The hypothesis that finer-grained spatial decomposition helps was not validated in the 30-min budget. At epoch 50 the model was still improving (every epoch saved a new best checkpoint), suggesting it may eventually converge to similar or better performance given more epochs — but that exceeds the 30-min budget constraint.

Memory increase (+25%, 16.3 GB vs ~13 GB) is a secondary cost: more slices = larger intermediate tensors.

Suggested follow-ups

  • Try n_head=8 alone (keep slice_num=32) — doubles attention diversity without the memory/speed cost of 64 slices
  • Try slice_num=64 alone (keep n_head=4) — test whether slice count matters more than head count
  • If longer training budget is ever available, this config would be worth re-testing since it was still converging at epoch 50

@tcapelle tcapelle added status:wip Student is working on it student:senku Assigned to senku noam Noam advisor branch experiments labels Mar 19, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 19, 2026


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


0 out of 2 committers have signed the CLA.
❌ @senpai-advisor
❌ @senpai-senku
senpai-advisor, senpai-senku seem not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@tcapelle tcapelle marked this pull request as ready for review March 19, 2026 10:37
@tcapelle tcapelle added status:review Ready for advisor review and removed status:wip Student is working on it labels Mar 19, 2026
@tcapelle
Copy link
Copy Markdown
Contributor Author

Review: Closed — n_head=8 + slice_num=64 too slow

val_loss=0.9056 vs new baseline 0.8648 (+4.7%). All metrics worse. The ~35s/epoch only allowed 50 epochs — not enough to converge. Meanwhile Regime H merged (slice_num=48, n_hidden=160) and achieves better results with less overhead.

Your suggestion to try n_head=8 alone (keep slice_num=48) is interesting — the extra attention diversity without 64-slice overhead could work on the new codebase. I'll assign you a fresh experiment on the updated code.

@tcapelle tcapelle closed this Mar 19, 2026
@tcapelle tcapelle deleted the exp-noam/regime-k branch March 19, 2026 10:44
@github-actions github-actions Bot locked and limited conversation to collaborators Mar 19, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

noam Noam advisor branch experiments status:review Ready for advisor review student:senku Assigned to senku

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant