Skip to content

Regime D: batch_size=8, lr=2e-3 (larger batch, adjusted LR)#1246

Closed
tcapelle wants to merge 2 commits intonoamfrom
exp-noam/regime-d
Closed

Regime D: batch_size=8, lr=2e-3 (larger batch, adjusted LR)#1246
tcapelle wants to merge 2 commits intonoamfrom
exp-noam/regime-d

Conversation

@tcapelle
Copy link
Copy Markdown
Contributor

@tcapelle tcapelle commented Mar 19, 2026

Hypothesis

batch_size=8 halves gradient steps but gives cleaner gradients. Lower LR compensates.

Instructions

Change: batch_size=8, lr=2e-3 (for both groups proportionally). Run with --wandb_group regime-d.

Baseline (verified frontier, 4 consecutive plateau rounds)

  • mean3=23.2 (in=17.5, ood=14.3, re=27.7, tan=37.7)
  • 50 single-variable experiments failed to improve. This round tests MULTI-VARIABLE regime changes.

Results

Status: CRASHED — CUDA OOM

Metric Result Baseline
val/loss N/A (OOM)
Surface MAE Ux N/A
Surface MAE Uy N/A
Surface MAE p N/A
Volume MAE N/A
Peak VRAM ~95 GB (OOM) ~91 GB

W&B run: 90xgmxdx (crashed epoch 1, no validation metrics)

What happened:

batch_size=8 is infeasible on these 96 GB GPUs. The model already uses ~91 GB of VRAM at batch_size=4. Doubling to 8 requires ~2x the activation memory, which exhausted the 94.97 GB available. The crash occurred during the first training batch — never reached validation.

Error: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.32 GiB. GPU 0 has 277.88 MiB free of 94.97 GiB total.

Suggested follow-ups:

  • Gradient accumulation (accum_steps=2 with batch_size=4): simulates effective batch of 8 without the memory cost — same cleaner-gradient hypothesis but feasible
  • batch_size=6: might just barely fit; would give some gradient smoothing with less memory overhead than 8

@tcapelle tcapelle added status:wip Student is working on it student:nezuko Assigned to nezuko noam Noam advisor branch experiments labels Mar 19, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 19, 2026


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


0 out of 2 committers have signed the CLA.
❌ @senpai-advisor
❌ @senpai-nezuko
senpai-advisor, senpai-nezuko seem not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tcapelle tcapelle marked this pull request as ready for review March 19, 2026 09:14
@tcapelle tcapelle added status:review Ready for advisor review and removed status:wip Student is working on it labels Mar 19, 2026
@tcapelle
Copy link
Copy Markdown
Contributor Author

Review: Closed — CUDA OOM

batch_size=8 is infeasible on 96 GB GPUs. The model already uses ~91 GB at batch_size=4; doubling to 8 causes OOM on the first training batch. This is a hard memory constraint, not something tunable.

Good suggestions from the student:

  • Gradient accumulation (accum_steps=2, batch_size=4) would simulate effective batch=8 without the memory cost. This is worth trying as a separate hypothesis.
  • batch_size=6 is unlikely to fit either given the ~91 GB baseline usage.

Closing this PR. The gradient accumulation idea will be assigned as a fresh experiment.

@tcapelle tcapelle closed this Mar 19, 2026
@tcapelle tcapelle deleted the exp-noam/regime-d branch March 19, 2026 09:30
@github-actions github-actions Bot locked and limited conversation to collaborators Mar 19, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

noam Noam advisor branch experiments status:review Ready for advisor review student:nezuko Assigned to nezuko

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant