Regime D: batch_size=8, lr=2e-3 (larger batch, adjusted LR) by tcapelle · Pull Request #1246 · wandb/senpai

tcapelle · 2026-03-19T08:46:18Z

Hypothesis

batch_size=8 halves gradient steps but gives cleaner gradients. Lower LR compensates.

Instructions

Change: batch_size=8, lr=2e-3 (for both groups proportionally). Run with --wandb_group regime-d.

Baseline (verified frontier, 4 consecutive plateau rounds)

mean3=23.2 (in=17.5, ood=14.3, re=27.7, tan=37.7)
50 single-variable experiments failed to improve. This round tests MULTI-VARIABLE regime changes.

Results

Status: CRASHED — CUDA OOM

Metric	Result	Baseline
val/loss	N/A (OOM)	—
Surface MAE Ux	N/A	—
Surface MAE Uy	N/A	—
Surface MAE p	N/A	—
Volume MAE	N/A	—
Peak VRAM	~95 GB (OOM)	~91 GB

W&B run: 90xgmxdx (crashed epoch 1, no validation metrics)

What happened:

batch_size=8 is infeasible on these 96 GB GPUs. The model already uses ~91 GB of VRAM at batch_size=4. Doubling to 8 requires ~2x the activation memory, which exhausted the 94.97 GB available. The crash occurred during the first training batch — never reached validation.

Error: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.32 GiB. GPU 0 has 277.88 MiB free of 94.97 GiB total.

Suggested follow-ups:

Gradient accumulation (accum_steps=2 with batch_size=4): simulates effective batch of 8 without the memory cost — same cleaner-gradient hypothesis but feasible
batch_size=6: might just barely fit; would give some gradient smoothing with less memory overhead than 8

github-actions · 2026-03-19T08:46:36Z

Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.

I have read the CLA Document and I hereby sign the CLA

0 out of 2 committers have signed the CLA.
❌ @senpai-advisor
❌ @senpai-nezuko
senpai-advisor, senpai-nezuko seem not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tcapelle · 2026-03-19T09:29:47Z

Review: Closed — CUDA OOM

batch_size=8 is infeasible on 96 GB GPUs. The model already uses ~91 GB at batch_size=4; doubling to 8 causes OOM on the first training batch. This is a hard memory constraint, not something tunable.

Good suggestions from the student:

Gradient accumulation (accum_steps=2, batch_size=4) would simulate effective batch=8 without the memory cost. This is worth trying as a separate hypothesis.
batch_size=6 is unlikely to fit either given the ~91 GB baseline usage.

Closing this PR. The gradient accumulation idea will be assigned as a fresh experiment.

Experiment placeholder

3abd1d2

tcapelle added status:wip Student is working on it student:nezuko Assigned to nezuko noam Noam advisor branch experiments labels Mar 19, 2026

regime-d: batch_size=8, lr=2e-3 (OOM — infeasible)

34ebe45

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tcapelle marked this pull request as ready for review March 19, 2026 09:14

tcapelle added status:review Ready for advisor review and removed status:wip Student is working on it labels Mar 19, 2026

tcapelle closed this Mar 19, 2026

tcapelle deleted the exp-noam/regime-d branch March 19, 2026 09:30

github-actions Bot locked and limited conversation to collaborators Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regime D: batch_size=8, lr=2e-3 (larger batch, adjusted LR)#1246

Regime D: batch_size=8, lr=2e-3 (larger batch, adjusted LR)#1246
tcapelle wants to merge 2 commits intonoamfrom
exp-noam/regime-d

tcapelle commented Mar 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

tcapelle commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tcapelle commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hypothesis

Instructions

Baseline (verified frontier, 4 consecutive plateau rounds)

Results

Uh oh!

github-actions Bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tcapelle commented Mar 19, 2026

Review: Closed — CUDA OOM

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tcapelle commented Mar 19, 2026 •

edited

Loading

github-actions Bot commented Mar 19, 2026 •

edited

Loading