Regime D: batch_size=8, lr=2e-3 (larger batch, adjusted LR)#1246
Regime D: batch_size=8, lr=2e-3 (larger batch, adjusted LR)#1246
Conversation
|
I have read the CLA Document and I hereby sign the CLA 0 out of 2 committers have signed the CLA. |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Review: Closed — CUDA OOMbatch_size=8 is infeasible on 96 GB GPUs. The model already uses ~91 GB at batch_size=4; doubling to 8 causes OOM on the first training batch. This is a hard memory constraint, not something tunable. Good suggestions from the student:
Closing this PR. The gradient accumulation idea will be assigned as a fresh experiment. |
Hypothesis
batch_size=8 halves gradient steps but gives cleaner gradients. Lower LR compensates.
Instructions
Change: batch_size=8, lr=2e-3 (for both groups proportionally). Run with
--wandb_group regime-d.Baseline (verified frontier, 4 consecutive plateau rounds)
Results
Status: CRASHED — CUDA OOM
W&B run: 90xgmxdx (crashed epoch 1, no validation metrics)
What happened:
batch_size=8 is infeasible on these 96 GB GPUs. The model already uses ~91 GB of VRAM at batch_size=4. Doubling to 8 requires ~2x the activation memory, which exhausted the 94.97 GB available. The crash occurred during the first training batch — never reached validation.
Error:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.32 GiB. GPU 0 has 277.88 MiB free of 94.97 GiB total.Suggested follow-ups: