Skip to content

OOM Issues — Unexpected Single-GPU Batch Size Memory Usage & Multi-GPU OOM Errors #43

@yiping-tks

Description

@yiping-tks

Hello, I encountered memory-related issues while training with the train_lotus_g_depth.sh script and would appreciate your guidance:

  1. Single-GPU Training:
  • When setting BATCH_SIZE=4 in train_lotus_g_depth.sh, the GPU memory usage is about 22GB, and training runs normally.
  • However, setting BATCH_SIZE=1 causes the memory usage to increase to around 23GB, which seems counterintuitive.
  • The accelerate config used is as follows:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
gpu_ids: '0'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
  1. Multi-GPU Training:
  • When configuring multi-GPU training, setting BATCH_SIZE to either 1 or 4 leads to out-of-memory (OOM) errors.
  • The accelerate config used is as follows:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: '0,1,2,3'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

What could cause the memory usage anomaly in single-GPU training? Could you provide recommended multi-GPU training config examples or advice?

Thank you very much!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions