Hello, I encountered memory-related issues while training with the train_lotus_g_depth.sh script and would appreciate your guidance:
- Single-GPU Training:
- When setting
BATCH_SIZE=4 in train_lotus_g_depth.sh, the GPU memory usage is about 22GB, and training runs normally.
- However, setting
BATCH_SIZE=1 causes the memory usage to increase to around 23GB, which seems counterintuitive.
- The
accelerate config used is as follows:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
gpu_ids: '0'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
- Multi-GPU Training:
- When configuring multi-GPU training, setting
BATCH_SIZE to either 1 or 4 leads to out-of-memory (OOM) errors.
- The
accelerate config used is as follows:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: '0,1,2,3'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
What could cause the memory usage anomaly in single-GPU training? Could you provide recommended multi-GPU training config examples or advice?
Thank you very much!
Hello, I encountered memory-related issues while training with the
train_lotus_g_depth.shscript and would appreciate your guidance:BATCH_SIZE=4intrain_lotus_g_depth.sh, the GPU memory usage is about 22GB, and training runs normally.BATCH_SIZE=1causes the memory usage to increase to around 23GB, which seems counterintuitive.accelerateconfig used is as follows:BATCH_SIZEto either 1 or 4 leads to out-of-memory (OOM) errors.accelerateconfig used is as follows:What could cause the memory usage anomaly in single-GPU training? Could you provide recommended multi-GPU training config examples or advice?
Thank you very much!