OOM Issues — Unexpected Single-GPU Batch Size Memory Usage & Multi-GPU OOM Errors

Hello, I encountered memory-related issues while training with the `train_lotus_g_depth.sh` script and would appreciate your guidance:
1. Single-GPU Training:
- When setting `BATCH_SIZE=4` in `train_lotus_g_depth.sh`, the GPU memory usage is about 22GB, and training runs normally.
- However, setting `BATCH_SIZE=1` causes the memory usage to increase to around 23GB, which seems counterintuitive.
- The `accelerate` config used is as follows:
```
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
gpu_ids: '0'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```
2. Multi-GPU Training:
- When configuring multi-GPU training, setting `BATCH_SIZE` to either 1 or 4 leads to out-of-memory (OOM) errors.
- The `accelerate` config used is as follows:
```
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: '0,1,2,3'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```

What could cause the memory usage anomaly in single-GPU training? Could you provide recommended multi-GPU training config examples or advice?

Thank you very much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM Issues — Unexpected Single-GPU Batch Size Memory Usage & Multi-GPU OOM Errors #43

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OOM Issues — Unexpected Single-GPU Batch Size Memory Usage & Multi-GPU OOM Errors #43

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions