[Reproducing baseline performance] - 1x RTX4090, Gradient Accumulation

Hello,

I'm currently trying to reproduce your baseline performance (ResNet-50) using a **single RTX 4090 GPU**.

To achieve the effective batch size of 48 (6 * 8) that you used, 
I'm using a batch size of 6 with gradient accumulation over 8 steps. 
Accordingly, I've increased the number of iterations by 8 times as well.

While gradient accumulation has helped me avoid "NaN" and "Infinity" issues in grad_norm and loss values, 

I'm still facing gradient explosion around epoch 10 out of 100.

I've attempted to tune hyperparameters such as 
grad_norm, learning rate, and weight_decay, 
but the gradient explosion issue persists.

Has anyone else encountered this problem?

I would greatly appreciate any advice on suitable hyperparameter settings for a single RTX 4090 with gradient accumulation.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Reproducing baseline performance] - 1x RTX4090, Gradient Accumulation #121

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Reproducing baseline performance] - 1x RTX4090, Gradient Accumulation #121

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions