Skip to content

[Reproducing baseline performance] - 1x RTX4090, Gradient Accumulation #121

@sonjt00

Description

@sonjt00

Hello,

I'm currently trying to reproduce your baseline performance (ResNet-50) using a single RTX 4090 GPU.

To achieve the effective batch size of 48 (6 * 8) that you used,
I'm using a batch size of 6 with gradient accumulation over 8 steps.
Accordingly, I've increased the number of iterations by 8 times as well.

While gradient accumulation has helped me avoid "NaN" and "Infinity" issues in grad_norm and loss values,

I'm still facing gradient explosion around epoch 10 out of 100.

I've attempted to tune hyperparameters such as
grad_norm, learning rate, and weight_decay,
but the gradient explosion issue persists.

Has anyone else encountered this problem?

I would greatly appreciate any advice on suitable hyperparameter settings for a single RTX 4090 with gradient accumulation.

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions