Hello,
I'm currently trying to reproduce your baseline performance (ResNet-50) using a single RTX 4090 GPU.
To achieve the effective batch size of 48 (6 * 8) that you used,
I'm using a batch size of 6 with gradient accumulation over 8 steps.
Accordingly, I've increased the number of iterations by 8 times as well.
While gradient accumulation has helped me avoid "NaN" and "Infinity" issues in grad_norm and loss values,
I'm still facing gradient explosion around epoch 10 out of 100.
I've attempted to tune hyperparameters such as
grad_norm, learning rate, and weight_decay,
but the gradient explosion issue persists.
Has anyone else encountered this problem?
I would greatly appreciate any advice on suitable hyperparameter settings for a single RTX 4090 with gradient accumulation.
Thank you.