CUDA out of memory during Training

I've changed your training code to get optimal rotation matrix only by setting groupsize for weight and activations to 32. However, I consistently encounter CUDA out of memory error during forward pass. I personally think that this is due to large activations memory required during forward pass. I 've also included your FSDP config file. Could you give me an explanation for this?