with quantized_model_init (DelayedScaling), FusedAdam, and FSDP2, we allocate a transformer_engine/pytorch/tensor/storage/float8_tensor_storage.py:204:_create_transpose tensor during the backwards pass that accumulates over all the model layers until the subsequent forward layer
https://nvidia.slack.com/archives/C038G319G6R/p1771868674747129?thread_ts=1771868398.818749&cid=C038G319G6R