This is likely the optimal way to run FSDP2 with TE layers, but it would be great to have an example of using FusedAdam with master_weights=True (https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/optimizers/fused_adam.py#L75-L82), quantized_model_init, and fully_shard together.