Example of `quantized_model_init` for low-precision compute weights, fp32 main weights using fused_adam with fsdp2

This is likely the optimal way to run FSDP2 with TE layers, but it would be great to have an example of using FusedAdam with master_weights=True (https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/optimizers/fused_adam.py#L75-L82), `quantized_model_init`, and `fully_shard` together.