te.autocast() and fully_shard doesn't de-allocate quantized weights until the backwards pass

When running a model with fsdp2 and te.autocast(), quantized weights are created during the forward pass and not de-allocated until the backwards pass. This essentially undoes the memory savings of FSDP2, since we end up accumulating the entire model's worth of quantized weights on each rank.

https://github.com/NVIDIA/bionemo-framework/blob/main/bionemo-recipes/recipes/llama3_native_te/train_fsdp2_cp.py

<img width="732" height="526" alt="Image" src="https://github.com/user-attachments/assets/c72b0de4-33b8-4a7f-9f1d-5ecbda808a36" />

https://nvidia.slack.com/archives/C03V462SAMS/p1771004012335309

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

te.autocast() and fully_shard doesn't de-allocate quantized weights until the backwards pass #2681

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

te.autocast() and fully_shard doesn't de-allocate quantized weights until the backwards pass #2681

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions