Add fsdp2 fp8 unit tests TE 2.10 by sudhu2k · Pull Request #492 · ROCm/TransformerEngine

sudhu2k · 2026-03-17T20:15:20Z

Description

This PR adds unit test covering different configurations such as:

delayed scaling + fp8 autocast only
delayed scaling + fp8 init only (new)
delayed scaling + fp8 init + fp8 autocast (new)
current scaling + fp8 autocast only
current scaling + fp8 init only (new)
current scaling + fp8 init + fp8 autocast (new)
MXFP8 scaling + fp8 autocast only
MXFP8 scaling + fp8 init only (new)
MXFP8 scaling + fp8 init + fp8 autocast (new)
fp32 (new)

All the unit tests compare FSDP2 vs DDP grads/output.

This PR also cleans up fsdp2_all_gather_tensor to match upstream's methods.

Removes keep_fp8_weight_transpose_cache dependency
Removes storing module reference to the tensor.

This PR also fixes issue with fused_adam when using it with FSDP2.

Fixes # (https://github.com/ROCm/frameworks-internal/issues/15291)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…g and refined test case generation for various configurations. - Cleaned up unused variables and improved code readability in the FSDPAGTensor class by removing unnecessary parameters.

… FusedAdam. Added debug print for DTensor in MultiTensorApply.

… tolerances for tensor comparisons. Updated test logic to accommodate new tolerance parameters for improved accuracy in floating-point comparisons.

…l differences in gradient calculations. Clean up unused debug print statements in MultiTensorApply and ensure proper newline at the end of the FSDPAGTensor serialization method.

sudhu2k · 2026-03-17T21:29:39Z

transformer_engine/pytorch/tensor/fsdp2_allgather_tensor.py

-        if not isinstance(quantizer, MXFP8Quantizer) and not self._keep_fp8_weight_transpose_cache:
+        quantizer = module.quantizers["scaling_fwd"][self._fp8_meta_index]
+        if not isinstance(quantizer, MXFP8Quantizer):
            quantizer.set_usage(columnwise=False)


For FSDP2 with FP8, keep_fp8_weight_transpose_cache should be False. Caching the transposed weight would imply an all-gather of the transposed tensor as well, increasing memory and communication and negating the advantages of FSDP2’s sharded parameter layout.

sudhu2k · 2026-03-17T21:35:37Z

transformer_engine/pytorch/optimizers/fused_adam.py

+            data = torch.zeros_like(param, dtype=torch.int16)
        else:
-            data = torch.empty(param.shape, dtype=dtype, device=param.device)
+            data = torch.empty_like(param, dtype=dtype)


When using FSDP2, parameters are DTensors, and when we do torch.zeros() or torch.empty() we create regular pytorch Tensors.
This was causing
[rank1]: RuntimeError: aten.copy_.default: got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!

[rank7]: File "/workspace/TransformerEngine/transformer_engine/pytorch/optimizers/fused_adam.py", line 422, in initialize_state [rank7]: self.set_scaled_state(param, "master_param", param.clone().detach().float()) [rank7]: File "/workspace/TransformerEngine/transformer_engine/pytorch/optimizers/fused_adam.py", line 363, in set_scaled_state [rank7]: state[state_name].copy_(unscaled_state)

Fix:
Keep optimizer state consistent with the parameter type: when parameters are DTensors, state should be DTensors as well. Using torch.empty_like(param, ...) (and the same idea for other state buffers) creates state as a DTensor with the same placement as param, so both sides of copy_ are DTensors and the error is avoided.

Is it upstream fix cherry-picking?

Upstream fixes this in TEv2.12, along with few other fixes.
NVIDIA/TransformerEngine@fe8fad5#diff-0801a8d92a56d458946da1439b62e0add1613b7da83d31bc218a852b6b9e42b1
This wasn't cherry picked.

…by adding a newline character after the pass statement in the test_dummy function.

ipanfilo · 2026-03-18T02:26:34Z

tests/pytorch/distributed/run_fsdp2_fp8_model.py


        # Zero the parameter gradients
        optimizer.zero_grad()
-        with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):


Does with te.fp8_autocast(enabled=args.fp8_autocast,.. ) do the same?

It does do the same but since with TEv2.10, te.fp8_autocast is replaced with te.autocast, I've made the change to be consistent.

So will 'with te.autocast(enabled=args.fp8_autocast, recipe=...)' do the same as if/else?

Yes, it should. I'll make the changes.

ipanfilo · 2026-03-18T02:27:44Z

tests/pytorch/distributed/test_torch_fsdp2_fp8.py

    assert len(l1) == len(l2), "Unequal number of outputs."
    for i, (t1, t2) in enumerate(zip(l1, l2)):
-        result = torch.allclose(t1, t2, atol=0, rtol=0)
+        tols = dict(atol=atol)


Move tolls calculation out of the loop

…s for improved clarity and consistency.

…_fix_2.10

tests/pytorch/distributed/run_fsdp2_fp8_model.py

tests/pytorch/distributed/test_torch_fsdp2_fp8.py

transformer_engine/pytorch/tensor/fsdp2_allgather_tensor.py

Manually ported fix from upstream commit 139c863 The full commit was not cherry-picked due to unrelated changes across many files. Addressed PR comments

wangye805 · 2026-03-26T17:11:31Z

tests/pytorch/distributed/test_torch_fsdp2_fp8.py

+    # scales and produce bit-identical FP8 GEMMs — strict tolerance (0) is used.
+    if quantized_init or (not quantized_init and not autocast):
+        atol = 1e-6
+        rtol = 5e-5


If our reference is ddp with the same fp8 primary weight, then the same cast from fp32 master weight to fp8 happens in both target and reference flow. Then we will have exact match?

sudhu2k and others added 5 commits March 17, 2026 01:23

Initial commit

e8e63b1

- Updated test functions to include new parameters for FP8 autocastin…

c3e33e3

…g and refined test case generation for various configurations. - Cleaned up unused variables and improved code readability in the FSDPAGTensor class by removing unnecessary parameters.

Refactor quantizer state checks and optimize tensor initialization in…

db36143

… FusedAdam. Added debug print for DTensor in MultiTensorApply.

Refactor assertion function in FP8 tests to use relative and absolute…

13b4007

… tolerances for tensor comparisons. Updated test logic to accommodate new tolerance parameters for improved accuracy in floating-point comparisons.

Update test tolerances for FP8 configurations to account for potentia…

d91241f

…l differences in gradient calculations. Clean up unused debug print statements in MultiTensorApply and ensure proper newline at the end of the FSDPAGTensor serialization method.

sudhu2k commented Mar 17, 2026

View reviewed changes

Ensure proper newline at the end of the test_torch_fsdp2_fp8.py file …

2b8818d

…by adding a newline character after the pass statement in the test_dummy function.

sudhu2k marked this pull request as ready for review March 17, 2026 21:45

sudhu2k requested review from ipanfilo, wangye805 and wenchenvincent as code owners March 17, 2026 21:45

sudhu2k self-assigned this Mar 17, 2026

ipanfilo reviewed Mar 18, 2026

View reviewed changes

Refactor tolerance calculations.

8964d56

sudhu2k requested a review from ipanfilo March 18, 2026 15:10

sudhu2k added 2 commits March 18, 2026 16:19

Refactor model initialization and autocasting logic in FSDP2 FP8 test…

54938d9

…s for improved clarity and consistency.

Merge remote-tracking branch 'origin/dev' into sudhu/FSDP2_unit_tests…

f771955

…_fix_2.10

ipanfilo approved these changes Mar 18, 2026

View reviewed changes

alextmagro reviewed Mar 19, 2026

View reviewed changes

Fix FusedAdam DTensor state initialization for FSDP2

c1949d3

Manually ported fix from upstream commit 139c863 The full commit was not cherry-picked due to unrelated changes across many files. Addressed PR comments

alextmagro approved these changes Mar 19, 2026

View reviewed changes

wangye805 requested changes Mar 26, 2026

View reviewed changes

Conversation

sudhu2k commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sudhu2k commented Mar 17, 2026 •

edited

Loading