[TE] Enable deterministic mode for fused attention by AllenFarcas · Pull Request #508 · ROCm/TransformerEngine

AllenFarcas · 2026-03-27T18:37:04Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes https://github.com/ROCm/frameworks-internal/issues/15875

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Added deterministic functionality to fused attention
Added test for the introduced deterministic functionality

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Copilot

Pull request overview

Enables deterministic mode propagation for ROCm fused-attention backward (CK backend) and adds JAX coverage to validate bitwise reproducibility and gradient correctness when non-deterministic algorithms are disallowed.

Changes:

Forward the deterministic flag from NVTE ROCm fused-attn backward entrypoints into CK backend calls.
Add JAX tests that (on HIP/AMD) verify backward gradients are bitwise reproducible across runs and match an unfused JAX reference.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
transformer_engine/common/fused_attn_rocm/fused_attn.cpp	Passes the `deterministic` argument into CK fused-attn backward implementations (qkvpacked/kvpacked/separate).
tests/jax/test_fused_attn.py	Adds HIP-only deterministic-backward tests and imports `global_shard_guard` to ensure mesh resource context is set.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/jax/test_fused_attn.py

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/jax/test_fused_attn.py

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-27T19:16:03Z

tests/jax/test_fused_attn.py

+
+        fused_val_grad = jit(jax.value_and_grad(fused_fn, argnums=(0, 1, 2)))
+
+        with global_shard_guard(MeshResource()):


MeshResource() is likely an invalid constructor call in this codebase. In this test suite, MeshResource is commonly instantiated with explicit axis/resource names (e.g. MeshResource('dp', 'cp', 'tp')), and calling it with no arguments may raise TypeError at runtime, preventing the test from running. Construct MeshResource with the expected arguments (or reuse the same default resource configuration used elsewhere in this file).

Suggested change

with global_shard_guard(MeshResource()):

with global_shard_guard(MeshResource('dp', 'cp', 'tp')):

Copilot · 2026-03-27T19:16:04Z

tests/jax/test_fused_attn.py

+    assert backend == NVTE_Fused_Attn_Backend.NVTE_CK, (
+        f"Expected CK backend but got {backend}."
+    )


This hard assert makes the test brittle if another valid HIP backend is selected in the future (or on some configurations), even though the deterministic behavior under test could still be correct. Consider replacing the assert with pytest.skip(...) when backend != NVTE_CK, or explicitly constraining the test inputs to only configurations that can select CK deterministically.

Suggested change

assert backend == NVTE_Fused_Attn_Backend.NVTE_CK, (

f"Expected CK backend but got {backend}."

)

if backend != NVTE_Fused_Attn_Backend.NVTE_CK:

pytest.skip(f"Deterministic CK test requires CK backend, got {backend}.")

This will indeed report it as a fail, so let's skip instead.

Copilot · 2026-03-27T19:16:04Z

tests/jax/test_fused_attn.py

+    _orig_nondeterministic = os.environ.get("NVTE_ALLOW_NONDETERMINISTIC_ALGO")
+    os.environ["NVTE_ALLOW_NONDETERMINISTIC_ALGO"] = "0"


This test mutates a process-global environment variable. Even though it’s restored in finally, using pytest’s monkeypatch fixture (e.g., monkeypatch.setenv / monkeypatch.delenv) would be more robust and idiomatic, and reduces the risk of state leaking if this helper evolves (e.g., added early returns) or is reused elsewhere.

Yeah, copilot gave a good comment. You will need to cache the outside env values and reset it after we finish this pytest

Copilot · 2026-03-27T19:16:04Z

tests/jax/test_fused_attn.py

+        if _orig_nondeterministic is None:
+            os.environ.pop("NVTE_ALLOW_NONDETERMINISTIC_ALGO", None)
+        else:
+            os.environ["NVTE_ALLOW_NONDETERMINISTIC_ALGO"] = _orig_nondeterministic


This test mutates a process-global environment variable. Even though it’s restored in finally, using pytest’s monkeypatch fixture (e.g., monkeypatch.setenv / monkeypatch.delenv) would be more robust and idiomatic, and reduces the risk of state leaking if this helper evolves (e.g., added early returns) or is reused elsewhere.

Suggested change

if _orig_nondeterministic is None:

os.environ.pop("NVTE_ALLOW_NONDETERMINISTIC_ALGO", None)

else:

os.environ["NVTE_ALLOW_NONDETERMINISTIC_ALGO"] = _orig_nondeterministic

monkeypatch = pytest.MonkeyPatch()

if _orig_nondeterministic is None:

monkeypatch.delenv("NVTE_ALLOW_NONDETERMINISTIC_ALGO", raising=False)

else:

monkeypatch.setenv("NVTE_ALLOW_NONDETERMINISTIC_ALGO", _orig_nondeterministic)

Micky774 · 2026-03-27T19:44:25Z

Unless we want to support non-deterministic CK only for the JAX integration, we should probably also add some tests to the pytorch integration side since it'll be enabled there too.

Also I think you still need to adjust

TransformerEngine/transformer_engine/pytorch/attention/dot_product_attention/utils.py

Lines 1070 to 1078 in 82617fe

    
           # TODO: remove the filtering after ck team tells us how to enable more deterministic bwd kernels 
        
           if use_fused_attention and deterministic and IS_HIP_EXTENSION: 
        
               if ( 
        
                   fused_attention_backend == FusedAttnBackend["CK"] 
        
                   and is_training 
        
               ): 
        
                   logger.debug("Disabling FusedAttention for determinism reasons") 
        
                   use_fused_attention = False 
        
                   fused_attention_backend = None #TODO: switch to AOTriton when supported

wangye805

BTW, add some deterministic testcases in pytorch side as well

wangye805 · 2026-03-27T21:22:59Z

tests/jax/test_fused_attn.py

+    if check_numerical is None:
+        check_numerical = seq_len <= 256


Why do we skip checking the numerical for cases with seqlen<=256

wangye805 · 2026-03-27T21:23:36Z

tests/jax/test_fused_attn.py

 from transformer_engine.jax.cpp_extensions.misc import is_hip_extension
 from transformer_engine.jax import autocast
-from transformer_engine.jax.sharding import MeshResource
+from transformer_engine.jax.sharding import MeshResource, global_shard_guard


Why do we need this?

wangye805 · 2026-03-27T21:25:35Z

tests/jax/test_fused_attn.py

+    if check_numerical is None:
+        check_numerical = seq_len <= 256
+    s = seq_len
+    dtype = jnp.bfloat16


Let's check for both bf16 and fp16

wangye805 · 2026-03-27T21:26:39Z

tests/jax/test_fused_attn.py

+    _orig_nondeterministic = os.environ.get("NVTE_ALLOW_NONDETERMINISTIC_ALGO")
+    os.environ["NVTE_ALLOW_NONDETERMINISTIC_ALGO"] = "0"


Yeah, copilot gave a good comment. You will need to cache the outside env values and reset it after we finish this pytest

wangye805 · 2026-03-27T21:30:43Z

tests/jax/test_fused_attn.py

+    backend = FusedAttnHelper(
+        True, dtype, dtype, qkv_layout, AttnBiasType.NO_BIAS, attn_mask_type,
+        0.0, h_q, h_kv, s, s, d, d, (-1, -1),
+    ).get_fused_attn_backend()
+    if backend == NVTE_Fused_Attn_Backend.NVTE_No_Backend:
+        pytest.skip("No fused attention backend available for this config")
+    assert backend == NVTE_Fused_Attn_Backend.NVTE_CK, (
+        f"Expected CK backend but got {backend}."
+    )


Technically, if we specify NVTE_ALLOW_NONDETERMINISTIC_ALGO=0, the backend selection should take this env and choose deterministic backend for us, not restricting to CK. As I recall, aotriton by its nature is deterministic @xinyazhang

wangye805 · 2026-03-27T21:33:38Z

tests/jax/test_fused_attn.py

+    "attn_mask_type",
+    [
+        pytest.param(AttnMaskType.NO_MASK, id="NO_MASK"),
+        pytest.param(AttnMaskType.CAUSAL_MASK, id="CAUSAL"),


Let's not restrict to NO_MASK or CAUSAL, let's add causal, padding, padding causal, padding causal bottom right as well

wangye805 · 2026-03-27T21:34:02Z

tests/jax/test_fused_attn.py

+    ],
+)
+def test_deterministic_bwd_gqa(attn_mask_type):
+    """GQA variant: BSHD_BSHD_BSHD with h_q != h_kv."""


Also, extend to nonGQA cases as well

wangye805 · 2026-03-27T21:40:04Z

tests/jax/test_fused_attn.py

+    _run_deterministic_bwd_case(
+        qkv_layout=QKVLayout.BSHD_BSHD_BSHD,
+        attn_mask_type=attn_mask_type,
+        b=2, seq_len=2048, h_q=12, h_kv=4, d=128,


Also, check some sequence packing cases

[Fix] Added functionality and test for determinism in Fused Attention.

00b1aa9

AllenFarcas requested review from ipanfilo, wangye805 and wenchenvincent as code owners March 27, 2026 18:37

AllenFarcas requested review from Copilot, ipanfilo, wangye805 and wenchenvincent and removed request for ipanfilo, wangye805 and wenchenvincent March 27, 2026 18:37

Copilot started reviewing on behalf of AllenFarcas March 27, 2026 18:38 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

tests/jax/test_fused_attn.py Outdated Show resolved Hide resolved

tests/jax/test_fused_attn.py Show resolved Hide resolved

[Fix] Removed extra space and restore env var

2eec8ff

AllenFarcas requested a review from Copilot March 27, 2026 18:50

Copilot started reviewing on behalf of AllenFarcas March 27, 2026 18:52 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

tests/jax/test_fused_attn.py Outdated Show resolved Hide resolved

tests/jax/test_fused_attn.py Outdated Show resolved Hide resolved

tests/jax/test_fused_attn.py Show resolved Hide resolved

tests/jax/test_fused_attn.py Outdated Show resolved Hide resolved

[Fix] Addressed review comments, refactored.

6ce1b9d

AllenFarcas requested a review from Copilot March 27, 2026 19:11

Copilot started reviewing on behalf of AllenFarcas March 27, 2026 19:13 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

AllenFarcas added the ci-level 1 CI test level 1 label Mar 27, 2026

wangye805 requested changes Mar 27, 2026

View reviewed changes


		fused_val_grad = jit(jax.value_and_grad(fused_fn, argnums=(0, 1, 2)))

		with global_shard_guard(MeshResource()):

	with global_shard_guard(MeshResource()):
	with global_shard_guard(MeshResource('dp', 'cp', 'tp')):

		_orig_nondeterministic = os.environ.get("NVTE_ALLOW_NONDETERMINISTIC_ALGO")
		os.environ["NVTE_ALLOW_NONDETERMINISTIC_ALGO"] = "0"

Conversation

AllenFarcas commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Micky774 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangye805 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AllenFarcas commented Mar 27, 2026 •

edited

Loading

Micky774 commented Mar 27, 2026 •

edited

Loading