Skip to content

Conversation

@arrdel
Copy link

@arrdel arrdel commented Dec 3, 2025

What does this PR do?

Fixes #12783

This PR temporarily disables the Sage Attention sm90 backend which is causing confetti/noisy output on SM 9.0+ (Hopper) GPUs.

The Problem

The _SAGE_QK_INT8_PV_FP8_CUDA_SM90 backend was automatically being selected on SM 9.0+ GPUs (Hopper architecture) due to the constraint:

constraints=[_check_device_cuda_atleast_smXY(9, 0), _check_shape]

However, this backend is producing incorrect output (described as "confetti" or "noisy" output), indicating a bug in the underlying sageattention library's sm90 implementation.

The Solution

Temporarily disabled the sm90 backend by commenting out its registration:

  • Users on SM 9.0+ GPUs will now fall back to the standard Sage Attention backends
  • Added a comment referencing issue Sage Attention sm90 causes confetti/noisy output #12783 for future reference
  • This is a temporary workaround until the upstream sageattention library fixes the sm90 implementation

Impact

  • ✅ Fixes the confetti/noisy output issue on Hopper GPUs
  • ✅ Users can still use other Sage Attention backends
  • ✅ No breaking changes for users not on SM 9.0+ devices
  • ⚠️ SM 9.0+ users won't get sm90-specific optimizations until upstream fixes the bug

Future Work

This backend can be re-enabled once the sageattention library fixes the sm90 implementation bug.

The _SAGE_QK_INT8_PV_FP8_CUDA_SM90 backend is causing confetti/noisy output on SM 9.0+ GPUs. Temporarily disabling this backend by commenting out its registration until the upstream sageattention library fixes the issue.

Fixes huggingface#12783
Copilot AI review requested due to automatic review settings December 3, 2025 17:16
Copilot finished reviewing on behalf of arrdel December 3, 2025 17:18
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR temporarily disables the Sage Attention sm90 backend to fix confetti/noisy output issues on SM 9.0+ (Hopper) GPUs by commenting out the backend registration and implementation.

Key Changes:

  • Commented out the _sage_qk_int8_pv_fp8_cuda_sm90_attention function registration to prevent automatic selection on Hopper GPUs
  • Added a comment referencing issue #12783 to track re-enabling when upstream fixes the bug

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2220 to +2240
# AttentionBackendName._SAGE_QK_INT8_PV_FP8_CUDA_SM90,
# constraints=[_check_device_cuda_atleast_smXY(9, 0), _check_shape],
# )
# def _sage_qk_int8_pv_fp8_cuda_sm90_attention(
# query: torch.Tensor,
# key: torch.Tensor,
# value: torch.Tensor,
# is_causal: bool = False,
# scale: Optional[float] = None,
# return_lse: bool = False,
# _parallel_config: Optional["ParallelConfig"] = None,
# ) -> torch.Tensor:
# return sageattn_qk_int8_pv_fp8_cuda_sm90(
# q=query,
# k=key,
# v=value,
# tensor_layout="NHD",
# is_causal=is_causal,
# sm_scale=scale,
# return_lse=return_lse,
# )
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting out code is not the recommended approach for temporarily disabling functionality. This creates several issues:

  1. The AttentionBackendName._SAGE_QK_INT8_PV_FP8_CUDA_SM90 enum member still exists (line 195) but now has no registered implementation
  2. The backend is still referenced in _check_attention_backend_requirements() (line 459) which could cause confusion
  3. Commented code creates maintenance burden and can become stale

Recommended approach:
Instead of commenting out the function, keep it registered but add a runtime check that raises a clear error or logs a warning and falls back to another backend. For example:

@_AttentionBackendRegistry.register(
    AttentionBackendName._SAGE_QK_INT8_PV_FP8_CUDA_SM90,
    constraints=[_check_device_cuda_atleast_smXY(9, 0), _check_shape],
)
def _sage_qk_int8_pv_fp8_cuda_sm90_attention(
    query: torch.Tensor,
    key: torch.Tensor,
    value: torch.Tensor,
    is_causal: bool = False,
    scale: Optional[float] = None,
    return_lse: bool = False,
    _parallel_config: Optional["ParallelConfig"] = None,
) -> torch.Tensor:
    # Temporarily disabled due to issue #12783 - sm90 backend causes confetti/noisy output
    raise NotImplementedError(
        "The sm90 backend for Sage Attention is temporarily disabled due to producing "
        "incorrect output (see issue #12783). Please use a different attention backend."
    )

Alternatively, if you want automatic fallback, modify the constraints to never match, or add a feature flag to control this behavior.

Suggested change
# AttentionBackendName._SAGE_QK_INT8_PV_FP8_CUDA_SM90,
# constraints=[_check_device_cuda_atleast_smXY(9, 0), _check_shape],
# )
# def _sage_qk_int8_pv_fp8_cuda_sm90_attention(
# query: torch.Tensor,
# key: torch.Tensor,
# value: torch.Tensor,
# is_causal: bool = False,
# scale: Optional[float] = None,
# return_lse: bool = False,
# _parallel_config: Optional["ParallelConfig"] = None,
# ) -> torch.Tensor:
# return sageattn_qk_int8_pv_fp8_cuda_sm90(
# q=query,
# k=key,
# v=value,
# tensor_layout="NHD",
# is_causal=is_causal,
# sm_scale=scale,
# return_lse=return_lse,
# )
@_AttentionBackendRegistry.register(
AttentionBackendName._SAGE_QK_INT8_PV_FP8_CUDA_SM90,
constraints=[_check_device_cuda_atleast_smXY(9, 0), _check_shape],
)
def _sage_qk_int8_pv_fp8_cuda_sm90_attention(
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
is_causal: bool = False,
scale: Optional[float] = None,
return_lse: bool = False,
_parallel_config: Optional["ParallelConfig"] = None,
) -> torch.Tensor:
# Temporarily disabled due to issue #12783 - sm90 backend causes confetti/noisy output
raise NotImplementedError(
"The sm90 backend for Sage Attention is temporarily disabled due to producing "
"incorrect output (see issue #12783). Please use a different attention backend."
)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sage Attention sm90 causes confetti/noisy output

1 participant