-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Disable Sage Attention sm90 backend due to confetti/noisy output #12785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Disable Sage Attention sm90 backend due to confetti/noisy output #12785
Conversation
The _SAGE_QK_INT8_PV_FP8_CUDA_SM90 backend is causing confetti/noisy output on SM 9.0+ GPUs. Temporarily disabling this backend by commenting out its registration until the upstream sageattention library fixes the issue. Fixes huggingface#12783
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR temporarily disables the Sage Attention sm90 backend to fix confetti/noisy output issues on SM 9.0+ (Hopper) GPUs by commenting out the backend registration and implementation.
Key Changes:
- Commented out the
_sage_qk_int8_pv_fp8_cuda_sm90_attentionfunction registration to prevent automatic selection on Hopper GPUs - Added a comment referencing issue #12783 to track re-enabling when upstream fixes the bug
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # AttentionBackendName._SAGE_QK_INT8_PV_FP8_CUDA_SM90, | ||
| # constraints=[_check_device_cuda_atleast_smXY(9, 0), _check_shape], | ||
| # ) | ||
| # def _sage_qk_int8_pv_fp8_cuda_sm90_attention( | ||
| # query: torch.Tensor, | ||
| # key: torch.Tensor, | ||
| # value: torch.Tensor, | ||
| # is_causal: bool = False, | ||
| # scale: Optional[float] = None, | ||
| # return_lse: bool = False, | ||
| # _parallel_config: Optional["ParallelConfig"] = None, | ||
| # ) -> torch.Tensor: | ||
| # return sageattn_qk_int8_pv_fp8_cuda_sm90( | ||
| # q=query, | ||
| # k=key, | ||
| # v=value, | ||
| # tensor_layout="NHD", | ||
| # is_causal=is_causal, | ||
| # sm_scale=scale, | ||
| # return_lse=return_lse, | ||
| # ) |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commenting out code is not the recommended approach for temporarily disabling functionality. This creates several issues:
- The
AttentionBackendName._SAGE_QK_INT8_PV_FP8_CUDA_SM90enum member still exists (line 195) but now has no registered implementation - The backend is still referenced in
_check_attention_backend_requirements()(line 459) which could cause confusion - Commented code creates maintenance burden and can become stale
Recommended approach:
Instead of commenting out the function, keep it registered but add a runtime check that raises a clear error or logs a warning and falls back to another backend. For example:
@_AttentionBackendRegistry.register(
AttentionBackendName._SAGE_QK_INT8_PV_FP8_CUDA_SM90,
constraints=[_check_device_cuda_atleast_smXY(9, 0), _check_shape],
)
def _sage_qk_int8_pv_fp8_cuda_sm90_attention(
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
is_causal: bool = False,
scale: Optional[float] = None,
return_lse: bool = False,
_parallel_config: Optional["ParallelConfig"] = None,
) -> torch.Tensor:
# Temporarily disabled due to issue #12783 - sm90 backend causes confetti/noisy output
raise NotImplementedError(
"The sm90 backend for Sage Attention is temporarily disabled due to producing "
"incorrect output (see issue #12783). Please use a different attention backend."
)Alternatively, if you want automatic fallback, modify the constraints to never match, or add a feature flag to control this behavior.
| # AttentionBackendName._SAGE_QK_INT8_PV_FP8_CUDA_SM90, | |
| # constraints=[_check_device_cuda_atleast_smXY(9, 0), _check_shape], | |
| # ) | |
| # def _sage_qk_int8_pv_fp8_cuda_sm90_attention( | |
| # query: torch.Tensor, | |
| # key: torch.Tensor, | |
| # value: torch.Tensor, | |
| # is_causal: bool = False, | |
| # scale: Optional[float] = None, | |
| # return_lse: bool = False, | |
| # _parallel_config: Optional["ParallelConfig"] = None, | |
| # ) -> torch.Tensor: | |
| # return sageattn_qk_int8_pv_fp8_cuda_sm90( | |
| # q=query, | |
| # k=key, | |
| # v=value, | |
| # tensor_layout="NHD", | |
| # is_causal=is_causal, | |
| # sm_scale=scale, | |
| # return_lse=return_lse, | |
| # ) | |
| @_AttentionBackendRegistry.register( | |
| AttentionBackendName._SAGE_QK_INT8_PV_FP8_CUDA_SM90, | |
| constraints=[_check_device_cuda_atleast_smXY(9, 0), _check_shape], | |
| ) | |
| def _sage_qk_int8_pv_fp8_cuda_sm90_attention( | |
| query: torch.Tensor, | |
| key: torch.Tensor, | |
| value: torch.Tensor, | |
| is_causal: bool = False, | |
| scale: Optional[float] = None, | |
| return_lse: bool = False, | |
| _parallel_config: Optional["ParallelConfig"] = None, | |
| ) -> torch.Tensor: | |
| # Temporarily disabled due to issue #12783 - sm90 backend causes confetti/noisy output | |
| raise NotImplementedError( | |
| "The sm90 backend for Sage Attention is temporarily disabled due to producing " | |
| "incorrect output (see issue #12783). Please use a different attention backend." | |
| ) |
What does this PR do?
Fixes #12783
This PR temporarily disables the Sage Attention sm90 backend which is causing confetti/noisy output on SM 9.0+ (Hopper) GPUs.
The Problem
The
_SAGE_QK_INT8_PV_FP8_CUDA_SM90backend was automatically being selected on SM 9.0+ GPUs (Hopper architecture) due to the constraint:However, this backend is producing incorrect output (described as "confetti" or "noisy" output), indicating a bug in the underlying sageattention library's sm90 implementation.
The Solution
Temporarily disabled the sm90 backend by commenting out its registration:
Impact
Future Work
This backend can be re-enabled once the sageattention library fixes the sm90 implementation bug.