Skip to content

huggingface accuracy inference Error in op: torch.ops.aten._scaled_dot_product_fused_attention_overrideable.default #3258

@bjarzemb

Description

@bjarzemb

🐛 Describe the bug

Almost all huggingface accuracy inference models failed with a stride mismatch bug in PyTorch's XPU backend for the Scaled Dot Product Attention (SDPA) operation:

Traceback (most recent call last):
  File "C:\Users\gta\repositories\pytorch\pytorch\benchmarks\dynamo\common.py", line 2379, in check_accuracy
    new_result = self.run_n_iterations(
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\gta\repositories\pytorch\pytorch\benchmarks\dynamo\common.py", line 2077, in run_n_iterations
    model_iter_fn(mod, inputs, collect_outputs=False)
  File "C:\Users\gta\miniforge3\envs\202604010405_4e67aacf_32.0.101.8626_2025.3.2.61\Lib\site-packages\torch\_dynamo\eval_frame.py", line 1036, in compile_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\gta\repositories\pytorch\pytorch\benchmarks\dynamo\huggingface.py", line 554, in forward_pass
    def forward_pass(self, mod, inputs, collect_outputs=True):
  File "C:\Users\gta\miniforge3\envs\202604010405_4e67aacf_32.0.101.8626_2025.3.2.61\Lib\site-packages\torch\_dynamo\eval_frame.py", line 1272, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\gta\miniforge3\envs\202604010405_4e67aacf_32.0.101.8626_2025.3.2.61\Lib\site-packages\torch\_functorch\aot_autograd.py", line 1186, in forward
    return compiled_fn(full_args)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\gta\miniforge3\envs\202604010405_4e67aacf_32.0.101.8626_2025.3.2.61\Lib\site-packages\torch\_functorch\_aot_autograd\runtime_wrappers.py", line 767, in runtime_wrapper
    all_outs = compiled_invoker.run(args, on_before_call=exit_prologue)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\gta\miniforge3\envs\202604010405_4e67aacf_32.0.101.8626_2025.3.2.61\Lib\site-packages\torch\_functorch\_aot_autograd\runtime_wrappers.py", line 513, in run
    return call_func_at_runtime_with_args(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\gta\miniforge3\envs\202604010405_4e67aacf_32.0.101.8626_2025.3.2.61\Lib\site-packages\torch\_functorch\_aot_autograd\utils.py", line 126, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "C:\Users\gta\miniforge3\envs\202604010405_4e67aacf_32.0.101.8626_2025.3.2.61\Lib\site-packages\torch\_functorch\_aot_autograd\runtime_wrappers.py", line 840, in wrapper
    return compiled_fn(runtime_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\gta\miniforge3\envs\202604010405_4e67aacf_32.0.101.8626_2025.3.2.61\Lib\site-packages\torch\_functorch\_aot_autograd\runtime_wrappers.py", line 1044, in inner_fn
    outs = compiled_fn(args)
           ^^^^^^^^^^^^^^^^^
  File "C:\Users\gta\miniforge3\envs\202604010405_4e67aacf_32.0.101.8626_2025.3.2.61\Lib\site-packages\torch\_inductor\output_code.py", line 682, in __call__
    return self.current_callable(inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\gta\miniforge3\envs\202604010405_4e67aacf_32.0.101.8626_2025.3.2.61\Lib\site-packages\torch\_inductor\utils.py", line 3459, in run
    out = model(new_inputs)
          ^^^^^^^^^^^^^^^^^
  File "C:\Users\gta\AppData\Local\Temp\torchinductor_gta\2i\c2ijxofv4zv6scovbk6xqyo7xb5lx4g3ofjg5yuax4p24dh4smau.py", line 2567, in call
    (buf464, buf467, buf471) = self.partitions[0](partition0_args)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\gta\AppData\Local\Temp\torchinductor_gta\2i\c2ijxofv4zv6scovbk6xqyo7xb5lx4g3ofjg5yuax4p24dh4smau.py", line 1559, in partition_0
    assert_size_stride(buf19, (1, 64, 512, 64), (2097152, 32768, 64, 1), 'torch.ops.aten._scaled_dot_product_fused_attention_overrideable.default')
AssertionError: expected size 64==64, stride 64==32768 at dim=1; expected size 512==512, stride 4096==64 at dim=2
Error in op: torch.ops.aten._scaled_dot_product_fused_attention_overrideable.default
This error most often comes from a incorrect fake (aka meta) kernel for a custom op.
Use torch.library.opcheck to test your custom op.
See https://pytorch.org/docs/stable/library.html#torch.library.opcheck
TorchDynamo optimized model failed to run because of following error
fail_to_run

Versions

PyTorch version: 2.12.0a0+git4e67aac
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Pro (10.0.26100 64-bit)
GCC version: Could not collect
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: N/A

Python version: 3.12.13 | packaged by conda-forge | (main, Mar 5 2026, 16:36:12) [MSC v.1944 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-11-10.0.26100-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250302
Intel GPU driver version:

  • 32.0.101.8626 (20260311000000.***+)
    Intel GPU models onboard:
  • Intel(R) Arc(TM) B580 Graphics
    Intel GPU models detected:
  • [0] _XpuDeviceProperties(name='Intel(R) Arc(TM) B580 Graphics', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero V2', type='gpu', device_id=0xE20B, uuid=86800be2-0000-0000-0300-000000000000, driver_version='1.14.37111', total_memory=11875MB, local_mem_size=128KB, max_compute_units=160, memory_clock_rate=0MHz, memory_bus_width=64-bit, gpu_eu_count=160, gpu_subslice_count=20, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
    HIP runtime version: N/A
    MIOpen runtime version: N/A
    Is XNNPACK available: True
    Caching allocator config: N/A

CPU:
Name: 13th Gen Intel(R) Core(TM) i5-13400
Manufacturer: GenuineIntel
Family: 205
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 2500
MaxClockSpeed: 2500
L2CacheSize: 9728
L2CacheSpeed: None
Revision: None

Versions of relevant libraries:
[pip3] intel-openmp==2025.3.3
[pip3] mkl-include==2025.3.1
[pip3] mkl-static==2025.3.1
[pip3] mypy==1.20.0
[pip3] mypy_extensions==1.1.0
[pip3] numpy==1.26.2
[pip3] onemkl-license==2025.3.1
[pip3] onnx==1.20.0
[pip3] onnx-ir==0.1.16
[pip3] onnxscript==0.6.2
[pip3] optree==0.13.0
[pip3] pytorch-labs-segment-anything-fast==0.2
[pip3] tbb==2022.3.1
[pip3] tbb-devel==2022.3.1
[pip3] tcmlib==1.4.1
[pip3] torch==2.12.0a0+git4e67aac
[pip3] torch_geometric==2.4.0
[pip3] torchao==0.17.0
[pip3] torchaudio==2.11.0a0+c0cbdb9
[pip3] torchbench==0.1
[pip3] torchmetrics==1.9.0
[pip3] torchmultimodal==0.1.0b0
[pip3] torchrec-nightly==2022.4.26
[pip3] torchtext==0.17.0a0+a5e6106
[pip3] torchvision==0.27.0a0+9bf794d
[pip3] torchx-nightly==2026.3.31
[pip3] triton-xpu==3.7.0+git33f782ef
[conda] intel-openmp 2025.3.3 pypi_0 pypi
[conda] mkl-include 2025.3.1 pypi_0 pypi
[conda] mkl-static 2025.3.1 pypi_0 pypi
[conda] numpy 1.26.2 pypi_0 pypi
[conda] onemkl-license 2025.3.1 pypi_0 pypi
[conda] optree 0.13.0 pypi_0 pypi
[conda] pytorch-labs-segment-anything-fast 0.2 pypi_0 pypi
[conda] tbb 2022.3.1 pypi_0 pypi
[conda] tbb-devel 2022.3.1 pypi_0 pypi
[conda] tcmlib 1.4.1 pypi_0 pypi
[conda] torch 2.12.0a0+git4e67aac pypi_0 pypi
[conda] torch-geometric 2.4.0 pypi_0 pypi
[conda] torchao 0.17.0 pypi_0 pypi
[conda] torchaudio 2.11.0a0+c0cbdb9 pypi_0 pypi
[conda] torchbench 0.1 pypi_0 pypi
[conda] torchmetrics 1.9.0 pypi_0 pypi
[conda] torchmultimodal 0.1.0b0 pypi_0 pypi
[conda] torchrec-nightly 2022.4.26 pypi_0 pypi
[conda] torchtext 0.17.0a0+a5e6106 pypi_0 pypi
[conda] torchvision 0.27.0a0+9bf794d pypi_0 pypi
[conda] torchx-nightly 2026.3.31 pypi_0 pypi
[conda] triton-xpu 3.7.0+git33f782ef pypi_0 pypi

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions