Skip to content

Decompose dot_xpu_mkl into mul and sum in non oneMKL path#3265

Closed
Silv3S wants to merge 4 commits intointel:mainfrom
Silv3S:dot_mkl
Closed

Decompose dot_xpu_mkl into mul and sum in non oneMKL path#3265
Silv3S wants to merge 4 commits intointel:mainfrom
Silv3S:dot_mkl

Conversation

@Silv3S
Copy link
Copy Markdown
Contributor

@Silv3S Silv3S commented Apr 3, 2026

If oneMKL is not available, replace CPU fallback with dot to mul+sum decomposition to avoid unnecessary data copies between devices.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses XPU torch.dot failing for torch.int64 by decomposing the operation into elementwise mul followed by sum, and it removes the CPU fallback when oneMKL is unavailable to avoid device↔host copies.

Changes:

  • Add an explicit Long guard in the oneMKL path to run mul + sum instead of dot_xpu_mkl.
  • Replace the non-oneMKL CPU fallback with mul + sum directly on XPU.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ATen/native/xpu/Blas.cpp Outdated
}

#if defined(USE_ONEMKL_XPU)
if (self.scalar_type() == at::ScalarType::Long) {
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dot_xpu_mkl only dispatches floating/complex (see AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES... in src/ATen/native/xpu/mkl/BlasImpl.cpp), so in USE_ONEMKL_XPU builds this function will still throw for other integer dtypes (e.g., Int/Short/Byte) even after the Long guard. If torch.dot is expected to work for all integral types, consider expanding this guard to cover all integral (non-bool) dtypes (or adding an explicit error for the unsupported ones).

Suggested change
if (self.scalar_type() == at::ScalarType::Long) {
if (c10::isIntegralType(self.scalar_type(), /*includeBool=*/false)) {

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Silv3S Since oneMKL dot doesn't support all integrate types, similar issues can also occur when the input tensor is another integer. Suggestions for two purposes:

  • To fix UT only => skipping failed cases just like CUDA did
  • To support comprehensive functionality => add support not only for long

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review. I think first suggestion is better than extending functionality for ints. User can invoke mul and sum that supports int inputs if needed

Comment thread src/ATen/native/xpu/Blas.cpp Outdated
@guangyey
Copy link
Copy Markdown
Contributor

guangyey commented Apr 7, 2026

I notice that CUDA's impl dot_cuda doesn't support long data type yet. Why do we need this.

@Silv3S
Copy link
Copy Markdown
Contributor Author

Silv3S commented Apr 7, 2026

Thanks for review @guangyey. You're right - we don't need it. I assumed that it should be implemented based on failing ut from open issue. Then I checked that dot for int64 is also not covered by CUDA, as they just skip this ut.

But as general improvement I'd consider replacing existing CPU fallback with mul+sum for non-oneMKL path to keep calculations on device

#if defined(USE_ONEMKL_XPU)
  return at::native::xpu::dot_xpu_mkl(self, other);
#else
  // return at::native::dot(self.cpu(), other.cpu()).to(self.device());
  return at::mul(self, other).sum();
#endif

@Silv3S Silv3S linked an issue Apr 7, 2026 that may be closed by this pull request
@guangyey guangyey requested a review from CuiYifeng April 13, 2026 05:59
Comment thread src/ATen/native/xpu/Blas.cpp Outdated
}

#if defined(USE_ONEMKL_XPU)
if (self.scalar_type() == at::ScalarType::Long) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Silv3S Since oneMKL dot doesn't support all integrate types, similar issues can also occur when the input tensor is another integer. Suggestions for two purposes:

  • To fix UT only => skipping failed cases just like CUDA did
  • To support comprehensive functionality => add support not only for long

Comment thread src/ATen/native/xpu/Blas.cpp Outdated

#if defined(USE_ONEMKL_XPU)
if (self.scalar_type() == at::ScalarType::Long) {
return at::mul(self, other).sum();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder: not sure whether this mathematically equivalent approach will overflow for large inputs.
For functionality first, this approach is acceptable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. They are numerically close, but it's not exact match. I checked multiple cases and they pass for 1e-5 tol (in fp32), but to not introduce any instabilities maybe it's better to leave the CPU fallback as is.

Copilot AI review requested due to automatic review settings April 14, 2026 11:46
@Silv3S Silv3S changed the title Decompose dot_xpu_mkl into mul and sum for long dtype Decompose dot_xpu_mkl into mul and sum in non oneMKL path Apr 14, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ATen/native/xpu/Blas.cpp
@Silv3S Silv3S closed this Apr 14, 2026
@Silv3S Silv3S deleted the dot_mkl branch April 14, 2026 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NotImplementedError: "dot_xpu_mkl" not implemented for 'Long'

4 participants