Decompose dot_xpu_mkl into mul and sum in non oneMKL path#3265
Decompose dot_xpu_mkl into mul and sum in non oneMKL path#3265Silv3S wants to merge 4 commits intointel:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses XPU torch.dot failing for torch.int64 by decomposing the operation into elementwise mul followed by sum, and it removes the CPU fallback when oneMKL is unavailable to avoid device↔host copies.
Changes:
- Add an explicit
Longguard in the oneMKL path to runmul+suminstead ofdot_xpu_mkl. - Replace the non-oneMKL CPU fallback with
mul+sumdirectly on XPU.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| #if defined(USE_ONEMKL_XPU) | ||
| if (self.scalar_type() == at::ScalarType::Long) { |
There was a problem hiding this comment.
dot_xpu_mkl only dispatches floating/complex (see AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES... in src/ATen/native/xpu/mkl/BlasImpl.cpp), so in USE_ONEMKL_XPU builds this function will still throw for other integer dtypes (e.g., Int/Short/Byte) even after the Long guard. If torch.dot is expected to work for all integral types, consider expanding this guard to cover all integral (non-bool) dtypes (or adding an explicit error for the unsupported ones).
| if (self.scalar_type() == at::ScalarType::Long) { | |
| if (c10::isIntegralType(self.scalar_type(), /*includeBool=*/false)) { |
There was a problem hiding this comment.
@Silv3S Since oneMKL dot doesn't support all integrate types, similar issues can also occur when the input tensor is another integer. Suggestions for two purposes:
- To fix UT only => skipping failed cases just like CUDA did
- To support comprehensive functionality => add support not only for
long
There was a problem hiding this comment.
Thanks for review. I think first suggestion is better than extending functionality for ints. User can invoke mul and sum that supports int inputs if needed
|
I notice that CUDA's impl |
|
Thanks for review @guangyey. You're right - we don't need it. I assumed that it should be implemented based on failing ut from open issue. Then I checked that But as general improvement I'd consider replacing existing CPU fallback with #if defined(USE_ONEMKL_XPU)
return at::native::xpu::dot_xpu_mkl(self, other);
#else
// return at::native::dot(self.cpu(), other.cpu()).to(self.device());
return at::mul(self, other).sum();
#endif |
| } | ||
|
|
||
| #if defined(USE_ONEMKL_XPU) | ||
| if (self.scalar_type() == at::ScalarType::Long) { |
There was a problem hiding this comment.
@Silv3S Since oneMKL dot doesn't support all integrate types, similar issues can also occur when the input tensor is another integer. Suggestions for two purposes:
- To fix UT only => skipping failed cases just like CUDA did
- To support comprehensive functionality => add support not only for
long
|
|
||
| #if defined(USE_ONEMKL_XPU) | ||
| if (self.scalar_type() == at::ScalarType::Long) { | ||
| return at::mul(self, other).sum(); |
There was a problem hiding this comment.
Just a reminder: not sure whether this mathematically equivalent approach will overflow for large inputs.
For functionality first, this approach is acceptable.
There was a problem hiding this comment.
Agree. They are numerically close, but it's not exact match. I checked multiple cases and they pass for 1e-5 tol (in fp32), but to not introduce any instabilities maybe it's better to leave the CPU fallback as is.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
If oneMKL is not available, replace CPU fallback with
dottomul+sumdecomposition to avoid unnecessary data copies between devices.