Use std::fma in addcmul and foreach pointwise ops for FMA parity with CUDA#3275
Use std::fma in addcmul and foreach pointwise ops for FMA parity with CUDA#3275AKloniecki wants to merge 10 commits intomainfrom
Conversation
… CUDA The CUDA pointwise_op_impl (DeviceAddCmulCdiv.cuh) uses std::fma to guarantee consistent fused multiply-add behavior, particularly when alpha=1 where it emits std::fma(tensor1, tensor2, input). The XPU kernels were instead computing input + alpha * op(tensor1, tensor2) without any FMA guarantee, causing bitwise differences between addcmul and add-with-alpha operations. Port the same FMA logic to XPU: add a pointwise_op_impl helper in ForeachFunctors.h that mirrors CUDA's implementation, and update AddcmulFunctor, PointwiseOpScalarFunctor, and PointwiseOpScalarListFunctor to use it. This fixes test_addcmul_alpha_one_fma_parity on XPU. Signed-off-by: Artur Kłoniecki <arturx.kloniecki@intel.com>
There was a problem hiding this comment.
Pull request overview
This PR aligns XPU addcmul and foreach pointwise operations with CUDA’s fused multiply-add (FMA) behavior by using std::fma for real floating-point math, addressing bitwise parity failures when alpha == 1 (issue #2759).
Changes:
- Updated XPU
addcmulkernel to usestd::fma(with analpha == 1fast-path) for floating-point accumulator types. - Introduced a
pointwise_op_implhelper inForeachFunctors.hto centralize FMA behavior forinput + alpha * op(tensor1, tensor2). - Switched foreach pointwise scalar and scalar-list functors to call
pointwise_op_impl.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp | Adds std::fma usage in AddcmulFunctor to match CUDA fused behavior. |
| src/ATen/native/xpu/sycl/ForeachFunctors.h | Adds pointwise_op_impl and routes foreach pointwise ops through it for FMA parity. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ication in AddcmulFunctor Agent-Logs-Url: https://github.com/intel/torch-xpu-ops/sessions/93af4d41-5f18-47eb-9154-be673b0da4a7 Co-authored-by: AKloniecki <188310598+AKloniecki@users.noreply.github.com>
|
What's the issue this PR is fixing? |
|
+please fix the linter |
|
@AKloniecki could you please fix the lint issue. |
Signed-off-by: Artur Kłoniecki <arturx.kloniecki@intel.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@AKloniecki please fix the linter issues + address all comments. Then we can auto-merge this PR |
Signed-off-by: Artur Kłoniecki <arturx.kloniecki@intel.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Performance outliers, please check!
|
Performance outliers, please check!
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (alpha == opmath_t(1)) { | ||
| if constexpr ( | ||
| std::is_same_v<Op, std::multiplies<opmath_t>> && | ||
| std::is_floating_point_v<opmath_t>) { | ||
| return std::fma(tensor1, tensor2, input); | ||
| } else { | ||
| return input + op(tensor1, tensor2); | ||
| } |
There was a problem hiding this comment.
The FMA fast-path is guarded by an exact type match (std::is_same_v<Op, std::multiplies<opmath_t>>). This is fragile: if callers pass an equivalent multiply op with a different type (e.g., std::multiplies<void>, custom functor/wrapper, or a different instantiation), the FMA path won’t trigger and you’ll lose the intended CUDA parity. Consider making the multiply/FMA intent explicit (e.g., separate overload/tag for multiply, or a small trait like is_multiply_op<Op, opmath_t> that can recognize expected multiply functors).
Performance outliers, please check!
|
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| #include <functional> | ||
| #include <ATen/native/xpu/sycl/DeviceAddCmulCdiv.h> |
There was a problem hiding this comment.
The PR description says PointwiseOpsKernels.cpp removes unused #include <functional>, but the diff adds it back. Since DeviceAddCmulCdiv.h already includes <functional>, this include is likely redundant here—either drop #include <functional> from this file or update the PR description to reflect the new requirement.
pointwise_op_implhelper into a sharedDeviceAddCmulCdiv.hheaderForeachFunctors.hto include the shared header instead of definingpointwise_op_impldirectlyPointwiseOpsKernels.cppto use the shared helper inAddcmulFunctor, removing duplicated FMA logic and unused#include <functional>/#include <type_traits>Fixes: #2759