Skip to content

Use std::fma in addcmul and foreach pointwise ops for FMA parity with CUDA#3275

Open
AKloniecki wants to merge 10 commits intomainfrom
aklonieckix/use-std-fma-in=addcmul
Open

Use std::fma in addcmul and foreach pointwise ops for FMA parity with CUDA#3275
AKloniecki wants to merge 10 commits intomainfrom
aklonieckix/use-std-fma-in=addcmul

Conversation

@AKloniecki
Copy link
Copy Markdown
Contributor

@AKloniecki AKloniecki commented Apr 7, 2026

  • Extract pointwise_op_impl helper into a shared DeviceAddCmulCdiv.h header
  • Update ForeachFunctors.h to include the shared header instead of defining pointwise_op_impl directly
  • Update PointwiseOpsKernels.cpp to use the shared helper in AddcmulFunctor, removing duplicated FMA logic and unused #include <functional> / #include <type_traits>

Fixes: #2759

… CUDA

The CUDA pointwise_op_impl (DeviceAddCmulCdiv.cuh) uses std::fma to
guarantee consistent fused multiply-add behavior, particularly when
alpha=1 where it emits std::fma(tensor1, tensor2, input). The XPU kernels
were instead computing input + alpha * op(tensor1, tensor2) without
any FMA guarantee, causing bitwise differences between addcmul and
add-with-alpha operations.

Port the same FMA logic to XPU: add a pointwise_op_impl helper in
ForeachFunctors.h that mirrors CUDA's implementation, and update
AddcmulFunctor, PointwiseOpScalarFunctor, and
PointwiseOpScalarListFunctor to use it. This fixes
test_addcmul_alpha_one_fma_parity on XPU.

Signed-off-by: Artur Kłoniecki <arturx.kloniecki@intel.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aligns XPU addcmul and foreach pointwise operations with CUDA’s fused multiply-add (FMA) behavior by using std::fma for real floating-point math, addressing bitwise parity failures when alpha == 1 (issue #2759).

Changes:

  • Updated XPU addcmul kernel to use std::fma (with an alpha == 1 fast-path) for floating-point accumulator types.
  • Introduced a pointwise_op_impl helper in ForeachFunctors.h to centralize FMA behavior for input + alpha * op(tensor1, tensor2).
  • Switched foreach pointwise scalar and scalar-list functors to call pointwise_op_impl.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp Adds std::fma usage in AddcmulFunctor to match CUDA fused behavior.
src/ATen/native/xpu/sycl/ForeachFunctors.h Adds pointwise_op_impl and routes foreach pointwise ops through it for FMA parity.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp Outdated
…ication in AddcmulFunctor

Agent-Logs-Url: https://github.com/intel/torch-xpu-ops/sessions/93af4d41-5f18-47eb-9154-be673b0da4a7

Co-authored-by: AKloniecki <188310598+AKloniecki@users.noreply.github.com>
@astachowiczhabana
Copy link
Copy Markdown
Contributor

What's the issue this PR is fixing?

@astachowiczhabana
Copy link
Copy Markdown
Contributor

+please fix the linter

Comment thread src/ATen/native/xpu/sycl/DeviceAddCmulCdiv.h
Copy link
Copy Markdown
Contributor

@guangyey guangyey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

@guangyey
Copy link
Copy Markdown
Contributor

@AKloniecki could you please fix the lint issue.

Signed-off-by: Artur Kłoniecki <arturx.kloniecki@intel.com>
Copilot AI review requested due to automatic review settings April 13, 2026 08:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ATen/native/xpu/sycl/DeviceAddCmulCdiv.h
@astachowiczhabana
Copy link
Copy Markdown
Contributor

@AKloniecki please fix the linter issues + address all comments. Then we can auto-merge this PR

Signed-off-by: Artur Kłoniecki <arturx.kloniecki@intel.com>
Copilot AI review requested due to automatic review settings April 14, 2026 09:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Copy Markdown

Performance outliers, please check!

  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training mnasnet1_0 1.038304 0.848556

@github-actions
Copy link
Copy Markdown

Performance outliers, please check!

  • 🔴 [-1, 80%), should be regression
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training Background_Matting 0.745273 0.674022
torchbench_bfloat16_training pytorch_unet 0.771577 0.687898
huggingface_bfloat16_training AllenaiLongformerBase 0.655497 0.713858
huggingface_float16_training BartForCausalLM 0.728444 0.718180
huggingface_float16_training DistilBertForMaskedLM 0.718782 0.722377
huggingface_bfloat16_training DistilBertForMaskedLM 0.764382 0.731362
huggingface_bfloat16_training BartForCausalLM 0.766717 0.732476
huggingface_bfloat16_training RobertaForCausalLM 0.767930 0.734936
huggingface_float16_training MBartForCausalLM 0.764234 0.742384
timm_models_bfloat16_training mobilenetv3_large_100 0.785150 0.745316
huggingface_bfloat16_training MBartForCausalLM 0.772614 0.746304
huggingface_float16_training DistillGPT2 0.716538 0.747344
huggingface_bfloat16_training BertForMaskedLM 0.769190 0.748490
huggingface_float16_training RobertaForCausalLM 0.739283 0.748929
huggingface_float16_training TrOCRForCausalLM 0.709275 0.749845
huggingface_bfloat16_training XLNetLMHeadModel 0.739952 0.751047
huggingface_float16_training PLBartForCausalLM 0.750769 0.751723
huggingface_bfloat16_training DistillGPT2 0.764054 0.755594
huggingface_bfloat16_training TrOCRForCausalLM 0.766857 0.756291
timm_models_bfloat16_training dm_nfnet_f0 0.650628 0.758461
huggingface_float16_training YituTechConvBert 0.705677 0.761141
torchbench_bfloat16_training alexnet 0.773019 0.762087
huggingface_float16_training BertForMaskedLM 0.732425 0.763958
huggingface_bfloat16_training PLBartForCausalLM 0.782743 0.766780
huggingface_bfloat16_training PegasusForCausalLM 0.764571 0.768837
torchbench_bfloat16_training nvidia_deeprecommender 0.751497 0.769826
timm_models_bfloat16_training vit_base_patch16_siglip_256 0.747196 0.770614
huggingface_bfloat16_training LayoutLMForMaskedLM 0.762446 0.775298
timm_models_bfloat16_training mobilenetv2_100 0.831272 0.775558
timm_models_bfloat16_training deit_base_distilled_patch16_224 0.754666 0.775697
huggingface_float16_training XLNetLMHeadModel 0.700432 0.776440
huggingface_float16_training AllenaiLongformerBase 0.642540 0.776901
huggingface_bfloat16_training YituTechConvBert 0.738639 0.777282
huggingface_float16_training OPTForCausalLM 0.800536 0.778238
timm_models_bfloat16_training nfnet_l0 0.756283 0.779352
huggingface_bfloat16_training OPTForCausalLM 0.809883 0.780987
torchbench_bfloat16_training resnet50 0.756308 0.782585
timm_models_bfloat16_training mobilevit_s 0.722870 0.789869
timm_models_bfloat16_training ghostnet_100 0.853412 0.791252
huggingface_float16_training LayoutLMForMaskedLM 0.729038 0.792587
huggingface_bfloat16_training ElectraForCausalLM 0.791876 0.794815
huggingface_bfloat16_training MegatronBertForCausalLM 0.835333 0.796350
huggingface_float16_training ElectraForCausalLM 0.752805 0.803783
huggingface_bfloat16_training GPT2ForSequenceClassification 0.786668 0.804084
huggingface_float16_training T5Small 0.775605 0.804086
huggingface_float16_training T5ForConditionalGeneration 0.776841 0.806441
huggingface_float16_training XGLMForCausalLM 0.792231 0.809044
huggingface_float16_training GPT2ForSequenceClassification 0.736074 0.820409
huggingface_bfloat16_training T5ForConditionalGeneration 0.797144 0.822241
timm_models_bfloat16_training visformer_small 0.748556 0.825865
huggingface_float16_training AlbertForMaskedLM 0.766528 0.827724
torchbench_bfloat16_training vgg16 0.755118 0.858791
timm_models_bfloat16_training beit_base_patch16_224 0.778738 0.860687
torchbench_bfloat16_training LearningToPaint 0.696167 0.876601
torchbench_bfloat16_training shufflenet_v2_x1_0 0.771365 0.878733
torchbench_bfloat16_training squeezenet1_1 0.795947 0.944616
  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
huggingface_float16_training PegasusForCausalLM 0.810822 0.800966
huggingface_float16_training MegatronBertForCausalLM 0.802270 0.803674
timm_models_bfloat16_training tf_efficientnet_b0 0.810177 0.804410
timm_models_bfloat16_training inception_v3 0.806139 0.810939
huggingface_bfloat16_training T5Small 0.800906 0.819592
huggingface_bfloat16_training DebertaV2ForMaskedLM 0.871673 0.823401
huggingface_bfloat16_training XGLMForCausalLM 0.813335 0.824673
huggingface_float16_training DebertaV2ForMaskedLM 0.859265 0.828173
huggingface_bfloat16_training M2M100ForConditionalGeneration 0.863729 0.834758
huggingface_float16_training M2M100ForConditionalGeneration 0.857934 0.836822
timm_models_bfloat16_training convnextv2_nano.fcmae_ft_in22k_in1k 0.847051 0.840457
timm_models_bfloat16_training adv_inception_v3 0.800922 0.844142
torchbench_bfloat16_training resnet18 0.846731 0.845751
huggingface_bfloat16_training AlbertForMaskedLM 0.802799 0.846479
timm_models_bfloat16_training repvgg_a2 0.847237 0.850430
huggingface_float16_training BlenderbotForCausalLM 0.842684 0.861516
torchbench_bfloat16_training mobilenet_v2 0.861276 0.866330
huggingface_bfloat16_training BlenderbotForCausalLM 0.864689 0.887394
timm_models_bfloat16_training deit_tiny_patch16_224.fb_in1k 0.923123 0.892760
timm_models_bfloat16_training swin_base_patch4_window7_224 0.876122 0.893900
huggingface_bfloat16_training MobileBertForMaskedLM 0.983390 0.899263
torchbench_bfloat16_training BERT_pytorch 0.989852 0.899879
huggingface_float16_training GoogleFnet 0.842863 0.909044
huggingface_bfloat16_training GoogleFnet 0.843420 0.911409

Copilot AI review requested due to automatic review settings April 16, 2026 09:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp
Comment thread src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp
Comment on lines +29 to +36
if (alpha == opmath_t(1)) {
if constexpr (
std::is_same_v<Op, std::multiplies<opmath_t>> &&
std::is_floating_point_v<opmath_t>) {
return std::fma(tensor1, tensor2, input);
} else {
return input + op(tensor1, tensor2);
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FMA fast-path is guarded by an exact type match (std::is_same_v<Op, std::multiplies<opmath_t>>). This is fragile: if callers pass an equivalent multiply op with a different type (e.g., std::multiplies<void>, custom functor/wrapper, or a different instantiation), the FMA path won’t trigger and you’ll lose the intended CUDA parity. Consider making the multiply/FMA intent explicit (e.g., separate overload/tag for multiply, or a small trait like is_multiply_op<Op, opmath_t> that can recognize expected multiply functors).

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown

Performance outliers, please check!

  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training resnext50_32x4d 0.902251 0.838351

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 17, 2026 12:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment on lines +16 to +17
#include <functional>
#include <ATen/native/xpu/sycl/DeviceAddCmulCdiv.h>
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says PointwiseOpsKernels.cpp removes unused #include <functional>, but the diff adds it back. Since DeviceAddCmulCdiv.h already includes <functional>, this include is likely redundant here—either drop #include <functional> from this file or update the PR description to reflect the new requirement.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug Skip]: New failed cases 2026-1-22

6 participants