Use std::fma in addcmul and foreach pointwise ops for FMA parity with CUDA by AKloniecki · Pull Request #3275 · intel/torch-xpu-ops

AKloniecki · 2026-04-07T08:03:22Z

Extract pointwise_op_impl helper into a shared DeviceAddCmulCdiv.h header
Update ForeachFunctors.h to include the shared header instead of defining pointwise_op_impl directly
Update PointwiseOpsKernels.cpp to use the shared helper in AddcmulFunctor, removing duplicated FMA logic and unused #include <functional> / #include <type_traits>

… CUDA The CUDA pointwise_op_impl (DeviceAddCmulCdiv.cuh) uses std::fma to guarantee consistent fused multiply-add behavior, particularly when alpha=1 where it emits std::fma(tensor1, tensor2, input). The XPU kernels were instead computing input + alpha * op(tensor1, tensor2) without any FMA guarantee, causing bitwise differences between addcmul and add-with-alpha operations. Port the same FMA logic to XPU: add a pointwise_op_impl helper in ForeachFunctors.h that mirrors CUDA's implementation, and update AddcmulFunctor, PointwiseOpScalarFunctor, and PointwiseOpScalarListFunctor to use it. This fixes test_addcmul_alpha_one_fma_parity on XPU. Signed-off-by: Artur Kłoniecki <arturx.kloniecki@intel.com>

Copilot

Pull request overview

This PR aligns XPU addcmul and foreach pointwise operations with CUDA’s fused multiply-add (FMA) behavior by using std::fma for real floating-point math, addressing bitwise parity failures when alpha == 1 (issue #2759).

Changes:

Updated XPU addcmul kernel to use std::fma (with an alpha == 1 fast-path) for floating-point accumulator types.
Introduced a pointwise_op_impl helper in ForeachFunctors.h to centralize FMA behavior for input + alpha * op(tensor1, tensor2).
Switched foreach pointwise scalar and scalar-list functors to call pointwise_op_impl.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp	Adds `std::fma` usage in `AddcmulFunctor` to match CUDA fused behavior.
src/ATen/native/xpu/sycl/ForeachFunctors.h	Adds `pointwise_op_impl` and routes foreach pointwise ops through it for FMA parity.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ication in AddcmulFunctor Agent-Logs-Url: https://github.com/intel/torch-xpu-ops/sessions/93af4d41-5f18-47eb-9154-be673b0da4a7 Co-authored-by: AKloniecki <188310598+AKloniecki@users.noreply.github.com>

astachowiczhabana · 2026-04-07T11:48:26Z

What's the issue this PR is fixing?

astachowiczhabana · 2026-04-07T12:29:17Z

+please fix the linter

guangyey

Overall LGTM.

guangyey · 2026-04-13T03:05:48Z

@AKloniecki could you please fix the lint issue.

Signed-off-by: Artur Kłoniecki <arturx.kloniecki@intel.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

astachowiczhabana · 2026-04-13T14:03:04Z

@AKloniecki please fix the linter issues + address all comments. Then we can auto-merge this PR

Signed-off-by: Artur Kłoniecki <arturx.kloniecki@intel.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2026-04-14T21:16:19Z

Performance outliers, please check!

🟡 [80%, 90%), may be fluctuations

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
torchbench_bfloat16_training	mnasnet1_0	1.038304	0.848556

github-actions · 2026-04-16T05:06:15Z

Performance outliers, please check!

🔴 [-1, 80%), should be regression

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
torchbench_bfloat16_training	Background_Matting	0.745273	0.674022
torchbench_bfloat16_training	pytorch_unet	0.771577	0.687898
huggingface_bfloat16_training	AllenaiLongformerBase	0.655497	0.713858
huggingface_float16_training	BartForCausalLM	0.728444	0.718180
huggingface_float16_training	DistilBertForMaskedLM	0.718782	0.722377
huggingface_bfloat16_training	DistilBertForMaskedLM	0.764382	0.731362
huggingface_bfloat16_training	BartForCausalLM	0.766717	0.732476
huggingface_bfloat16_training	RobertaForCausalLM	0.767930	0.734936
huggingface_float16_training	MBartForCausalLM	0.764234	0.742384
timm_models_bfloat16_training	mobilenetv3_large_100	0.785150	0.745316
huggingface_bfloat16_training	MBartForCausalLM	0.772614	0.746304
huggingface_float16_training	DistillGPT2	0.716538	0.747344
huggingface_bfloat16_training	BertForMaskedLM	0.769190	0.748490
huggingface_float16_training	RobertaForCausalLM	0.739283	0.748929
huggingface_float16_training	TrOCRForCausalLM	0.709275	0.749845
huggingface_bfloat16_training	XLNetLMHeadModel	0.739952	0.751047
huggingface_float16_training	PLBartForCausalLM	0.750769	0.751723
huggingface_bfloat16_training	DistillGPT2	0.764054	0.755594
huggingface_bfloat16_training	TrOCRForCausalLM	0.766857	0.756291
timm_models_bfloat16_training	dm_nfnet_f0	0.650628	0.758461
huggingface_float16_training	YituTechConvBert	0.705677	0.761141
torchbench_bfloat16_training	alexnet	0.773019	0.762087
huggingface_float16_training	BertForMaskedLM	0.732425	0.763958
huggingface_bfloat16_training	PLBartForCausalLM	0.782743	0.766780
huggingface_bfloat16_training	PegasusForCausalLM	0.764571	0.768837
torchbench_bfloat16_training	nvidia_deeprecommender	0.751497	0.769826
timm_models_bfloat16_training	vit_base_patch16_siglip_256	0.747196	0.770614
huggingface_bfloat16_training	LayoutLMForMaskedLM	0.762446	0.775298
timm_models_bfloat16_training	mobilenetv2_100	0.831272	0.775558
timm_models_bfloat16_training	deit_base_distilled_patch16_224	0.754666	0.775697
huggingface_float16_training	XLNetLMHeadModel	0.700432	0.776440
huggingface_float16_training	AllenaiLongformerBase	0.642540	0.776901
huggingface_bfloat16_training	YituTechConvBert	0.738639	0.777282
huggingface_float16_training	OPTForCausalLM	0.800536	0.778238
timm_models_bfloat16_training	nfnet_l0	0.756283	0.779352
huggingface_bfloat16_training	OPTForCausalLM	0.809883	0.780987
torchbench_bfloat16_training	resnet50	0.756308	0.782585
timm_models_bfloat16_training	mobilevit_s	0.722870	0.789869
timm_models_bfloat16_training	ghostnet_100	0.853412	0.791252
huggingface_float16_training	LayoutLMForMaskedLM	0.729038	0.792587
huggingface_bfloat16_training	ElectraForCausalLM	0.791876	0.794815
huggingface_bfloat16_training	MegatronBertForCausalLM	0.835333	0.796350
huggingface_float16_training	ElectraForCausalLM	0.752805	0.803783
huggingface_bfloat16_training	GPT2ForSequenceClassification	0.786668	0.804084
huggingface_float16_training	T5Small	0.775605	0.804086
huggingface_float16_training	T5ForConditionalGeneration	0.776841	0.806441
huggingface_float16_training	XGLMForCausalLM	0.792231	0.809044
huggingface_float16_training	GPT2ForSequenceClassification	0.736074	0.820409
huggingface_bfloat16_training	T5ForConditionalGeneration	0.797144	0.822241
timm_models_bfloat16_training	visformer_small	0.748556	0.825865
huggingface_float16_training	AlbertForMaskedLM	0.766528	0.827724
torchbench_bfloat16_training	vgg16	0.755118	0.858791
timm_models_bfloat16_training	beit_base_patch16_224	0.778738	0.860687
torchbench_bfloat16_training	LearningToPaint	0.696167	0.876601
torchbench_bfloat16_training	shufflenet_v2_x1_0	0.771365	0.878733
torchbench_bfloat16_training	squeezenet1_1	0.795947	0.944616

🟡 [80%, 90%), may be fluctuations

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
huggingface_float16_training	PegasusForCausalLM	0.810822	0.800966
huggingface_float16_training	MegatronBertForCausalLM	0.802270	0.803674
timm_models_bfloat16_training	tf_efficientnet_b0	0.810177	0.804410
timm_models_bfloat16_training	inception_v3	0.806139	0.810939
huggingface_bfloat16_training	T5Small	0.800906	0.819592
huggingface_bfloat16_training	DebertaV2ForMaskedLM	0.871673	0.823401
huggingface_bfloat16_training	XGLMForCausalLM	0.813335	0.824673
huggingface_float16_training	DebertaV2ForMaskedLM	0.859265	0.828173
huggingface_bfloat16_training	M2M100ForConditionalGeneration	0.863729	0.834758
huggingface_float16_training	M2M100ForConditionalGeneration	0.857934	0.836822
timm_models_bfloat16_training	convnextv2_nano.fcmae_ft_in22k_in1k	0.847051	0.840457
timm_models_bfloat16_training	adv_inception_v3	0.800922	0.844142
torchbench_bfloat16_training	resnet18	0.846731	0.845751
huggingface_bfloat16_training	AlbertForMaskedLM	0.802799	0.846479
timm_models_bfloat16_training	repvgg_a2	0.847237	0.850430
huggingface_float16_training	BlenderbotForCausalLM	0.842684	0.861516
torchbench_bfloat16_training	mobilenet_v2	0.861276	0.866330
huggingface_bfloat16_training	BlenderbotForCausalLM	0.864689	0.887394
timm_models_bfloat16_training	deit_tiny_patch16_224.fb_in1k	0.923123	0.892760
timm_models_bfloat16_training	swin_base_patch4_window7_224	0.876122	0.893900
huggingface_bfloat16_training	MobileBertForMaskedLM	0.983390	0.899263
torchbench_bfloat16_training	BERT_pytorch	0.989852	0.899879
huggingface_float16_training	GoogleFnet	0.842863	0.909044
huggingface_bfloat16_training	GoogleFnet	0.843420	0.911409

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-16T09:18:34Z

+  if (alpha == opmath_t(1)) {
+    if constexpr (
+        std::is_same_v<Op, std::multiplies<opmath_t>> &&
+        std::is_floating_point_v<opmath_t>) {
+      return std::fma(tensor1, tensor2, input);
+    } else {
+      return input + op(tensor1, tensor2);
+    }


The FMA fast-path is guarded by an exact type match (std::is_same_v<Op, std::multiplies<opmath_t>>). This is fragile: if callers pass an equivalent multiply op with a different type (e.g., std::multiplies<void>, custom functor/wrapper, or a different instantiation), the FMA path won’t trigger and you’ll lose the intended CUDA parity. Consider making the multiply/FMA intent explicit (e.g., separate overload/tag for multiply, or a small trait like is_multiply_op<Op, opmath_t> that can recognize expected multiply functors).

github-actions · 2026-04-16T23:32:38Z

Performance outliers, please check!

🟡 [80%, 90%), may be fluctuations

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
torchbench_bfloat16_training	resnext50_32x4d	0.902251	0.838351

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Copilot · 2026-04-17T12:49:49Z

+#include <functional>
+#include <ATen/native/xpu/sycl/DeviceAddCmulCdiv.h>


The PR description says PointwiseOpsKernels.cpp removes unused #include <functional>, but the diff adds it back. Since DeviceAddCmulCdiv.h already includes <functional>, this include is likely redundant here—either drop #include <functional> from this file or update the PR description to reflect the new requirement.

AKloniecki requested review from EikanWang, astachowiczhabana, Copilot and guangyey April 7, 2026 08:03

Copilot started reviewing on behalf of AKloniecki April 7, 2026 08:06 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Comment thread src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp Outdated

Comment thread src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp Outdated

Copilot started work on behalf of AKloniecki April 7, 2026 08:46 View session

Extract pointwise_op_impl into PointwiseFMAHelper.h to eliminate dupl…

5dc1a18

…ication in AddcmulFunctor Agent-Logs-Url: https://github.com/intel/torch-xpu-ops/sessions/93af4d41-5f18-47eb-9154-be673b0da4a7 Co-authored-by: AKloniecki <188310598+AKloniecki@users.noreply.github.com>

Copilot finished work on behalf of AKloniecki April 7, 2026 08:58

guangyey reviewed Apr 7, 2026

View reviewed changes

Comment thread src/ATen/native/xpu/sycl/DeviceAddCmulCdiv.h

guangyey approved these changes Apr 7, 2026

View reviewed changes

Rename helper header. Fix linter.

38749b5

Signed-off-by: Artur Kłoniecki <arturx.kloniecki@intel.com>

Copilot AI review requested due to automatic review settings April 13, 2026 08:55

Merge branch 'main' into aklonieckix/use-std-fma-in=addcmul

9084b4b

Copilot started reviewing on behalf of AKloniecki April 13, 2026 08:55 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

Comment thread src/ATen/native/xpu/sycl/DeviceAddCmulCdiv.h

AKloniecki added 2 commits April 14, 2026 09:05

Fix linter issue.

d31fb9b

Signed-off-by: Artur Kłoniecki <arturx.kloniecki@intel.com>

Merge branch 'main' into aklonieckix/use-std-fma-in=addcmul

eca5564

Copilot AI review requested due to automatic review settings April 14, 2026 09:07

Copilot started reviewing on behalf of AKloniecki April 14, 2026 09:08 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Merge branch 'main' into aklonieckix/use-std-fma-in=addcmul

af41778

Merge branch 'main' into aklonieckix/use-std-fma-in=addcmul

3df7edd

Copilot AI review requested due to automatic review settings April 16, 2026 09:10

Copilot AI reviewed Apr 16, 2026

View reviewed changes

tszulist-hbn approved these changes Apr 16, 2026

View reviewed changes

astachowiczhabana approved these changes Apr 16, 2026

View reviewed changes

Update src/ATen/native/xpu/sycl/PointwiseOpsKernels.cpp

b548092

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 17, 2026 12:42

Merge branch 'main' into aklonieckix/use-std-fma-in=addcmul

31a0c36

Copilot AI reviewed Apr 17, 2026

View reviewed changes

		#include <functional>
		#include <ATen/native/xpu/sycl/DeviceAddCmulCdiv.h>

Conversation

AKloniecki commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

astachowiczhabana commented Apr 7, 2026

Uh oh!

astachowiczhabana commented Apr 7, 2026

Uh oh!

Uh oh!

guangyey left a comment

Choose a reason for hiding this comment

Uh oh!

guangyey commented Apr 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

astachowiczhabana commented Apr 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

github-actions bot commented Apr 14, 2026

Performance outliers, please check!

Uh oh!

github-actions bot commented Apr 16, 2026

Performance outliers, please check!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 16, 2026

Performance outliers, please check!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

AKloniecki commented Apr 7, 2026 •

edited

Loading