Add an fma intrinsic #8900

abadams · 2025-12-12T00:43:04Z

This is equivalent to std::fma. The use-case is when you're in a strict_float context and you want an actual fma instruction. E.g. for bit-exact transcendentals.

Also did a minor drive-by clean up of some things in simd_op_check_x86

This is equivalent to std::fma. The use-case is when you're in a strict_float context and you want an actual fma instruction. E.g. for bit-exact transcendentals.

mcourteaux · 2025-12-12T14:51:14Z

👀 Is this supported on platforms that don't have an FMA instruction? Maybe I should revive my PR. I graduated from the PhD, so I'm back! I should have time 😄

abadams · 2025-12-12T21:33:46Z

It is, currently via a slow libm call. Not sure how much I care because every modern platform has an fma, but I have also been tinkering with faster ways to emulate fma without it. I was thinking this PR would mean you could use explicit fmas for the transcendentals if you decide to come back to that.

mcourteaux · 2025-12-13T10:43:01Z

I was thinking about the C-based codegen backends. Those should report an error now, regarding an unsupported intrinsic? Many shader languages support the fma intrinsic as a library function.

Also a drive-by fix for fmod

mcourteaux · 2025-12-16T10:22:13Z

src/CodeGen_C_prologue.template.cpp

+    if (sizeof(T) == sizeof(float)) {
+        return fmaf(a, b, c);
+    } else {
+        return (T)fma((double)a, (double)b, (double)c);


I'm curious: what's the point of casting? It looks like this would make it accept long double, but actually not respect the required precision (which is hard on SSE fp either way).

This was for float16 support. It's not quite right doing it in a wider type though - the rounding on the wider fma might result in a tie when casting back to the narrow type, and that tie may break in a different direction than directly rounding the fma result to the narrow type. Not sure how to handle this. A static assert that T is a double or a float? What should the C backend do if you use a float16 fma call?

mcourteaux · 2025-12-16T10:24:51Z

src/IROperator.h


+/** Fused multiply-add. fma(a, b, c) is equivalent to a * b + c, but only
+ * rounded once at the end. Halide will turn a * b + c into an fma
+ * automatically, except in strict_float contexts. This intrinsic only exists in


Will it? LLVM does, but Halide doesn't AFAIK. So this means that this statement would be incorrect for C-based backends.

Yes, "automatically" is the wrong word here as outside of strict float contexts, Halide leaves the semantics undefined as to whether it is fused or not and it may or may not be optimized depending on all sorts of details.

mcourteaux · 2025-12-16T10:27:11Z

test/correctness/strict_fma.cpp

+        }
+
+        saw_error = true;
+        // The rounding error, if any, ought to be 1 ULP


Because of b being small? In general the ULP error can be bigger, right?

True, I was thinking of a different case - if you do the non-fma using a wider type and then narrow the result, it should be at most off by 1 ULP relative to the fma. I'll fix the comment.

mcourteaux

Few questions, but looks good in general!

zvookin · 2025-12-16T20:02:09Z

src/IROperator.h

+ * context, Halide will already generate fma instructions from a * b + c. This
+ * intrinsic's main purpose is to request a true fma inside a strict_float
+ * context. A true fma will be emulated on targets without one. */
+Expr fma(const Expr &, const Expr &, const Expr &);


Probably should mention that emulation is not guaranteed on every last platform yet as that's a pretty heavy lift. If not supported or emulated it should be a compiler error. Also maybe document it supports 16, 32, and 64 bit float. (I don't think we support 128 bit float anywhere. Uh yet... :-)) Maybe mention IEEE 754-2008 and Wikipedia?

I'm trying to guarantee it on every last platform. I may give up depending on how the PR goes. llvm offers it as an intrinsic, so for llvm backends emulating it becomes llvm's problem. For the C-like backends most shading languages have it as a builtin. C backend output compiled for CPU has fma and fmaf from libm. float16 is a problem though. Not sure what to do about that yet. I may just emulate it - it's not so bad.

Turns out doing a (b)float16 fma as a double fma is exact. PR #8906 is necessary for getting the final narrowing cast right though.

Also add and fix some comments

…ns' into abadams/strict_fma

Hopefully this means we can reenable win-32 testing, because we should no longer trigger the need for a lib call to convert double to float16

Not ideal. In fact performance is known to be terrible (https://gitlab.com/libeigen/eigen/-/issues/2959) but wasm only has a relaxed fma, not a strict one.

abadams added 2 commits December 11, 2025 16:42

Add an fma intrinsic

cdb9e67

This is equivalent to std::fma. The use-case is when you're in a strict_float context and you want an actual fma instruction. E.g. for bit-exact transcendentals.

Add fma to python bindings

c622de9

alexreinking approved these changes Dec 12, 2025

View reviewed changes

Don't even try for fma on arm 32

9ea8c63

abadams added 2 commits December 15, 2025 14:14

Get fma working in C and GPU backends

22f6765

Also a drive-by fix for fmod

move definition of has_builtin

2f8558b

mcourteaux reviewed Dec 16, 2025

View reviewed changes

mcourteaux approved these changes Dec 16, 2025

View reviewed changes

Comment fixes

5ae7b14

zvookin reviewed Dec 16, 2025

View reviewed changes

abadams added 8 commits December 16, 2025 12:25

Skip fma test on two legacy platforms

55e8956

Fix double-rounding bug in double -> (b)float16 casts

fcfa871

Share more code between coming from 64 and 32 bits

9b23ae6

Also add and fix some comments

Merge remote-tracking branch 'origin/abadams/double_float16_conversio…

cbc59b8

…ns' into abadams/strict_fma

handle float16 fmas

82f24c7

Hopefully this means we can reenable win-32 testing, because we should no longer trigger the need for a lib call to convert double to float16

wasm fix

649ac3e

Not ideal. In fact performance is known to be terrible (https://gitlab.com/libeigen/eigen/-/issues/2959) but wasm only has a relaxed fma, not a strict one.

Skip test on webgpu

7656742

Don't check for kandw

6acb53c

Add an fma intrinsic #8900

Are you sure you want to change the base?

Add an fma intrinsic #8900

Conversation

abadams commented Dec 12, 2025

Uh oh!

mcourteaux commented Dec 12, 2025

Uh oh!

abadams commented Dec 12, 2025

Uh oh!

mcourteaux commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcourteaux left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mcourteaux commented Dec 13, 2025 •

edited

Loading