-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add an fma intrinsic #8900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add an fma intrinsic #8900
Conversation
This is equivalent to std::fma. The use-case is when you're in a strict_float context and you want an actual fma instruction. E.g. for bit-exact transcendentals.
|
👀 Is this supported on platforms that don't have an FMA instruction? Maybe I should revive my PR. I graduated from the PhD, so I'm back! I should have time 😄 |
|
It is, currently via a slow libm call. Not sure how much I care because every modern platform has an fma, but I have also been tinkering with faster ways to emulate fma without it. I was thinking this PR would mean you could use explicit fmas for the transcendentals if you decide to come back to that. |
|
I was thinking about the C-based codegen backends. Those should report an error now, regarding an unsupported intrinsic? Many shader languages support the fma intrinsic as a library function. |
Also a drive-by fix for fmod
| if (sizeof(T) == sizeof(float)) { | ||
| return fmaf(a, b, c); | ||
| } else { | ||
| return (T)fma((double)a, (double)b, (double)c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious: what's the point of casting? It looks like this would make it accept long double, but actually not respect the required precision (which is hard on SSE fp either way).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was for float16 support. It's not quite right doing it in a wider type though - the rounding on the wider fma might result in a tie when casting back to the narrow type, and that tie may break in a different direction than directly rounding the fma result to the narrow type. Not sure how to handle this. A static assert that T is a double or a float? What should the C backend do if you use a float16 fma call?
src/IROperator.h
Outdated
|
|
||
| /** Fused multiply-add. fma(a, b, c) is equivalent to a * b + c, but only | ||
| * rounded once at the end. Halide will turn a * b + c into an fma | ||
| * automatically, except in strict_float contexts. This intrinsic only exists in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it? LLVM does, but Halide doesn't AFAIK. So this means that this statement would be incorrect for C-based backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, "automatically" is the wrong word here as outside of strict float contexts, Halide leaves the semantics undefined as to whether it is fused or not and it may or may not be optimized depending on all sorts of details.
test/correctness/strict_fma.cpp
Outdated
| } | ||
|
|
||
| saw_error = true; | ||
| // The rounding error, if any, ought to be 1 ULP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of b being small? In general the ULP error can be bigger, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I was thinking of a different case - if you do the non-fma using a wider type and then narrow the result, it should be at most off by 1 ULP relative to the fma. I'll fix the comment.
mcourteaux
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few questions, but looks good in general!
| * context, Halide will already generate fma instructions from a * b + c. This | ||
| * intrinsic's main purpose is to request a true fma inside a strict_float | ||
| * context. A true fma will be emulated on targets without one. */ | ||
| Expr fma(const Expr &, const Expr &, const Expr &); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should mention that emulation is not guaranteed on every last platform yet as that's a pretty heavy lift. If not supported or emulated it should be a compiler error. Also maybe document it supports 16, 32, and 64 bit float. (I don't think we support 128 bit float anywhere. Uh yet... :-)) Maybe mention IEEE 754-2008 and Wikipedia?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to guarantee it on every last platform. I may give up depending on how the PR goes. llvm offers it as an intrinsic, so for llvm backends emulating it becomes llvm's problem. For the C-like backends most shading languages have it as a builtin. C backend output compiled for CPU has fma and fmaf from libm. float16 is a problem though. Not sure what to do about that yet. I may just emulate it - it's not so bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out doing a (b)float16 fma as a double fma is exact. PR #8906 is necessary for getting the final narrowing cast right though.
Also add and fix some comments
…ns' into abadams/strict_fma
Hopefully this means we can reenable win-32 testing, because we should no longer trigger the need for a lib call to convert double to float16
Not ideal. In fact performance is known to be terrible (https://gitlab.com/libeigen/eigen/-/issues/2959) but wasm only has a relaxed fma, not a strict one.
This is equivalent to std::fma. The use-case is when you're in a strict_float context and you want an actual fma instruction. E.g. for bit-exact transcendentals.
Also did a minor drive-by clean up of some things in simd_op_check_x86