Skip to content

Conversation

@abadams
Copy link
Member

@abadams abadams commented Dec 12, 2025

This is equivalent to std::fma. The use-case is when you're in a strict_float context and you want an actual fma instruction. E.g. for bit-exact transcendentals.

Also did a minor drive-by clean up of some things in simd_op_check_x86

This is equivalent to std::fma. The use-case is when you're in a
strict_float context and you want an actual fma instruction. E.g. for
bit-exact transcendentals.
@mcourteaux
Copy link
Contributor

👀 Is this supported on platforms that don't have an FMA instruction? Maybe I should revive my PR. I graduated from the PhD, so I'm back! I should have time 😄

@abadams
Copy link
Member Author

abadams commented Dec 12, 2025

It is, currently via a slow libm call. Not sure how much I care because every modern platform has an fma, but I have also been tinkering with faster ways to emulate fma without it. I was thinking this PR would mean you could use explicit fmas for the transcendentals if you decide to come back to that.

@mcourteaux
Copy link
Contributor

mcourteaux commented Dec 13, 2025

I was thinking about the C-based codegen backends. Those should report an error now, regarding an unsupported intrinsic? Many shader languages support the fma intrinsic as a library function.

if (sizeof(T) == sizeof(float)) {
return fmaf(a, b, c);
} else {
return (T)fma((double)a, (double)b, (double)c);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious: what's the point of casting? It looks like this would make it accept long double, but actually not respect the required precision (which is hard on SSE fp either way).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was for float16 support. It's not quite right doing it in a wider type though - the rounding on the wider fma might result in a tie when casting back to the narrow type, and that tie may break in a different direction than directly rounding the fma result to the narrow type. Not sure how to handle this. A static assert that T is a double or a float? What should the C backend do if you use a float16 fma call?

src/IROperator.h Outdated

/** Fused multiply-add. fma(a, b, c) is equivalent to a * b + c, but only
* rounded once at the end. Halide will turn a * b + c into an fma
* automatically, except in strict_float contexts. This intrinsic only exists in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it? LLVM does, but Halide doesn't AFAIK. So this means that this statement would be incorrect for C-based backends.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, "automatically" is the wrong word here as outside of strict float contexts, Halide leaves the semantics undefined as to whether it is fused or not and it may or may not be optimized depending on all sorts of details.

}

saw_error = true;
// The rounding error, if any, ought to be 1 ULP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of b being small? In general the ULP error can be bigger, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I was thinking of a different case - if you do the non-fma using a wider type and then narrow the result, it should be at most off by 1 ULP relative to the fma. I'll fix the comment.

Copy link
Contributor

@mcourteaux mcourteaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few questions, but looks good in general!

* context, Halide will already generate fma instructions from a * b + c. This
* intrinsic's main purpose is to request a true fma inside a strict_float
* context. A true fma will be emulated on targets without one. */
Expr fma(const Expr &, const Expr &, const Expr &);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should mention that emulation is not guaranteed on every last platform yet as that's a pretty heavy lift. If not supported or emulated it should be a compiler error. Also maybe document it supports 16, 32, and 64 bit float. (I don't think we support 128 bit float anywhere. Uh yet... :-)) Maybe mention IEEE 754-2008 and Wikipedia?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to guarantee it on every last platform. I may give up depending on how the PR goes. llvm offers it as an intrinsic, so for llvm backends emulating it becomes llvm's problem. For the C-like backends most shading languages have it as a builtin. C backend output compiled for CPU has fma and fmaf from libm. float16 is a problem though. Not sure what to do about that yet. I may just emulate it - it's not so bad.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out doing a (b)float16 fma as a double fma is exact. PR #8906 is necessary for getting the final narrowing cast right though.

Hopefully this means we can reenable win-32 testing, because we should
no longer trigger the need for a lib call to convert double to float16
Not ideal. In fact performance is known to be terrible
(https://gitlab.com/libeigen/eigen/-/issues/2959) but wasm only has a
relaxed fma, not a strict one.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants