-
Notifications
You must be signed in to change notification settings - Fork 249
Add fexpa learning path #2632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fexpa learning path #2632
Conversation
Added draft status and cascade settings to the index file.
blapie
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good and reads very well.
| FEXPA can be used to rapidly perform the table lookup. With this instruction a degree-2 polynomial is sufficient to obtain the same accuracy of the implementation we have seen before: | ||
|
|
||
| ```C | ||
| svfloat32_t lane_consts = svld1rq(pg, ln2_lo); // Load only ln2_lo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need ld1rq here? It's confusing cos the comment says that only ln2_lo is loaded.
If ld1rq is used why is it not loading ln2_hi too?
| --- | ||
|
|
||
| ## Conclusion | ||
| The SVE2 FEXPA instruction can speed-up the computation of the exponential function by implementing Look-Up and bit manipulation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SVE rather than SVE2?
|
|
||
| Arm introduced in SVE an instruction called FEXPA: the Floating Point Exponential Accelerator. | ||
|
|
||
| Let’s segment the IEEE754 floating-point representation fraction part into several sub-fields (Index, Exp and Remaining bits) with respective length of Idxb, Expb and Remb bits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IEEE 754
|
|
||
| Let’s segment the IEEE754 floating-point representation fraction part into several sub-fields (Index, Exp and Remaining bits) with respective length of Idxb, Expb and Remb bits. | ||
|
|
||
| | IEEE754 precision | Idxb | Expb | Remb | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IEEE 754
| Given what we said in the previous chapters, the exponential function can be implemented with SVE intrinsics in the following way: | ||
|
|
||
| ```C | ||
| svfloat32_t lane_consts = svld1rq(pg, constants); // Load ln2_lo, c0, c2, c4 in register |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth using ld1rq in the example. That is not the most approachable for this audience.
I think it would save a few lines and help understanding to use duplication instead.
Then you can make a note that further memory-access optimisation can be performed, and maybe link to AOR versions.
Besides using pg is wrong here, you need to use an all true predicate.
| --- | ||
|
|
||
| ## Conclusion | ||
| The SVE2 FEXPA instruction can speed-up the computation of the exponential function by implementing Look-Up and bit manipulation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would generalise to "exponential functionS" (e^x, 2^x, 10^x, x^y...) by virtue of accelerating the computation of 2^n/N. Up to you
| - Fewer instructions (no back-and-forth to scalar/SVE code) | ||
| - Potentially higher aggregate throughput (more exponentials per cycle) | ||
| - Lower power & bandwidth (data being kept in SME engine) | ||
| - Cleaner fusion with GEMM/GEMV workloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you intentionally not mentioning SoftMax and AI applications? It could help understanding the use for such fusion.
Before submitting a pull request for a new Learning Path, please review Create a Learning Path
Please do not include any confidential information in your contribution. This includes confidential microarchitecture details and unannounced product information.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the Creative Commons Attribution 4.0 International License.