Skip to content

Conversation

@ClaudioMartino-arm
Copy link
Contributor

Before submitting a pull request for a new Learning Path, please review Create a Learning Path

  • I have reviewed Create a Learning Path

Please do not include any confidential information in your contribution. This includes confidential microarchitecture details and unannounced product information.

  • I have checked my contribution for confidential information

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the Creative Commons Attribution 4.0 International License.

Added draft status and cascade settings to the index file.
@pareenaverma pareenaverma merged commit 9a0026d into ArmDeveloperEcosystem:main Dec 16, 2025
1 check failed
Copy link

@blapie blapie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good and reads very well.

FEXPA can be used to rapidly perform the table lookup. With this instruction a degree-2 polynomial is sufficient to obtain the same accuracy of the implementation we have seen before:

```C
svfloat32_t lane_consts = svld1rq(pg, ln2_lo); // Load only ln2_lo
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need ld1rq here? It's confusing cos the comment says that only ln2_lo is loaded.
If ld1rq is used why is it not loading ln2_hi too?

---

## Conclusion
The SVE2 FEXPA instruction can speed-up the computation of the exponential function by implementing Look-Up and bit manipulation.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SVE rather than SVE2?


Arm introduced in SVE an instruction called FEXPA: the Floating Point Exponential Accelerator.

Let’s segment the IEEE754 floating-point representation fraction part into several sub-fields (Index, Exp and Remaining bits) with respective length of Idxb, Expb and Remb bits.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IEEE 754


Let’s segment the IEEE754 floating-point representation fraction part into several sub-fields (Index, Exp and Remaining bits) with respective length of Idxb, Expb and Remb bits.

| IEEE754 precision | Idxb | Expb | Remb |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IEEE 754

Given what we said in the previous chapters, the exponential function can be implemented with SVE intrinsics in the following way:

```C
svfloat32_t lane_consts = svld1rq(pg, constants); // Load ln2_lo, c0, c2, c4 in register
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth using ld1rq in the example. That is not the most approachable for this audience.
I think it would save a few lines and help understanding to use duplication instead.
Then you can make a note that further memory-access optimisation can be performed, and maybe link to AOR versions.

Besides using pg is wrong here, you need to use an all true predicate.

---

## Conclusion
The SVE2 FEXPA instruction can speed-up the computation of the exponential function by implementing Look-Up and bit manipulation.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would generalise to "exponential functionS" (e^x, 2^x, 10^x, x^y...) by virtue of accelerating the computation of 2^n/N. Up to you

- Fewer instructions (no back-and-forth to scalar/SVE code)
- Potentially higher aggregate throughput (more exponentials per cycle)
- Lower power & bandwidth (data being kept in SME engine)
- Cleaner fusion with GEMM/GEMV workloads
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you intentionally not mentioning SoftMax and AI applications? It could help understanding the use for such fusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ACM Arm Cloud Migration tech_review

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

4 participants