Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 37 additions & 13 deletions vfabia64/vfabia64.rst
Original file line number Diff line number Diff line change
Expand Up @@ -942,6 +942,13 @@ undefined.
Zn.b [msb] ... 0x??????03 0x??????02 0x??????01 0x??????00 [lsb]
Zn.s [msb] ... 0x00000003 0x00000002 0x00000001 0x00000000 [lsb]

Streaming compatibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative would be that streaming/streaming-compatible regions call the scalar functions? Or make all vector functions streaming-compatible. Which would avoid the name mangling, but lose performance.

Ideally a streaming-compatible target would be universally compatible so that we wouldn't need a non-streaming and streaming-compatible implementation, although I don't think we can represent that with name-mangling. I guess a synonym or wrapper could be used in that case?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling scalar routines in a streaming-section would be highly detrimental to performance (we have evidence to back that). Yes, making all routines streaming compatible would indeed be a performance loss in rather important routines.

Ideally a streaming-compatible target would be universally compatible

Do you mean that no calls to streaming compatible vector math routines would be emitted unless the target supports all instructions (FEAT_SME_FA64 )? In this case yes a single optimized versions can most likely be provided.

although I don't think we can represent that with name-mangling.

And also we cannot suddenly add a new attribute to the existing symbol without breaching ABI, right?

I guess a synonym or wrapper could be used in that case?

Could you elaborate on "synonym" and "wrapper" a little? Is this something the VFABI should specify?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll have to forgive me, that comment was 4 Months ago so I've forgotton a lot of the context.

What I think I was getting at was that if you had a streaming compatible implementation, then that would be sufficient (if not optimal) for SVE, streaming and streaming-compatible callers. If an implementation wanted to provide only one "streaming compatible" implementation (for whatever reason), then if functions were name mangled then they'd need to define symbols for each of the name manglings. These could be wrappers that just call the streaming compatible" implementation, or they could be aliases (symbols defined at the same address) as the streaming compatible implementation.

Whether this is at all useful or not I don't know. It maybe that the people using the vfabi are only using it because of maximum performance so there would always be an optimal implementation for each case.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got you! Well remembered and good point.

Agreed that wrappers/aliases could be used to provide both manglings when only a single implementation is available (SVE or SSVE), which is a cheap way to get decent (not optimal) SVE and SSVE performance.

if you had a streaming compatible implementation, then that would be sufficient (if not optimal) for SVE, streaming and streaming-compatible callers.

Yes, this is the case for a significant amount of implementations (e.g. SLEEF, AOR, libmvec).

It maybe that the people using the vfabi are only using it because of maximum performance so there would always be an optimal implementation for each case.

Also agreed that VFABI may often be associated with best performance (e.g. AAVPCS), so we think the new mangling makes sense in that respect. Not to mention it offers a way to avoid performance drops due to mode switch (in math.h-heavy workloads).

^^^^^^^^^^^^^^^^^^^^^^^

If targeting SVE from a streaming or streaming-compatible region,
calls should be emitted to the streaming-compatible SVE rather than
the plain SVE variant (differentiated by mangling, as below).

Vector function name mangling
-----------------------------

Expand Down Expand Up @@ -983,6 +990,7 @@ Name mangling grammar for vector functions.

<isa> := "n" (Advanced SIMD)
| "s" (SVE)
| "c" (Streaming-compatible SVE)

<mask> := "N" (No Mask)
| "M" (Mask)
Expand Down Expand Up @@ -1195,6 +1203,19 @@ Note that the ``svbool_t`` parameter is described in `SVE masking`_.
svfloat32_t _ZGVsM8vv_bar(svfloat64_t vx, svfloat64_t vy,
svbool_t vmask);

Streaming-compatible SVE Examples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The use of ``#pragma omp declare simd`` with ``f``, ``g`` and ``foo``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would the compiler know that we are in a streaming or streaming-compatible region?

From what I can find in the ACLE the user adds __arm_streaming and __arm_streaming_compatible to functions. I can see how this would work in generating calls to streaming compatible vector functions, but I'm struggling with how this would work for generating functions with #pragma omp declare simd as these appear to be placed in front of function declarations, which typically won't be within the scope of of a function body.

I'm not at all familiar with OMP though so there could be a way of declaring a streaming compatible region, I've checked https://clang.llvm.org/docs/AttributeReference.html#pragma-omp-declare-simd

Would be good to give an example if you can?

If this part is somewhat speculative it could be worth separating it from the name mangling part, as I assume someone could manually write the functions rather than have the compiler auto-generate them.

in a streaming or streaming-compatible region will also generate:

* ``svfloat32_t _ZGVcMxv_f(svfloat64_t, svbool_t) __arm_streaming_compatible``
streaming-compatible VLA signature for the vector version of ``f``;
* ``svfloat64_t _ZGVcMxv_g(svfloat32_t, svbool_t) __arm_streaming_compatible``
streaming-compatible VLA signature for the vector version of ``g``;
* ``svint16_t _ZGVcMxvvv_foo(svint64_t, svint32_t, svint8_t, svbool_t) __arm_streaming_compatible``
streaming-compatible VLA signature for the vector version of ``foo``.

Linear parameters examples
--------------------------

Expand Down Expand Up @@ -1364,17 +1385,19 @@ AArch64 Variant Traits

.. table:: AArch64 traits for OpenMP contexts.

+------------------+-----------------------+-------------------------+
|Trait set |Trait value |Notes |
+==================+=======================+=========================+
|``device`` |``isa("simd")`` |Advanced SIMD call. |
+------------------+-----------------------+-------------------------+
|``device`` |``isa("sve")`` |SVE call. |
+------------------+-----------------------+-------------------------+
|``device`` |``arch("march-list")`` |Used to match |
| | |``-march=march-list`` |
| | |from the compiler. |
+------------------+-----------------------+-------------------------+
+------------------+-----------------------+-------------------------------+
|Trait set |Trait value |Notes |
+==================+=======================+===============================+
|``device`` |``isa("simd")`` |Advanced SIMD call. |
+------------------+-----------------------+-------------------------------+
|``device`` |``isa("sve")`` |SVE call. |
+------------------+-----------------------+-------------------------------+
|``device`` |``isa("sc_sve")`` |Streaming-compatible SVE call. |
+------------------+-----------------------+-------------------------------+
|``device`` |``arch("march-list")`` |Used to match |
| | |``-march=march-list`` |
| | |from the compiler. |
+------------------+-----------------------+-------------------------------+

The scalar function ``f`` that is decorated with a ``declare
variant`` directive with a ``simd`` trait in the ``construct`` set is
Expand All @@ -1391,8 +1414,9 @@ mapped to the vector function ``F`` according to the following rules:

1. ``isa("simd")`` targets Advanced SIMD function signatures.
2. ``isa("sve")`` targets SVE function signatures.
3. Either ``isa("simd")`` or ``isa("sve")`` must be specified.
4. The ``arch`` traits of the ``device`` set is optional, and it
3. ``isa("sc_sve")`` targets streaming-compatible SVE function signatures.
4. One of ``isa("simd")``, ``isa("sve")`` or ``isa("sc_sve")`` must be specified.
5. The ``arch`` traits of the ``device`` set is optional, and it
accepts any value that can be passed to the compiler via the
command line option ``-march``.

Expand Down