Use AVX512F and AVX512IFMA to speed up integer multiplication

There was a paper "Accelerating Big Integer Arithmetic Using Intel IFMA Extensions" (see below) that uses AVX512F and AVX512IFMA to multiply small integers. I (asked AI to) benchmark it against flint and get:

```
  flint_mpn_mul_n_1024      95.94 ns  +/-   0.44 ns
  mul1024_avx512           100.64 ns  +/-   0.82 ns
  mul1024_vpmadd            45.60 ns  +/-   0.43 ns

  flint_mpn_mul_n_2048     294.76 ns  +/-   4.59 ns
  mul2048_avx512           275.32 ns  +/-   3.53 ns
  mul2048_vpmadd           116.30 ns  +/-   1.96 ns

  flint_mpn_mul_n_3072     649.84 ns  +/-  11.00 ns
  mul3072_avx512           595.43 ns  +/-   7.82 ns
  mul3072_vpmadd           192.20 ns  +/-   2.27 ns

  flint_mpn_mul_n_4096     962.09 ns  +/-  14.97 ns
  mul4096_avx512           899.81 ns  +/-  11.22 ns
  mul4096_vpmadd           607.16 ns  +/-  10.44 ns
```

Looks quite competitive. Note that the suffix is the number of bits, so for example `1024` means 16 limbs input, 32 limbs output.

The paper only provide some fixed sizes, however. More work is needed to adapt it to all sizes.

Note that the repo use GPL v3, but flint is LGPL v3. Might have issue here.

There's a sharp spike at `mul4096_vpmadd`. No doubt that can be improved.

References:


- https://github.com/vkrasnov/vpmadd
- https://www.semanticscholar.org/paper/Accelerating-Big-Integer-Arithmetic-Using-Intel-Gueron-Krasnov/fd2f45cb7e2631d34874dde307f317a859d59a68
- http://acsel-lab.com/arithmetic/arith23/data/1616a032.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use AVX512F and AVX512IFMA to speed up integer multiplication #2696

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Use AVX512F and AVX512IFMA to speed up integer multiplication #2696

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions