There was a paper "Accelerating Big Integer Arithmetic Using Intel IFMA Extensions" (see below) that uses AVX512F and AVX512IFMA to multiply small integers. I (asked AI to) benchmark it against flint and get:
flint_mpn_mul_n_1024 95.94 ns +/- 0.44 ns
mul1024_avx512 100.64 ns +/- 0.82 ns
mul1024_vpmadd 45.60 ns +/- 0.43 ns
flint_mpn_mul_n_2048 294.76 ns +/- 4.59 ns
mul2048_avx512 275.32 ns +/- 3.53 ns
mul2048_vpmadd 116.30 ns +/- 1.96 ns
flint_mpn_mul_n_3072 649.84 ns +/- 11.00 ns
mul3072_avx512 595.43 ns +/- 7.82 ns
mul3072_vpmadd 192.20 ns +/- 2.27 ns
flint_mpn_mul_n_4096 962.09 ns +/- 14.97 ns
mul4096_avx512 899.81 ns +/- 11.22 ns
mul4096_vpmadd 607.16 ns +/- 10.44 ns
Looks quite competitive. Note that the suffix is the number of bits, so for example 1024 means 16 limbs input, 32 limbs output.
The paper only provide some fixed sizes, however. More work is needed to adapt it to all sizes.
Note that the repo use GPL v3, but flint is LGPL v3. Might have issue here.
There's a sharp spike at mul4096_vpmadd. No doubt that can be improved.
References:
There was a paper "Accelerating Big Integer Arithmetic Using Intel IFMA Extensions" (see below) that uses AVX512F and AVX512IFMA to multiply small integers. I (asked AI to) benchmark it against flint and get:
Looks quite competitive. Note that the suffix is the number of bits, so for example
1024means 16 limbs input, 32 limbs output.The paper only provide some fixed sizes, however. More work is needed to adapt it to all sizes.
Note that the repo use GPL v3, but flint is LGPL v3. Might have issue here.
There's a sharp spike at
mul4096_vpmadd. No doubt that can be improved.References: