Recently I was looking at the fft_small code, and notice that in nmod_poly_mul, at most 4 primes are needed.
This is because the modulus is M < 2^64, so for sequences of length, say, n > 2^22, then n ⋅ M^2 may exceed 2^150. Each CRT prime gives you 50 bits of precision, so 3 primes may be insufficient in this case.
One idea to do slightly better is the following. If it works, it may be 25% faster than the current algorithm in the largest cases where 4 primes are used.
Suppose we want to compute the convolution of (a_0, ..., a_{n-1}) and (b_0, ..., b_{m-1}) modulo M. For each i, let a'_i be randomly either a_i or a_i - M, the probability is such that the expected value of a'_i is zero. Similar for b'_i. Compute the convolution of a' and b' using CRT as before. Then, each resulting coefficient c'_k is the sum of at most n independent random variables, each has expected value 0 and constant variance, so you can apply Chernoff bound (need to work through the detail) to show that ℙ[|c'_k| ≥ C ⋅ √n ⋅ M^2] ∼ exp(-C ⋅ constant). Pick large enough C so that the failure probability is vanishingly small, then take the union bound over all k.
We expect to do better in this case if C ≪ √n. Would be nice if there is a similar deterministic algorithm though.
With a similar idea, one can gain one bit in the CRT product. Note that [-M/2, M/2] × [-M/2, M/2] = [-M^2/4, M^2/4] which has width M^2/2, while [0, M) × [0, M) = [0, (M-1)^2] which has width (M-1)^2.
Recently I was looking at the
fft_smallcode, and notice that innmod_poly_mul, at most 4 primes are needed.This is because the modulus is
M < 2^64, so for sequences of length, say,n > 2^22, thenn ⋅ M^2may exceed2^150. Each CRT prime gives you 50 bits of precision, so 3 primes may be insufficient in this case.One idea to do slightly better is the following. If it works, it may be 25% faster than the current algorithm in the largest cases where 4 primes are used.
Suppose we want to compute the convolution of
(a_0, ..., a_{n-1})and(b_0, ..., b_{m-1})moduloM. For eachi, leta'_ibe randomly eithera_iora_i - M, the probability is such that the expected value ofa'_iis zero. Similar forb'_i. Compute the convolution ofa'andb'using CRT as before. Then, each resulting coefficientc'_kis the sum of at mostnindependent random variables, each has expected value 0 and constant variance, so you can apply Chernoff bound (need to work through the detail) to show thatℙ[|c'_k| ≥ C ⋅ √n ⋅ M^2] ∼ exp(-C ⋅ constant). Pick large enoughCso that the failure probability is vanishingly small, then take the union bound over allk.We expect to do better in this case if
C ≪ √n. Would be nice if there is a similar deterministic algorithm though.With a similar idea, one can gain one bit in the CRT product. Note that
[-M/2, M/2] × [-M/2, M/2] = [-M^2/4, M^2/4]which has widthM^2/2, while[0, M) × [0, M) = [0, (M-1)^2]which has width(M-1)^2.