Idea for slightly faster `fft_small`

Recently I was looking at the `fft_small` code, and notice that in `nmod_poly_mul`, at most 4 primes are needed.

This is because the modulus is `M < 2^64`, so for sequences of length, say, `n > 2^22`, then `n ⋅ M^2` may exceed `2^150`. Each CRT prime gives you 50 bits of precision, so 3 primes may be insufficient in this case.

One idea to do slightly better is the following. If it works, it may be 25% faster than the current algorithm in the largest cases where 4 primes are used.

Suppose we want to compute the convolution of `(a_0, ..., a_{n-1})` and `(b_0, ..., b_{m-1})` modulo `M`. For each `i`, let `a'_i` be randomly either `a_i` or `a_i - M`, the probability is such that the expected value of `a'_i` is zero. Similar for `b'_i`. Compute the convolution of `a'` and `b'` using CRT as before. Then, each resulting coefficient `c'_k` is the sum of at most `n` independent random variables, each has expected value 0 and constant variance, so you can apply Chernoff bound (need to work through the detail) to show that `ℙ[|c'_k| ≥ C ⋅ √n ⋅ M^2] ∼ exp(-C ⋅ constant)`. Pick large enough `C` so that the failure probability is vanishingly small, then take the union bound over all `k`.

We expect to do better in this case if `C ≪ √n`. Would be nice if there is a similar deterministic algorithm though.

With a similar idea, one can gain one bit in the CRT product. Note that `[-M/2, M/2] × [-M/2, M/2] = [-M^2/4, M^2/4]` which has width `M^2/2`, while `[0, M) × [0, M) = [0, (M-1)^2]` which has width `(M-1)^2`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea for slightly faster `fft_small` #2687

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Idea for slightly faster fft_small #2687

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Idea for slightly faster `fft_small` #2687