### pajenegod's blog

By pajenegod, history, 5 months ago,

Hi Codeforces!

I have something exciting to tell you guys about today! I have recently come up with a really neat and simple recursive algorithm for multiplying polynomials in $O(n \log n)$ time. It is so neat and simple that I think it might possibly revolutionize the way that fast polynomial multiplication is taught and coded. You don't need to know anything about FFT to understand and implement this algorithm.

Big thanks to nor, c1729 and Spheniscine for discussing the contents of the blog with me and comming up with ideas for how to improve the blog =).

I've split this blog up into two parts. The first part is intended for anyone to be able to read and understand. The second part is advanced and goes into a ton of interesting ideas and concepts related to this algorithm.

Prerequisite: Polynomial quotient and remainder, see Wiki article and this Stackexchange example.

Given two polynomials $P$ and $Q$, an integer $n$ and a non-zero complex number $c$, where degree $P < n$ and degree $Q < n$. Your task is to calculate the polynomial $P(x) \, Q(x) \% (x^n - c)$ in $O(n \log n)$ time. You may assume that $n$ is a power of two.

### Solution:

We can create a divide and conquer algorithm for $P(x) \, Q(x) \% (x^n - c)$ based on the difference of squares formula. Assuming $n$ is even, then $(x^n - c) = (x^{n/2} - \sqrt{c}) (x^{n/2} + \sqrt{c})$. The idea behind the algorithm is to calculate $P(x) \, Q(x) \% (x^{n/2} - \sqrt{c})$ and $P(x) \, Q(x) \% (x^{n/2} + \sqrt{c})$ using 2 recursive calls, and then use that result to calculate $P(x) \, Q(x) \% (x^n - c)$.

So how do we actually calculate $P(x) \, Q(x) \% (x^n - c)$ using $P(x) \, Q(x) \% (x^{n/2} - \sqrt{c})$ and $P(x) \, Q(x) \% (x^{n/2} + \sqrt{c})$?

Well, we can use the following formula:

\begin{aligned} A(x) \% (x^n - c) = &\frac{1}{2} \left(1 + \frac{x^{n/2}}{\sqrt{c}}\right) \left(A(x) \% (x^{n/2} - \sqrt{c})\right) \, + \\ &\frac{1}{2} \left(1 - \frac{x^{n/2}}{\sqrt{c}}\right) \left(A(x) \% (x^{n/2} + \sqrt{c})\right). \end{aligned}
Proof of the formula

This formula is very useful. If we substitute $A(x)$ by $P(x) Q(x)$, then the formula tells us how to calculate $P(x) \, Q(x) \% (x^n - c)$ using $P(x) \, Q(x) \% (x^{n/2} - \sqrt{c})$ and $P(x) \, Q(x) \% (x^{n/2} + \sqrt{c})$ in linear time. With this we have the recipie for implementing a $O(n \log n)$ divide and conquer algorithm:

Input:

• Integer $n$ (power of 2),
• Non-zero complex number $c$,
• Two polynomials $P(x) \% (x^n - c)$ and $Q(x) \% (x^n - c)$.

Output:

• The polynomial $P(x) \, Q(x) \% (x^n - c)$.

Algorithm:

Step 1. (Base case) If $n = 1$, then return $P(0) \cdot Q(0)$. Otherwise:

Step 2. Starting from $P(x) \% (x^n - c)$ and $Q(x) \% (x^n - c)$, in $O(n)$ time calculate

\begin{align} & P(x) \% (x^{n/2} - \sqrt{c}), \\ & Q(x) \% (x^{n/2} - \sqrt{c}), \\ & P(x) \% (x^{n/2} + \sqrt{c}) \text{ and} \\ & Q(x) \% (x^{n/2} + \sqrt{c}). \end{align}

Step 3. Make two recursive calls to calculate $P(x) \, Q(x) \% (x^{n/2} - \sqrt{c})$ and $P(x) \, Q(x) \% (x^{n/2} + \sqrt{c})$.

Step 4. Using the formula, calculate $P(x) \, Q(x) \% (x^n - c)$ in $O(n)$ time. Return the result.

Here is a Python implementation following this recipie:

One final thing that I want to mention before going into the advanced section is that this algorithm can also be used to do fast unmodded polynomial multiplication, i.e. given polynomials $P(x)$ and $Q(x)$ calculate $P(x) \, Q(x)$. The trick is simply to pick $n$ large enough such that $P(x) \, Q(x) = P(x) \, Q(x) \% (x^n - c)$, and then use the exact same algorithm as before. $c$ can be arbitrarily picked (any non-zero complex number works).

Python implementation for general Fast polynomial multiplication

If you want to try out implementing this algorithm yourself, then here is a very simple problem to test out your implementation on: SPOJ:POLYMUL.

### (Advanced) Speeding up the algorithm

This section will be about tricks that can be used to speed up the algorithm. The first two tricks will speed up the algorithm by a factor of 2 each. The last trick is advanced, and it has the potential to both speed up the algorithm and also make it more numerically stable.

$n$ doesn't actually need to be a power of 2
Imaginary-cyclic convolution
Calculating fast_polymult_mod(P, Q, n, c) using using fast_polymult_mod(P, Q, n, 1) (reweight technique)

This algorithm is actually FFT in disguise. But it is also different compared to any other FFT algorithm that I've seen in the past (for example the Cooley–Tukey FFT algorithm).

Using this algorithm to calculate FFT
This algorithm is not the same algorithm as Cooley–Tukey
FFT implementation in Python based on this algorithm
FFT implementation in C++ based on this algorithm

### (Advanced) Connection between this algorithm and NTT

Just like how there is FFT and NTT, there are two variants of this algorithm too. One using complex floating point numbers, and the other using modulo a prime (or more generally modulo an odd composite number).

Using modulo integers instead of complex numbers
Calculating fast_polymult_mod(P, Q, n, c) using fast_polymult_mod(P, Q, 2*n, 1)
This algorithm works to some degree even for bad NTT primes
NTT implementation in Python based on this algorithm
NTT implementation in C++ based on this algorithm
Blazingly fast NTT C++ implementation

### (Advanced) Shorter implementations ("Codegolfed version")

It is possible to make really short but slightly less natural implementations of this algorithm. Originally I was thinking of using this shorter version in the blog, but in the end I didn't do it. So here they are. If you want a short implemention of this algorithm to use in practice, then I would recommend taking one of these implementations and porting it to C++.

Short Python implementation without any speedup tricks
Short Python implementation supporting odd and even $n$ (making it up to 2 times faster)
Short Python implementation supporting odd and even $n$ and imaginary cyclic convolution (making it up to 4 times faster)
• +349

 » 5 months ago, # |   +8 Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).
 » 5 months ago, # |   +8 Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).
 » 5 months ago, # | ← Rev. 2 →   +36 Somewhat direct and compact way to obtain the key formula is as follows.Note that $x^n-c \equiv -2c \pmod{x^n+c}$ and $x^n+c \equiv 2c \pmod{x^n-c}$. Thus, the expression $A(x) = Q(x) \left(\frac{x^n+c}{2c}\right) + P(x) \left(\frac{x^n-c}{-2c}\right)$has remainder $Q(x)$ modulo $(x^n-c)$ and $P(x)$ modulo $(x^n+c)$. This is very similar to Chinese remainder theorem.But since FFT is commonly used as a black box, it's very hard to estimate how useful this recursion really is, until it's checked on problems designed to compare various FFT implementations with one another, such as Convolution or Convolution (large)...I tested out the suggested NTT implementation vs current kactl implementation. The implementation from this blog post takes 302ms and 48.46MiB (submission), The current kactl implementation takes 284ms and 52.6MiB (submission). So, seems roughly the same, but maybe the approach from this blog would be better if it's used for convolution directly, rather than through NTT? Unfortunately, the blog doesn't provide C++ implementation for this case...
•  » » 4 months ago, # ^ | ← Rev. 2 →   +3 The two implementations take about about the same time because computationally they are almost the same algorithm.However, it is possible to make something that runs faster than standard NTT. Here are two things to consider: You need almost no extra padding using the method from this blog compared to standard NTT. You can make the base case be something like if (n <= 64), and then switch to a fast $O(n^2)$ polynomial multiplication algorithm. This can be significantly faster than going down all the way to $n = 1$. Unfortunately, the blog doesn't provide C++ implementation for this case... Yes I agree. Because of that I have been working on making efficient C++ implementations.Here is a simple recursive implementation that makes use of 1. and 2: (239 ms), and here is a slightly more complicated iterative version: (205 ms). Adding on Montgomery and FastIO to the iterative version brings the time down to 63 ms. This same implementation also holds the current record for the fastest solution to Convolution (Large).
 » 5 months ago, # |   0 Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).
 » 5 months ago, # | ← Rev. 2 →   +5 Great blog!Below I'll try to compare this blog algorithm and standard FFT with splitting to even and odd indices (probably it's Cooley-Tukey algorithm, or only its special case, don't know).Let $n = n_1 n_2$. I was told at my university algorithm course that every discrete Fourier transform of size $n$ (that is, calculating $\sum\limits_{i = 0}^{n - 1}a_i \omega^{ji}$ for $0 \le j < n$ where $\omega^n = 1$) could be done via $n_2$ discrete Forier transforms of sizes $n_1$ and $n_1$ transforms on the result — of sizes $n_2$ (and some $\mathcal{O}(n)$ stuff). It decomposes indices and powers to the form of $in_2 + j$ or $k + n_1l$ and uses cyclicity of $\omega$. In terms of matricesLet $a \in \mathbb{C}^n, DFT_{n,\omega} := \left[\omega^{ij}\right]_{i=0..n}^{j=0..n}, \quad \omega_1 := \omega^{n_1}, \omega_2 := \omega^{n_2}$. Then $DFT_{n_2,\omega_1} \cdot \left(\left[\omega^{ij}\right]_{i=0..n_1}^{j=0..n_2}\circ \left(DFT_{n_1,\omega_2}\cdot \left[a_{n_2 i + j}\right]_{i=0..n_1}^{j=0..n_2}\right)\right)^t$is a $n_2 \times n_1$ matrix with values of $DFT_{n,\omega} \cdot a$ written row by row.Here $\left[x_{i,j}\right]_{i=0..m}^{j=0..k}$ denotes $m\times k$ matrix for $0 \le i < m$ and $0 \le j < k$; $\circ$ is element-wise product.Of course, the equality holds only if $\omega^n = 1$.Curiously enough, all ways to factorize $n$ (to small enough factors) result in the same complexity. For example, $T(n) = 2\cdot T\left(\frac{n}{2}\right) + \frac{n}{2}\cdot T(2) + \mathcal{O}(n)$ and $T(n) = 2\sqrt{n}\cdot T(\sqrt{n}) + \mathcal{O}(n)$ both resolve to $T(n) = \mathcal{O}(n\log n)$. WhyIntroduce $D(n) = \frac{T(n)}{n}$ and rewrite recurrence with it.So, one have some freedom while calculating discrete Fourier transform of size $n$. The most common scheme is to split the elements into two parts — odd and even, go into recursion and then merge results. It corresponds to $n_1 = \frac{n}{2}, n_2 = 2$: firstly you calculate $2$ independent DFT's of sizes $n_1$ and then merge them — in fact by $n_1$ DFT's of sizes $2$.The algorithm from this blog is different. For $c = 1$ it's the same as $n_1 = 2$ and $n_2 = \frac{n}{2}$: firstly you process data (take polynomials modulo $x^{n/2} \pm \sqrt{c}$ (in fact, doing $n_2$ DFT's of size 2), then go into recursion (and no need to process results afterwards). In the recursion $c$ alters, but the reweight technique, mentioned in the blog, shows us that it doesn't matter.Nevertheless, it was really unexpected to me that this interpretation somehow is able to deal with arbitrary $c$ in such a natural way. I don't know (putting aside general calculations for arbitrary $n_1, n_2$), which algorithm for a beginner is easier to understand. The main fundamental difference, as far as I can see, is that this interpretation is more "polynomial": you don't calculate values at some points and then interpolate, but take polynomials modulo. This allows more general tricks such that "bad NTT primes" and "avoiding padding too many zeroes to make n a power of 2". This tricks are impossible with even-odd FFT, aren't they?The last thing to notice is that I implemented in C++ even-odd FFT (iterative (723 ms), recursive (879 ms)) and $x^n - c$ version (iterative (537 ms), recursive (699 ms)). Putting aside that they don't show high performance, in my implementation this blog algorithm is a bit faster than a standard one.
•  » » 4 months ago, # ^ | ← Rev. 6 →   0 This allows more general tricks such that "bad NTT primes" and "avoiding padding too many zeroes to make n a power of 2". This tricks are impossible with even-odd FFT, aren't they? Actually there is a way to do this with even-odd FFT (Cooley-Tukey) too. ExplanationSuppose $n = a \, 2^b$. Start by splitting up the coefficients of your polynomials depending on their index mod a. This will split up $P$ into $a$ polynomials, and $Q$ into $a$ polynomials. Note that it is possible to construct $P \, Q$ by stitching together the products between all pairs of $P$ polynomials and $Q$ polynomials. If implemented efficiently, then this will run in $O((a + b) \, n)$. In comparision, using the algorithm from this blog results in $O((a + b) \, n)$.So the conclusion is that there are some analogues to bad NTT primes and avoiding padding too many zeroes to make n a power of 2 with even-odd FFT. But I think that using even-odd FFT will result in a more complicated and slower running algorithm. Putting aside that they don't show high performance About performance. I think there are two things that you are missing which will make this algorithm a lot faster than even-odd FFT (Cooley-Tukey).Firstly, pick $n = a \, 2^b$ in order to minimize padding. This wont matter for the Yosupo problem you benchmarked on since the input constraints are "nice" (Yosupo uses power of 2). But in general this will save a factor of two in time and memory over even-odd FFT.Secondly, try using a more advanced base case than  if (n == 1) { a[0] *= b[0]; return; } If you instead make the base case something like if (n <= 64), then you will have a much faster running algorithm. The reason for this is that if you have low degree polynomials, then the fastest thing to do is to just multiply them in $O(n^2)$ time. This turns out to give a fairly significant speed boost.My implementations take recursive (239 ms), iterative (205 ms). Adding on Montgomery and FastIO to the iterative version brings down the time to 63 ms. This same implementation also holds the current record for the fastest solution to Convolution (Large).
•  » » » 4 months ago, # ^ |   0 Hm, indeed, it's possible to implement even-odd FFT in $\mathcal{O}((a + b)n)$ using DFT instead of multiplications as subroutine.Regarding performance: yes, I see now (and could have easily guessed before) that there is much room to optimize. I tried to push it a little more (based on your codes), and it indeed becomes a bit faster, but not much. So, I just wanted to show that in my particular implementation this blog algorithm runs ~20% faster than even-odd FFT, but provided inefficiency of it, the result doesn't really matter.
 » 5 months ago, # |   +1 I ain’t reading allat
 » 5 months ago, # | ← Rev. 3 →   +8 I was not exactly sure about how it worked, but it seems that I have had a very similar (if not exactly the same) implementation of this NTT transform in our ICPC notebook from two years ago. I think I stole it from yosupo judge when looking for a fast but easy to code implementation of NTT.Also, this implementation seems to not require bit reversal at all. Codevoid DFT(vector &a, bool rev) { int n = a.size(); auto b = a; // assert(!(n & (n - 1))); ModInt g = 1; while (g.pow((MOD - 1) / 2) == 1) g = g + 1; if (rev) g = g.inv(); for (int step = n / 2; step; step /= 2) { ModInt w = g.pow((MOD - 1) / (n / step)), wn = 1; for (int i = 0; i < n / 2; i += step) { for (int j = 0; j < step; ++j) { auto u = a[2 * i + j], v = wn * a[2 * i + j + step]; b[i + j] = u + v; b[i + n / 2 + j] = u - v; } wn = wn * w; } swap(a, b); } if (rev) { auto n1 = ModInt(n).inv(); for (auto& x : a) x = x * n1; } } I think my reasoning at that time was that the Hadamard linear transform that appears when doing polynomial evaluation can be written as a product of elementary "butterfly" matrices, but they can be evaluated in any order, and this implementation just chooses to compute it in reverse (large to small instead of small to large). I might be wrong, though, as my memory is really foggy nowadays.
•  » » 4 months ago, # ^ |   0 Oh cool. Thats a really nice and short implementation! Also, this implementation seems to not require bit reversal at all. The way I understand it is that FFT algorithms require bit reversal if they work in-place in memory. Your implementation gets around this by not working in-place. So in a sense, swap(a, b); is your bit reversal.
•  » » » 4 months ago, # ^ |   0 Yes, that’s true. It’s nice that it makes the code shorter, and it also seemed to be a bit faster than bit reversal solutions in my experiments (probably due to cache friendliness).
•  » » » » 4 months ago, # ^ | ← Rev. 2 →   0 I tried implementing this style of NTT combined with the algorithm from this blog. The code becomes really neat and short, but it also seems about 30 % slower compared to the in-place version.Even if it turns out to be slower, I think the added simplicity is likely worth it for CP. It is also possible to codegolf it down a ton, for example you can switch (MOD - 1) / 2 to MOD / 2 and (MOD - 1) / (n / step) to MOD / n * step. You also don't need to explicitly invert $g$, instead you could just do ModInt w = g.pow(rev ? MOD - 1 - MOD / n * step : MOD / n * step).
 » 4 months ago, # |   0 Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).
 » 4 months ago, # |   +81 Discussed it with some polynomial-pros and found out that this algorithm is well known in china:
•  » » 4 months ago, # ^ |   +34 yet another "well known in China" time
•  » » 4 months ago, # ^ |   0 Is there some funny name for it, like for other things that are well known in China?
•  » » » 4 months ago, # ^ |   0 it's called DIF-DIT, not funny enough for a name though