Notes on FFT / NTT, and the "ultimate" NTT with modulus > 9 * 10^18

Prerequisites: you need to be familiar with both modular arithmetic and Fast Fourier Transform / number theoretic transform. The latter can be a rather advanced topic, but I personally found this article helpful. Note that there are some relatively easy optimizations/improvements that can be done, like precalculating the modular inverse of $$$n$$$, the "bit-reversed-indices" (can be done in $$$O(n)$$$ with DP), as well as the powers of $$$\omega_n$$$. Also useful is modifying it so that the input array length can be any power of two $$$\leq n$$$; some problems require multiplying polynomials of many different lengths, and you'd rather the runtime be $$$O(n \log n)$$$ over the sums of lengths, rather than (number of polynomials * maximum capacity).

Also helpful is knowing how to use NTT to solve problems modulo $$$998244353$$$, like 1096G - Lucky Tickets and 1251F - Red-White Fence. Note that for some problems it's easier to think of FFT not as multiplying polynomials, but of finding multiset $$$C$$$ as the pairwise sums of multisets $$$A$$$ and $$$B$$$, in the form of arrays $$$A[i] =$$$ number of instances of $$$i$$$ in multiset $$$A$$$. This is equivalent to multiplying polynomials of the form $$$\sum _{i=0} ^n A[i]x^i$$$.

Note that $$$\omega_n$$$ can be easily found via the formula $$$g ^ {(m-1) / n} \ \text{ mod } m$$$, provided that:

$$$m$$$ is prime
$$$g$$$ is any primitive root modulo $$$m$$$. It is easiest to find this before hand and then hardcode it in the submission. You can either implement the algorithm yourself or use Wolfram Alpha to find it via the query PrimitiveRoot[m]. (Spoiler alert, $$$g = 3$$$ works well for $$$998244353$$$)
$$$n$$$ divides $$$m-1$$$ evenly. As $$$n$$$ is typically rounded up to the nearest power of 2 for basic FFT implementations, this is easiest when $$$m$$$ is of the form $$${a \cdot 2^{k} + 1}$$$ where $$$2^k \geq n$$$. This is why $$$998244353$$$ is a commonly-appearing modulus; it's $$$119 \cdot 2^{23} + 1$$$. Note that this modulus also appears in many problems that don't require FFT/NTT; this is a deliberate "crying-wolf" strategy employed by puzzle writers, so that you can't recognize immediately that a problem requires FFT/NTT via the given modulus.

Now onto the main topic, the "ultimate" NTT.

Rationale: There are a few problems like 993E - Nikita and Order Statistics that require FFT, however the results aren't output with any modulus, and indeed may exceed the range of a 32-bit integer type. There are several usual solutions for these types of problems:

Do NTT with two moduli and restore the result via Chinese Remainder Theorem. This has several prominent disadvantages:
1. Slow, as the NTT routine has to be done twice.
2. Complicated setup, as several suitable moduli must be found, and their primitive roots calculated
3. Restoring the result with CRT requires either brute force or multiplications modulo $$$pq$$$, which may overflow even 64-bit integer types.
Do FFT with complex numbers and floating point types. Disadvantages are:
1. Could be slow due to heavy floating-point arithmetic. Additionally, JVM-based languages (Java, Kotlin, Scala) suffer complications here, as representing complex numbers with object-based tuples adds a significant overhead.
2. Limited precision due to rounding error. Typically the problems are constructed such that it won't be a problem if care is taken in the implementation, but won't it be nice to just not to have to worry about it?

To solve these problems, I propose the "ultimate" NTT solution — just use one huge modulus. The one I use is $$$m = 9223372036737335297 = 549755813881 \cdot 2^{24} + 1, g = 3$$$. This is just over a hundred million less than $$$2^{63} - 1$$$, the maximum value of a signed 64-bit integer.

However, this obviously raises the issue of how to safely do modular arithmetic with such huge integers. Addition is complicated by possible overflow into the sign bit, thus the usual if(x >= m) x -= m won't work. Instead, first normalize $$$x$$$ into the range $$$[-m, m)$$$; this is easily done with subtracting $$$m$$$ before any addition operation. Then do x += (x >> 63) & m. This has the effect of adding $$$m$$$ to $$$x$$$ if and only if $$$x$$$ is negative.

The elephant in the room however is multiplication. The usual method requires computing a 126-bit product, then doing a modulo operation over a 63-bit integer; this could be slow and error-prone to implement. C++ users could rejoice, as Codeforces recently implemented support for 128-bit integers via this update. However, before you celebrate too early, there are still issues: this may not be available on other platforms, and I can't imagine that straight 128-bit modulo 64-bit is exactly the fastest operation in the world, so the following might still be helpful even to C++ users.

A rather simple-to-code general-case method is to "double-and-add" similar to modular exponentiation, however it is too slow for the purposes of this article; FFT implementation speed is heavily bound by the speed of multiplications.

My favored technique for this problem is called Barrett reduction. The explanation is (adapted from this website)

Choose integer $$$k$$$ such that $$$2^k > m $$$. $$$k = 63$$$ works for our modulus.
Precompute $$$\displaystyle r = \left\lfloor \frac {4^k} m \right\rfloor$$$. For our modulus this is $$$9223372036972216319$$$, which just overflows a signed 64-bit integer. You can either store it as unsigned, or the as the signed 64-bit representation $$$-9223372036737335297$$$ (yes, this just happens to be $$$-m$$$, which is quite a neat coincidence)
Multiply the two integers. Note that we need all 126 bits of this product. The low bits can be easily obtained via straight multiplication, however the high bits need some bit-shifting arithmetic tricks to obtain. Adapt the following Java code, taken from here for your preferred language:

multiplyHighUnsigned code

public static long multiplyHighUnsigned(long x, long y) {
    long x_high = x >>> 32;
    long y_high = y >>> 32;
    long x_low = x & 0xFFFFFFFFL;
    long y_low = y & 0xFFFFFFFFL;
        
    long z2 = x_low * y_low;
    long t = x_high * y_low + (z2 >>> 32);
    long z1 = t & 0xFFFFFFFFL;
    long z0 = t >>> 32;
    z1 += x_low * y_high;
    return x_high * y_high + z0 + (z1 >>> 32);
}

Let the product be $$$x$$$. Then calculate $$$\displaystyle t = x - \left\lfloor \frac{xr}{4^k} \right\rfloor m - m$$$. This requires some adaptation of grade-school arithmetic; pseudocode in spoiler below. $$$t$$$ is guaranteed to be in the range $$$[-m, m)$$$, so the t += (t >> 63) & m trick should work to normalize it. In the following pseudocode, BARR_R $$$= r$$$ and MOD $$$= m$$$. Also, ^ represents bitwise xor, not exponentiation.

mulMod pseudocode

function mulMod(a: Long, b: Long): Long {
    xh = multiplyHighUnsigned(a, b) // high word of product
    xl = a * b // low word of product

    xrh = multiplyHighUnsigned(xh, BARR_R) // high word of xr
    xrm = multiplyHighUnsigned(xl, BARR_R) // middle word of xr, first part
    add = xh * BARR_R // second part of middle word
    xrm += add // add them
    if(add ^ (1L << 63) > xrm ^ (1L << 63)) xrh++ // carry, see note 1

    t = xl - ((xrh << 2) | (xrm >>> 62)) * MOD - MOD // see note 2
    t += (t >> 63) & MOD
    return t
}

Note 1: The carry step does a little trick for unsigned comparison (flip high bits and compare as signed). If the addend is greater than the result, an overflow must have occurred, so we perform the carry.

Note 2: $$$t$$$ fits in 64 bits, so we only need the low word of $$$x$$$. We reconstruct $$$\left\lfloor \frac{xr}{4^{63}} \right\rfloor$$$ with the bits from xrh and xrm. Note we don't need the low word of $$$xr$$$ as it's discarded by the floor operation. When we multiply this factor with $$$m$$$, we ignore the overflow as the final result is guaranteed to fit in 64 bits.

Update: I have further optimized the Barrett reduction step, however it requires a new proof of correctness, so I'm keeping the old variant on this page. For info on the optimized version, read this article.

And that's it for this blogpost. This should be sufficient to solve most integer-based FFT problems in competitive programming. Note that if you need negative numbers in your polynomials, you can input them modulo $$$m$$$, then assume any integer in the output that exceeds $$$m/2$$$ is negative; this works as long as no number in the output has an absolute value greater than $$$m/2$$$, yet another advantage of using such a huge modulus.

Note that for arbitrary polynomials of length $$$n$$$ with input values not exceeding $$$a$$$, the maximum possible value in the product polynomial is around $$$a^2n$$$.

Samples

74644979 for 993E - Nikita and Order Statistics

74641269 for 986D - Perfect Encoding — this is particularly interesting, because this huge modulus allows 6 digits per word in the custom big-integer implementation instead of the recommended 3, cutting down the FFT size (as well as the size of other arithmetic operations) by half. With this tight a time limit, I'm not sure this problem is solvable in Kotlin otherwise.

Addendum

There are a small minority of problems that may ask to perform FFT with an inconvenient modulus like $$$10^9 + 7$$$, or an arbitrarily given one in the input (do any exist on Codeforces?). $$$a^2n$$$ in this case could overflow even this large modulus, but it can be handled with the "multiplication with arbitrary modulus" method listed in the CP-algorithms article. (I heard this is called MTT; also known as "FFT-mod") Even then, the large modulus this article covers might be helpful, as it reduces the number of decompositions you must make.

More addendum (December 2021)

It seems this method is significantly faster when run on a 64-bit system:

130750094 (1808 ms, Rust ran on 32-bit before an update circa November, IIRC)

140255354 (576 ms, after 64-bit update)

It even beats out my implementations using floats (140229918, 717 ms) and CRT/Garner's algorithm with two 32-bit moduli (140230689, 842 ms)

Unfortunately Java and Kotlin, at this time of writing, still seems to run on a VM that's effectively 32-bit.

There is also another interesting prime modulus, as noted in this offsite blogpost: $$$18446744069414584321$$$ (hex: 0xffff_ffff_0000_0001, $$$g = 7$$$). Due to its bitpattern, you can use some adds, subtracts, and shifts to do the reduction under this modulus, instead of the multiplies needed by Barrett reduction.

However my implementation of it ended up slightly slower (140308499, 639 ms). It seems multiplies are just that good on 64-bit systems. It might still be worth considering for its slightly higher range, huge capacity (can theoretically support FFTs up to $$$2^{32}$$$ in length, though you'd likely MLE first). It might also be friendlier to 32-bit systems (needs testing). This modulus, as noted in the blog post, also has other interesting properties; chiefly that $$$2$$$ is a $$$192^{\text{nd}}$$$ root of unity; this allows some roots of unity to be expressed in powers of two, which allows using shifts instead of multiplication in some cases. However I've not figured out how to efficiently use this fact (it might be more helpful for mixed-radix variants of FFT, which I've not learned how to implement).

Either modulus can also be used to emulate convolutions in other arbitrary moduli, such as $$$10^9 + 7$$$ (https://judge.yosupo.jp/submission/70563) and $$$2^{64}$$$ (https://judge.yosupo.jp/submission/70564), by splitting each coefficient into $$$k$$$ "place values" for a particular base (I use $$$2^{16}$$$ in these examples), then assigning $$$2k - 1$$$ slots for each coefficient (the extra slots allow the convolution to "spill over"), essentially emulating multivariate convolution with a single FFT. The speed overhead of this method, however, is quite significant; my implementation using CRT/Garner's (https://judge.yosupo.jp/submission/70678) is much faster in this case.

Comments (4)

Write comment?

jef

4 years ago, # |

"Instead, first normalize x into the range [−m,m); this is easily done with subtracting m after any addition operation." — You'll want to subtract before any addition that can potentially overflow since signed integer overflow is undefined behavior.

→ Reply

Spheniscine

4 years ago, # ^ |

← Rev. 2 →

It's defined in JVM and acts the same way in C++ on pretty much all modern systems, but sure, I'll amend it

You really don't want to rely on undefined behavior in C++ since the compiler can optimize based on the assumption that UB never happens. If you handle overflow after it happens it is entirely possible for the compiler to completely discard your checks. See https://godbolt.org/z/mDmuID for an example. What you're doing seems likely to be fine since the UB doesn't present any opportunities for optimization but it's just not a good idea to rely on signed overflow in general.

The multiplication with arbitrary 32-bit modulus method is known as FFTMod. It's explained here: http://neerc.ifmo.ru/trains/toulouse/2017/fft2.pdf

#	User	Rating
1	ecnerwala	3648
2	Benq	3580
3	orzdevinwang	3570
4	cnnfls_csy	3569
5	Geothermal	3568
6	tourist	3565
7	maroonrk	3530
8	Radewoosh	3520
9	Um_nik	3481
10	jiangly	3467

#	User	Contrib.
1	maomao90	174
2	awoo	164
2	adamant	164
4	TheScrasse	159
4	nor	159
6	maroonrk	156
7	-is-this-fft-	150
8	SecondThread	147
9	orz	146
10	pajenegod	145

Spheniscine's blog

Samples

Addendum

More addendum (December 2021)