Montgomery/Barret reduction and NTT [Performance optimization]

Revision en3, by alexvim, 2024-05-19 03:08:30

$$$\DeclareMathOperator{\ord}{ord}$$$ $$$\DeclareMathOperator{\mod}{mod}$$$

Montgomery/Barret reduction and NTT

In this article it will be shown how to apply Montgomery/Barret reduction of modular arithmetic to Number Theoretic Transform and combinatorial tasks. We will implement NTT with minimum of modular operations that beats classic FFT both in precision and speed. We will compare the performance of various optimizations of NTT and FFT in complex field. Also it will be shown how to add support of negative coefficients to NTT and replace FFT.

1. Why use NTT

DFT (Discrete Fourier Transform) is widely used for polynomial multiplication. DFT can be calculated in every field such that there is nth root (element with order n). For instance, we can use complex numbers or numbers modulo some $$$p$$$ (finite field). Widely known realization — Fast Fourier Transform — calculates DFT in $$$O(n\log{n})$$$ both in complex numbers and integers (only arithmacy varies). FFT in $$$\mathbb{Z}/p\mathbb{Z}$$$ called NTT (number theoretic transform). The basic algorithm is reviewed in many articles, e.g. CP-algorithms. NTT is barely used because standard implementation with big modules is very slow without hardware acceleration. There it will be shown how to implement fast NTT with implicit modular operations.

Of course, we have Fast Fourier transform on complex doubles that widely used in polynomial multiplication (and not only that), including solving tasks on Codeforces. But FFT has very bad precision and quite slow (even with non-recursive implementation). Therefore, if we are given polynomials with non-negative integer coefficients to multiply, we are able to use NTT: calculate coefficients some modulo $$$p$$$ and recover actual answer up to $$$10^{18}$$$ (either Chinese Remainder theorem can be used for recovering answer by two modules or single module $$$~10^{18}$$$ with 128-bit cast).

The main advantage over FFT is precision — we can get actual product with coefficients in $$$[0; 10^{18}]$$$ where with FFT we will get error 10-100 with doubles and 1-10 with long doubles (for instance, 259759436 gets WA43 in polynomial multiplication task with FFT with doubles, but 259759880 with long doubles gets OK).

NTT is not faster out of the box. Majority implementations of NTT uses a lot of modular arithmetic so at many hardware (especiallly without SIMD) it runs a slower that FFT. But if we will be able to get rid of modular operations then NTT will run much faster than FFT with doubles. And there we can use Montgomery or Barret reduction to multiply integers in special form without % operation.

2. Montgomery reduction

Montgomery reduction is approach to operate with unsigned integers on some module without % division. The main idea is that we can reduce multiplication modulo $$$p$$$ to multiplication modulo $$$2^k$$$ that can be implemented using bit manipulations. Recommended to see the guide. Montgomery reduction requires numbers to operate be in special form (Montgomery space). Conversion into this form and vice versa requires using actual modular division but NTT has special case: all base numbers that we will multiply modulo are source coefficients, root and some inverses (n, root). We can do it in $$$O(n)$$$ before using NTT and after it (to convert into normal form). Hence, we got rid of all modular division inside NTT. Single nuance there is that if we calculating NTT over big modulo ($$$~10^{18}$$$) we have to use __int128. Therefore, this algorithm has big potential to be used with AVX. There's even a Intel guide for optimization for Montgomery multiplication via AVX.

P.S. But if we want to reduce 128 bit integers without loss of size of coefficients, we can calculate NTT over two modules using only 64-bit integers (can be calculated in single call in parallel) and then solve system of congruences over two prime modules via Chinese Remainder Theorem. Then __int128 will be used only in CRT.

3. Barret reduction

Another way to reduce modular arithmetic is Barret reduction. Barrett is based on approximating the real reciprocal with bounded precision. Certainly, let's transform expression for remainder:

$$$ a \mod p = a - \lfloor {a \over p} \rfloor \cdot p $$$

Choose $$$m$$$, such that:

$$$ {1 \over p} = {m \over 2^k} \longleftrightarrow m = {2^k \over p} $$$

If we take big enough $$$2^k$$$ (like $$$p^2$$$) answer for numbers in our range (<$$$10^{18}$$$) will be correct. See full proof here. Barret reduction gives us the same functionality as Montgomery but doesn't require converting to special form. Main nuance that we have to use bit shift to more than $$$p^2$$$ so if we use modulo $$$~10^{18}$$$ 256-bit integers is required so operations is more exprensive.

4. Good modules

Firstly, we have to choose module $$$p$$$ such that $$$n | (p-1): n=2^k$$$ in order for existence nth root (order of element divides order of cyclic group from Lagrange theorem). Since we have $$$g$$$ — primitive root, consider $$$g^{p-1 \over n}$$$ as nth root. For concrete module $$$p$$$ such roots can be easily found via binary exponentiation and primitive root searching algorithm in $$$O(Ans \cdot \log^2(p))$$$, $$$Ans$$$ is always small by theory of roots in finite field. See proof of correctness. For example, for work with coefficients up to $$$2 \cdot 10^{18}$$$ and size of polynomial up to $$$2^{24}$$$ we can use module $$$p=2524775926340780033$$$. Primitive root is $$$3$$$.

Primitive root searching algorithm

5. Implementations

We will use non-recursive realization of FFT (NTT) with external structure for Montgomery/Barret arithmetic. Convenient to use unsigned integers while dealing with reductions because of trick with overflow: it doesn't lead to undefined behavior but returns answer modulo $$$2^{64}$$$.

NTT with Montgomery multiplication

Implementation with Barret reduction can only be used with modules $$$<2^{32}$$$ without 256-bit integers. Because Barret reduction a priori has more expensive operations, there isn't implementation with uint256. There is ability to use uint32 instead of uint64 in code but for equality of testing conditions there is same formats both for Montgomery and Barret.

NTT with Barret multiplication

Also there is provided FFT implementation with custom struct for complex numbers. It can be used with doubles / long doubles.

FFT with custom cmpls

NTT without optimizations:

NTT without opt

6. Performance

We will test 5 algorithms: FFT with doubles and long doubles, NTT with Montgomery and Barret reductions and NTT without optimizations.

Firstly, let's test it on 993E - Nikita and Order Statistics. It is important to notice that NTT with Barret reduction and FFT with doubles got WA43 (but passed another tests with big input) because FFT with doubles has very bad precision as it was noticed before. NTT with Barret has not passed because module $$$998244353$$$ is quite small to store answer in this problem (NTT with Barret and without uint256 can be used only for small coefficients).

FFT, doubles FFT, long doubles NTT, Montgomery NTT, Barret NTT, no opt
261258380, 311ms, BAD PRECISION 261256931, 702ms, OK 261256487, 281ms, OK 261257532, 312ms, BAD LIMITS 261257532, 859ms, OK

Now let's see results on multiplication random polynomials with $$$10^6$$$ size (tested on Codeforces custom test):

FFT, doubles FFT, long doubles NTT, Montgomery NTT, Barret NTT, no opt
1407ms 2434ms 1374ms 1437ms 4122ms
1286ms 2426ms 1256ms 1675ms 3683ms
1052ms 2286ms 1153s 1320ms 3518ms
1079ms 2359ms 1309ms 1380ms 3961ms
1217ms 2791ms 1476ms 1780ms 3400ms
1241ms 2191ms 1125ms 1337ms 3512ms

Next, local tests on Ryzen 5 5650u (AVX), 8Gb RAM 4266MHz, Debian 12, GCC 12.2.0. Using g++ -Wall -Wextra -Wconversion -static -DONLINE_JUDODGE -O2 -std=c++20 -o fftest compilation parameters (like on Codeforces).

FFT, doubles FFT, long doubles NTT, Montgomery NTT, Barret NTT, no opt
691ms 2339ms 352ms 401ms 398ms
700ms 2328ms 411ms 500ms 487ms
729ms 2036ms 359ms 446ms 396ms
698ms 2164ms 380ms 456ms 472ms
734ms 2284ms 407ms 462ms 440ms
737ms 2148ms 396ms 459ms 434ms

On Codeforces server Montgomery + NTT and FFT + doubles have similar performance, then NTT + Barret, then FFT + long doubles and then vanilla NTT. On local with modern CPU with SIMD leader is Mongomery + NTT. Actually, on local system NTT runs much faster even wihout optimizations. I can't certainly determine what technology are responsible for fast modular arithmetic, but most likely it is some SIMD. In general, NTT has good potential to hardware acceleration. For example here shown how to accelerate NTT via FPGA.

As wee see, performance of NTT with big module rests on platform because of many operations with high-bit integers, exactly, 128-bit modular division. This operation depends on bare metal implementation so on Codeforces NTT without optimization is the slowest but on CPU with SIMD acceleration NTT wins FFT. However, diffirence can be neutralized via reduction which uses only multiplication of 128-bit integers. This article shows problems of 128-bit division.

So, NTT + Montgomery reduction is fairly universal choice on all platforms. Single nuance is that NTT can't operate with negative coefficients because answer is modulo $$$p$$$.

7. Negative coefficients

Consider $$$A(x)$$$ and $$$B(x)$$$ have any integer coefficients and we want to get coefficients of the product via NTT. Let $$$C(x)$$$ is polynomial such that every coefficient of $$$A(x)+C(x)$$$ $$$B(x)+C(x)$$$ and $$$A(x)+B(x)+C(x)$$$ is non-negative. Notice that

$$$ A \cdot B = (A+C) \cdot (B+C) - C \cdot (A+B+C) $$$

Because every term in sum is polynomial with non-negative coefficients, $$$A(x) \cdot B(x)$$$ can be found via 2 calls of NTT multiplication. $$$C(x)$$$ can be found in $$$O(n)$$$ greedily.

8. Class for modular arithmetic

It is convenient to make a class for modular arithmetic in $$$\mathbb{Z}/p\mathbb{Z}$$$ using some reduction. Below showed class for modular arithmetic with Barret reduction (it has methods for gcd, pow, inverses, roots). Class can be extended to arithmetic in every finite field. P.S. If we want get maximum performance while working with specific set of integers, we can replace Barret reduction with Montgomery, because Montgomery is faster.

Modular arithmetic class

9. Conclusion

As a result we compared various reductions for modular arithmetic and found out that Montgomery reduction has the best performance if we know set of integers in advance. Hence, NTT with this reduction can be used instead of FFT in tasks with polynomial multiplication (even for polynomials with negative coefficients). So there shown reusable realization of NTT with Montgomery reduction which shows better performance even than FFT, especially with hardware optimizations (SIMD?). Open question here what hardware optimizations makes NTT run faster on quite modern CPUs "out of the box".

Tags fft, ntt, modular arithmetic, performance, hardware, montgomery, barret


  Rev. Lang. By When Δ Comment
en3 English alexvim 2024-05-19 03:08:30 19
en2 English alexvim 2024-05-19 03:00:48 2 Tiny change: '2^k \over n} $$\nIf w' -> '2^k \over p} $$\nIf w'
ru2 Russian alexvim 2024-05-19 03:00:41 61
en1 English alexvim 2024-05-19 02:28:27 29844 Initial revision for English translation
ru1 Russian alexvim 2024-05-19 02:27:22 31455 Первая редакция (опубликовано)