A modulo multiplication method that is 2x faster than compiler implementation

#	User	Rating
1	tourist	3843
2	jiangly	3705
3	Benq	3628
4	orzdevinwang	3571
5	Geothermal	3569
5	cnnfls_csy	3569
7	jqdai0815	3530
8	ecnerwala	3499
9	gyh20	3447
10	Rebelz	3409

#	User	Contrib.
1	maomao90	171
2	awoo	164
3	adamant	162
4	TheScrasse	159
5	nor	153
5	maroonrk	153
7	-is-this-fft-	152
8	Petr	146
9	orz	145
10	pajenegod	144

Given $$$k,m$$$ and $$$n$$$ numbers $$$a_i$$$, compute each $$$a_i\times k\bmod m$$$

I came up with a method that is almost twice as fast as the compiler implementation (when $$$m$$$ is a constant), which can effectively speed up the NTT and some DPs.

First of all

$$$ a_i\times k\bmod m=\{a_i\times \frac km\}\times m\tag1 $$$

where $$$\{x\}=x-\lfloor x\rfloor$$$ is the fractional part function. The principle of equation $$$(1)$$$ is that $$$a_i\times k\bmod m$$$ is only related to the fractional part of $$$\frac{a_i\times k}m$$$.

Let $$$\frac km\approx\frac p{2^{64}}$$$, $$$p$$$ is either equal to $$$\lfloor\frac km\times2^{64}\rfloor$$$ or $$$\lceil\frac km\times2^{64}\rceil$$$, which one to choose, we will decide later, then

$$$ \begin{aligned} a_i\times k\bmod m&\approx\{a_i\times \frac p{2^{64}}\}\times m\\ &=\frac{a_ip\bmod 2^{64}}{2^{64}}\times m\\ &=\frac{a_ip\bmod 2^{64}\times m}{2^{64}} \end{aligned} \tag2 $$$

There are two place we need to choose whether round up or down:

$$$p=\lfloor\frac km\times2^{64}\rfloor$$$ or $$$p=\lceil\frac km\times2^{64}\rceil$$$.
The final formula in (2) isn't always an integer, so we need to consider whether we round it up or down.

We choose $$$p=\lceil\frac km\times2^{64}\rceil$$$, which will slightly enlarge the answer, while the final formula in (2) rounded down, which will slightly lessen the answer. We'll prove that this set of choice can give us correct answer just if $$$a_i\le\frac{2^{64}}m$$$.

Proof

const int P = 998244353;

void calc(int n, int k, int a[]) {
    unsigned long long p = ((unsigned __int128)k << 64) + P - 1) / P;
    for(int i = 0; i < n; i++)
    	a[i] = (unsigned)a[i] * p * (unsigned __int128)P >> 64;
}

A few notes.

The code uses unsigned __int128 so it can only be used in a 64-bit architecture.
(unsigned)a[i] * p will automatically be modulo $$$2^{64}$$$.
* (unsigned __int128)P >> 64 is 64-bit multiplication retaining only the high 64 bits (in the rdx register), the same speed as multiplying two unsigned long longs together.
Converting a[i] to unsigned is because int to unsigned and then to unsigned long long is faster than going directly to unsigned long long, which requires sign extension.

Speed test.

code

#include <bits/stdc++.h>

using namespace std;

const int N = 5e4, P = 998244353;

int a[N];

void ThroughputTest() {
    int checkSum1 = 0, checkSum2 = 0, checkSum3 = 0;

    auto start = chrono::steady_clock::now();
    for(int i = 0; i < N; i += 2)
        for(int j = 0; j < N; j++) {
            checkSum1 ^= (int64_t)a[i] * a[j] % P;
            checkSum1 ^= (int64_t)a[i + 1] * a[j] % P;
        }
    auto end = std::chrono::steady_clock::now();
    cout << "Compiler's signed modulo:   " << (end - start).count() * 1e-6 << " ms" << endl;

    start = chrono::steady_clock::now();
    for(int i = 0; i < N; i += 2)
        for(int j = 0; j < N; j++) {
            checkSum2 ^= (uint64_t)(uint32_t)a[i] * (uint32_t)a[j] % P;
            checkSum2 ^= (uint64_t)(uint32_t)a[i + 1] * (uint32_t)a[j] % P;
        }
    end = std::chrono::steady_clock::now();
    cout << "Compiler's unsigned modulo: " << (end - start).count() * 1e-6 << " ms" << endl;

    start = chrono::steady_clock::now();
    for(int i = 0; i < N; i += 2) {
        uint64_t x = (((__uint128_t)a[i] << 64) + P - 1) / P;
        uint64_t y = (((__uint128_t)a[i + 1] << 64) + P - 1) / P;
        for(int j = 0; j < N; j++) {
            checkSum3 ^= (uint32_t)a[j] * x * (__uint128_t)P >> 64;
            checkSum3 ^= (uint32_t)a[j] * y * (__uint128_t)P >> 64;
        }
    }
    end = std::chrono::steady_clock::now();
    cout << "My modulo:                  " << (end - start).count() * 1e-6 << " ms" << endl;

    assert(checkSum1 == checkSum2 && checkSum2 == checkSum3);
}
void LatencyTest() {
    int checkSum1 = 0, checkSum2 = 0, checkSum3 = 0;

    auto start = chrono::steady_clock::now();
    for(int i = 0; i < N; i += 2)
        for(int j = 0; j < N / 2; j++) {
            checkSum1 = (int64_t)a[i] * (a[j] ^ checkSum1) % P;
            checkSum1 = (int64_t)a[i + 1] * (a[j] ^ checkSum1) % P;
        }
    auto end = std::chrono::steady_clock::now();
    cout << "Compiler's signed modulo:   " << (end - start).count() * 1e-6 << " ms" << endl;

    start = chrono::steady_clock::now();
    for(int i = 0; i < N; i += 2)
        for(int j = 0; j < N / 2; j++) {
            checkSum2 = (uint64_t)(uint32_t)a[i] * (uint32_t)(a[j] ^ checkSum2) % P;
            checkSum2 = (uint64_t)(uint32_t)a[i + 1] * (uint32_t)(a[j] ^ checkSum2) % P;
        }
    end = std::chrono::steady_clock::now();
    cout << "Compiler's unsigned modulo: " << (end - start).count() * 1e-6 << " ms" << endl;

    start = chrono::steady_clock::now();
    for(int i = 0; i < N; i += 2) {
        uint64_t x = (((__uint128_t)a[i] << 64) + P - 1) / P;
        uint64_t y = (((__uint128_t)a[i + 1] << 64) + P - 1) / P;
        for(int j = 0; j < N / 2; j++) {
            checkSum3 = (uint32_t)(a[j] ^ checkSum3) * x * (__uint128_t)P >> 64;
            checkSum3 = (uint32_t)(a[j] ^ checkSum3) * y * (__uint128_t)P >> 64;
        }
    }
    end = std::chrono::steady_clock::now();
    cout << "My modulo:                  " << (end - start).count() * 1e-6 << " ms" << endl;

    assert(checkSum1 == checkSum2 && checkSum2 == checkSum3);
}
int main() {
    mt19937 gen;
    for(int i = 0; i < N; i++) a[i] = gen() % P;
    cout << "Throughput test (50000 * 50000):" << endl;
    ThroughputTest();
    cout << endl;
    cout << "Latency test (50000 * 25000):" << endl;
    LatencyTest();
}

It contains two parts.

The first part is the reciprocal throughput, the time taken by the CPU to be highly parallel (modern CPUs can be parallelized at instruction level on a single core), containing a total of $$$50000\times50000$$$ modulo multiplications.
The second part is the Latency, which is the time taken for each modulo multiplication to be performed sequentially without parallelism, containing a total of $$$50000\times25000$$$ modulo multiplications.

Possible output:

Throughput test(50000 * 50000):
Compiler's signed modulo:   1954.83 ms
Compiler's unsigned modulo: 1746.73 ms
My modulo:                  1160.47 ms

Latency test(50000 * 25000):
Compiler's signed modulo:   4329.33 ms
Compiler's unsigned modulo: 3945.29 ms
My modulo:                  2397.97 ms

By the way, a few general facts.

Constant modulo multiplication is almost 4 times faster in parallel than serial (as is modulo multiplication of my invention).
int to unsigned then to long long is faster than long long, but negative numbers will be wrong.
unsigned long long modulo constants is faster than long long.

Comparison with other methods

Comparison with Barrett reduction and Montgomery multiplication:

The purpose of my method is to compute $$$a\times b\bmod m$$$ for fixed $$$m$$$ and $$$b$$$, while Barrett reduction and Montgomery multiplication compute $$$a\times b\bmod m$$$ for fixed m. But my method is faster than the other two methods.
The derivation of my method is similar to Barrett reduction. So They both work when $$$m < 2^{32}$$$, while Montgomery multiplication works when $$$m < 2^{64}$$$ and $$$m$$$ is an odd number.

Extensions

It is also possible to compute $$$(a_1b_1+a_2b_2+\cdots+a_nb_n)\bmod m$$$, but $$$\sum a_i$$$ cannot exceed $$$\frac{2^{64}}m$$$.

Let $$$p_i=\lceil\frac{b_i}m\times2^{64}\rceil$$$.

$$$ (\sum a_ib_i)\bmod m=\lfloor\frac{(\sum a_ip_i)\bmod 2^{64}\times m}{2^{64}}\rfloor $$$

Comments (6)

Write comment?

adamant

17 months ago, # |

+92

Hey, thanks for your blog post! May I suggest that you place your proof and code in the blog post, perhaps under the spoiler tag, rather than making them separate blog posts? It is more convenient for the reader to see them without moving to another page, and also this way it consumes less space in "Recent actions" section.

Have you compared your idea with other existing ones, such as Montgomery multiplication or Barrett reduction? pajenegod has a very comprehensive article about them here.

→ Reply

platelet

17 months ago, # ^ |

+41

Thanks for your suggestion. I've changed the external links to spoiler tags and compared my method to the others.

sslotin

← Rev. 3 →

+32

Daniel Lemire came up with this method about four years ago, and I describe it in my book. I heard there were some efforts to replace the algorithms used in GCC and LLVM with it, but I don't know if they went anywhere.

You can also use it for faster division-by-constant and divisibility checks, but it has the drawback that it needs 4x the machine word width for intermediates (Barrett reduction-like methods typically need 2x). It is also a less efficient than Montgomery multiplication if you need to execute a chain of modular operations and not just one modular reduction. Many optimized number theory algorithms, including NTT, largely rely on (SIMDized versions of) Montgomery multiplication.

When k=1 my method is the same as Lemire Reduction, but there are some differences:

The purpose of my method is to compute $$$a_i\times k \bmod m$$$ multiple times for fixed $$$m,k$$$, while Lemire Reduction is modulo a single number.
When computing $$$a_i\times k \bmod m$$$, my method works for $$$m<2^{32}$$$ and with two 64-bit multiplications. If you calculate $$$a_i\times k$$$ first and then use Lemire Reduction, it only works for $$$m<\sqrt[3]{2^{64}}$$$ with three 64-bit multiplications.

Maybe my method is an improvement of Lemire Reduction, which works for $$$m<2^{32}$$$ and is useful in competitive programming (common moduli are $$$10^9+7$$$ and $$$998244353$$$)

+31

Auto comment: topic has been updated by platelet (previous revision, new revision, compare).

fast_photon

6 months ago, # |

wow it is so strong. stO \platelet/ Orz

platelet's blog

Speed test.

Comparison with other methods

Extensions