A modulo multiplication method that is 2x faster than compiler implementation

#	User	Rating
1	tourist	3880
2	jiangly	3669
3	ecnerwala	3654
4	Benq	3627
5	orzdevinwang	3612
6	Geothermal	3569
6	cnnfls_csy	3569
8	jqdai0815	3532
9	Radewoosh	3522
10	gyh20	3447

#	User	Contrib.
1	awoo	161
2	maomao90	160
3	adamant	156
4	maroonrk	153
5	-is-this-fft-	148
5	atcoder_official	148
5	SecondThread	148
8	Petr	147
9	nor	144
9	TheScrasse	144

> Given $k,m$ and $n$ numbers $a_i$, compute each $a_i\times k\bmod m$↵
↵
I came up with a algorithm that is almost twice as fast as the compiler implementation (when $m$ is a constant), which can effectively speed up the NTT and some DPs.↵
↵
First of all↵
↵
$$↵
a_i\times k\bmod m=\{a_i\times \frac km\}\times m↵
$$↵
↵
where $\\{x\\}=x-\lfloor x\rfloor$ is the fractional part function. The principle of the above equation is that $a_i\times k\bmod m$ is only related to the fractional part of $\frac{a_i\times k}m$.↵
↵
By analogy with barrett reduction, let $\frac km\approx\frac p{2^{64}}$, then↵
↵
$$↵
\begin{aligned}↵
a_i\times k\bmod m&\approx\{a_i\times \frac p{2^{64}}\}\times m\\↵
&=\frac{a_ip\bmod 2^{64}}{2^{64}}\times m\\↵
&=\frac{a_ip\bmod 2^{64}\times m}{2^{64}}↵
\end{aligned}↵
$$↵
↵
Here are two multiple choice questions.↵
↵
- $\frac p{2^{64}}\le\frac km$ or $\frac p{2^{64}}\ge\frac km$.↵
- The last formula does not necessarily work out as an integer, to be rounded down or up.↵
↵
We choose $\frac p{2^{64}}\ge\frac km$, with the final formula rounded down. The former will make the result biased large, the latter will make the result biased small, a perceptual understanding so that the result will be more accurate.↵
↵
> **Theorem**: Let $p=\lceil\frac km\times2^{64}\rceil$, when $a_i\le \frac{2^{64}}m$, the calculation result of the lower rounding is exact.↵
>↵
> **Proof**: [written outside](https://codeforces.com/blog/entry/111564)↵
↵
```cpp↵
const int P = 998244353;↵
↵
void calc(int n, int k, int a[]) {↵
    unsigned long long p = ((unsigned __int128)k << 64) + P - 1) / P;↵
    for(int i = 0; i < n; i++)↵
     a[i] = (unsigned)a[i] * p * (unsigned __int128)P >> 64;↵
}↵
```↵
↵
A few notes.↵
↵
- The code uses `unsigned __int128` so it can only be used in a 64-bit architecture.↵
- `(unsigned)a[i] * p` will automatically be modulo 2^64.↵
- `* (unsigned __int128)P >> 64` is 64-bit multiplication retaining only the high 64 bits (in the rdx register), the same speed as multiplying two `unsigned long long`s together.↵
- Converting `a[i]` to `unsigned` is because `int` to `unsigned` and then to `unsigned long long` is faster than going directly to `unsigned long long`, which requires **sign extension**.↵
↵
### Speed test.↵
↵
[code](https://codeforces.com/blog/entry/111565),method that is almost twice as fast as the compiler implementation (when $m$ is a constant), which can effectively speed up the NTT and some DPs.↵
↵
First of all↵
↵
$$↵
a_i\times k\bmod m=\{a_i\times \frac km\}\times m\tag1↵
$$↵
↵
where $\\{x\\}=x-\lfloor x\rfloor$ is the fractional part function. The principle of equation $(1)$ is that $a_i\times k\bmod m$ is only related to the fractional part of $\frac{a_i\times k}m$.↵
↵
Let $\frac km\approx\frac p{2^{64}}$, $p$ is either equal to $\lfloor\frac km\times2^{64}\rfloor$ or $\lceil\frac km\times2^{64}\rceil$, which one to choose, we will decide later, then↵
↵
$$↵
\begin{aligned}↵
a_i\times k\bmod m&\approx\{a_i\times \frac p{2^{64}}\}\times m\\↵
&=\frac{a_ip\bmod 2^{64}}{2^{64}}\times m\\↵
&=\frac{a_ip\bmod 2^{64}\times m}{2^{64}}↵
\end{aligned}↵
\tag2↵
$$↵
↵
There are two place we need to choose whether round up or down:↵
↵
- $p=\lfloor\frac km\times2^{64}\rfloor$ or $p=\lceil\frac km\times2^{64}\rceil$.↵
- The final formula in (2) isn't always an integer, so we need to consider whether we round it up or down.↵
↵
We choose $p=\lceil\frac km\times2^{64}\rceil$, which will slightly enlarge the answer, while the final formula in (2) rounded down, which will slightly lessen the answer. We'll prove that this set of choice can give us correct answer just if $a_i\le\frac{2^{64}}m$.↵
↵
<spoiler summary="Proof">↵
Let $\frac p{2^{64}}=\frac km+\epsilon$, where $\epsilon\in[0,\frac1{2^{64}})$.↵
$$↵
\begin{aligned}↵
\lfloor\{a_i\times \frac p{2^{64}}\}\times m\rfloor&=\lfloor\{a_i\times (\frac km+\epsilon)\}\times m\rfloor\\↵
&=\lfloor\{a_i\times \frac km+a_i\epsilon\}\times m\rfloor↵
\end{aligned}↵
$$↵
↵
Here if $\lfloor a_i\times\frac km+a_i\epsilon\rfloor>\lfloor a_i\times \frac km\rfloor$ must be wrong, $a_i\times \frac km$ is at least $\frac1m$ away from $\lfloor a_i\times \frac km\rfloor+1$, so as long as $a_i\epsilon<\frac1m$ it's OK, let's continue the derivation.↵
↵
$$↵
\begin{aligned}↵
&=\lfloor\{a_i\times \frac km+a_i\epsilon\}\times m\rfloor\\↵
&=\big\lfloor\left(\{a_i\times \frac km\}+a_i\epsilon\right)m\big\rfloor\\↵
&=\lfloor\{a_i\times \frac km\}\times m+a_im\epsilon\rfloor↵
\end{aligned}↵
$$↵
↵
Since $\\{a_i\times\frac km\\}\times m$ is an integer, the result is exact as long as $a_im\epsilon<1$.↵
↵
The $a_i\epsilon<\frac1m$ and $a_im\epsilon<1$ are the same, and then the condition can be rewritten as $a_i\le\frac{2^{64}}m$ according to $\epsilon<\frac1{2^{64}}$.↵
</spoiler>↵
↵
```cpp↵
const int P = 998244353;↵
↵
void calc(int n, int k, int a[]) {↵
    unsigned long long p = ((unsigned __int128)k << 64) + P - 1) / P;↵
    for(int i = 0; i < n; i++)↵
     a[i] = (unsigned)a[i] * p * (unsigned __int128)P >> 64;↵
}↵
```↵
↵
A few notes.↵
↵
- The code uses `unsigned __int128` so it can only be used in a 64-bit architecture.↵
- `(unsigned)a[i] * p` will automatically be modulo $2^{64}$.↵
- `* (unsigned __int128)P >> 64` is 64-bit multiplication retaining only the high 64 bits (in the rdx register), the same speed as multiplying two `unsigned long long`s together.↵
- Converting `a[i]` to `unsigned` is because `int` to `unsigned` and then to `unsigned long long` is faster than going directly to `unsigned long long`, which requires **sign extension**.↵
↵
### Speed test.↵
↵
↵
<spoiler summary="code">↵
```cpp↵
#include <bits/stdc++.h>↵
↵
using namespace std;↵
↵
const int N = 5e4, P = 998244353;↵
↵
int a[N];↵
↵
void ThroughputTest() {↵
    int checkSum1 = 0, checkSum2 = 0, checkSum3 = 0;↵
↵
    auto start = chrono::steady_clock::now();↵
    for(int i = 0; i < N; i += 2)↵
        for(int j = 0; j < N; j++) {↵
            checkSum1 ^= (int64_t)a[i] * a[j] % P;↵
            checkSum1 ^= (int64_t)a[i + 1] * a[j] % P;↵
        }↵
    auto end = std::chrono::steady_clock::now();↵
    cout << "Compiler's signed modulo:   " << (end - start).count() * 1e-6 << " ms" << endl;↵
↵
    start = chrono::steady_clock::now();↵
    for(int i = 0; i < N; i += 2)↵
        for(int j = 0; j < N; j++) {↵
            checkSum2 ^= (uint64_t)(uint32_t)a[i] * (uint32_t)a[j] % P;↵
            checkSum2 ^= (uint64_t)(uint32_t)a[i + 1] * (uint32_t)a[j] % P;↵
        }↵
    end = std::chrono::steady_clock::now();↵
    cout << "Compiler's unsigned modulo: " << (end - start).count() * 1e-6 << " ms" << endl;↵
↵
    start = chrono::steady_clock::now();↵
    for(int i = 0; i < N; i += 2) {↵
        uint64_t x = (((__uint128_t)a[i] << 64) + P - 1) / P;↵
        uint64_t y = (((__uint128_t)a[i + 1] << 64) + P - 1) / P;↵
        for(int j = 0; j < N; j++) {↵
            checkSum3 ^= (uint32_t)a[j] * x * (__uint128_t)P >> 64;↵
            checkSum3 ^= (uint32_t)a[j] * y * (__uint128_t)P >> 64;↵
        }↵
    }↵
    end = std::chrono::steady_clock::now();↵
    cout << "My modulo:                  " << (end - start).count() * 1e-6 << " ms" << endl;↵
↵
    assert(checkSum1 == checkSum2 && checkSum2 == checkSum3);↵
}↵
void LatencyTest() {↵
    int checkSum1 = 0, checkSum2 = 0, checkSum3 = 0;↵
↵
    auto start = chrono::steady_clock::now();↵
    for(int i = 0; i < N; i += 2)↵
        for(int j = 0; j < N / 2; j++) {↵
            checkSum1 = (int64_t)a[i] * (a[j] ^ checkSum1) % P;↵
            checkSum1 = (int64_t)a[i + 1] * (a[j] ^ checkSum1) % P;↵
        }↵
    auto end = std::chrono::steady_clock::now();↵
    cout << "Compiler's signed modulo:   " << (end - start).count() * 1e-6 << " ms" << endl;↵
↵
    start = chrono::steady_clock::now();↵
    for(int i = 0; i < N; i += 2)↵
        for(int j = 0; j < N / 2; j++) {↵
            checkSum2 = (uint64_t)(uint32_t)a[i] * (uint32_t)(a[j] ^ checkSum2) % P;↵
            checkSum2 = (uint64_t)(uint32_t)a[i + 1] * (uint32_t)(a[j] ^ checkSum2) % P;↵
        }↵
    end = std::chrono::steady_clock::now();↵
    cout << "Compiler's unsigned modulo: " << (end - start).count() * 1e-6 << " ms" << endl;↵
↵
    start = chrono::steady_clock::now();↵
    for(int i = 0; i < N; i += 2) {↵
        uint64_t x = (((__uint128_t)a[i] << 64) + P - 1) / P;↵
        uint64_t y = (((__uint128_t)a[i + 1] << 64) + P - 1) / P;↵
        for(int j = 0; j < N / 2; j++) {↵
            checkSum3 = (uint32_t)(a[j] ^ checkSum3) * x * (__uint128_t)P >> 64;↵
            checkSum3 = (uint32_t)(a[j] ^ checkSum3) * y * (__uint128_t)P >> 64;↵
        }↵
    }↵
    end = std::chrono::steady_clock::now();↵
    cout << "My modulo:                  " << (end - start).count() * 1e-6 << " ms" << endl;↵
↵
    assert(checkSum1 == checkSum2 && checkSum2 == checkSum3);↵
}↵
int main() {↵
    mt19937 gen;↵
    for(int i = 0; i < N; i++) a[i] = gen() % P;↵
    cout << "Throughput test (50000 * 50000):" << endl;↵
    ThroughputTest();↵
    cout << endl;↵
    cout << "Latency test (50000 * 25000):" << endl;↵
    LatencyTest();↵
}↵
```↵
</spoiler>↵
↵
It contains two parts.↵
↵
- The first part is the reciprocal throughput, the time taken by the CPU to be highly parallel (modern CPUs can be parallelized at instruction level on a single core), containing a total of $50000\times50000$ modulo multiplications.↵
- The second part is the Latency, which is the time taken for each modulo multiplication to be performed sequentially without parallelism, containing a total of $50000\times25000$ modulo multiplications.↵
↵
Possible output:↵
↵
```plain↵
Throughput test(50000 * 50000):↵
Compiler's signed modulo:   1954.83 ms↵
Compiler's unsigned modulo: 1746.73 ms↵
My modulo:                  1160.47 ms↵
↵
Latency test(50000 * 25000):↵
Compiler's signed modulo:   4329.33 ms↵
Compiler's unsigned modulo: 3945.29 ms↵
My modulo:                  2397.97 ms↵
```↵
↵
By the way, a few general facts.↵
↵
- Constant modulo multiplication is almost 4 times faster in parallel than serial (as is modulo multiplication of my invention).↵
- `int` to `unsigned` then to `long long` is faster than `long long`, but negative numbers will be wrong.↵
- `unsigned long long` modulo **constants** is faster than `long long`.↵
~~- Constant modulo multiplication is almost 4 times faster in parallel than serial (as is modulo multiplication of my invention)~~↵
### Comparison with other methods↵
↵
Comparison with Barrett reduction and Montgomery multiplication:↵
↵
- The purpose of my method is to compute $a\times b\bmod m$ for fixed $m$ and $b$, while Barrett reduction and Montgomery multiplication compute $a\times b\bmod m$ for fixed m. But my method is faster than the other two methods.↵
↵
- The derivation of my method is similar to Barrett reduction, so it has something in common: it fits modulo $2^{32}$ and requires 64-bit multiplication of integers. Montgomery multiplication is suitable for odd numbers up to $2^{64}$.↵
↵
### Extensions↵
↵
It is also possible to compute $(a_1b_1+a_2b_2+\cdots+a_nb_n)\bmod m$, but $\sum a_i$ cannot exceed $\frac{2^{64}}m$.↵
↵
Let $p_i=\lceil\frac{b_i}m\times2^{64}\rceil$.↵
↵
$$↵
(\sum a_ib_i)\bmod m=\lfloor\frac{(\sum a_ip_i)\bmod 2^{64}\times m}{2^{64}}\rfloor↵
$$

Rev.	By	When	Δ	Comment
en4	platelet	2023-01-18 16:20:20	158
en3	platelet	2023-01-18 04:55:17	9829
en2	platelet	2023-01-17 18:21:08	65
en1	platelet	2023-01-17 18:14:25	3760	Initial revision (published)

History