Hi CF! During this past weekend I was reading up on Montgomery transformation, which is a really interesting and useful technique to do fast modular multiplication. However, all of the explanations I could find online felt very unintuitive for me, so I decided to write my own blog on the subject. A big thanks to kostia244, nor, nskybytskyi and -is-this-fft- for reading this blog and giving me some feedback =).

### Fast modular multiplication

Let $$$P=10^9+7$$$ and let $$$a$$$ and $$$b$$$ be two numbers in $$$[0,P)$$$. Our goal is to calculate $$$a \cdot b \, \% \, P$$$ without ever actually calling $$$\% \, P$$$. This is because calling $$$\% \, P$$$ is very costly.

*If you haven't noticed that calling $$$\% \, P$$$ is really slow, then the reason you haven't noticed it is likely because the compiler automatically optimizes away the $$$\% \, P$$$ call if $$$P$$$ is known at compile time. But if $$$P$$$ is not known at compile time, then the compiler will have to call $$$\% \, P$$$, which is really really slow.*

### Montgomery reduction of $$$a \cdot b$$$

It turns out that the trick to calculate $$$a \cdot b \, \% \, P$$$ efficently is to calculate $$$a \cdot b \cdot 2^{-32} \, \% \, P$$$ efficiently. So the goal for this section will be to figure out how to calculate $$$a \cdot b \cdot 2^{-32} \, \% \, P$$$ efficently. $$$a \cdot b \cdot 2^{-32} \, \% \, P$$$ is called the Montgomery reduction of $$$a \cdot b$$$, denoted by $$$\text{m_reduce}(a \cdot b)$$$.

#### Idea (easy case)

Suppose that $$$a \cdot b$$$ just happens to be divisible by $$$2^{32}$$$. Then $$$(a \cdot b \cdot 2^{-32}) \, \% \, P = (a \cdot b) \gg 32$$$, which runs super fast!

#### Idea (general case)

Can we do something similar if $$$a \cdot b$$$ is not divisible by $$$2^{32}$$$? The answer is yes! The trick is to find some integer $$$m$$$ such that $$$(a \cdot b + m \cdot P)$$$ is divisible by $$$2^{32}$$$. Then $$$a \cdot b \cdot 2^{-32} \, \% \, P = (a \cdot b + m \cdot P) \cdot 2^{-32} \, \% \, P = (a \cdot b + m \cdot P) \gg 32$$$.

So how do we find such an integer $$$m$$$? We want $$$ (a \cdot b + m \cdot P) \,\%\, 2^{32} = 0$$$ so $$$m = (-a \cdot b \cdot P^{-1}) \,\%\, 2^{32}$$$. So if we precalculate $$$(-P^{-1}) \,\%\, 2^{32}$$$ then calculating $$$m$$$ can be done blazingly fast.

### Montgomery transformation

Since the Montgomery reduction divides $$$a \cdot b$$$ by $$$2^{32}$$$, we would like some some way of multiplying by $$$2^{32}$$$ modulo $$$P$$$. The operation $$$x \cdot 2^{32} \, \% \, P$$$ is called the Montgomery transform of $$$x$$$, denoted by $$$\text{m_transform}(x)$$$.

The trick to implement $$$\text{m_transform}$$$ efficently is to make use of the Montgomery reduction. Note that $$$\text{m_transform}(x) = \text{m_reduce}(x \cdot (2^{64} \, \% \, P))$$$, so if we precalculate $$$2^{64} \, \% \, P$$$, then $$$\text{m_transform}$$$ also runs blazingly fast.

### Montgomery multiplication

Using $$$\text{m_reduce}$$$ and $$$\text{m_transform}$$$ there are multiple different ways of calculating $$$a \cdot b \, \% \, P$$$ effectively. One way is to run $$$\text{m_transform}(\text{m_reduce}(a \cdot b))$$$. This results in two calls to $$$\text{m_reduce}$$$ per multiplication.

Another common way to do it is to always keep all integers transformed in the so called Montgomery space. If $$$a' = \text{m_transform}(a)$$$ and $$$b' = \text{m_transform}(b)$$$ then $$$\text{m_transform}(a \cdot b \, \% \, P) = \text{m_reduce}(a' \cdot b')$$$. This effectively results in one call to $$$\text{m_reduce}$$$ per multiplication, however you now have to pay to move integers in to and out of the Montgomery space.

### Example implementation

Here is a Python 3.8 implementation of Montgomery multiplication. This implementation is just meant to serve as a basic example. Implement it in C++ if you want it to run fast.

```
P = 10**9 + 7
r = 2**32
r2 = r * r % P
Pinv = pow(-P, -1, r) # (-P^-1) % r
def m_reduce(ab):
m = ab * Pinv % r
return (ab + m * P) // r
def m_transform(a):
return m_reduce(a * r2)
# Example of how to use it
a = 123456789
b = 35
a_prim = m_transform(a) # mult a by 2^32
b_prim = m_transform(b) # mult b by 2^32
prod_prim = m_reduce(a_prim * b_prim) # divide a' * b' by 2^32
prod = m_reduce(prod_prim) # divide prod' by 2^32
print('%d * %d %% %d = %d' % (a, b, P, prod)) # prints 123456789 * 35 % 1000000007 = 320987587
```

### Final remarks

One important issue that I've so far swept under the rug is that the output of `m_reduce`

is actually in $$$[0, 2 P)$$$ and not $$$[0, P)$$$. I just want end by discussing this issue. I can see two ways of handling this:

- Alternative 1. You can force $$$\text{m_reduce}(a \cdot b)$$$ to be in $$$[0, P)$$$ for $$$a$$$ and $$$b$$$ in $$$[0, P)$$$ by adding an if-stament to the output of
`m_reduce`

. This will work for any odd integer $$$P < 2^{31}$$$.

**Fixed implementation of m_reduce**

- Alternative 2. Assuming $$$P$$$ is an odd integer $$$< 2^{30}$$$ then if $$$a$$$ and $$$b$$$ $$$\in [0, 2 P)$$$ you can show that the output of $$$\text{m_reduce}(a \cdot b)$$$ is also in $$$[0,2 P)$$$. So if you are fine working with $$$[0, 2 P) \vphantom]$$$ everywhere then you don't need any if-statements. Nyaan's github has a nice C++ implementation of Montgomery multiplication using this style of implementation.

Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).I could read and understand all of what you wrote in 10 minutes. Learning something new so quickly is rare, at least for me. Your explanation is very clear and easy to follow, congratulations.

Do you happen to have a comparison of the performances of Montgomery multiplication and Barrett reduction for computing $$$a\cdot b\ \%\ P$$$?

At the very beginning (in "Idea (easy case)"), when you mention $$$a\cdot b\cdot 2^{-32}\ \%\ P$$$ for the first time, I had to stop and think about what you meant with $$$2^{-32}$$$. You may want to write explicitly that it represent the inverse of $$$2^{32}$$$ modulo $$$P$$$.

The "mod-switching" or "integer division under mod" trick is useful in more mathematical ways too. One way it's useful is computing $$$a^{-1} \mod n$$$ for composite $$$n$$$.

This works because you can just divide by $$$a$$$ normally, even in the space $$$\mod n$$$ given that the numerator is divisible by $$$a$$$

This works because

$$$(m - m \% i) \times \text{inv}[m \% i] \equiv -1 \mod m$$$

$$$\frac{m - m\%i}{i} \times \text{inv}[m\%i] = -i^{-1} \mod m$$$

$$$\lfloor\frac{m}{i}\rfloor\text{inv}[m\%i] = -i^{-1} \mod m$$$

$$$(m - \lfloor\frac{m}{i}\rfloor) \times \text{inv}[m\%i] = i^{-1} \mod m$$$

Essentially, we needed an integer $$$x = 1 \mod m$$$ and $$$x = 0 \mod i$$$. We know $$$(m - m \% i)$$$ is $$$0 \mod i$$$, and we also know it's inverse already. So then $$$(m - m \% i) \times (m - m \% i)^{-1}$$$ is $$$1 \mod m$$$ and $$$0 \mod i$$$ as we needed.

We can use extended euclidean algorithm right? Is this faster compared to ExtendedGcd?

We can. This is both faster and slower.

when it is slower?

The given algorithm precomputes first n modular inverses in $$$O(n)$$$, while gcd takes about $$$O(\log{n})$$$ iterations for a specific value.

Also see https://cp-algorithms.com/algebra/module-inverse.html#mod-inv-all-num