Hi everyone!

There is an article on cp-algorithms about how to compute $$$n!$$$ modulo prime number $$$p$$$. Today I was wondering, how to do it if we need to compute modulo $$$p^k$$$ rather than just $$$p$$$. Turns out one can do it efficiently if $$$p$$$ is small.

#### Motivational example

**Xiaoxu Guo Contest 3 — Binomial Coefficient**. Given $$$1 \leq n,k \leq 10^{18}$$$, find $$$\binom{n}{k}$$$ modulo $$$2^{32}$$$.

There are also some projecteuler problems that require it, including with several queries of distinct $$$n$$$ and $$$k$$$.

#### Task formulation and outline of result

To clarify the task a bit, our ultimate goal here is to be able to compute e.g. binomial coefficients modulo $$$p^k$$$. Thus, what we really need is to represent $$$n! = p^t a$$$, where $$$\gcd(a, p)=1$$$, and then report $$$t$$$ and $$$a$$$ modulo $$$p^k$$$. This is sufficient to also compute $$$\binom{n}{r}$$$ modulo $$$p^k$$$.

We will show that, assuming that polynomial multiplication of size $$$n$$$ requires $$$O(n \log n)$$$ operations, and assuming that arithmetic operations modulo $$$p^k$$$ take $$$O(1)$$$, we can find $$$t$$$ and $$$a$$$ in $$$O(d^2 + dk\log k)$$$, where $$$d=\log_p n$$$. It requires $$$O(pk\log^2 k)$$$ pre-computation that takes $$$O(pk \log k)$$$ memory.

#### Algorithm when $$$p^k$$$ is also small

First, let's solve a simpler version of this problem.

**Library Judge — Binomial Coefficient**. Given $$$n$$$, $$$k \leq 10^{18}$$$ and $$$m \leq 10^6$$$, print $$$\binom{n}{k} \bmod m$$$ for $$$T \leq 2 \cdot 10^5$$$ test cases.

Let's denote $$$n!_p$$$ as the product of all numbers from $$$1$$$ to $$$n$$$ that are co-prime with $$$p$$$. That is,

In this notion, assuming $$$k=\lfloor \frac{n}{p} \rfloor$$$, we can represent $$$n!$$$ as $$$n! = p^k k! n!_p$$$, which expands recursively into

On practice, we only care about $$$i \leq \log_p n$$$, and we may pre-compute all possible values of $$$a!_p$$$ modulo $$$p^k$$$ in advance in $$$O(p^k)$$$ for all $$$a < p^k$$$, which then allows us to compute the full formula in $$$O(d)$$$, where $$$d = \log_p n$$$. To do this, we note that

and $$$(p^k-1)!_p \equiv -1 \pmod{p^k}$$$ for $$$p > 2$$$ and $$$p^k = 4$$$. For other values with $$$p=2$$$ it is $$$1$$$ instead of $$$-1$$$.

Please find the implementation in Python below for details:

**Code**

It works 5.5s on the aforementioned Library Judge problem.

#### Algorithm with polynomials

Great thanks to Endagorion for sharing this algorithm with me!

Let's look further into the formulas from the section above. There is a part contributed by numbers divisible by $$$p$$$, which can be reduced to computing $$$k!$$$, and a part contributed by numbers not divisible by $$$p$$$. Let $$$n = a_0 + a_1 p + \dots + a_d p^d$$$, then

To compute it quickly, we can define a family of polynomial $$$P_{i,a}(x)$$$ for $$$0 \leq a \leq p$$$ such that

so the value of the factorial would be represented as

and expanding $$$k!$$$ it then rewrites into

which simplifies as

Now, what would it take us to use this setup? First of all, notice that $$$P_{i,a}(x)$$$ can be computed from one another:

Note that for shorter and more consistent implementation, this recurrent formula also mostly works for $$$i=1$$$ if we put $$$P_{0, p}(x) = x+1$$$, but for $$$P_{1,p}$$$ we should go up to $$$p-2$$$ instead of $$$p-1$$$. We should also note that for $$$P_{i,a}(x)$$$, we only care for coefficients up to $$$x^{\lfloor x/i \rfloor}$$$, as the larger ones are divisible by $$$p^k$$$.

This allows to compute $$$P_{i,a}(x)$$$ in $$$O(\frac{pk \log k}{i})$$$ for all $$$a$$$ for a given $$$i$$$. Over all $$$i$$$ from $$$1$$$ to $$$k$$$ it sums up to $$$O(pk \log^2 k)$$$ time and $$$O(p k \log k)$$$ memory. Then, evaluating all the polynomials for any specific $$$a$$$ requires $$$O(d^2 + dk \log k)$$$ operations, where $$$d = \log_p n$$$.

As it requires some manipulations with polynomials, I implemented it in Python with sympy just as a proof of concept:

**Code**

#### Algorithm with $$$p$$$-adic logarithms and exponents

The algorithm above requires some heavy machinery on polynomials. I also found out an alternative approach that allows to compute $$$t$$$ and $$$a$$$ for any given $$$n$$$ divisible by $$$p$$$ in $$$O(dk^2)$$$ with $$$O(pk)$$$ precomputation, where $$$d = \log_p n$$$. It doesn't rely on polynomial operations, except for Lagrange interpolation.

Let $$$n=pt+b$$$, where $$$0 \leq b < p$$$, then we can represent its factorial as

We can further rewrite it as

Let's learn how to compute the product

as with it we can represent the factorial as

We can take a $$$p$$$-adic logarithm (please refer to this article for definitions and properties) from the product to get

We can precompute sums of $$$i^{-z}$$$ for each $$$z$$$ up to $$$k$$$ and $$$b$$$ up to $$$p-1$$$ in $$$O(pk)$$$, and we can find the sum of $$$j^z$$$ for $$$j$$$ from $$$0$$$ to $$$t$$$ in $$$O(z)$$$ via Lagrange interpolation. Therefore, with $$$O(pk)$$$ precomputation, we can compute $$$\log A(b, t)$$$ in $$$O(k^2)$$$, which then allows to find $$$n!$$$ in $$$O(dk^2)$$$, where $$$d = \log_p n$$$. I implemented it in Python, as a proof of concept (incl. fast sums over $$$i$$$ and $$$j$$$):

**Code**

**Note**: Due to inherent properties of $$$p$$$-adic logarithms and exponents, the algorithm only works with $$$p>2$$$, and it might return $$$-n!$$$ instead with $$$p=2$$$. This is because $$$\exp \log z = -z$$$ with $$$p=2$$$ when $$$z$$$ has remainder $$$3$$$ modulo $$$4$$$. It is accounted for in the code by tracking the number of integers below $$$n$$$ that have remainder $$$3$$$ modulo $$$4$$$.

I also tested the solution on the motivational problem and got AC in Python: submission.

#### Some other approaches

There is also a blog by prabowo, based apparently on another blog by Min_25, and a scholarly article by Andrew Granville. I'm not sure whether they talk about the same algorithms as described here.

While this approach is efficient in the general case, I must mention that there is a much more straightforward solution for the $$$\bmod 2^{32}$$$ case. Simply said, you can decompose the factorials into odd double-factorials and powers of two. After this idea, you can optimize the solution with autovectorization. (Or use SIMD by hand) you can read the detailed article here.