By adamant, history, 4 months ago, Hi everyone!

There is an article on cp-algorithms about how to compute $n!$ modulo prime number $p$. Today I was wondering, how to do it if we need to compute modulo $p^k$ rather than just $p$. Turns out one can do it efficiently if $p$ is small.

#### Motivational example

Xiaoxu Guo Contest 3 — Binomial Coefficient. Given $1 \leq n,k \leq 10^{18}$, find $\binom{n}{k}$ modulo $2^{32}$.

There are also some projecteuler problems that require it, including with several queries of distinct $n$ and $k$.

#### Task formulation and outline of result

To clarify the task a bit, our ultimate goal here is to be able to compute e.g. binomial coefficients modulo $p^k$. Thus, what we really need is to represent $n! = p^t a$, where $\gcd(a, p)=1$, and then report $t$ and $a$ modulo $p^k$. This is sufficient to also compute $\binom{n}{r}$ modulo $p^k$.

We will show that, assuming that polynomial multiplication of size $n$ requires $O(n \log n)$ operations, and assuming that arithmetic operations modulo $p^k$ take $O(1)$, we can find $t$ and $a$ in $O(d^2 + dk\log k)$, where $d=\log_p n$. It requires $O(pk\log^2 k)$ pre-computation that takes $O(pk \log k)$ memory.

#### Algorithm when $p^k$ is also small

First, let's solve a simpler version of this problem.

Library Judge — Binomial Coefficient. Given $n$, $k \leq 10^{18}$ and $m \leq 10^6$, print $\binom{n}{k} \bmod m$ for $T \leq 2 \cdot 10^5$ test cases.

Let's denote $n!_p$ as the product of all numbers from $1$ to $n$ that are co-prime with $p$. That is,

$n!_p = \prod\limits_{\substack{1 \leq j \leq n \\ \gcd(j, p)=1}}j.$

In this notion, assuming $k=\lfloor \frac{n}{p} \rfloor$, we can represent $n!$ as $n! = p^k k! n!_p$, which expands recursively into

$n! = \prod\limits_{i=0}^\infty p^{\left\lfloor \frac{n}{p^{i+1}}\right\rfloor} \left\lfloor n/p^i \right\rfloor !_p$

On practice, we only care about $i \leq \log_p n$, and we may pre-compute all possible values of $a!_p$ modulo $p^k$ in advance in $O(p^k)$ for all $a < p^k$, which then allows us to compute the full formula in $O(d)$, where $d = \log_p n$. To do this, we note that

$a!_p \equiv (p^k-1)!_p^{\lfloor a/p^k \rfloor} (a \bmod p^k)!_p \pmod{p^k},$

and $(p^k-1)!_p \equiv -1 \pmod{p^k}$ for $p > 2$ and $p^k = 4$. For other values with $p=2$ it is $1$ instead of $-1$.

Please find the implementation in Python below for details:

Code

It works 5.5s on the aforementioned Library Judge problem.

#### Algorithm with polynomials

Great thanks to Endagorion for sharing this algorithm with me!

Let's look further into the formulas from the section above. There is a part contributed by numbers divisible by $p$, which can be reduced to computing $k!$, and a part contributed by numbers not divisible by $p$. Let $n = a_0 + a_1 p + \dots + a_d p^d$, then

$\prod\limits_{\substack{1 \leq j \leq n \\ \gcd(j, p)=1}}j = \prod\limits_{i=0}^d \prod\limits_{\substack{1 \leq j \leq a_i p^i \\\gcd(j, p)=1}} \left(\left\lfloor\frac{n}{p^{i+1}} \right\rfloor p^{i+1}+j\right).$

To compute it quickly, we can define a family of polynomial $P_{i,a}(x)$ for $0 \leq a \leq p$ such that

$P_{i, a}(x) = \prod\limits_{\substack{1 \leq j\leq a p^{i-1} \\ \gcd(j, p)=1}}(xp^{i}+j),$

so the value of the factorial would be represented as

$n! = p^k k! \prod\limits_{i=0}^d P_{i+1,a_i}\left(\left\lfloor\frac{n}{p^{i+1}} \right\rfloor\right),$

and expanding $k!$ it then rewrites into

$n! = \prod\limits_{i=0}^{d} p^{\left\lfloor \frac{n}{p^{i+1}} \right\rfloor} \prod\limits_{j=0}^{d-i} P_{j+1,a_{i+j}}\left(\left\lfloor\frac{n}{p^{i+j+1}} \right\rfloor\right),$

which simplifies as

$\boxed{n! = \prod\limits_{i=0}^{d} p^{\left\lfloor \frac{n}{p^{i+1}} \right\rfloor} \prod\limits_{j=0}^{i} P_{j+1,a_{i}}\left(\left\lfloor\frac{n}{p^{i+1}} \right\rfloor\right)}$

Now, what would it take us to use this setup? First of all, notice that $P_{i,a}(x)$ can be computed from one another:

$P_{i,a}(x) = \prod\limits_{\substack{1 \leq j\leq a p^{i-1} \\ \gcd(j, p)=1}}(xp^{i}+j) = \prod\limits_{t=0}^{a-1}\prod\limits_{\substack{1 \leq j\leq p^{i-1} \\ \gcd(j, p)=1}}(xp^{i}+tp^{i-1}+j)=\prod\limits_{t=0}^{a-1} P_{i-1,p}(px+t).$

Note that for shorter and more consistent implementation, this recurrent formula also mostly works for $i=1$ if we put $P_{0, p}(x) = x+1$, but for $P_{1,p}$ we should go up to $p-2$ instead of $p-1$. We should also note that for $P_{i,a}(x)$, we only care for coefficients up to $x^{\lfloor x/i \rfloor}$, as the larger ones are divisible by $p^k$.

This allows to compute $P_{i,a}(x)$ in $O(\frac{pk \log k}{i})$ for all $a$ for a given $i$. Over all $i$ from $1$ to $k$ it sums up to $O(pk \log^2 k)$ time and $O(p k \log k)$ memory. Then, evaluating all the polynomials for any specific $a$ requires $O(d^2 + dk \log k)$ operations, where $d = \log_p n$.

As it requires some manipulations with polynomials, I implemented it in Python with sympy just as a proof of concept:

Code

#### Algorithm with $p$-adic logarithms and exponents

The algorithm above requires some heavy machinery on polynomials. I also found out an alternative approach that allows to compute $t$ and $a$ for any given $n$ divisible by $p$ in $O(dk^2)$ with $O(pk)$ precomputation, where $d = \log_p n$. It doesn't rely on polynomial operations, except for Lagrange interpolation.

Let $n=pt+b$, where $0 \leq b < p$, then we can represent its factorial as

$n! = p^t t! \prod\limits_{i=1}^{b} \prod\limits_{j=0}^{t} \left(pj+i\right)\prod\limits_{i=b+1}^{p-1} \prod\limits_{j=0}^{t-1} \left(pj+i\right).$

We can further rewrite it as

$n! = p^t t! (p-1)!^t b! \prod\limits_{i=1}^{b} \prod\limits_{j=0}^{t} \left(1+\frac{j}{i}p\right)\prod\limits_{i=b+1}^{p-1} \prod\limits_{j=0}^{t-1} \left(1+\frac{j}{i}p\right)$

Let's learn how to compute the product

$A(b, t) = \prod\limits_{i=1}^b \prod\limits_{j=0}^t \left(1+\frac{j}{i}p\right),$

as with it we can represent the factorial as

$n! = p^t t! (p-1)!^t b! A(b, t)\frac{A(p-1,t-1)}{A(b,t-1)}.$

We can take a $p$-adic logarithm (please refer to this article for definitions and properties) from the product to get

$\boxed{\log A(b, t) = \sum\limits_{i=1}^{b} \sum\limits_{j=0}^{t} \log\left(1+\frac{j}{i}p\right)=\sum\limits_{z=1}^\infty \frac{(-1)^{z+1}p^z}{z} \sum\limits_{i=1}^{b} i^{-z} \sum\limits_{j=0}^{t} j^z}$

We can precompute sums of $i^{-z}$ for each $z$ up to $k$ and $b$ up to $p-1$ in $O(pk)$, and we can find the sum of $j^z$ for $j$ from $0$ to $t$ in $O(z)$ via Lagrange interpolation. Therefore, with $O(pk)$ precomputation, we can compute $\log A(b, t)$ in $O(k^2)$, which then allows to find $n!$ in $O(dk^2)$, where $d = \log_p n$. I implemented it in Python, as a proof of concept (incl. fast sums over $i$ and $j$):

Code

Note: Due to inherent properties of $p$-adic logarithms and exponents, the algorithm only works with $p>2$, and it might return $-n!$ instead with $p=2$. This is because $\exp \log z = -z$ with $p=2$ when $z$ has remainder $3$ modulo $4$. It is accounted for in the code by tracking the number of integers below $n$ that have remainder $3$ modulo $4$.

I also tested the solution on the motivational problem and got AC in Python: submission.

#### Some other approaches

There is also a blog by prabowo, based apparently on another blog by Min_25, and a scholarly article by Andrew Granville. I'm not sure whether they talk about the same algorithms as described here.  Comments (1)
 » While this approach is efficient in the general case, I must mention that there is a much more straightforward solution for the $\bmod 2^{32}$ case. Simply said, you can decompose the factorials into odd double-factorials and powers of two. After this idea, you can optimize the solution with autovectorization. (Or use SIMD by hand) you can read the detailed article here.