Hi everyone!

Today I want to describe an efficient solution of the following problem:

**Composition of Formal Power Series**. Given $$$A(x)$$$ and $$$B(x)$$$ of degree $$$n-1$$$ such that $$$B(0) = 0$$$, compute $$$A(B(x)) \pmod{x^n}$$$.

The condition $$$B(0)=0$$$ doesn't decrease the generality of the problem, as $$$A(B(x)) = P(Q(x))$$$, where $$$P(x) = A(x+B(0))$$$ and $$$Q(x) = B(x) - B(0)$$$. Hence you could replace $$$A(x)$$$ and $$$B(x)$$$ with $$$P(x)$$$ and $$$Q(x)$$$ when the condition is not satisfied.

Solutions that I'm going to describe were published in Joris van der Hoeven's article about operations on formal power series. The article also describes a lot of other common algorithms on polynomials. It is worth noting that Joris van der Hoeven and David Harvey are the inventors of the breakthrough $$$O(n \log n)$$$ integer multiplication algorithm in the multitape Turing machine model.

#### Naive methods

An obvious solution would be to evaluate $$$B(x)$$$ in $$$n^2$$$ points, then evaluate $$$A(B(x_i))$$$ in these points and interpolate $$$(A \circ B)(x)$$$. This solution would require $$$O(n^2 \log^2 n)$$$ operations with an enormous constant. Not the best idea for $$$n \leq 8000$$$.

Another possible approach is to go from the definition of the polynomial composition:

You could compute $$$B^2(x), B^3(x), \dots, B^{n-1}(x)$$$ modulo $$$x^n$$$ one by one and sum them up, which would take $$$O(n^2 \log n)$$$ operations.

#### Divide and conquer

If we write $$$A(x) = A_0(x) + x^t A_1(x)$$$, then

With this decomposition, you can reduce the computation of $$$A(B(x))$$$ to $$$2$$$ recursive calls to $$$A_0(B(x))$$$ and $$$A_1(B(x))$$$ with $$$O(n \log n)$$$ overhead for the multiplication. If you use $$$t = \lfloor \frac{n}{2} \rfloor$$$, the degree of $$$A_0$$$ will be at most $$$\lfloor \frac{n}{2} \rfloor$$$ and the degree of $$$A_1$$$ will be $$$\lceil \frac{n}{2} \rceil$$$.

Considering possible values of $$$t$$$ over all recursion levels they will be at most $$$O(\log n)$$$ different values, hence you can compute necessary powers of $$$B(x)$$$ in advance, which would require $$$O(n \log^2 n)$$$ cumulative time for the pre-processing.

The overall time complexity for this algorithm is still $$$O(n^2 \log n)$$$, but it allows significant constant-time optimizations, which is enough to pass the Library Checker problem. For this one need to carefully consider how many terms do you really need on each level of recursion.

In a more generic situation when $$$\deg A = p$$$ and $$$\deg B = q$$$, it is possible to prove the alternative complexity estimation of $$$O(pq \log^2 n)$$$, which is a bit better when $$$p$$$ or $$$q$$$ are significantly smaller than $$$n$$$.

#### Square root trick

In 1978, Brent and Kung suggested a faster algorithm to compute $$$A \circ B$$$ assuming that formal power series are taken over the division ring. Instead of decomposing $$$A(x)$$$, it is possible to decompose $$$B(x) = B_0(x) + x^t B_1(x)$$$ and use the Taylor expansion:

For this expression, it generally holds that

which allows to compute $$$A^{(1)}(B_0(x)), A^{(2)}(B_0(x)),\dots$$$ in $$$O(\frac{n^2}{t} \log n)$$$ by successively multiplying them with $$$B_0'(x)$$$ and integrating, starting with $$$A^{(\lfloor n/t \rfloor)}(B_0(x))$$$, which is computed with the divide and conquer algorithm above with the complexity $$$O(nt \log^2 n)$$$.

Optimal value of $$$t$$$ to minimize both complexities is $$$t = \sqrt{\frac{n}{\log n}}$$$, with which the final complexity is

In a more generic setting, when you need to compute $$$A(B(x))$$$ modulo $$$x^n$$$ and $$$\deg A = p$$$, $$$\deg B = q$$$, the corresponding complexities are $$$O(\frac{n^2}{t}\log n)$$$ and $$$O(pt \log^2 n)$$$, hence the optimal value of $$$t$$$ is $$$t = \sqrt{\frac{n^2}{p \log n}}$$$ and the total complexity is

I've implemented the algorithm in CP-Algorithm polynomial library, on the actual problem it works in around 1280 ms for $$$n \approx 8000$$$.

*Thanks to Benq for pointing out a mistake in my initial calculation with arbitrary $$$p$$$ and $$$q$$$!*

##### Another square root trick

Alternatively, we can represent $$$A(B(x))$$$ in the following manner:

If you compute all $$$B, B^2,\dots, B^{q-1}$$$, which is doable in $$$O(n q \log n)$$$, you can compute the right-hand side of each summand in $$$O(nq)$$$ per block for $$$\frac{n}{q}$$$ blocks, making a total of $$$O(n^2)$$$ for the summation. Then you compute $$$B^q, B^{2q},\dots$$$ in $$$O(\frac{n^2}{q} \log n)$$$ and multiply them with the values computed on the previous step. Then the answer is the sum of these values, which is computed in $$$O(\frac{n^2}{q} \log n)$$$.

If you choose $$$q = \sqrt n$$$, overall complexity would be $$$O(n^2 + n \sqrt n \log n) = O(n^2)$$$.

Although this gives a worse complexity, it is likely to perform better on practice for $$$\deg A = \deg B = n$$$ due to better constant. However, it loses advantage over the algorithm above when $$$n$$$ is significantly larger than $$$\deg A$$$ and $$$\deg B$$$ (for example, when you want to compute the full composition of small polynomials).

I've also implemented this algorithm, it runs for around 800 ms on the Library Checker problem.

Thanks to box for telling me about this simpler approach!

##### A bit more on square roots with polynomials

This algorithm is not the only application of square root techniques to polynomials.

**Large factorial**. You're given two numbers $$$n$$$ and $$$m$$$. Compute $$$n! \mod m$$$ for $$$n \leq 10^{9}$$$.

For a fixed $$$m$$$, the solution would be to precompute $$$q!, (2q)!, \dots$$$ for $$$q \approx \sqrt m$$$. It's not possible when $$$m$$$ is a part of input. Instead, you can evaluate the polynomial $$$x(x-1)\dots(x-q+1)$$$ in $$$q,2q,\dots$$$ and multiply the values to get the answer.