I wanted to write down a very short blog on this technique, since I've encountered it a few times but have not seen it formally discussed anywhere. *Theoretically* this technique shaves off a log factor from the complexity, but in practice the technique often only yields a constant factor speed-up, so it's more of a micro-optimization than anything. I think it's interesting enough to warrant discussion anyway, but that's probably the reason why this technique is niche.

Let's assume that all lowercase variables here are integers, all uppercase variables are matrices, and that computations are done modulo some large prime. We use $$$I$$$ to refer to the identity matrix and $$$0$$$ to refer to the zero matrix.

**Warmup Problem:** Given a square matrix $$$A$$$ and a positive integer $$$n$$$, evaluate $$$A^n$$$.

Matrices are the classic example of the fast exponentiation algorithm working not just with standard multiplication of integers, but on any object equipped with an associative binary operator.

This is usually one of the first divide-and-conquer algorithms people are taught. One way to approach this is to note that, if $$$n$$$ is even, then

So, we only need to compute $$$A^{\frac{n}{2}}$$$ once, save that value, and then square it. If $$$n$$$ is odd, then we can compute it as $$$A^n = A^{n-1}\times A$$$, since $$$n-1$$$ will be even.

Here is some pseudocode for fast exponentiation.

```
def fast_pow(A,n):
if n==0:
return I
if n&1:
return A*fast_pow(A,n-1)
else:
X = fast_pow(A,n/2)
return X*X
```

This gives a runtime of $$$\mathcal{O}(\lg n)$$$ matrix multiplications needed.

**Problem:** Given a square matrix $$$A$$$ and a positive integer $$$n$$$, evaluate $$$I+A+A^2+\dots+A^{n-1}$$$.

We can't use the geometric formula here because $$$A-I$$$ is not necessarily going to be invertible.

However, we can also use divide-and-conquer to solve this problem. If $$$n$$$ is even, we can split the sum into two halves---the "small" half, with terms whose exponents are $$$< \frac{n}{2}$$$, and the "big" half, with terms whose exponents are $$$\geq \frac{n}{2}$$$. Then, we can factor out $$$A^{\frac{n}{2}}$$$ from the "big" terms, and notice a similarity...

To simplify things, let $$$f(n) = I+A+A^2+\dots+A^{n-1}$$$.

f(n) &= f\left(\frac{n}{2}\right) + A^{\frac{n}{2}}f\left(\frac{n}{2}\right) \\ f(n) &= (I+A^{\frac{n}{2}})f\left(\frac{n}{2}\right) \end{align}$$$

If $$$n$$$ is odd, then we can compute $$$f(n) = I+A(I+A+\dots+A^{n-2}) = I+A\times f(n-1)$$$, since $$$n-1$$$ will be even.

Here is some pseudocode for the fast summing.

```
def fast_sum(A,n):
if n==0:
return 0
if n&1:
return I+A*fast_sum(A,n-1)
else:
return (I+fast_pow(A,n/2))*fast_sum(A,n/2)
```

We just learned that we can compute $$$A^{\frac{n}{2}}$$$ in $$$\mathcal{O}(\lg n)$$$, therefore with Master Theorem (or similar analysis) we conclude that the above algorithm takes at most $$$\mathcal{O}(\lg^2 n)$$$ matrix multiplications.

Can we do better than $$$\mathcal{O}(\lg^2 n)$$$?

**DP on Function Calls**

Notice that `fast_sum`

is $$$\mathcal{O}(\lg^2 n)$$$ because there are $$$\mathcal{O}(\lg n)$$$ function calls, and within each function call we need to perform an $$$\mathcal{O}(\lg n)$$$ operation (the fast exponentiation). Is it possible for us to perhaps get the value of $$$A^{\frac{n}{2}}$$$ in just $$$\mathcal{O}(1)$$$?

Try adding `print(n)`

to the start of each function and observe which values of $$$n$$$ each function visits in each of its recursive calls. Do you notice something? If we start with the same initial $$$n$$$, then `fast_pow(A,n)`

and `fast_sum(A,n)`

will visit the exact same states in the exact same order! The reason why is simple---if you look at the recurrence relations derived, the transitions between states are exactly the same in both functions ($$$n \to n-1$$$ if $$$n$$$ is odd, and $$$n \to \frac{n}{2}$$$ if $$$n$$$ is even).

What does this mean? Well, in `fast_sum(A,n)`

, we need $$$A^{\frac{n}{2}}$$$ for various values of $$$n$$$. However, in `fast_pow(A,n)`

, we *also* need $$$A^{\frac{n}{2}}$$$ for those exact same values of $$$n$$$, and in fact we *already* compute all the values that we need when we call `fast_pow(A,n)`

once.

So, instead of just discarding each $$$X = A^{\frac{n}{2}}$$$, we store those values into a lookup table. Before calling `fast_sum(A,n)`

, we first call `fast_pow(A,n)`

in order to populate the lookup table. Then, `fast_sum(A,n)`

now magically computes each $$$A^{\frac{n}{2}}$$$ that it needs in just $$$\mathcal{O}(1)$$$ since it is just an array lookup.

Since both functions visit the same states in the same order, it is sufficient for us to store $$$X$$$ into a simple array `state`

, where `state[s]`

equals $$$A^{\frac{n}{2}}$$$ in the `s`

th recursive call.

Here is some pseudocode for this technique.

```
state = [None for i in range(60)]
def fast_pow(A,n,s=0):
if n==0:
return I
if n&1:
return A*fast_pow(A,n-1,s+1)
else:
X = fast_pow(A,n/2,s+1)
state[s] = X
return X*X
def fast_sum(A,n,s=0):
if n==0:
return 0
if n&1:
return I+A*fast_sum(A,n-1,s+1)
else:
return (I+state[s])*fast_sum(A,n/2,s+1)
fast_pow(A,n) # make sure to call fast_pow first to populate the table
print(fast_sum(A,n))
```

Here, `fast_sum(A,n)`

is evaluated using only $$$\mathcal{O}(\lg n)$$$ matrix multiplications.

I am going to call the following technique "DP on Function Calls", since I can't seem to find a name for it anywhere else online, though I am sure that I am not the first one to have invented it. I also toyed with the name "Functional Cascading", to mirror the technique called "Fractional Cascading", which works on similar principles.

**Sample Problems that use DP on Function Calls**

I originally managed to come up with DP on Function Calls when making the model solution for this problem, which I wrote for the 2020 NOI.PH Eliminations Round. The problem can still be solved even without shaving off the extra log factor, but the time limit is tighter and you have to be much more careful with your constant factors.

For this problem, my solution without DP on Function Calls takes 0.54 s, whereas *with* DP on Function Calls it takes 0.27 s. That seems to be a `x2`

constant factor speedup, which I think makes it interesting despite the technique not being strictly necessary to solve this problem :P

If you know any, please link more examples of problems that use DP on Function Calls in the comments! I am quite fond of this cute technique and I'd like to see more problems for which it is applicable.

P.S. The given "pseudocode" is (generally) valid Python, but you would want to do `n>>1`

or `n//2`

instead of `n/2`

, so that we perform integer division. When I type `n//2`

in the block code, everything after the `//`

gets highlighted red since it's being interpreted as a comment. If someone can help me fix that, it would be appreciated as well!

Nice blog!

I suspect that there is another solution to compute $$$I + A + A^2 + \ldots + A^n$$$ in $$$O(\log n)$$$ by expressing the sum of first $$$i$$$ matrices as $$$S_i$$$ and transforming equation $$$S_i = S_{i-1} \cdot A + I$$$ into $$$S'_i = S'_{i-1} \cdot A'$$$ by expanding matrices $$$S$$$ and $$$A$$$ by one more row and column. I'm not sure about details, maybe we need to make much bigger matrices. I can only claim that this works for computing $$$v \cdot (1 + A + A^2 + \ldots + A^n)$$$ where vector $$$v$$$ was also given, which seems equivalent... I think.

My idea came from the fact that we can do the non-matrix version with matrices:

Given integers $$$a$$$ and $$$n$$$, compute $$$1 + a + a^2 + \ldots + a^n$$$ with non-prime modulo. Let $$$s_i = 1 + a + a^2 + \ldots + s_i$$$. Note that $$$s_i = a \cdot s_{i-1} + 1$$$ so we can do matrix exponentiation to get $$$s_n$$$.Thank you so much!

Your solution is pretty interesting. If I understand correctly, you can compute $$$s_n = 1 + a + \dots + a^{n-1}$$$ for the non-matrix case with the matrix equation,

where $$$s_0 = 0$$$.

Then, I think the matrix case with $$$S_n = I+A+\dots+A^{n-1}$$$ is almost exactly the same. You can express it with block matrix notation,

where $$$S_0 = 0$$$.

So, the side length of the matrix doubles, which isn't insignificant (roughly 8x more operations) but it also isn't so bad either. I think the constant factor is quite possibly worse, but it's still $$$\mathcal{O}(\lg n)$$$ so that's pretty cool!

The constant factor should be identical with this block-matrix approach to the problem, as

always has the same simple structure, allowing the overall block multiplication to be performed with two (nontrivial) $$$N \times N$$$ matrix-matrix multiplications, so it ends up taking roughly $$$4 \log_2{n}$$$ matrix-matrix multiplications (since it might perform two block matrix multiplications per halving), which is the same as your proposal, which uses two functions that might each perform two (normal) matrix multiplications per halving.