#	User	Rating
1	tourist	3880
2	jiangly	3669
3	ecnerwala	3654
4	Benq	3627
5	orzdevinwang	3612
6	Geothermal	3569
6	cnnfls_csy	3569
8	jqdai0815	3532
9	Radewoosh	3522
10	gyh20	3447

#	User	Contrib.
1	awoo	161
1	maomao90	161
3	adamant	156
4	maroonrk	153
5	-is-this-fft-	148
5	atcoder_official	148
5	SecondThread	148
8	Petr	147
9	nor	144
10	TheScrasse	142

pajenegod's blog

Slowdown bug affecting C++ (64 bit) on Codeforces

By pajenegod, history, 5 months ago, In English

Yesterday, investigating Strange TLE by cin using GNU C++20 (64), I found an easy and reproducable way to trigger a slowdown bug that I believe has been plaguing Codeforces for some time now. So I'm making this blog to raise awarness of it. MikeMirzayanov, please take a look at this!

Here is how to trigger the slowness bug:

Take any problem with a relatively large input on Codeforces ($$$2 \cdot 10^5$$$ ints is enough).
Take a random AC C++ submission that uses std::cin.
Add the line vector<vector<int>> TLE(40000, vector<int>(7)); somewhere in global space.
Submit using either C++20(64 bit) or C++17(64 bit).
???
TLE

For example take tourist's solution to problem 1936-D - Битовый парадокс. With the vector of death added to the code, it gets TLE on TC5 (taking $$$> 5$$$ s). While without the deadly vector, the submission takes 155 ms on TC5.

Here is a stand alone example with the slowdown (credit to kostia244). It runs 100 times slower with the vector of death.

#include<bits/stdc++.h>
using namespace std;

vector<vector<int>> TLE(40000, vector<int>(7));

int main() {
    string s;
    for(int i = 0; i < 17; i++) s += to_string(i) + " ";
    
    for (int j = 0; j < 60000; ++j) {
        istringstream ss(s);
        int x;
        while (ss >> x);
    }
}

What is causing this?

Other ways to trigger the slowdown bug

Other blogs on this topic

Full text and comments »

tle, bug, slowdown, 64 bit, c++

+658

pajenegod
5 months ago
70

Shallowest Decomposition Tree

By pajenegod, history, 6 months ago, In English

Hi Codeforces!

Today I want to tell you about a really cool and relatively unknown technique that is reminiscent of Centroid decomposition, but at the same time is also completely different.

The most common and well known decomposition tree algorithm out there is the centroid decomposition algorithm (running in $$$O(n \log n)$$$). It is a standard algorithm that is commonly used to solve tons of different kind of divide and conquer problems on trees. However, it turns out that there exists another closely related decomposition tree algorithm, that is in a sense optimal, and can be implemented to run in $$$O(n)$$$ time in around $$$30$$$ lines of code. I have chosen to call it the Shallowest Decomposition Tree. This blog will be all about the shallowest decomposition tree. Something I want to remark on before we start is that I did not invent this. However, very few people know about it. So I decided to make this blog in order to teach people about this super cool and relatively unknown technique!

I believe that part of the reason for why the shallowest decomposition tree is almost never used in practice is because no one has published a simple to use implementation of it. My contribution here is that I've come up with a slick and efficient implementation that constructs the decomposition in linear time. I've implemented it both in Python and C++.

Shallowest Decomposition Tree:

C++ Implementation

#define log std::__lg
#define ctz __builtin_ctz 
 
// Rooted tree
struct arborescence {
    std::vector<std::vector<int>> children;
    int root;
};
 
arborescence shallowest_decomposition_tree(std::vector<std::vector<int>> &graph, int root=0) {
    int n = (int)graph.size();
    std::vector<std::vector<int>> decomposition_tree(n), stacks(log(n) + 1);
    
    auto extract_chain = [&](int labels, int u) {
        while (labels) {
            int label = log(labels);
            labels ^= 1 << label;
            int v = stacks[label].back(); stacks[label].pop_back();
            decomposition_tree[u].push_back(v);
            u = v;
        }
    };
    std::vector<int> forbidden(n, -1);
    auto dfs = [&](int u, int p, auto&& self) -> void {
        int forbidden_once = 0, forbidden_twice = 0;
        for (auto v : graph[u]) {
            if (v != p) {
                self(v, u, self);
                forbidden_twice |= forbidden_once & (forbidden[v] + 1);
                forbidden_once |= forbidden[v] + 1;
            }
        }
        forbidden[u] = forbidden_once | ((1 << log(2*forbidden_twice  + 1)) - 1);
        int label_u = ctz(forbidden[u] + 1);
        stacks[label_u].push_back(u);
        for (int i = (int)graph[u].size() - 1; i >= 0; --i) {
            int v = graph[u][i];
            extract_chain((forbidden[v] + 1) & ((1 << label_u) - 1), u);
        }
    };
    dfs(root, -1, dfs);
    int max_label = log(forbidden[root] + 1);
    int decomposition_root = stacks[max_label].back(); stacks[max_label].pop_back();
    extract_chain((forbidden[root] + 1) & ((1 << max_label) - 1), decomposition_root);
    return {std::move(decomposition_tree), decomposition_root};
}

Python implementation (with recursion)

def ctz(x): # Count trailing zeroes
    return (x & -x).bit_length() - 1
 
def shallowest_decomposition_tree(graph, root=0):
    n = len(graph)
    forbidden = [-1] * n
    decomposition_tree = [[] for _ in range(n)]
    stacks = [[] for _ in range(n.bit_length())]
    def extract_chain(labels, u):
        while labels:
            label = labels.bit_length() - 1
            labels ^= 2**label
            v = stacks[label].pop()
            decomposition_tree[u].append(v)
            u = v
    def dfs(u, p):
        forbidden_once = forbidden_twice = 0
        for v in graph[u]:
            if v != p: 
              dfs(v, u)
              forbidden_twice  |= forbidden_once & (forbidden[v] + 1)
              forbidden_once  |= forbidden[v] + 1
        forbidden[u] = forbidden_once | (2**forbidden_twice.bit_length() - 1)    
        label_u = ctz(forbidden[u] + 1)
        stacks[label_u].append(u)
        for v in reversed(graph[u]): 
            extract_chain((forbidden[v] + 1) & (2**label_u - 1), u)
    dfs(root, -1)
    max_label = (forbidden[root] + 1).bit_length() - 1
    decomposition_root = stacks[max_label].pop()
    extract_chain((forbidden[root] + 1) & (2**max_label - 1), decomposition_root)
    return decomposition_tree, decomposition_root

Python implementation (without recursion)

Since doing deep recursion in Python is generally a really bad idea. I've also implemented a slightly less nice looking version without recursion. This is what you should actually use in practice.

def ctz(x): # Count trailing zeroes
    return (x & -x).bit_length() - 1
 
def shallowest_decomposition_tree(graph, root=0):
    n = len(graph)
    forbidden = [0] * n
    decomposition_tree = [[] for _ in range(n)]
    stacks = [[] for _ in range(n.bit_length())]
    def extract_chain(labels, u):
        while labels:
            label = labels.bit_length() - 1
            labels ^= 2**label
            v = stacks[label].pop()
            decomposition_tree[u].append(v)
            u = v
    dfs = [root]
    while dfs:
        u = dfs.pop()
        if u >= 0:
            forbidden[u] = -1
            dfs.append(~u)
            for v in graph[u]:
                if not forbidden[v]:
                    dfs.append(v)
        else:
            u = ~u
            forbidden_once = forbidden_twice = 0
            for v in graph[u]:
                forbidden_twice  |= forbidden_once & (forbidden[v] + 1)
                forbidden_once  |= forbidden[v] + 1
            forbidden[u] = forbidden_once | (2**forbidden_twice.bit_length() - 1)    
            label_u = ctz(forbidden[u] + 1)
            stacks[label_u].append(u)
            for v in graph[u]: 
                extract_chain((forbidden[v] + 1) & (2**label_u - 1), u)
    max_label = (forbidden[root] + 1).bit_length() - 1
    decomposition_root = stacks[max_label].pop()
    extract_chain((forbidden[root] + 1) & (2**max_label - 1), decomposition_root)
    return decomposition_tree, decomposition_root

And for comparison, here are also a couple of centroid decomposition tree implementations using the same interface.

Centroid Decomposition Tree:

C++ implementation

#define remover(vec, v) vec.erase(find(vec.begin(), vec.end(), v))
 
// Rooted tree
struct arborescence {
    std::vector<std::vector<int>> children;
    int root;
};
 
arborescence centroid_decomposition_tree(std::vector<std::vector<int>> graph) {
    int n = graph.size();
    std::vector<int> subtree_size(n);
    auto dfs = [&](int u, auto&& self) -> int {
        for (auto v : graph[u]) {
            remover(graph[v], u);
            subtree_size[u] += self(v, self);
        }
        return ++subtree_size[u];
    };
    dfs(0, dfs);
    
    std::vector<std::vector<int>> decomposition_tree(n);
    auto centroid_reroot = [&](int u, auto&& self) -> int {
        int N = subtree_size[u];
        for (bool found = 0; !found;) {
            found = 1;
            for (auto v : graph[u]) {
                if (subtree_size[v] > N / 2) {
                    found = 0;
                    subtree_size[u] = N - subtree_size[v];
                    remover(graph[u], v);
                    graph[v].push_back(u);
                    u = v;
                    break;
                }
            }
        }
        for (auto v : graph[u]) {
            decomposition_tree[u].push_back(self(v, self));
        }
        return u;
    };
    int decomposition_root = centroid_reroot(0, centroid_reroot);
    return {std::move(decomposition_tree), decomposition_root};
}

Python implementation

def centroid_decomposition_tree(graph):
    n = len(graph)
    graph = [c[:] for c in graph] # copy
    bfs = [0]
    for node in bfs:
        bfs += graph[node]
        for nei in graph[node]:
            graph[nei].remove(node)
    size = [0] * n
    for node in reversed(bfs):
        size[node] = 1 + sum(size[child] for child in graph[node])
    decomposition_tree = [[] for _ in range(n)]
    def centroid_reroot(u):
        N = size[u]
        while True:
            for v in graph[u]:
                if size[v] > N // 2:
                    size[u] = N - size[v]
                    graph[u].remove(v)
                    graph[v].append(u)
                    u = v
                    break
            else: # u is the centroid
                decomposition_tree[u] = [centroid_reroot(v) for v in graph[u]]
                return u
    decomposition_root = centroid_reroot(0)
    return decomposition_tree, decomposition_root

1. Motivation / Background

Take a look at the following problem

Treasure Hunt on a Tree (interactive)

1.1. What exactly is a "decomposition tree"?

Think about how one could visualize a deterministic strategy in the treasure hunt game. One natural way to do it is to create a new tree where the root of this new tree is the first node you guess. The children of this root are all the possible 2nd guesses (which depend on the result of the first guess). Then do the same for 3rd guesses, 4th guesses, etc. I am going to refer to this resulting tree as a decomposition tree of the original tree. Note that the goal of minimizing the number of guesses is equivalent to constructing the shallowest possible decomposition tree.

1.2. The centroid guessing strategy

A natural strategy for the treasure hunt game is to guess the centroid of the tree. The definition of a centroid is a node, such that if removed from the tree, would split the tree into subtrees such that all subtrees have a size $$$\leq n/2$$$. Note that trees always have at least one centroid, and can have up to a maximum of 2 centroids.

In practice, a common way to find a centroid is to start at some arbitrary node $$$u$$$, try splitting the tree at $$$u$$$, and find the largest subtree from the split. If the size of the largest subtree is $$$> n/2$$$, then you move $$$u$$$ in the direction of that subtree. If the size of the largest subtree $$$\leq n/2$$$, then $$$u$$$ is a centroid, and so you have found a centroid.

By repeatedly guessing centroids in the treasure hunt problem, the set of nodes the treasure could be hidden at is guaranteed to decrease by at least a factor of $$$2$$$ for each guess. This will lead to having to do at most $$$\log_2(n)$$$ guesses in worst case. However, it turns out there examples of trees where this centroid guessing strategy is sub-optimal.

1.2.1. Examples of trees where the centroid guessing strategy is not optimal

Small example where centroid decomposition is a factor 4/3 from optimal

Construction where centroid decomposition is a factor O(log n) from optimal

1.3. The center guessing strategy

Another natural strategy for the treasure hunt game is to guess the center of the tree, i.e. the least eccentric vertex (the "middle node" of the diameter). This turns out to be a fairly bad strategy, as can be seen in the following counter example (credit to dorijanlendvaj for this example).

Construction where center decomposition is a factor O(sqrt(n)/log(n)) from optimal

2. Shallowest decomposition tree

Now we are finally at the point where we can talk about how to find the optimal (the shallowest) decomposition tree. The key to finding the shallowest decomposition tree turns out to be a greedy solution of a certain "Labeling Problem" 321C - Ciel the Commander.

The Labeling Problem

It turns out that this labeling problem can be solved optimally by labeling greedily.

Greedy labeling

In the next section, we will prove that greedy labeling is in fact optimal, and we will also show how to construct it in linear time.

Something that is very important to note is that it is possible to extract a decomposition tree from a labeling. The highest labeled node must have a unique label (because of constraint 2). So start by picking the highest labeled node as the root for the decomposition. Then remove that node and recurse. This will lead to a decomposition tree of depth $$$\leq$$$ largest label.

Also note that given a decomposition tree it is possible to create a labeling. One way to do this is to label the nodes by their height in the decomposition tree. This will make it so the largest label used $$$=$$$ depth of the decomposition tree.

The take away from this discussion is that optimally solving the labeling problem is equivalent to finding an optimal deterministic strategy in the treasure hunt game, since a solution to the labeling problem can be made into a deterministic strategy for the treasure hunt game, and vice versa. So our (optimal) greedy labeling corresponds to an (optimal) deterministic guessing strategy in the treasure hunt game.

3. Analysis of greedy labeling

Let us first define the notion of forbidden labels. Given a rooted tree, a labeling of the tree, and a node $$$u$$$ in the tree, define forbidden(u) as the bitmask describing all labels that cannot be put on $$$u$$$ considering the descendants of $$$u$$$. I.e. bit $$$i$$$ of forbidden(u) is $$$1$$$ if and only if labeling node $$$u$$$ with label $$$i$$$ would cause a contradiction with the labels of the descendants of $$$u$$$.

Note that in the case of the greedy labeling, the label of $$$u$$$ corresponds to the least significant set bit of forbidden(u) + 1, or equivalently it is the number of trailing zeroes of forbidden(u) + 1.

3.1 A O(n) DP-algorithm for greedy labeling.

In the case of the greedy labeling, it is possible to make a DP-formula for forbidden(u). There are 3 cases:

Case 1. Node $$$u$$$ has no children. In this case forbidden(u) = 0.

Case 2. Node $$$u$$$ has exactly one child $$$v$$$. In this case forbidden(u) = forbidden(v) + 1.

Case 3. Node $$$u$$$ has multiple children, $$$v_1$$$, ..., $$$v_m$$$. In this case bit $$$i$$$ is set in forbidden(u) if either

bit $$$i$$$ is set in at least one of forbidden(v_1) + 1, ..., forbidden(v_m) + 1.
or there exists $$$j > i$$$ such that bit $$$j$$$ is set in at least two of forbidden(v_1) + 1, ..., forbidden(v_m) + 1.

This gives us a simple O(n) implementation of the greedy labeling algorithm.

# Count trailing zeros
# Or equivalently, index of lowest set bit
def ctz(x): 
    return (x & -x).bit_length() - 1

forbidden = [0] * n
def dfs(u, p):
    forbidden_once = forbidden_twice = 0
    for v in graph[u]:
        if v != p:
            dfs(v, u)
            forbidden_by_v = forbidden[v] + 1
            forbidden_twice |= forbidden_once & forbidden_by_v
            forbidden_once |= forbidden_by_v
    forbidden[u] = forbidden_once | (2**forbidden_twice.bit_length() - 1)
dfs(root, -1)
labels = [ctz(forbidden[u] + 1) for u in range(n)]

Remark: It is not actually obvious why this algorithm runs in linear time since this algorithm could in theory be using big integers. But as we will see in the next section, Section 3.2, greedy labeling is optimal. So by comparing the greedy labeling to centroid labeling, we know that the largest forbidden label for the greedy labeling is upper bounded by $$$\log_2(n)$$$. So this DP-algorithm does not use any big integers, and is therefore just a standard dfs algorithm which runs in linear time.

3.2 Greedy labeling is optimal

Lemma: Given a rooted tree (rooted at some node $$$r$$$). Out of all possibly labelings of this rooted tree, the greedy labeling (with root $$$r$$$) minimizes forbidden(r).

Note that minimizing forbidden(r) is effectively the same thing as minimizing the largest label used in the labeling. So it follows from this claim that the greedy algorithm uses the fewest number of distinct labels out of any valid labeling.

Proof of Lemma

Recall the DP-algorithm used to compute forbidden(u) in the case of greedy labeling. It computes which labels are forbidden for $$$u$$$ based on the values of forbidden(v) + 1 over all children $$$v$$$ of $$$u$$$. Note that this DP-algorithm can be modified to compute forbidden(u) for any labeling (and not just the greedy labeling). To do this, instead of always adding + 1, we need to add some other number $$$b_v \geq 1$$$. The value of $$$b_v$$$ depends only on the label of $$$v$$$ and the value of forbidden(v).

Here is a generalized DP-algorithm that given a root and a labeling computes forbidden(u) for all nodes $$$u$$$:

def compute_b(label_v, forbidden_v):
    assert 2**label_v & forbidden_v == 0
    b_v = ((forbidden_v | 2**label_v) & ~(2**label_v - 1)) - forbidden_v
    assert b_v >= 1
    # Note: if v was labeled greedily, then b_v will be 1
    # A non-greedy choice of labeling v will make b_v > 1.
    return b_v

forbidden = [0] * n
def dfs(u, p):
    forbidden_once = forbidden_twice = 0
    for v in graph[u]:
        if v != p:
            dfs(v, u)
            forbidden_by_v = forbidden[v] + compute_b(label[v], forbidden[v])
            forbidden_twice |= forbidden_once & forbidden_by_v
            forbidden_once |= forbidden_by_v
    forbidden[u] = forbidden_once | (2**forbidden_twice.bit_length() - 1)
dfs(root, -1)

Now all that remains to prove the Lemma is to show that forbidden(u) is a monotone multivariate function as a function of forbidden(v_1) + b_{v_1}, ...,forbidden(v_m) + b_{v_m} over the children $$$v_1$$$,...,$$$v_m$$$ of $$$u$$$, as monotonicity would imply that the greedy labeling (picking all $$$b_v$$$'s to be $$$1$$$) minimizes forbidden(u). So we want to prove that the following multivariate function $$$f(x_1, ..., x_m)$$$ is monotone

def f(X):
    forbidden_once = forbidden_twice = 0
    for x in X:
        forbidden_twice |= forbidden_once & x
        forbidden_once |= x
    return forbidden_once | (2**forbidden_twice.bit_length() - 1)

I don't know of a simple argument to show this. However, one way to see that $$$f$$$ is monotone think of the input variables x_1, ..., x_m as being 0/1 vectors v_1, ..., v_m in a real vector space. Additionally, also think of $$$f(X) - 1$$$ as a 0/1 vector. In this view, $$$f(X) - 1$$$ is the lexographically smallest 0/1 vector $$$\geq v_1 + ... + v_m$$$. From this it is possible to draw the conclusion that $$$f$$$ must be monotone. If anyone has a cleaner argument for this, then please post it below.

3.3. Adv. Constructing the shallowest decomposition tree in linear time

As seen in section 3.2, it is easy to construct the greedy labeling in linear time. However, it is far more tricky to construct the shallowest decomposition tree in $$$O(n)$$$ time. The way we will do it is by making use of chains. Before we formally define what a chain is, take a look at the following examples.

Basic example

More general example

So formally,

Definition of the chains of a rooted labeled tree

Using the forbidden variable, it is possible to identify these chains. If $$$v$$$ is a child of $$$u$$$, then set of labels in forbidden(v) + 1 which are smaller than $$$u$$$'s label corresponds to a (Case 1) chain. Furthermore, the set of labels in forbidden(root) + 1 corresponds to the (Case 2) chain. So we can easily identify the sets of labels making up the chains. However, what we need in order to build the decomposition tree is to find the set of nodes that make up the chains.

The last trick we need is to use $$$O(\log n)$$$ stacks (one stack for each label). To extract the shallowest decomposition tree. Do a DFS over the greedily labeled tree. When we first visit a node, append that node into its corresponding stack. Furthermore, during the dfs, whenever we identify the labels of a chain, we can pop the corresponding stacks in order to find the nodes making up that chain. Then add the chain edges to the decomposition tree. I've called this popping procedure extract_chain in the code below.

Extraction of decomposition tree using chains implemented in Python

def ctz(x): # Count trailing zeros
    return (x & -x).bit_length() - 1

def shallowest_decomposition_tree(graph, root=0):
    n = len(graph)
  
    # Create a greedy labeling
    forbidden = [0] * n
    def dfs(u, p):
        forbidden_once = forbidden_twice = 0
        for v in graph[u]:
            if v != p:
                dfs(v, u)
                forbidden_by_v = forbidden[v] + 1
                forbidden_twice |= forbidden_once & forbidden_by_v
                forbidden_once |= forbidden_by_v
        forbidden[u] = forbidden_once | (2**forbidden_twice.bit_length() - 1)
    dfs(root, -1)
    labels = [ctz(forbidden[u] + 1) for u in range(n)]  
  
    # Extract the shallowest decomposition tree in O(n) time
    decomposition_tree = [[] for _ in range(n)]
    stacks = [[] for _ in range(max(labels) + 1)]
    def extract_chain(labels, u):
        # Extract a chain starting at node u
        while labels:
            label = labels.bit_length() - 1
            labels ^= 2**label
            v = stacks[label].pop()
            decomposition_tree[u].append(v)
            u = v
    def dfs(u, p):
        stacks[labels[u]].append(u)
        for v in graph[u]:
            if v != p: 
                dfs(v, u)
                extract_chain((forbidden[v] + 1) & (2**labels[u] - 1), u)
    dfs(root, -1)
    decomposition_root = stacks[-1].pop()
    extract_chain((forbidden[root] + 1) & (2**labels[decomposition_root] - 1), decomposition_root)
    return decomposition_tree, decomposition_root

With this, the linear time algorithm for the shallowest decomposition tree is finally complete! However, it is still possible to make some slight improvements. The main improvement would be to greedy label and build the decomposition tree at the same time in a single dfs, instead of using two dfs's. This is what I've chosen to do in my Python and C++ implementations found at the top of this blog. Another possible improvement is to have a variable for forbidden[u] + 1 instead of forbidden[u] itself. Because of comprehensibility, I've chosen not to do this. But it would definitely help if you'd want to codegolf it. The final possible improvement is to switch from using a recursive dfs to manually doing the dfs using a stack. This improvement is important for languages that don't handle recursion well, like Python.

4. Benchmarks

To my knowledge, every problem that can be solved with centroid decomposition can be solved with shallowest decomposition tree too, and you can freely switch between them. So here are a couple of comparisons between the two decompositions.

321C - Ciel the Commander: Centroid (Python) TLE 1.34 s / 1 s | Shallowest (Python) 0.72 s | Centroid (C++) 0.28 s | Shallowest (C++) 0.16 s
914E - Palindromes in a Tree: Centroid (Python) TLE 4.2 s / 4 s | Shallowest (Python) 3.75 s | Centroid (C++) 1.67 s | Shallowest (C++) 1.53 s

321C - Ciel the Commander is a good example of a problem where most of the time is spent building the decomposition tree. Here we can see a fairly significant boost from using the Shallowest Decomposition Tree compared to using Centroid Decomposition, especially if we take away the time spent on IO. 914E - Palindromes in a Tree is a good example of a problem where building the decomposition tree only takes up a small portion of the total time. For this reason, we only see a small performance gain from using the Shallowest Decomposition Tree.

Remark: There are definitely faster solutions to 321C - Ciel the Commander out there. For example, you could solve the problem just by outputting a greedy labeling without ever constructing any kind of decomposition tree. But the reason I'm using this problem as a benchmark is to compare the time used to construct the shallowest decomposition tree vs the centroid decomposition tree.

5. Mentions and final remarks

A big thanks Devil for introducing the shallowest decomposition tree to me. Also a big thanks to everyone that has discussed the shallowest decomposition tree with me over at the AC server. qmk magnus.hegdahl nor gamegame Savior-of-Cross meooow brunovsky PurpleCrayon.

One final thing I want to mention is that I know of two competitive programming problems that are intended to be solved specifically using the shallowest decomposition tree. Chronologically, the first problem is Cavern from POI. From what I understand, they were the first to come up with the idea. Many years later, atcoder independently came up with Uninity.

There has also been a much more recent problem 1444E - Finding the Vertex, which is a treasure hunt game on a tree, where you are allowed to guess edges instead of nodes. The solution isn't exactly the shallowest decomposition tree, but the method used to solve it is closely related to the shallowest decomposition tree. I challenge anyone that think they've mastered shallowest decomposition tree to solve it. If you need help on it, then take a look at my submission 242905032.

Full text and comments »

centroid decomposition, centroid, tree dp, tree, decomposition, tutorial, labeling, binary search, greedy, optimal strategy

+403

pajenegod
6 months ago
15

The Ultimate Reroot Template

By pajenegod, history, 7 months ago, In English

Hi Codeforces!

Have you ever had this issue before?

If yes, then you have come to the right place! This is a blog about my super easy to use template for (reroot) DP on trees. I really believe that this template is kind of revolutionary for solving reroot DP problems. I've implemented it both in Python and in C++ (the template supports Python2, Python3 and >= C++14). Using this template, you will be able to easily solve > 2000 rated reroot problems in a couple of minutes, with a couple of lines of code.

A big thanks goes out to everyone that has helped me by giving feedback on blog and/or discussing reroot with me, nor, meooow, qmk, demoralizer, jeroenodb, ffao. And especially a huge thanks goes to nor for helping out making the C++ version of the template.

1. Introduction / Motivation

As an example, consider this problem: 1324F - Maximum White Subtree. The single thing that makes this problem difficult is that you need for every node $$$u$$$ find the maximum white subtree containing $$$u$$$. Had this problem only asked to find the answer for a specific node $$$u$$$, then a simple dfs solution would have worked.

Simple dfs solution

def dfs(node, parent=-1):
  nodeDP = 0
  for nei in graph[node]:
    if nei != parent:
      nodeDP = nodeDP + max(dfs(nei, node), 0)
  return nodeDP + 2 * color[node] - 1
print(dfs(u))

But 1324F - Maximum White Subtree requires you to find the answer for every node. This forces you to use a technique called rerooting. Long story short, it is a mess to code. Maybe you could argue that for this specific problem it isn't all that bad. But it is definitely not as easy to code as the dfs solution above.

What if I told you that it is possible to take the logic from the dfs function above, put it inside of a "black box", and get the answer for all $$$u$$$ in $$$O(n \log n)$$$ time? Well it is, and that is what this blog is all about =)

In order to extract the logic from the simple dfs solution, let us first create a generic template for DP on trees and implement the simple dfs solution using its interface. Note that the following code contain the exact same logic as the simple dfs solution above. It solves the problem for a specific node $$$u$$$.

Simple dfs solution (using treeDP template)

def treeDP(root, graph, default, combine, finalize = lambda nodeDP,node,eind: nodeDP):
  DP = [0] * len(graph)
  def dfs(node, parent=-1):
    nodeDP = default[node]
    for eind, nei in enumerate(graph[node]):
      if nei != parent:
        neiDP = dfs(nei, node)
        nodeDP = combine(nodeDP, neiDP, node, eind)
    parent_eind = -1 if parent == -1 else graph[node].index(parent)
    DP[node] = finalize(nodeDP, node, parent_eind)
    return DP[node]
  
  dfs(root)
  return DP

default = [0] * n

def combine(nodeDP, neiDP, node, eind):
  return nodeDP + max(neiDP, 0)

def finalize(nodeDP, node, eind):
  return nodeDP + 2 * color[node] - 1

print(treeDP(u, graph, default, combine, finalize)[u])

Now, all that remains to solve the full problem is to switch out the treeDP function with the ultimate reroot template. The template returns the output of treeDP for every node $$$u$$$, in $$$O(n \log n)$$$ time! It is just that easy. 240150867

Solution to problem F using the ultimate reroot template

default = [0] * n

def combine(nodeDP, neiDP, node, eind):
  return nodeDP + max(neiDP, 0)

def finalize(nodeDP, node, eind):
  return nodeDP + 2 * color[node] - 1

rootDP, _, _ = rerooter(graph, default, combine, finalize)
print(*rootDP)

The takeaway from this is example is that the reroot template makes it almost trivial to solve complicated reroot problems. For example, suppose we modify 1324F - Maximum White Subtree such that both nodes and edges have colors. Normally this modification would be complicated and would require an entire overhaul of the solution. However, with the ultimate reroot template, the solution is simply:

Solution to problem F if both edges and nodes are colored

default = [0] * n

def combine(nodeDP, neiDP, node, eind):
  return nodeDP + max(neiDP + 2 * edge_color[node][eind] - 1, 0)

def finalize(nodeDP, node, eind):
  return nodeDP + 2 * color[node] - 1

rootDP, _, _ = rerooter(graph, default, combine, finalize)
print(*rootDP)

2. Collection of reroot problems and solutions

Here is a collection of reroot problems on Codeforces, together with some short and simple solutions in both Python and C++ using the rerooter template. These are nice problems to practice on if you want to try out using this template. The difficulty rating ranges between 1700 and 2600. I've also put together a GYM contest with all of the problems: Collection of Reroot DP problems (difficulty rating 1700 to 2600).

(1700 rating) 219D - Choosing Capital for Treeland: Python solution, C++ solution
(2300 rating) 543D - Road Improvement: Python solution, C++ solution
(2200 rating) 592D - Super M: Python solution, C++ solution
(2600 rating) 627D - Preorder Test: Python solution, C++ solution
(2100 rating) 852E - Casinos and travel: Python solution, C++ solution
(2300 rating) 960E - Alternating Tree: Python solution, C++ solution; Or alternatively using "edge DP":; Python solution, C++ solution
(1900 rating) 1092F - Tree with Maximum Cost: Python solution, C++ solution
(2200 rating) 1156D - 0-1-Tree: Python solution, C++ solution
(2400 rating) 1182D - Complete Mirror: Python solution, C++ solution
(2100 rating) 1187E - Tree Painting: Python solution, C++ solution
(2000 rating) 1294F - Three Paths on a Tree: Python solution, C++ solution
(1800 rating) 1324F - Maximum White Subtree: Python solution, C++ solution
(2500 rating) 1498F - Christmas Game: Python solution, C++ solution
(2400 rating) 1626E - Black and White Tree: Python solution, C++ solution
(2500 rating) 1691F - K-Set Tree: Python solution, C++ solution
(2400 rating) 1794E - Labeling the Tree with Distances: Python solution, C++ solution
(2500 rating) 1796E - Colored Subgraphs: Python solution, C++ solution
(1700 rating) 1881F - Minimum Maximum Distance: Python solution, C++ solution
104008G - Group Homework: Python solution, C++ solution
104665H - Alice Learns Eertree!: Python solution, C++ solution

3. Understanding the `rerooter` black box

The following is the black box rerooter implemented naively:

Template (naive O(n^2) version)

def treeDP(root, graph, default, combine, finalize = lambda nodeDP,node,eind: nodeDP):
  DP = [0] * len(graph)
  def dfs(node, parent=-1):
    nodeDP = default[node]
    for eind, nei in enumerate(graph[node]):
      if nei != parent:
        neiDP = dfs(nei, node)
        nodeDP = combine(nodeDP, neiDP, node, eind)
    parent_eind = -1 if parent == -1 else graph[node].index(parent)
    DP[node] = finalize(nodeDP, node, parent_eind)
    return DP[node]
  
  dfs(root)
  return DP

def rerooter(graph, default, combine, finalize = lambda nodeDP,node,eind: nodeDP):
  n = len(graph)
  rootDP = []
  edgeDP = {}
  for root in range(n):
    DP = treeDP(root, graph, default, combine, finalize)
    rootDP.append(DP[root])
    for nei in graph[root]:
      edgeDP[root, nei] = DP[nei]
    
  forwardDP = [[edgeDP[node, nei] for nei in graph[node]] for node in range(n)]
  reverseDP = [[edgeDP[nei, node] for nei in graph[node]] for node in range(n)]

  return rootDP, forwardDP, reverseDP

rerooter outputs three variables.

rootDP is a list, where rootDP[node] = dfs(node).
forwardDP is a list of lists, where forwardDP[node][eind] = dfs(nei, node), where nei = graph[node][eind].
reverseDP is a list of lists, where reverseDP[node][eind] = dfs(node, nei), where nei = graph[node][eind].

If you don't understand the definitions of rootDP/forwardDP/reverseDP, then I recommend reading the naive $$$O(n^2)$$$ implementation of rerooter. It should be fairly self explanatory.

The rest of this blog is about the techniques of how to make rerooter run in $$$O(n \log n)$$$. So if you just want to use rerooter as a black box, then you don't have to read or understand the rest of this blog.

One last remark. If you've ever done rerooting before, you might recall that rerooting usually runs in $$$O(n)$$$ time. So why does this template run in $$$O(n \log n)$$$? The reason for this is that I restrict myself to use the combine function in a left folding procedure, e.g. combine(combine(combine(nodeDP, neiDP1), neiDP2), neiDP3). My template is not allowed to do for example combine(nodeDP, combine(neiDP1, combine(neiDP2, neiDP3))). While this limitation makes the template run slower, $$$O(n \log n)$$$ instead of $$$O(n)$$$, it also makes it a lot easier to use the template. If you still think that the $$$O(n)$$$ version is superior, then I don't think you've understood how nice and general the $$$O(n \log n)$$$ version truly is.

4. Rerooting and exclusivity

The general idea behind rerooting is that we first compute the DP as normal for some arbitrary node as the root (I use node = 0 for this). After we have done this we can "move" the root of the tree by updating the DP value of the old root and the DP value of a neighbour to the old root. That neighbour then becomes the new root.

Let $$$u$$$ denote the current root, and let $$$v$$$ denote the neighbour of $$$u$$$ that we want to move the root to. At this point, we already know the value of dfs(v, u) since $$$u$$$ is the current root. But in order to be able to move the root from $$$u$$$ to $$$v$$$, we need to find the new DP value of $$$u$$$, i.e. dfs(u, v).

If we think about this in terms of forwardDP and reverseDP, then we currently know forwardDP[u], and our goal is to compute reverseDP[u]. This can be done naively in $$$O(\text{deg}(u)^2)$$$ time with a couple of for loops by calling combine $$$O(\text{deg}(u)^2)$$$ times, and then calling finalize $$$O(\text{deg}(u))$$$ times.

The bottle neck here are the $$$O(\text{deg}(u)^2)$$$ calls to combine. So for now, let us separate out the part of the code that calls combine from the rest of the code into a function called exclusive. The goal of the next section will then be to speed up the naively implemented exclusive function to run in $$$O(\text{deg}(u) \text{log} (\text{deg}(u)))$$$ time.

Rerooting using exclusivity (O(sum deg^2) version)

def exclusive(A, zero, combine, node):
    n = len(A)

    exclusiveA = [zero] * n
    for i in range(n):
        for j in range(n):
            if i != j:
                exclusiveA[i] = combine(exclusiveA[i], A[j], node, j)
    return exclusiveA

def treeDP(root, graph, default, combine, finalize = lambda nodeDP,node,eind: nodeDP):
  DP = [0] * len(graph)
  def dfs(node, parent=-1):
    nodeDP = default[node]
    for eind, nei in enumerate(graph[node]):
      if nei != parent:
        neiDP = dfs(nei, node)
        nodeDP = combine(nodeDP, neiDP, node, eind)
    parent_eind = -1 if parent == -1 else graph[node].index(parent)
    DP[node] = finalize(nodeDP, node, parent_eind)
    return DP[node]
  
  dfs(root)
  return DP

def rerooter(graph, default, combine, finalize = lambda nodeDP,node,eind: nodeDP):
    n = len(graph)
    rootDP = [0] * n
    forwardDP = [None] * n
    reverseDP = [None] * n
 
    # Compute DP for root=0
    DP = treeDP(0, graph, default, combine, finalize)
     
    # Do rerooting using the exclusive function
    def dfs(node, parent=-1, parent_eind=-1):
        forwardDP[node] = [DP[nei] for nei in graph[node]]
        rerootDP = exclusive(forwardDP[node], default[node], combine, node)
        reverseDP[node] = [finalize(nodeDP, node, eind) for eind, nodeDP in enumerate(rerootDP)]
        nodeDP = combine(rerootDP[0], forwardDP[node][0], node, 0) if n > 1 else default[node]
        rootDP[node] = finalize(nodeDP, node, -1)
        for eind, nei in enumerate(graph[node]):
            if nei != parent:
                DP[node] = reverseDP[node][eind]
                dfs(nei, node, eind)
    dfs(0)
    return rootDP, forwardDP, reverseDP

5. The exclusive segment tree

We are almost done implementing the fast reroot template. The only operation left to speed up is the function exclusive. Currently it runs in $$$O(\sum \text{deg}^2)$$$ time. The trick to make exclusive run in $$$O(\sum \text{deg} \log{(\text{deg})})$$$ time is to create something similar to a segment tree.

Suppose you have a segment tree where each node in the segment tree accumulates all of the values outside of its interval. The leaves of such a segment tree can then be used as the output of exclusive. I call this data structure the exclusive segment tree.

Example: Exclusive segment tree of size n = 8

The exclusive segment tree is naturally built from top to bottom, taking $$$O(n \log n)$$$ time. Here is an implementation of rerooter using the exclusive segment tree:

Rerooting using exclusivity (O(sum deg log(deg)) version)

def exclusive(A, zero, combine, node):
    n = len(A)
    exclusiveA = [zero] * n # Exclusive segment tree
 
    # Build exclusive segment tree
    for bit in range(n.bit_length())[::-1]:
        for i in range(n)[::-1]:
            # Propagate values down the segment tree    
            exclusiveA[i] = exclusiveA[i // 2]
        for i in range(n & ~(bit == 0)):
            # Fold A[i] into exclusive segment tree
            ind = (i >> bit) ^ 1
            exclusiveA[ind] = combine(exclusiveA[ind], A[i], node, i)
    return exclusiveA

def treeDP(root, graph, default, combine, finalize = lambda nodeDP,node,eind: nodeDP):
  DP = [0] * len(graph)
  def dfs(node, parent=-1):
    nodeDP = default[node]
    for eind, nei in enumerate(graph[node]):
      if nei != parent:
        neiDP = dfs(nei, node)
        nodeDP = combine(nodeDP, neiDP, node, eind)
    parent_eind = -1 if parent == -1 else graph[node].index(parent)
    DP[node] = finalize(nodeDP, node, parent_eind)
    return DP[node]
  
  dfs(root)
  return DP

def rerooter(graph, default, combine, finalize = lambda nodeDP,node,eind: nodeDP):
    n = len(graph)
    rootDP = [0] * n
    forwardDP = [None] * n
    reverseDP = [None] * n
 
    # Compute DP for root=0
    DP = treeDP(0, graph, default, combine, finalize)
     
    # Do rerooting using the exclusive function
    def dfs(node, parent=-1, parent_eind=-1):
        forwardDP[node] = [DP[nei] for nei in graph[node]]
        rerootDP = exclusive(forwardDP[node], default[node], combine, node)
        reverseDP[node] = [finalize(nodeDP, node, eind) for eind, nodeDP in enumerate(rerootDP)]
        nodeDP = combine(rerootDP[0], forwardDP[node][0], node, 0) if n > 1 else default[node]
        rootDP[node] = finalize(nodeDP, node, -1)
        for eind, nei in enumerate(graph[node]):
            if nei != parent:
                DP[node] = reverseDP[node][eind]
                dfs(nei, node, eind)
    dfs(0)
    return rootDP, forwardDP, reverseDP

This algorithm runs in $$$O(\sum \text{deg} \log{(\text{deg})})$$$, so we are essentially done. However, this implementation uses recursive DFS which especially for Python is a huge drawback. Recursion in Python is both relatively slow and increadibly memory hungry. So for a far more practical version, I've also implemented this same algorithm using a BFS instead of a DFS. This gives us the final version of the ultimate rerooter template!

Rerooting using exclusivity (O(sum deg log(deg)) version with BFS)

def exclusive(A, zero, combine, node):
    n = len(A)
    exclusiveA = [zero] * n # Exclusive segment tree
 
    # Build exclusive segment tree
    for bit in range(n.bit_length())[::-1]:
        for i in range(n)[::-1]:
            # Propagate values down the segment tree    
            exclusiveA[i] = exclusiveA[i // 2]
        for i in range(n & ~(bit == 0)):
            # Fold A[i] into exclusive segment tree
            ind = (i >> bit) ^ 1
            exclusiveA[ind] = combine(exclusiveA[ind], A[i], node, i)
    return exclusiveA
 
def rerooter(graph, default, combine, finalize = lambda nodeDP,node,eind: nodeDP):
    n = len(graph)
    rootDP = [0] * n
    forwardDP = [None] * n
    reverseDP = [None] * n
 
    # Compute DP for root=0
    DP = [0] * n
    bfs = [0]
    P = [0] * n
    for node in bfs:
        for nei in graph[node]:
            if P[node] != nei:
                P[nei] = node
                bfs.append(nei)
 
    for node in reversed(bfs):
        nodeDP = default[node]
        for eind, nei in enumerate(graph[node]):
            if P[node] != nei:
                nodeDP = combine(nodeDP, DP[nei], node, eind)
        DP[node] = finalize(nodeDP, node, graph[node].index(P[node]) if node else -1)
    # DP for root=0 done
    
    # Use the exclusive function to reroot 
    for node in bfs:
        DP[P[node]] = DP[node]
        forwardDP[node] = [DP[nei] for nei in graph[node]]
        rerootDP = exclusive(forwardDP[node], default[node], combine, node)
        reverseDP[node] = [finalize(nodeDP, node, eind) for eind, nodeDP in enumerate(rerootDP)]
        rootDP[node] = finalize((combine(rerootDP[0], forwardDP[node][0], node, 0) if n > 1 else default[node]), node, -1)
        for nei, dp in zip(graph[node], reverseDP[node]):
            DP[nei] = dp
    return rootDP, forwardDP, reverseDP

Full text and comments »

rerooting technique, tree dp, dp, tree, segment tree, template, tutorial, o(nlogn)

+489

pajenegod
7 months ago
61

I don't want this to get normalized

By pajenegod, history, 12 months ago, In English

I don't think blogs like this one should be normal on Codeforces.

I know that sometimes comments can be frustrating. Sometimes it is the commenters fault, and sometimes it is the community's fault for upvoting mean spirited comments. I understand that in some cases, the criticism the commenter receives is fair and well-deserved.

But there is a fine line between criticism, saying that you didn't like the comment, and hanging out a comment on the front page and letting everyone attack it. The former two are acceptable (and sometimes even needed, because commenter have to improve and learn from their mistakes); the latter one, in my opinion, should not be acceptable.

I think that blogs discussing toxicity on codeforces are a good thing, but singling out a comment made by some kid onto the front page of Codeforces is a very inconsiderate and irresponsible thing to do.

Full text and comments »

+391

pajenegod
12 months ago
21

Tutorial: A simple O(n log n) polynomial multiplication algorithm

By pajenegod, history, 13 months ago, In English

Hi Codeforces!

I have something exciting to tell you guys about today! I have recently come up with a really neat and simple recursive algorithm for multiplying polynomials in $$$O(n \log n)$$$ time. It is so neat and simple that I think it might possibly revolutionize the way that fast polynomial multiplication is taught and coded. You don't need to know anything about FFT to understand and implement this algorithm.

Big thanks to nor, c1729 and Spheniscine for discussing the contents of the blog with me and comming up with ideas for how to improve the blog =).

I've split this blog up into two parts. The first part is intended for anyone to be able to read and understand. The second part is advanced and goes into a ton of interesting ideas and concepts related to this algorithm.

Prerequisite: Polynomial quotient and remainder, see Wiki article and this Stackexchange example.

Task:

Given two polynomials $$$P$$$ and $$$Q$$$, an integer $$$n$$$ and a non-zero complex number $$$c$$$, where degree $$$P < n$$$ and degree $$$Q < n$$$. Your task is to calculate the polynomial $$$P(x) \, Q(x) \% (x^n - c)$$$ in $$$O(n \log n)$$$ time. You may assume that $$$n$$$ is a power of two.

Solution:

We can create a divide and conquer algorithm for $$$P(x) \, Q(x) \% (x^n - c)$$$ based on the difference of squares formula. Assuming $$$n$$$ is even, then $$$(x^n - c) = (x^{n/2} - \sqrt{c}) (x^{n/2} + \sqrt{c})$$$. The idea behind the algorithm is to calculate $$$P(x) \, Q(x) \% (x^{n/2} - \sqrt{c})$$$ and $$$P(x) \, Q(x) \% (x^{n/2} + \sqrt{c})$$$ using 2 recursive calls, and then use that result to calculate $$$P(x) \, Q(x) \% (x^n - c)$$$.

So how do we actually calculate $$$P(x) \, Q(x) \% (x^n - c)$$$ using $$$P(x) \, Q(x) \% (x^{n/2} - \sqrt{c})$$$ and $$$P(x) \, Q(x) \% (x^{n/2} + \sqrt{c})$$$?

Well, we can use the following formula:

$$$ \begin{aligned} A(x) \% (x^n - c) = &\frac{1}{2} \left(1 + \frac{x^{n/2}}{\sqrt{c}}\right) \left(A(x) \% (x^{n/2} - \sqrt{c})\right) \, + \\ &\frac{1}{2} \left(1 - \frac{x^{n/2}}{\sqrt{c}}\right) \left(A(x) \% (x^{n/2} + \sqrt{c})\right). \end{aligned} $$$

Proof of the formula

This formula is very useful. If we substitute $$$A(x)$$$ by $$$P(x) Q(x)$$$, then the formula tells us how to calculate $$$P(x) \, Q(x) \% (x^n - c)$$$ using $$$P(x) \, Q(x) \% (x^{n/2} - \sqrt{c})$$$ and $$$P(x) \, Q(x) \% (x^{n/2} + \sqrt{c})$$$ in linear time. With this we have the recipie for implementing a $$$O(n \log n)$$$ divide and conquer algorithm:

Input:

Integer $$$n$$$ (power of 2),
Non-zero complex number $$$c$$$,
Two polynomials $$$P(x) \% (x^n - c)$$$ and $$$Q(x) \% (x^n - c)$$$.

Output:

The polynomial $$$P(x) \, Q(x) \% (x^n - c)$$$.

Algorithm:

Step 1. (Base case) If $$$n = 1$$$, then return $$$P(0) \cdot Q(0)$$$. Otherwise:

Step 2. Starting from $$$P(x) \% (x^n - c)$$$ and $$$Q(x) \% (x^n - c)$$$, in $$$O(n)$$$ time calculate

$$$ \begin{align} & P(x) \% (x^{n/2} - \sqrt{c}), \\ & Q(x) \% (x^{n/2} - \sqrt{c}), \\ & P(x) \% (x^{n/2} + \sqrt{c}) \text{ and} \\ & Q(x) \% (x^{n/2} + \sqrt{c}). \end{align} $$$

Step 3. Make two recursive calls to calculate $$$P(x) \, Q(x) \% (x^{n/2} - \sqrt{c})$$$ and $$$P(x) \, Q(x) \% (x^{n/2} + \sqrt{c})$$$.

Step 4. Using the formula, calculate $$$P(x) \, Q(x) \% (x^n - c)$$$ in $$$O(n)$$$ time. Return the result.

Here is a Python implementation following this recipie:

Python solution to the task

"""
Calculates P(x) * Q(x) % (x^n - c) in O(n log n) time

Input:
  n: Integer, needs to be power of 2
  c: Non-zero complex floating point number
  P: A list of length n representing a polynomial P(x) % (x^n - c)
  Q: A list of length n representing a polynomial Q(x) % (x^n - c)
Output:
  A list of length n representing the polynomial P(x) * Q(x) % (x^n - c)
"""
def fast_polymult_mod(P, Q, n, c):
    assert len(P) == n and len(Q) == n
    
    # Base case
    if n == 1:
        return [P[0] * Q[0]]

    assert n % 2 == 0
    import cmath
    sqrtc = cmath.sqrt(c)

    # Calulate P_minus := P mod (x^(n/2) - sqrt(c))
    #          Q_minus := Q mod (x^(n/2) - sqrt(c))

    P_minus = [p1 + sqrtc * p2 for p1,p2 in zip(P[:n//2], P[n//2:])]
    Q_minus = [q1 + sqrtc * q2 for q1,q2 in zip(Q[:n//2], Q[n//2:])]

    # Calulate P_plus := P mod (x^(n/2) + sqrt(c))
    #          Q_plus := Q mod (x^(n/2) + sqrt(c))

    P_plus = [p1 - sqrtc * p2 for p1,p2 in zip(P[:n//2], P[n//2:])]
    Q_plus = [q1 - sqrtc * q2 for q1,q2 in zip(Q[:n//2], Q[n//2:])]

    # Recursively calculate PQ_minus := P * Q % (x^n/2 - sqrt(c)) 
    #                       PQ_plus  := P * Q % (x^n/2 + sqrt(c))
    
    PQ_minus = fast_polymult_mod(P_minus, Q_minus, n//2, sqrtc)
    PQ_plus  = fast_polymult_mod(P_plus,  Q_plus,  n//2, -sqrtc)

    # Calculate PQ mod (x^n - c) using PQ_minus and PQ_plus
    PQ = [(m + p)/2         for m,p in zip(PQ_minus, PQ_plus)] +\
         [(m - p)/(2*sqrtc) for m,p in zip(PQ_minus, PQ_plus)]
    
    return PQ

One final thing that I want to mention before going into the advanced section is that this algorithm can also be used to do fast unmodded polynomial multiplication, i.e. given polynomials $$$P(x)$$$ and $$$Q(x)$$$ calculate $$$P(x) \, Q(x)$$$. The trick is simply to pick $$$n$$$ large enough such that $$$P(x) \, Q(x) = P(x) \, Q(x) \% (x^n - c)$$$, and then use the exact same algorithm as before. $$$c$$$ can be arbitrarily picked (any non-zero complex number works).

Python implementation for general Fast polynomial multiplication

"""
Calculates P(x) * Q(x)

Input:
  P: A list representing a polynomial P(x)
  Q: A list representing a polynomial Q(x)
Output:
  A list representing the polynomial P(x) * Q(x)
"""
def fast_polymult(P, Q):
    # Calculate length of the list representing P*Q
    n1 = len(P)
    n2 = len(Q)
    res_len = n1 + n2 - 1
    
    # Pick n sufficiently big
    n = 1
    while n < res_len:
        n *= 2

    # Pad with extra 0s to reach length n
    P = P + [0] * (n - n1)
    Q = Q + [0] * (n - n2)
    
    # Pick non-zero c arbitrarily =)
    c = 123.24

    # Calculate P*Q mod x^n - c
    PQ = fast_polymult_mod(P, Q, n, c)

    # Remove extra 0 padding and return
    return PQ[:res_len]

If you want to try out implementing this algorithm yourself, then here is a very simple problem to test out your implementation on: SPOJ:POLYMUL.

(Advanced) Speeding up the algorithm

This section will be about tricks that can be used to speed up the algorithm. The first two tricks will speed up the algorithm by a factor of 2 each. The last trick is advanced, and it has the potential to both speed up the algorithm and also make it more numerically stable.

$n$ doesn't actually need to be a power of 2

We don't actually need the assumption that $$$n$$$ is a power of 2. If $$$n$$$ ever becomes odd during the recursion, then we have two choices: Either fall back to a $$$O(n^2)$$$ algorithm or fall back to the unmodded $$$O(n \log{n})$$$ Polynomial multiplication algorithm.

Let us discuss the run time of falling back to the $$$O(n^2)$$$ algorithm when $$$n$$$ becomes odd. Assume that $$$n = a \cdot 2^b$$$, where $$$a$$$ is an odd integer and $$$b$$$ is an integer. Think of the recursive algorithm as having layers, one layer for each possible value of $$$n$$$. The first $$$b$$$ layers will all take $$$O(n)$$$ time each. In the $$$(b+1)$$$-th layer the value of $$$n$$$ is $$$a$$$. Using the $$$O(n^2)$$$ polynomial multiplication algorithm leads to this layer taking $$$O(n/a \cdot a^2) = O(n \cdot a)$$$ time. The final time complexity comes out to be $$$O((a + b) \, n)$$$.

Python implementation that works for both odd and even $n$

"""
Calculates P(x) * Q(x) % (x^n - c) in O((a + b) * n) time, where n = a*2^b.

Input:
  n: Integer
  c: Non-zero complex floating point number
  P: A list of length n representing a polynomial P(x) % (x^n - c)
  Q: A list of length n representing a polynomial Q(x) % (x^n - c)
Output:
  A list of length n representing the polynomial P(x) * Q(x) % (x^n - c)
"""
def fast_polymult_mod2(P, Q, n, c):
    assert len(P) == n and len(Q) == n
    
    # Base case (n is odd)
    if n & 1:
        # Calculate the answer in O(n^2) time
        res = [0] * (2*n)
        for i in range(n):
            for j in range(n):
                res[i + j] += P[i] * Q[j]
        return [r1 + c * r2 for r1,r2 in zip(res[:n], res[n:])]

    assert n % 2 == 0
    import cmath
    sqrtc = cmath.sqrt(c)

    # Calulate P_minus := P mod (x^(n/2) - sqrt(c))
    #          Q_minus := Q mod (x^(n/2) - sqrt(c))

    P_minus = [p1 + sqrtc * p2 for p1,p2 in zip(P[:n//2], P[n//2:])]
    Q_minus = [q1 + sqrtc * q2 for q1,q2 in zip(Q[:n//2], Q[n//2:])]

    # Calulate P_plus := P mod (x^(n/2) + sqrt(c))
    #          Q_plus := Q mod (x^(n/2) + sqrt(c))

    P_plus = [p1 - sqrtc * p2 for p1,p2 in zip(P[:n//2], P[n//2:])]
    Q_plus = [q1 - sqrtc * q2 for q1,q2 in zip(Q[:n//2], Q[n//2:])]

    # Recursively calculate PQ_minus := P * Q % (x^n/2 - sqrt(c)) 
    #                       PQ_plus  := P * Q % (x^n/2 + sqrt(c))
    
    PQ_minus = fast_polymult_mod2(P_minus, Q_minus, n//2, sqrtc)
    PQ_plus  = fast_polymult_mod2(P_plus,  Q_plus,  n//2, -sqrtc)

    # Calculate PQ mod (x^n - c) using PQ_minus and PQ_plus
    PQ = [(m + p)/2         for m,p in zip(PQ_minus, PQ_plus)] +\
         [(m - p)/(2*sqrtc) for m,p in zip(PQ_minus, PQ_plus)]
    
    return PQ

The reason why this is super useful is that it allows us to speed up the fast unmodded polynomial multiplication algorithm. As long as we are fine with $$$a$$$ being less than say $$$10$$$, then we might be able to choose a significantly smaller $$$n$$$ compared to what would be possible if we were allowed to only choose powers of two. This trick has the potential of making the fast unmodded polynomial multiplication algorithm run twice as fast.

Python implementation for more efficient fast unmodded polynomial multiplication

"""
Calculates P(x) * Q(x)

Input:
  P: A list representing a polynomial P(x)
  Q: A list representing a polynomial Q(x)
Output:
  A list representing the polynomial P(x) * Q(x)
"""
def fast_polymult2(P, Q):
    # Calculate length of the list representing P*Q
    n1 = len(P)
    n2 = len(Q)
    res_len = n1 + n2 - 1
    
    # Pick n sufficiently big
    b = 0
    alim = 10
    while alim * 2**b < res_len:
        b += 1
    a = (res_len - 1) // 2**b + 1
    n = a * 2**b

    # Pad with extra 0s to reach length n
    P = P + [0] * (n - n1)
    Q = Q + [0] * (n - n2)
    
    # Pick non-zero c arbitrarily =)
    c = 123.24

    # Calculate P*Q mod x^n - c
    PQ = fast_polymult_mod2(P, Q, n, c)

    # Remove extra 0 padding and return
    return PQ[:res_len]

Imaginary-cyclic convolution

Calculating fast_polymult_mod(P, Q, n, c) using using fast_polymult_mod(P, Q, n, 1) (reweight technique)

(Advanced) -is-this-fft-?

This algorithm is actually FFT in disguise. But it is also different compared to any other FFT algorithm that I've seen in the past (for example the Cooley–Tukey FFT algorithm).

Using this algorithm to calculate FFT

This algorithm is not the same algorithm as Cooley–Tukey

FFT implementation in Python based on this algorithm

Here is an FFT implementation. A codegolfed version of the same code can be found on Pyrival.

"""
Calculates FFT(P) in O(n log n) time.

This implementation is based on the 
polynomial modulo multiplication algorithm.

Input:
  P: A list of length n representing a polynomial P(x).
     n needs to be a power of 2.
Output:
  A list of length n representing the FFT of the polynomial P,
     i.e. the list [P(exp(2j pi / n * i) for i in range(n)]
"""
rt = [1] # List used to store roots of unity
def FFT(P):
    n = len(P)
    # Assert n is a power of 2
    assert n and (n - 1) & n == 0
    # Make a copy of P to not modify original P
    P = P[:] 
    
    # Precalculate the roots
    while 2 * len(rt) < n:
        # 4*len(rt)-th root of unity
        import cmath
        root = cmath.exp(2j * cmath.pi / (4 * len(rt)))
        rt.extend([r * root for r in rt])

    # Transform P
    k = n
    while k > 1:
        for i in range(n//k):
            r = rt[i]
            for j1 in range(i*k, i*k + k//2):
                j2 = j1 + k//2
                z = r * P[j2]
                P[j2] = P[j1] - z
                P[j1] += z
        k //= 2
    
    # Bit reverse P before returning
    rev = [0] * n
    for i in range(1, n):
        rev[i] = rev[i // 2] // 2 + (i & 1) * n // 2

    return [P[r] for r in rev]

# Inverse of FFT(P) using a standard trick
def inverse_FFT(fft_P):
    n = len(fft_P)
    return FFT([fft_P[-i]/n for i in range(n)])

"""
Calculates P(x) * Q(x)

Input:
  P: A list representing a polynomial P(x)
  Q: A list representing a polynomial Q(x)
Output:
  A list representing the polynomial P(x) * Q(x)
"""
def fast_polymult_using_FFT(P, Q):
    # Calculate length of the list representing P*Q
    n1 = len(P)
    n2 = len(Q)
    res_len = n1 + n2 - 1
    
    # Pick n sufficiently big
    n = 1
    while n < res_len:
        n *= 2

    # Pad with extra 0s to reach length n
    P = P + [0] * (n - n1)
    Q = Q + [0] * (n - n2)
    
    # Transform P and Q
    fft_P = FFT(P)
    fft_Q = FFT(Q)

    # Calculate FFT of P*Q
    fft_PQ = [p*q for p,q in zip(fft_P,fft_Q)]

    # Inverse FFT
    PQ = inverse_FFT(fft_PQ)
    
    # Remove padding and return
    return PQ[:res_len]
"""
Calculates P(x) * Q(x)

Input:
  P: A list representing a polynomial P(x)
  Q: A list representing a polynomial Q(x)
Output:
  A list representing the polynomial P(x) * Q(x)
"""
def fast_polymult_using_FFT(P, Q):
    # Calculate length of the list representing P*Q
    n1 = len(P)
    n2 = len(Q)
    res_len = n1 + n2 - 1
    
    # Pick n sufficiently big
    n = 1
    while n < res_len:
        n *= 2

    # Pad with extra 0s to reach length n
    P = P + [0] * (n - n1)
    Q = Q + [0] * (n - n2)
    
    # Transform P and Q
    fft_P = FFT(P)
    fft_Q = FFT(Q)

    # Calculate FFT of P*Q
    fft_PQ = [p*q for p,q in zip(fft_P,fft_Q)]

    # Inverse FFT
    PQ = inverse_FFT(fft_PQ)
    
    # Remove padding and return
    return PQ[:res_len]

FFT implementation in C++ based on this algorithm

Here is an FTT implementation. It is coded in the same style as in KACTL.

typedef complex<double> C;
typedef vector<double> vd;
void fft(vector<C>& a) {
	int n = sz(a);
	static vector R{1.L + 0il};
	static vector rt{1. + 0i};
	for (static int k = 2; k < n; k *= 2) {
		R.resize(n/2); rt.resize(n/2);
                auto x = pow(1il, 2./k);
		rep(i,k/2,k) rt[i] = R[i] = R[i-k/2] * x;;
	}
	for (int k = n; k > 1; k /= 2) rep(i,0,n/k) rep(j,i*k,i*k + k/2) {
		C &u = a[j], &v = a[j+k/2], &r = rt[i];
		C z(v.real()*r.real() - v.imag()*r.imag(), 
		    v.real()*r.imag() + v.imag()*r.real());
		v = u - z;
		u = u + z;
	}
	vi rev(n);
	rep(i,0,n) rev[i] = rev[i / 2] / 2 + (i & 1) * n / 2;
	rep(i,0,n) if (i < rev[i]) swap(a[i], a[rev[i]]);
}

vd conv(const vd& a, const vd& b) {
	if (a.empty() || b.empty()) return {};
	vd res(sz(a) + sz(b) - 1);
	int L = 32 - __builtin_clz(sz(res)), n = 1 << L;
	vector<C> in(n), out(n);
	copy(all(a), begin(in));
	rep(i,0,sz(b)) in[i].imag(b[i]);
	fft(in);
	for (C& x : in) x *= x;
	rep(i,0,n) out[i] = in[-i & (n - 1)] - conj(in[i]);
	fft(out);
	rep(i,0,sz(res)) res[i] = imag(out[i]) / (4 * n);
	return res;
}

(Advanced) Connection between this algorithm and NTT

Just like how there is FFT and NTT, there are two variants of this algorithm too. One using complex floating point numbers, and the other using modulo a prime (or more generally modulo an odd composite number).

Using modulo integers instead of complex numbers

Calculating fast_polymult_mod(P, Q, n, c) using fast_polymult_mod(P, Q, 2*n, 1)

This algorithm works to some degree even for bad NTT primes

NTT implementation in Python based on this algorithm

Here is an NTT implementation. A codegolfed version of the same code can be found on Pyrival.

# Mod used for NTT
# Requirement: Any odd integer > 2
# It is important that MOD - 1 is
# divisible by lots of 2s
MOD = (119 << 23) + 1
assert MOD & 1

# Precalc non-quadratic_residue (used by the NTT)
non_quad_res = 2
while pow(non_quad_res, MOD//2, MOD) != MOD - 1:
    non_quad_res += 1
rt = [1]

"""
Calculates NTT(P) in O(n log n) time.

This implementation is based on the 
polynomial modulo multiplication algorithm.

Input:
  P: A list of length n representing a polynomial P(x).
     n needs to be a power of 2.
Output:
  A list of length n representing the NTT of the polynomial P,
     i.e. the list [P(root**i) % MOD for i in range(n)]
     where root is an n-th root of unity mod MOD
"""
def NTT(P):
    n = len(P)
    # Assert n is a power of 2
    assert n and (n - 1) & n == 0

    # Check that NTT is defined for this n
    assert (MOD - 1) % n == 0

    # Make a copy of P to not modify original P
    P = P[:] 
    
    # Precalculate the roots
    while 2 * len(rt) < n:
        # 4*len(rt)-th root of unity
        root = pow(non_quad_res, MOD//(4 * len(rt)), MOD)
        rt.extend([r * root % MOD for r in rt])

    # Transform P
    k = n
    while k > 1:
        for i in range(n//k):
            r = rt[i]
            for j1 in range(i*k, i*k + k//2):
                j2 = j1 + k//2
                z = r * P[j2]
                P[j2] = (P[j1] - z) % MOD
                P[j1] = (P[j1] + z) % MOD
        k //= 2
    
    # Bit reverse P before returning
    rev = [0] * n
    for i in range(1, n):
        rev[i] = rev[i // 2] // 2 + (i & 1) * n // 2

    return [P[r] for r in rev]

# Inverse of NTT(P) using a standard trick
def inverse_NTT(ntt_P):
    n = len(ntt_P)
    n_inv = pow(n, -1, MOD) # Requires Python 3.8
    # The following works in any Python version, but requires MOD to be prime
    # n_inv = pow(n, MOD - 2, MOD)
    assert n * n_inv % MOD == 1
    return NTT([ntt_P[-i] * n_inv % MOD for i in range(n)])

"""
Calculates P(x) * Q(x) (where the coeffiecents are returned % MOD)

Input:
  P: A list representing a polynomial P(x)
  Q: A list representing a polynomial Q(x)
Output:
  A list representing the polynomial P(x) * Q(x) (with coeffients % MOD)
"""
def fast_polymult_using_NTT(P, Q):
    # Calculate length of the list representing P*Q
    n1 = len(P)
    n2 = len(Q)
    res_len = n1 + n2 - 1
    
    # Pick n sufficiently big
    n = 1
    while n < res_len:
        n *= 2

    # Pad with extra 0s to reach length n
    P = P + [0] * (n - n1)
    Q = Q + [0] * (n - n2)
    
    # Transform P and Q
    ntt_P = NTT(P)
    ntt_Q = NTT(Q)

    # Calculate NTT of P*Q
    ntt_PQ = [p * q % MOD for p,q in zip(ntt_P,ntt_Q)]

    # Inverse NTT
    PQ = inverse_NTT(ntt_PQ)
    
    # Remove padding and return
    return PQ[:res_len]

NTT implementation in C++ based on this algorithm

Here is an NTT implementation. It is coded in the same style as in KACTL.

const ll mod = (119 << 23) + 1;// = 998244353
// For p < 2^30 there is also e.g. 5 << 25, 7 << 26, 479 << 21
// and 483 << 21 The last two are > 10^9.
typedef vector<ll> vl;

#include "../number-theory/ModPow.h"

void ntt(vl &a) {
	int n = sz(a);
	static ll r = 3;
	while(modpow(r, mod/2) + 1 < mod) ++r;
	static vl rt{1};
	for (static int k = 2; k < n; k *= 2) {
		rt.resize(n/2);
                ll x = modpow(r, mod/2/k);
		rep(i,k/2,k) rt[i] = rt[i-k/2] * x % mod;
	}
	for (int k = n; k > 1; k /= 2) rep(i,0,n/k) rep(j,i*k,i*k + k/2) {
		ll &u = a[j], &v = a[j+k/2], z = rt[i] * v % mod;
		v = u - z + (u < z ? mod : 0);
		u = u + z - (u + z >= mod ? mod : 0);
	}
	vi rev(n);
	rep(i,0,n) rev[i] = rev[i / 2] / 2 + (i & 1) * n / 2;
	rep(i,0,n) if (i < rev[i]) swap(a[i], a[rev[i]]);
}
vl conv(vl a, vl b) {
	
	if (a.empty() || b.empty()) return {};
	int s = sz(a) + sz(b) - 1, B = 32 - __builtin_clz(s), n = 1 << B;
	int inv = modpow(n, mod - 2);
	vl out(n);
	a.resize(n); b.resize(n);
	ntt(a), ntt(b);
	rep(i,0,n) out[-i & (n - 1)] = (ll)a[i] * b[i] % mod * inv % mod;
	ntt(out);
	return {out.begin(), out.begin() + s};
}

Blazingly fast NTT C++ implementation

(Advanced) Shorter implementations ("Codegolfed version")

It is possible to make really short but slightly less natural implementations of this algorithm. Originally I was thinking of using this shorter version in the blog, but in the end I didn't do it. So here they are. If you want a short implemention of this algorithm to use in practice, then I would recommend taking one of these implementations and porting it to C++.

Short Python implementation without any speedup tricks

"""
Calculates P(x) * Q(x) % (x^n - c) in O(n log n) time

Input:
  n: Integer, needs to be power of 2
  c: Non-zero complex floating point number
  P: A list of length 2*n representing a polynomial P(x) % (x^2n - c^2)
  Q: A list of length 2*n representing a polynomial Q(x) % (x^2n - c^2)
Output:
  A list of length n representing the polynomial P(x) * Q(x) % (x^n - c)
"""
def fast_polymult_mod3(P, Q, n, c):
    assert len(P) == 2*n and len(Q) == 2*n

    # Mod P and Q by (x^n - c)
    P = [p1 + c * p2 for p1,p2 in zip(P[:n], P[n:])]
    Q = [q1 + c * q2 for q1,q2 in zip(Q[:n], Q[n:])]
    
    # Base case
    if n == 1:
        return [P[0] * Q[0]]

    assert n % 2 == 0
    import cmath
    sqrtc = cmath.sqrt(c)

    # Recursively calculate PQ_minus := P * Q % (x^n/2 - sqrt(c)) 
    #                       PQ_plus  := P * Q % (x^n/2 + sqrt(c))
    
    PQ_minus = fast_polymult_mod3(P, Q, n//2, sqrtc)
    PQ_plus  = fast_polymult_mod3(P, Q, n//2, -sqrtc)

    # Calculate PQ mod (x^n - c) using PQ_minus and PQ_plus
    PQ = [(m + p)/2         for m,p in zip(PQ_minus, PQ_plus)] +\
         [(m - p)/(2*sqrtc) for m,p in zip(PQ_minus, PQ_plus)]
    
    return PQ
"""
Calculates P(x) * Q(x)

Input:
  P: A list representing a polynomial P(x)
  Q: A list representing a polynomial Q(x)
Output:
  A list representing the polynomial P(x) * Q(x)
"""
def fast_polymult3(P, Q):
    # Calculate length of the list representing P*Q
    n1 = len(P)
    n2 = len(Q)
    res_len = n1 + n2 - 1
    
    # Pick n sufficiently big
    n = 1
    while n < res_len:
        n *= 2

    # Pad with extra 0s to reach length 2*n
    P = P + [0] * (2*n - n1)
    Q = Q + [0] * (2*n - n2)
    
    # Pick non-zero c arbitrarily =)
    c = 123.24

    # Calculate P*Q mod x^n - c
    PQ = fast_polymult_mod3(P, Q, n, c)

    # Remove extra 0 padding and return
    return PQ[:res_len]

Short Python implementation supporting odd and even $n$ (making it up to 2 times faster)

"""
Calculates P(x) * Q(x) % (x^n - c) in O(n log n) time

Input:
  n: Integer, needs to be power of 2
  c: Non-zero complex floating point number
  P: A list of length 2*n representing a polynomial P(x) % (x^2n - c^2)
  Q: A list of length 2*n representing a polynomial Q(x) % (x^2n - c^2)
Output:
  A list of length n representing the polynomial P(x) * Q(x) % (x^n - c)
"""
def fast_polymult_mod4(P, Q, n, c):
    assert len(P) == 2*n and len(Q) == 2*n

    # Mod P and Q by (x^n - c)
    P = [p1 + c * p2 for p1,p2 in zip(P[:n], P[n:])]
    Q = [q1 + c * q2 for q1,q2 in zip(Q[:n], Q[n:])]
    
    # Base case (n is odd)
    if n & 1:
        # Calculate the answer in O(n^2) time
        res = [0] * (2*n)
        for i in range(n):
            for j in range(n):
                res[i + j] += P[i] * Q[j]
        return [r1 + c * r2 for r1,r2 in zip(res[:n], res[n:])]
    
    assert n % 2 == 0
    import cmath
    sqrtc = cmath.sqrt(c)

    # Recursively calculate PQ_minus := P * Q % (x^n/2 - sqrt(c)) 
    #                       PQ_plus  := P * Q % (x^n/2 + sqrt(c))
    
    PQ_minus = fast_polymult_mod4(P, Q, n//2, sqrtc)
    PQ_plus  = fast_polymult_mod4(P, Q, n//2, -sqrtc)

    # Calculate PQ mod (x^n - c) using PQ_minus and PQ_plus
    PQ = [(m + p)/2         for m,p in zip(PQ_minus, PQ_plus)] +\
         [(m - p)/(2*sqrtc) for m,p in zip(PQ_minus, PQ_plus)]
    
    return PQ

"""
Calculates P(x) * Q(x)

Input:
  P: A list representing a polynomial P(x)
  Q: A list representing a polynomial Q(x)
Output:
  A list representing the polynomial P(x) * Q(x)
"""
def fast_polymult4(P, Q):
    # Calculate length of the list representing P*Q
    n1 = len(P)
    n2 = len(Q)
    res_len = n1 + n2 - 1
    
    # Pick n sufficiently big
    b = 0
    alim = 10
    while alim * 2**b < res_len:
        b += 1
    a = (res_len - 1) // 2**b + 1
    n = a * 2**b

    # Pad with extra 0s to reach length 2*n
    P = P + [0] * (2*n - n1)
    Q = Q + [0] * (2*n - n2)
    
    # Pick non-zero c arbitrarily =)
    c = 123.24

    # Calculate P*Q mod x^n - c
    PQ = fast_polymult_mod4(P, Q, n, c)

    # Remove extra 0 padding and return
    return PQ[:res_len]

Short Python implementation supporting odd and even $n$ and imaginary cyclic convolution (making it up to 4 times faster)

"""
Calculates P(x) * Q(x) % (x^n - c) in O(n log n) time

Input:
  n: Integer, needs to be power of 2
  c: Non-zero complex floating point number
  P: A list of length 2*n representing a polynomial P(x) % (x^2n - c^2)
  Q: A list of length 2*n representing a polynomial Q(x) % (x^2n - c^2)
Output:
  A list of length n representing the polynomial P(x) * Q(x) % (x^n - c)
"""
def fast_polymult_mod4(P, Q, n, c):
    assert len(P) == 2*n and len(Q) == 2*n

    # Mod P and Q by (x^n - c)
    P = [p1 + c * p2 for p1,p2 in zip(P[:n], P[n:])]
    Q = [q1 + c * q2 for q1,q2 in zip(Q[:n], Q[n:])]
    
    # Base case (n is odd)
    if n & 1:
        # Calculate the answer in O(n^2) time
        res = [0] * (2*n)
        for i in range(n):
            for j in range(n):
                res[i + j] += P[i] * Q[j]
        return [r1 + c * r2 for r1,r2 in zip(res[:n], res[n:])]
    
    assert n % 2 == 0
    import cmath
    sqrtc = cmath.sqrt(c)

    # Recursively calculate PQ_minus := P * Q % (x^n/2 - sqrt(c)) 
    #                       PQ_plus  := P * Q % (x^n/2 + sqrt(c))
    
    PQ_minus = fast_polymult_mod4(P, Q, n//2, sqrtc)
    PQ_plus  = fast_polymult_mod4(P, Q, n//2, -sqrtc)

    # Calculate PQ mod (x^n - c) using PQ_minus and PQ_plus
    PQ = [(m + p)/2         for m,p in zip(PQ_minus, PQ_plus)] +\
         [(m - p)/(2*sqrtc) for m,p in zip(PQ_minus, PQ_plus)]
    
    return PQ

"""
Calculates P(x) * Q(x) of two real polynomials

Input:
  P: A list representing a real polynomial P(x)
  Q: A list representing a real polynomial Q(x)
Output:
  A list representing the real polynomial P(x) * Q(x)
"""
def fast_polymult5(P, Q):
    # Calculate length of the list representing P*Q
    n1 = len(P)
    n2 = len(Q)
    res_len = n1 + n2 - 1
    
    # Pick n sufficiently big
    b = 1
    alim = 10
    while alim * 2**b < res_len:
        b += 1
    a = (res_len - 1) // 2**b + 1
    n = a * 2**b
    
    # Pick c = i (imaginary unit)
    c = 1j
    # and decrease the size of n by a factor of 2
    n //= 2

    # Pad with extra 0s to reach length 2*n
    P = P + [0] * (2*n - n1)
    Q = Q + [0] * (2*n - n2)
    
    # Calculate P*Q mod x^n - i
    PQ = fast_polymult_mod4(P, Q, n, c)

    # The imaginary part contains the "overflow"
    PQ = [pq.real for pq in PQ] + [pq.imag for pq in PQ]

    # Remove extra 0 padding and return
    return PQ[:res_len]

Full text and comments »

fft, convolution, polynomials, recursion

+349

pajenegod
13 months ago
18

I don't know how I've never seen this C++ feature before

By pajenegod, history, 16 months ago, In English

Take a look at this C++ submission 199864568:

#import <bits/stdc++.h>
using namespace std;
int main()
{
    int a; 
    cin >> a; 
    cout << ((a%2==0 && a>2) ? "YES" : "NO");
}

Don't see it?

Spoilers

Full text and comments »

import, include, gcc, c++

+312

pajenegod
16 months ago
9

Introducing the imaginary cyclic convolution, speeding up convolution by a factor of 2

By pajenegod, history, 22 months ago, In English

I recently had a very interesting idea for how to greatly speed up convolution (a.k.a. polynomial multiplication).

def convolution(A, B):
  C = [0] * (len(A) + len(B) - 1)
  for i in range(len(A)):
    for j in range(len(B)):
      C[i + j] += A[i] * B[j]
  return C

The standard technique for how to do convolution fast is to make use of cyclic convolution (polynomial mult mod $$$x^n - 1$$$).

def cyclic_convolution(A, B):
  n = len(A) # A and B needs to have the same size
  C = [0] * n
  for i in range(n):
    for j in range(n):
      C[(i + j) % n] += A[i] * B[j]
  return C

Cyclic convolution can be calculated in $$$O(n \log n)$$$ using FFT, which is really fast. The issue here is that in order to do convolution using cyclic convolution, we need to pad with a lot of 0s to not be affected by the wrap around. All this 0-padding feels very inefficient.

So here is my idea. What if we do polynomial multiplication mod $$$x^n - i$$$ instead of mod $$$x^n - 1$$$? Then when we get wrap around, it will be multiplied by the imaginary unit, so it wont interfere with the real part! I call this the imaginary cyclic convolution.

def imaginary_cyclic_convolution(A, B):
  n = len(A) # A and B needs to have the same size
  C = [0] * n
  for i in range(n):
    for j in range(n):
      C[(i + j) % n] += (1 if i + j < n else 1j) * A[i] * B[j]
  return C

Imaginary cyclic convolution is the perfect algorithm to use for implementing convolution. Using it, we no longer need to do copious amount of 0 padding, since the imaginary unit will take care of the wrap around. In fact, the size (the value of $$$n$$$) required is exactly half of what we would need if we had used cyclic convolution.

One question still remains, how do we implement imaginary cyclic convolution efficiently?

The trick is rather simple. Let $$$\omega = i^{\frac{1}{n}}$$$. Now note that if $$$C(\omega x) = A(\omega x) B(\omega x) \mod x^n - 1$$$ then $$$C(x) = A(x) B(x) \mod x^n - i$$$. So here is the algorithm

def imaginary_cyclic_convolution(A, B):
  n = len(A) # A and B needs to have the same size
  w = (1j)**(1/n) # n-th root of imaginary unit
  
  # Transform the polynomials A(x) -> A(wx) and B(x) -> B(wx)
  A = [A[i] * w**i for i in range(n)]
  B = [B[i] * w**i for i in range(n)]

  C = cyclic_convolution(A, B)
  
  # Transform the polynomial C(wx) -> C(x)
  C = [C[i] / w**i for i in range(n)]
  return C

Full text and comments »

fft, convolution, constant factor, tricks

+157

pajenegod
22 months ago
3

Warning, Huge Security Vulnerability in Python was just found!

By pajenegod, history, 23 months ago, In English

Spoiler

Link to issue on github

Full text and comments »

python, complexity, security

+162

pajenegod
23 months ago
76

Montgomery Multiplication Explained (Fast Modular Multiplication)

By pajenegod, history, 2 years ago, In English

Hi CF! During this past weekend I was reading up on Montgomery transformation, which is a really interesting and useful technique to do fast modular multiplication. However, all of the explanations I could find online felt very unintuitive for me, so I decided to write my own blog on the subject. A big thanks to kostia244, nor, nskybytskyi and -is-this-fft- for reading this blog and giving me some feedback =).

Fast modular multiplication

Let $$$P=10^9+7$$$ and let $$$a$$$ and $$$b$$$ be two numbers in $$$[0,P)$$$. Our goal is to calculate $$$a \cdot b \, \% \, P$$$ without ever actually calling $$$\% \, P$$$. This is because calling $$$\% \, P$$$ is very costly.

If you haven't noticed that calling $$$\% \, P$$$ is really slow, then the reason you haven't noticed it is likely because the compiler automatically optimizes away the $$$\% \, P$$$ call if $$$P$$$ is known at compile time. But if $$$P$$$ is not known at compile time, then the compiler will have to call $$$\% \, P$$$, which is really really slow.

Montgomery reduction of $$$a \cdot b$$$

It turns out that the trick to calculate $$$a \cdot b \, \% \, P$$$ efficently is to calculate $$$a \cdot b \cdot 2^{-32} \, \% \, P$$$ efficiently. So the goal for this section will be to figure out how to calculate $$$a \cdot b \cdot 2^{-32} \, \% \, P$$$ efficently. $$$a \cdot b \cdot 2^{-32} \, \% \, P$$$ is called the Montgomery reduction of $$$a \cdot b$$$, denoted by $$$\text{m_reduce}(a \cdot b)$$$.

Idea (easy case)

Suppose that $$$a \cdot b$$$ just happens to be divisible by $$$2^{32}$$$. Then $$$(a \cdot b \cdot 2^{-32}) \, \% \, P = (a \cdot b) \gg 32$$$, which runs super fast!

Idea (general case)

Can we do something similar if $$$a \cdot b$$$ is not divisible by $$$2^{32}$$$? The answer is yes! The trick is to find some integer $$$m$$$ such that $$$(a \cdot b + m \cdot P)$$$ is divisible by $$$2^{32}$$$. Then $$$a \cdot b \cdot 2^{-32} \, \% \, P = (a \cdot b + m \cdot P) \cdot 2^{-32} \, \% \, P = (a \cdot b + m \cdot P) \gg 32$$$.

So how do we find such an integer $$$m$$$? We want $$$ (a \cdot b + m \cdot P) \,\%\, 2^{32} = 0$$$ so $$$m = (-a \cdot b \cdot P^{-1}) \,\%\, 2^{32}$$$. So if we precalculate $$$(-P^{-1}) \,\%\, 2^{32}$$$ then calculating $$$m$$$ can be done blazingly fast.

Montgomery transformation

Since the Montgomery reduction divides $$$a \cdot b$$$ by $$$2^{32}$$$, we would like some some way of multiplying by $$$2^{32}$$$ modulo $$$P$$$. The operation $$$x \cdot 2^{32} \, \% \, P$$$ is called the Montgomery transform of $$$x$$$, denoted by $$$\text{m_transform}(x)$$$.

The trick to implement $$$\text{m_transform}$$$ efficently is to make use of the Montgomery reduction. Note that $$$\text{m_transform}(x) = \text{m_reduce}(x \cdot (2^{64} \, \% \, P))$$$, so if we precalculate $$$2^{64} \, \% \, P$$$, then $$$\text{m_transform}$$$ also runs blazingly fast.

Montgomery multiplication

Using $$$\text{m_reduce}$$$ and $$$\text{m_transform}$$$ there are multiple different ways of calculating $$$a \cdot b \, \% \, P$$$ effectively. One way is to run $$$\text{m_transform}(\text{m_reduce}(a \cdot b))$$$. This results in two calls to $$$\text{m_reduce}$$$ per multiplication.

Another common way to do it is to always keep all integers transformed in the so called Montgomery space. If $$$a' = \text{m_transform}(a)$$$ and $$$b' = \text{m_transform}(b)$$$ then $$$\text{m_transform}(a \cdot b \, \% \, P) = \text{m_reduce}(a' \cdot b')$$$. This effectively results in one call to $$$\text{m_reduce}$$$ per multiplication, however you now have to pay to move integers in to and out of the Montgomery space.

Example implementation

Here is a Python 3.8 implementation of Montgomery multiplication. This implementation is just meant to serve as a basic example. Implement it in C++ if you want it to run fast.

P = 10**9 + 7
r = 2**32
r2 = r * r % P
Pinv = pow(-P, -1, r) # (-P^-1) % r

def m_reduce(ab):
  m = ab * Pinv % r
  return (ab + m * P) // r

def m_transform(a):
  return m_reduce(a * r2)

# Example of how to use it
a = 123456789
b = 35
a_prim = m_transform(a) # mult a by 2^32
b_prim = m_transform(b) # mult b by 2^32
prod_prim = m_reduce(a_prim * b_prim) # divide a' * b' by 2^32
prod = m_reduce(prod_prim) # divide prod' by 2^32
print('%d * %d %% %d = %d' % (a, b, P, prod)) # prints 123456789 * 35 % 1000000007 = 320987587

Final remarks

One important issue that I've so far swept under the rug is that the output of m_reduce is actually in $$$[0, 2 P)$$$ and not $$$[0, P)$$$. I just want end by discussing this issue. I can see two ways of handling this:

Alternative 1. You can force $$$\text{m_reduce}(a \cdot b)$$$ to be in $$$[0, P)$$$ for $$$a$$$ and $$$b$$$ in $$$[0, P)$$$ by adding an if-stament to the output of m_reduce. This will work for any odd integer $$$P < 2^{31}$$$.

Fixed implementation of m_reduce

def m_reduce(ab):
  m = ab * Pinv % r
  y = (ab + m * P) // r
  return y if y < P else y - P

Alternative 2. Assuming $$$P$$$ is an odd integer $$$< 2^{30}$$$ then if $$$a$$$ and $$$b$$$ $$$\in [0, 2 P)$$$ you can show that the output of $$$\text{m_reduce}(a \cdot b)$$$ is also in $$$[0,2 P)$$$. So if you are fine working with $$$[0, 2 P) \vphantom]$$$ everywhere then you don't need any if-statements. Nyaan's github has a nice C++ implementation of Montgomery multiplication using this style of implementation.

Full text and comments »

montgomery, modular, fast multiplication, primes, constant factor

pajenegod
2 years ago
12

(Request) Update PyPy's Version on CF

By pajenegod, history, 3 years ago, In English

I've always liked using Python (PyPy) for solving problems in competitive programming. And most problems are very doable, even in Python. What I've found is that the most difficult problems to solve in Python are those requiring 64 bit integers.

The reason why 64 bit integers are problematic is because CF runs Windows, and PyPy only supports 32 bit on Windows. So whenever a problem involves integers that cannot fit inside of a signed 32 bit int, PyPy switches to big integers (which runs insanely slow, sometimes a factor of 20 times slower).

What I currently have to do to get around big integers

However with the latest PyPy version (version 7.3.4) PyPy has finally switched to 64 bit on Windows! So upgrading PyPy would mean no more problems with big integers. This would make PyPy far more usable and more beginner friendly. So if possible please update PyPy's version on CF to 7.3.4! MikeMirzayanov

Edit: Reading Results of 2020 [list some changes and improvements] blog I realized that I should probably be tagging geranazavr555, kuviman and cannor147 too.

Full text and comments »

python, pypy, tle, mle

+642

pajenegod
3 years ago
22

CP can be profitable!

By pajenegod, history, 4 years ago, In English

Let me tell you the story of how I made $2200 from doing competitive programming.

Spoiler

Once many many fortnights ago Hackerrank held one of its regular competitions, "World CodeSprint 9". This was back when Hackerrank actually sent out its prizes. The competition was very unusual in that one of its hardest problems was a scored based approximation problem. This competition was also the first time that I would get placed in the top 100s! Using my beloved Python :)

As I recall the prize for getting placed 4 to 100 was a t-shirt and $75. More precisely these $75 were sent either in Bitcoins or Amazon giftcards depending on where the prize winners lived, and in my case I got Bitcoins. I received the $75 in Bitcoins on 21st of March 2017.

Prices 2017

When I got them, I didn't really know how to do anything with them, so I kind of just forgot about them. Turns out that the value of Bitcoin has increased a bit since then:

Prices 2017-2021

30 times more to be precise! So today I just sold them for a bit over $2200 (Sold when the price hit 26640€/btc). Not too shabby for a 34th place finish in a regular competition! =D

Full text and comments »

money, bitcoin, hackerrank

+488

pajenegod
4 years ago
16

Explanation to weird/strange floating point behaviour in C++

By pajenegod, history, 4 years ago, In English

Introduction

I'm writing this blog because of the large number of blogs asking about why they get strange floating arithmetic behaviour in C++. For example:

"WA using GNU C++17 (64) and AC using GNU C++17" https://codeforces.com/blog/entry/78094

"The curious case of the pow function" https://codeforces.com/blog/entry/21844

"Why does this happen?" https://codeforces.com/blog/entry/51884

"Why can this code work strangely?" https://codeforces.com/blog/entry/18005

and many many more.

Example

Here is a simple example of the kind of weird behaviour I'm talking about

Example showing the issue

#include <iostream>
using namespace std;
 
double f(double a, double b) {
    return a * a - b;
}

int main() {
  cout.precision(60);
  
  // Calculate 10*10 - 1e-15
  double ans;
  ans = f(atof("10"), atof("1e-15"));
  cout << (double)      ans << '\n';
  cout << (int)         ans << "\n\n";
  
  ans = f(atof("10"), atof("1e-15"));
  cout << (int)         ans << '\n';
  cout << (double)      ans << "\n\n";
  
  ans = f(atof("10"), atof("1e-15"));
  cout << (double)      ans << '\n';
  cout << (long double) ans << "\n\n";
  
  ans = f(atof("10"), atof("1e-15"));
  cout << (long double) ans << '\n';
  cout << (double)      ans << "\n\n";
  return 0;
}

Output for 32 bit g++

100
100

99
100

100
100

99.99999999999999900079927783735911361873149871826171875
100

Output for 64 bit g++

Looking at this example, the output that one would expect from $$$10 * 10 - 10^{-15}$$$ is exactly $$$100$$$ since $$$100$$$ is the closest representable value of a double. This is exactly what happens in 64 bit g++. However, in 32 bit g++ there seems to be some kind of hidden excess precision causing the output to only sometimes(???) be $$$100$$$.

Explanation

In C and C++ there are different modes (referred to as methods) of how floating point arithmetic is done, see (https://en.wikipedia.org/wiki/C99#IEEE_754_floating-point_support). You can detect which one is being used by the value of FLT_EVAL_METHOD found in cfloat. In mode 2 (which is what 32 bit g++ uses by default) all floating point arithmetic is done using long double. Note that in this mode numbers are temporarily stored as long doubles while being operated on, this can / will cause a kind of excess precision. In mode 0 (which is what 64 bit g++ uses by default) the arithmetic is done using each corresponding type, so there is no excess precision.

Detecting and turning on/off excess precision

Here is a simple example of how to detect excess precision (partly taken from https://stackoverflow.com/a/20870774)

Test for detecting excess precision

// #pragma GCC target("fpmath=sse,sse2") // Turns off excess precision
// #pragma GCC target("fpmath=387") // Turns on excess precision

#include <iostream>
#include <cstdlib>
#include <cfloat>
using namespace std;

int main() {
  cout << "This is compiled in mode "<< FLT_EVAL_METHOD << '\n';
  cout << "0 means no excess precision.\n";
  cout << "2 means there is excess precision.\n\n";
  
  cout << "The following test detects excess precision\n";
  cout << "0 if no excess precision, or 8e-17 if there is excess precision.\n";
  double a = atof("1.2345678");
  double b = a*a;
  cout << b - 1.52415765279683990130 << '\n';
  return 0;
}

If b is rounded (as one would "expect" since it is a double), then the result is zero. Otherwise it is something like 8e-17 because of excess precision. I tried running this in custom invocation. MSVC(C++17), Clang and g++17(64bit) all use mode 0 and round b to 0, while g++11, g++14 and g++17 as expected all use mode 2 and b = 8e-17.

The culprit behind all of this misery is the old x87 instruction set, which only supports (80 bit) long double arithmetic. The modern solution is to on top of this use the SSE instruction set (version 2 or later), which supports both float and double arithmetic. On GCC you can turn this on with the flags -mfpmath=sse -msse2. This will not change the value of FLT_EVAL_METHOD, but it will effectively turn off excess precision, see 81993714.

It is also possible to effectively turn on excess precision with -mfpmath=387, see 81993724.

Fun exercise

Using your newfound knowledge of excess precision, try to find a compiler + input to "hack" this

Try to hack this

#include <iostream>
#include <cmath>
using namespace std;

bool f() {
    double x;
    cin >> x;
    
    double y = x + 1.0;
    if (y >= 1.0)
        return false;
    
    int w = pow(y, 2);
    
    if (y < 1.0)
        return false;
    return y == 1.0;
}

int main() {
    if (f())
        cout << "HACKED\n";
    else
        cout << "Not hacked\n";
}

Conclusion / TLDR

32 bit g++ by default does all of its floating point arithmetic with (80 bit) long double. This causes a ton of frustrating and weird behaviours. 64 bit g++ does not have this issue.

Full text and comments »

c/c++, floating-point, weird, strange, obscure, unusual behaviour, excess precision

+187

pajenegod
4 years ago
16

pajenegod's blog

1. Motivation / Background

1.1. What exactly is a "decomposition tree"?

1.2. The centroid guessing strategy

1.2.1. Examples of trees where the centroid guessing strategy is not optimal

1.3. The center guessing strategy

2. Shallowest decomposition tree

3. Analysis of greedy labeling

3.1 A O(n) DP-algorithm for greedy labeling.

3.2 Greedy labeling is optimal

3.3. Adv. Constructing the shallowest decomposition tree in linear time

4. Benchmarks

5. Mentions and final remarks

1. Introduction / Motivation

2. Collection of reroot problems and solutions

3. Understanding the rerooter black box

4. Rerooting and exclusivity

5. The exclusive segment tree

Task:

Solution:

(Advanced) Speeding up the algorithm

(Advanced) -is-this-fft-?

(Advanced) Connection between this algorithm and NTT

(Advanced) Shorter implementations ("Codegolfed version")

Fast modular multiplication

Montgomery reduction of $$$a \cdot b$$$

Idea (easy case)

Idea (general case)

Montgomery transformation

Montgomery multiplication

Example implementation

Final remarks

1. Switching to floats

2. Calculating a * b % MOD

3. When all else fails

Introduction

Example

Explanation

Detecting and turning on/off excess precision

Fun exercise

Conclusion / TLDR

3. Understanding the `rerooter` black box

2. Calculating `a * b % MOD`