### quasisphere's blog

By quasisphere, 7 years ago, # Fast convolution for 64-bit integers

Warning: Math ahead. You can find my reference C++ implementation at https://github.com/quasisphere/conv64

## Introduction

Let (a0, ..., an - 1) and (b0, ..., bn - 1), n ≥ 1, be two tuples of elements of some ring R. By their convolution we mean the tuple (c0, ..., c2n - 2) whose elements are given by

ck = a0bk + a1bk - 1 + ... + an - 1bk - n + 1.

(Here we understand that ai = bi = 0 for i < 0 or i ≥ n.)

A succinct way of encoding this is via polynomials: If A(x) = a0 + a1x + ... + an - 1xn - 1 and B(x) = b0 + b1x + ... + bn - 1xn - 1, then

A(xB(x) = C(x) = c0 + c1x + ... + c2n - 2x2n - 2.

Convolution can thus be understood via multiplication of polynomials and vice versa. From this point on we will exclusively work with polynomials instead of tuples since that is much more natural.

Multiplying polynomials with real or complex coefficients can be done efficiently via the Fast Fourier Transform. The same method can also be applied to polynomials with integer coefficients, but then one has to worry about possible loss of precision and therefore getting a wrong answer. In this blog we will present a way of multiplying polynomials whose coefficients are 64-bit integers, or in mathematical terms, integers modulo 264. The underlying ideas of our method are not new, but I have not seen this exact combination used before. A nice survey of fast multiplication methods is given in https://cr.yp.to/papers/m3.pdf

We note that FFT computes cyclic convolutions, that is products in the ring R[x] / (xn - 1). This is not a problem since to compute an acyclic convolution one can simply pick n large enough and zero-pad the polynomial (so that no cyclic overflow occurs).

## A short recap on the Fast Fourier Transform

Assume that R is a commutative ring that contains an element ξ such that ξ2m = 1 and 2m is the multiplicative order of ξ. Then the polynomial x2m  - 1 = x2m - ξ2m splits recursively into factors as follows:

x2m - ξ2m  = (x2m - 1 - ξ2m - 1)(x2m - 1 + ξ2m - 1) = (x2m - 1 - ξ2m - 1)(x2m - 1 - (ξ2)2m - 1)

Now, assume that we want to compute the product of two polynomials in the quotient ring R[x] / (x2m - 1). Then by the Chinese remainder theorem it is enough to compute their product in R[x] / (x2m - 1 - ξ2m - 1) and R[x] / (x2m - 1 - (ξ2)2m - 1). The multiplication by FFT works by doing this reduction repeatedly until we are left only with first degree polynomials of the form x - ξj for j = 0, ..., 2m - 1.

In practice FFT algorithms give you the evaluations Aj), which corresponds exactly to reducing modulo x - ξj. By the preceding discussion to multiply the polynomials A and B we must form the products AjBj) and then do the Chinese remainder theorem all the way up to the original ring. This second stage is done by the inverse FFT.

## Outline of the method

Let R = Z / 264Z denote the ring of integers modulo 264. There are two main obstacles to multiplying polynomials in R[x] using the FFT method.

Obstacle 1: 2 is not invertible. This means that we cannot perform the standard (radix-2) FFT, since in the inverse transform we must divide by 2. The solution we will use is to instead perform a radix-3 FFT on an array whose length is a power of 3.

Obstacle 2: There are no roots of unity of order 3m. Our solution is to first adjoin a 3rd root of unity to the ring R and then manufacture higher order roots via a Schönhage-type trick.

Having identified what we must overcome, we will now outline the method. First of all, we extend our ring of scalars by looking instead at the ring T = R[ω] / (ω2 + ω + 1). Its elements can be represented as a + bω, where a and b are 64-bit integers and the ring of 64-bit integers is embedded in it as the elements with b = 0. The product of two such elements is given by

(a + bω)(c + dω) = ac + (ad + bc) ω + bdω2 = ac - bd + (ad + bc - bd)ω,

where we have used the fact that ω2 =  - ω - 1, and it is also easy to check that we have ω3 = 1. Thus ω is a 3rd root of unity in this ring. We say that the conjugate of an element a + bω is the element we get by mapping ω → ω2, that is a + bω2 = a - b - bω.

In what follows we will devise an algorithm for multiplication in T[x] / (xn - 1) where n is a power of 3. Assume that we are given a polynomial a0 + a1x + ... + an - 1xn - 1 with ai from the extended ring T. We will first let y = xm, where m and r are powers of 3 such that mr = n and 3m ≥ r. Then we can write our polynomial in the form

(a0 + ... + am - 1xm - 1) + (am + ... + a2m - 1xm - 1)y + ... + (a(r - 1)m + ... + arm - 1xm - 1)yr - 1

Notice that in a program this does not require us to do anything since the order of the coefficients stays the same.

Now since the coefficients of yis are polynomials of degree at most m - 1 and yr = xn = 1, we can think of this as an element of the ring T[x] / (x2m + xm + 1)[y] / (yr - 1). Clearly if we multiply two such polynomials the products of the coefficients in T[x] / (x2m + xm + 1) will not overflow, and we can deduce the result of the original multiplication.

But we can reduce the problem even further! We have x2m + xm + 1 = (xm - ω)(xm - ω2). Thus, if we can deduce the products in the rings T[x] / (xm - ω)[y] / (yr - 1) and T[x] / (xm - ω2)[y] / (yr - 1), we will be done by the Chinese remainder theorem. Now in either case x itself is a 3mth root of unity, so we can use FFT to compute the products, assuming that we have an efficient method of calculating products in T[x] / (xm - ω) and T[x] / (xm - ω2).

But this we can solve recursively! Indeed, the method of computing products in T[x] / (xn - 1) we have outlined goes through almost identically also in the case of T[x] / (xn - ω) and T[x] / (xn - ω2). The only difference is that we will end up with a ring T[x] / (xm - ω)[y] / (yr - ω) (in the first case, the second case is similar), which we can map to the ring T[x] / (xm - ω)[y] / (yr - 1) by mapping y → xm / ry. This requires that m ≥ r, which is a bit (but not significantly) worse than the condition 3m ≥ r we had earlier.

## Details and optimizations

When implementing the above strategy, it is useful to notice a couple of things.

1. On the first round of the recursion when we are working in the ring T[x] / (xm - ω2)[y] / (yr - 1), we note that we can map this to the ring T[x] / (xm - ω)[y] / (yr - 1) by conjugation. Since our original data came from ordinary 64-bit integers, conjugation performs as identity on them. Hence the product in T[x] / (xm - ω2)[y] / (yr - 1) is just the conjugate of the product in T[x] / (xm - ω)[y] / (yr - 1) and it is enough to only compute the latter.

2. On further rounds of the recursion we can assume that we are always trying to multiply in T[x] / (xm - ω). Indeed, when we would need a result in T[x] / (xm - ω2)[y] / (yr - ω), we can map this by conjugation to T[x] / (xm - ω)[y] / (yr - ω2) and then perform the change of variables y → x2m / ry to get to T[x] / (xm - ω)[y] / (yr - 1) and proceed using FFT as usual.

3. The element of T[x] / (x2m + xm + 1) that is f modulo xm - ω and g modulo xm - ω2 is given quite concretely by 3 - 1(1 + 2ω)(g(xm - ω) - f(xm - ω2))

4. The radix-3 FFT is a rather standard modification of the regular radix-2 FFT, so we will not go through it here. The twiddle factors will be powers of x, and therefore one just needs a simple function that lets one compute the product xtp for p in T[x] / (xm - ω). This same function can also be used for the linear changes of variables in the earlier steps.

## Conclusion

We have outlined an algorithm for computing exact convolutions modulo 264. All the computations are conveniently done modulo hardware. Other methods for exact convolutions include rounding of floating point convolutions, convolutions modulo suitable primes and the Nussbaumer/Schönhage-tricks. The algorithm here is close in spirit to the latter two, but the implementation is simplified because of the extension of scalars. In particular no zeropadding of the polynomials is needed.

Regarding speed: I haven't tested too much, but it seems reasonably fast, even compared to regular FFT algorithms (not optimized competitive programming snippets).

I'm happy to hear comments or about other interesting approaches! As it is, the code is on the border line of being competition ready. It certainly works if you can copy-paste but writing it down by hand requires some time. In my implementation there are a lot of long comments, so it's not quite as bloated as it might seem at first, however.

By quasisphere, 8 years ago, # A short guide to suffix automata

This is supposed to be a short intro to suffix automata — what they are, how they are constructed, and what can they do? It seems that people have had trouble finding an explanation in English (the most popular source mentioned is e-maxx.ru). I don't speak Russian, so this will be mainly based on the paper Automata for Matching Patterns by Crochemore and Hancart.

Disclaimer: I am not an expert in the subject, and I am yet to use the automaton in an actual competition. Comments and corrections are welcome!

## What is a suffix automaton?

A suffix automaton A for a string s is a minimal finite automaton that recognizes the suffixes of s. This means that A is a directed acyclic graph with an initial node i, a set of terminal nodes, and the arrows between nodes are labeled by letters of s. To test whether w is a suffix of s it is enough to start from the initial node of i and follow the arrows corresponding to the letters of w. If we are able to do this for every letter of w and end up in a terminal node, w is a suffix of s. The above picture shows an automaton for the string "abcbc". The initial node is marked with 'i', and the terminal nodes are marked with asterisks.

To better understand suffix automata, let us consider the following game: A string s is given. Your friend has chosen a substring w of s and starts reciting the letters of w one by one from the beginning. You have a serious case of amnesia and can only remember the last letter that your friend has said. When your friend has reached the end, can you tell whether the word w is a suffix of the string s?

To solve the problem let's do the following: At each stage we remember all the locations in s where we could be based on the previous letters. When the next letter comes in, we just check which of the locations have the correct next letter and form our next set of possible locations. The final set of locations are the endpoints of all the occurences of w in s. Of course if the end of the string s is among these, w is a suffix of s.

For example if our string is "abcbc" and our friend has picked the string "bcbc", the possible lists of locations develop as follows (marked with upper case letters): aBcBc, abCbC, abcBc, abcbC.

A bit of thinking will convince you that remembering these locations is not only enough but also necessary. Thus the nodes of the suffix automaton should correspond exactly to the possible sets of locations that can occur during the game (when played for all possible substrings w). This also gives a (slow) way of constructing the suffix automaton: Go through the game for every suffix of s and at every step add nodes and edges to the automaton if they don't yet exist.

We will say that two substrings u and v of s are equivalent if running u and v through the suffix automaton brings them to the same node, which by the above analysis is the same as saying that the endpoints of the occurences of u and v in s agree. Thus one could also say that the nodes of the suffix automaton correspond to the equivalence classes of substrings of s under this relation.

## The suffix tree and suffix links

Assume that u and v are the shortest and longest words in their common equivalence class. Then deleting the first letter of u will result in a larger set of possible endpoints, and adding a letter to the front of v will result in a smaller set. The words in the equivalence class are exactly those that are between u and v. The choice of the letter one can add in front of v induces a tree structure on the nodes of the suffix automaton. The removal of the first letter of u then means moving to the parent in this tree. These links to the parent are called suffix links, and they are useful both in constructing the suffix automaton and in applications. The above picture shows in each node the longest and shortest strings in the equivalence class, as well as the suffix links (in blue).

Notice that one can find the terminal nodes of a suffix automaton by starting from the node corresponding to the whole string s and following the suffix links up to the initial node.

One may note that the tree that appears is actually the Suffix tree of the reversed string s. Risking potential confusion, we will refer to our tree also as the 'suffix tree'.

## Constructing the suffix automaton

The suffix automaton can be constructed in linear time via an online algorithm that adds characters of s one by one and updates the automaton.

To understand the algorithm, it is necessary to consider how the equivalence classes change when adding a letter x to the end of the string. Consider moving in the suffix tree from the root along the path obtained by adding letters from the back of the new string one by one. Each such move either

1. stays in the current equivalence class,
2. moves down the tree to the next equivalence class, or
3. is impossible

When the case 3 happens (it will happen because the whole new string is not in the old tree), there are two possibilities:

1. we are in the longest string of some equivalence class q but there is no link forward to the next node
2. the next longest string in the class q has the wrong first letter

In the first case we may simply add a new equivalence class r containing the remaining new suffixes that could not be found in the tree. It will have a suffix link to q. In the second case we will have to add two new equivalence classes:

1. r containing the remaining new suffixes
2. q' containing the strings in q that are suffixes in the new string (i.e. were found before the search ended)

The strings in q' are of course removed from q, splitting the original class in two pieces.

For instance consider adding 'c' to the end of "abcb". In the original tree when we start following the letters from the back of the new string "abcbc", we will come to the equivalence class q with longest string "abc". The search will terminate at "bc" since the next letter we would try is 'c', which is different from 'a'. Thus we add the equivalence classes q' and r with longest strings "bc" and "abcbc" respectively.

Now the class q' becomes the parent of q and r in the suffix tree, and the parent of q' is the old parent of q.

Finally we have to consider the edges in the automaton. First of all we should add edges to the new class r from every class of every suffix of the original string that does not have an edge labeled with the added letter x. These can be found by starting from the class of the whole original string and following the suffix links until a class p with edge labeled x appears. Following the edge from p also gives us the class q.

If we have to split q, there will be additional changes to the edges. The edges from q will stay the same, and these will also be copied to be the edges from q'. The nodes with edges to q that now should point to the shorter strings in q' instead can be found among p and its parents by following suffix links.

## Implementation

Below is an implementation of the construction procedure in C++. It might appear on the surface that this is an O(n2) algorithm, but in fact the inner loops are run only O(n) times. I will skip this proof.

struct SuffixAutomaton {
vector<map<char,int>> edges; // edges[i]  : the labeled edges from node i
vector<int> length;          // length[i] : the length of the longest string in the ith class
int last;                    // the index of the equivalence class of the whole string

SuffixAutomaton(string s) {
edges.push_back(map<char,int>());
length.push_back(0);
last = 0;

for(int i=0;i<s.size();i++) {
// construct r
edges.push_back(map<char,int>());
length.push_back(i+1);
int r = edges.size() - 1;

// add edges to r and find p with link to q
int p = last;
while(p >= 0 && edges[p].find(s[i]) == edges[p].end()) {
edges[p][s[i]] = r;
}
if(p != -1) {
int q = edges[p][s[i]];
if(length[p] + 1 == length[q]) {
// we do not have to split q, just set the correct suffix link
} else {
// we have to split, add q'
edges.push_back(edges[q]); // copy edges of q
length.push_back(length[p] + 1);
int qq = edges.size()-1;
// add qq as the new parent of q and r
// move short classes pointing to q to point to q'
while(p >= 0 && edges[p][s[i]] == q) {
edges[p][s[i]] = qq;
}
}
}
last = r;
}
}
};


Note that this does not calculate the terminating nodes of the suffix automaton. This can however be done simply by starting from the last node and following the suffix links:

vector<int> terminals;
int p = last;
while(p > 0) {
terminals.push_back(p);
}


## Applications

Typical applications of the suffix automaton are based on some sort of dynamic programming scheme on top of the automaton, but let's start with the trivial ones that can be implemented on the plain automaton.

Problem: Find whether a given string w is a substring of s.

Solution: Simply run the automaton.

SuffixAutomaton a(s);
bool fail = false;
int n = 0;
for(int i=0;i<w.size();i++) {
if(a.edges[n].find(w[i]) == a.edges[n].end()) {
fail = true;
break;
}
n = a.edges[n][w[i]];
}
if(!fail) cout << w << " is a substring of " << s << "\n";


Problem: Find whether a given string w is a suffix of s.

Solution: Construct the list of terminal states, run the automaton as above and check in the end if the n is among the terminal states.

Let's now look at the dp problems.

Problem: Count the number of distinct substrings in s.

Solution: The number of distinct substrings is the number of different paths in the automaton. These can be calculated recursively by calculating for each node the number of different paths starting from that node. The number of different paths starting from a node is the sum of the corresponding numbers of its direct successors, plus 1 corresponding to the path that does not leave the node.

Problem: Count the number of times a given word w occurs in s.

Solution: Similar to the previous problem. Let p be the node in the automaton that we end up while running it for w. This time the number of times a given word occurs is the number of paths starting from p and ending in a terminal node, so one can calculate recursively the number of paths from each node ending in a terminal node.

Problem: Find where a given word w occurs for the first time in s.

Solution: This is equivalent to calculating the longest path in the automaton after reaching the node p (defined as in the previous solution).

Finally let's consider the following problem where the suffix links come handy.

Problem: Find all the positions where a given word w occurs in s.

Solution: Prepend the string with some symbol '\$' that does not occur in the string and construct the suffix automaton. Let's then add to each node of the suffix automaton its children in the suffix tree:

children=vector<vector<int>>(link.size());

Now find the node p corresponding to the node w as has been done in the previous problems. We can then dfs through the subtree of the suffix tree rooted at p by using the children vector. Once we reach a leaf, we know that we have found a prefix of s that ends in w, and the length of the leaf can be used to calculate the position of w. All of the dfs branches correspond to different prefixes, so no unnecessary work is done and the complexity is . 