Hi everyone! I was always wondered how cleverly interwoven so-called string algorithms. Six months ago I wrote here article about the possibility of a rapid transition from Z-function to prefix-function and backward. Some experienced users already know that such transitions are possible between the more complex string structures — suffix tree and suffix automaton. This transition is described at e -maxx.ru. Now I would like to tell you about such data structure as a suffix tree, and share the simple enough (theoretically) way of its fast building — obtain a suffix tree from suffix array .
I remind you that a suffix tree is a prefix tree, that contains all suffixes of specified string. In the simplest implementation it will require O(n2) time and memory — we simply add in the trie all suffixes one by one until we get what we want. In most cases it's too large. We will try to do something with it.
At first we will reduce the memory complexity to O(n). In order to do this we need to get the following idea: if we have a group of edges that are connected in series and do not have branches, we can combine them into one, which will be presented with some substring, but not a single character. Thus, we obtain a compact trie (also known as radix tree or patricia tree). But that's not all. Finally, we note that we have no need to store the entire substring on the edge, so we can store only the indices of its beginning and its ending in the source string. That is what will give us the desired linear complexity. Really, vertices in our tree will now appear only in places separating lexicographically consecutive suffixes, and there will be no more than n - 1 of such places.
Finally, let's reduce time complexity to O(n). In order to do this, we will approach the following strategy:
1) Add to the trie lexicographically minimal suffix.
2) For each of the next suffixes we will rise to lcp[i] and continue building trie from here.
Surprisingly, this will be enough. In fact, the actions we will do are identical to the depth-first traversal of tree, which obviously runs in O(n).
Wait a minute, but in the title is written "building in O(nlogn)", and you got O(n), wtf?
Indeed, in fact, if we have a lcp array, suffix tree can already be built in O(n). However, there still remains one problem — we need to get in some way lcp array. And here we come to the aid of suffix array, which can be used in order to get lcp. Relatively simple method for this is described on the site e-maxx.ru. We also can get it in O(n) using Kasai algorithm. Finally, we can combine it with some linear suffix array algo in order to get suffix tree with O(n) complexity.
Advantages of this method:
- Simple to understand.
- Acceptable time and memory complexity.
- Algorithm only works offline.
- The amount of code. It took me almost 300 lines (100 of which — for a suffix array), and then the whole evening to do something that will work. It was the first time for me building the suffix tree, so I can not say for sure whether it is possible to implement this algorithm easily.
Here, you can also see an example of code that does all these atrocities to create a suffix tree. Good luck to everyone and see you soon, I hope the article will be interesting :)