First part, I guess. Even if you think that you are familiar with suffix tree, please, take a look at the code below. It may be interesting to you.
Hi everyone! Finally I learnt this one :)
In this entry I would like to avoid long and complex theory which scared me from suffix tree for a long time. So straight to the point. I will not prove algorithm if you want some proofs, you may check stackoverflow or Dan Gusfield's book... Or somewhere else, I don't know.
Suffix tree is a compressed suffix trie, so all vertices which are not corresponding to suffixes and which have only one descendant are omitted.
Now about the algorithm. At each iteration, it makes implicit suffix tree. In implicit suffix tree all vertices which have only one descendant are omitted. Usually edges are stored as a pair of [L i, R i]. Personally I am not very convenient to work with them in this way, so I suggest to store in each node some data corresponding to the edge from its ancestor to it — fpos i & which is the left position of first edge occurence in the string and len i which is the length of the edge. In this case, the length of the edges, leading to leaves will by default be considered equal to inf. So we can be sure that at any time the edges to the leaves are correct. Root of the tree will be the vertex numbered 0.
Let's at each step of the algorithm keep the longest non-unique suffix of the string. To do this, let's keep a pair of numbers (node, pos) & mdash; vertex in the suffix tree and the number of characters that you need to pass down from it to have this suffix. By default node = pos = 0. When you append a new symbol, let's increase pos by 1 and add all of new unique suffix of the string that appear after adding a new character.
Also, we need the concept of suffix links. It is defined for internal nodes of the tree. Following suffix link will lead to the vertex corresponding to the same substring, but without first character. For the root vertex suffix link is not defined.
Appending of new character consists of the following stages:
- If pos = 0, then all suffixes are added. Return. Otherwise let's find the vertex after which new suffix will be added (it is not neccessarily node because edge from node may be too short). So while pos greater then edge from node let's follow this edge and substract its length from pos.
- Now let's try to add new suffix. We will have three options here:
- If we do not have needed outgoing edges at this node, we simply create a new vertex and hung it to the current one.
- If there is an edge and a suffix that we want to add lies entirely on it then this and further suffixes are not unique. Return.
- If there is an edge and suffix doesn't lie entirely on it then it differs in only one character, this means that we need to create a new vertex in the middle of the edge and then create another one new vertex (which will be new suffix) and hung it to the vertex in the middle of splitted edge.
- If you have not returned on the previous step, go to the next suffix. If node is root, then we reduce the pos with 1, otherwise we just follow the suffix link node = link(node) without changing pos. After that, we go to step 1.
And about siffix links. On the i-th step we will set suffix link of internal vertex created on i - 1-th step. If we create a new internal vertex (i.e. split some edge), then the suffix link will lead into it. In two other cases, the suffix link will lead to the node (I am too lazy to write truly marvellous proof of this, so it is left to the curious reader as an exercise).
And, finally the implementation.
I tried to make the code as simple and clear as possible :) I do not know if I managed to do so and hope for your feedback and questions.