Suffix tree. Ukkonen's algorithm

#	User	Rating
1	ecnerwala	3649
2	Benq	3581
3	orzdevinwang	3570
4	Geothermal	3569
4	cnnfls_csy	3569
6	tourist	3565
7	maroonrk	3531
8	Radewoosh	3521
9	Um_nik	3482
10	jiangly	3468

#	User	Contrib.
1	maomao90	174
2	awoo	164
3	adamant	163
4	TheScrasse	159
5	nor	158
6	maroonrk	156
7	-is-this-fft-	152
8	SecondThread	147
9	orz	146
10	pajenegod	145

First part, I guess. Even if you think that you are familiar with suffix tree, please, take a look at the code below. It may be interesting to you.

Hi everyone! Finally I learnt this one :)

In this entry I would like to avoid long and complex theory which scared me from suffix tree for a long time. So straight to the point. I will not prove algorithm if you want some proofs, you may check stackoverflow or Dan Gusfield's book... Or somewhere else, I don't know.

Suffix tree is a compressed suffix trie, so all vertices which are not corresponding to suffixes and which have only one descendant are omitted.

Now about the algorithm. At each iteration, it makes implicit suffix tree. In implicit suffix tree all vertices which have only one descendant are omitted. Usually edges are stored as a pair of [L_i, R_i]. Personally I am not very convenient to work with them in this way, so I suggest to store in each node some data corresponding to the edge from its ancestor to it — fpos_i & which is the left position of first edge occurence in the string and len_i which is the length of the edge. In this case, the length of the edges, leading to leaves will by default be considered equal to inf. So we can be sure that at any time the edges to the leaves are correct. Root of the tree will be the vertex numbered 0.

Let's at each step of the algorithm keep the longest non-unique suffix of the string. To do this, let's keep a pair of numbers (node, pos) & mdash; vertex in the suffix tree and the number of characters that you need to pass down from it to have this suffix. By default node = pos = 0. When you append a new symbol, let's increase pos by 1 and add all of new unique suffix of the string that appear after adding a new character.

Also, we need the concept of suffix links. It is defined for internal nodes of the tree. Following suffix link will lead to the vertex corresponding to the same substring, but without first character. For the root vertex suffix link is not defined.

Appending of new character consists of the following stages:

If pos = 0, then all suffixes are added. Return. Otherwise let's find the vertex after which new suffix will be added (it is not neccessarily node because edge from node may be too short). So while pos greater then edge from node let's follow this edge and substract its length from pos.
Now let's try to add new suffix. We will have three options here:
1. If we do not have needed outgoing edges at this node, we simply create a new vertex and hung it to the current one.
2. If there is an edge and a suffix that we want to add lies entirely on it then this and further suffixes are not unique. Return.
3. If there is an edge and suffix doesn't lie entirely on it then it differs in only one character, this means that we need to create a new vertex in the middle of the edge and then create another one new vertex (which will be new suffix) and hung it to the vertex in the middle of splitted edge.
If you have not returned on the previous step, go to the next suffix. If node is root, then we reduce the pos with 1, otherwise we just follow the suffix link node = link(node) without changing pos. After that, we go to step 1.

And about siffix links. On the i-th step we will set suffix link of internal vertex created on i - 1-th step. If we create a new internal vertex (i.e. split some edge), then the suffix link will lead into it. In two other cases, the suffix link will lead to the node (I am too lazy to write truly marvellous proof of this, so it is left to the curious reader as an exercise).

And, finally the implementation.

Code: #sT8Vd1

I tried to make the code as simple and clear as possible :) I do not know if I managed to do so and hope for your feedback and questions.

Comments (11)

Show archived | Write comment?

s17k

9 years ago, # |

+11

Wow :) That's very short code. Thanks.

→ Reply

catlak_profesor_mfb

If you use unordered_map instead of map the complexity will be O(N), which is one of the crucial parts of the algorithm :)

adamant

9 years ago, # ^ |

From my experience if you need huge amount of associative containers, usual map is preferred. Also usual map is more useful when you need to traverse tree in lexicographical order.

Yes, you are right. And because the alphabet size is constant, using map is also constant. But you could use only one unordered_map<pair<int,int>,int> like this and have a guaranteed constant time data access. As you said though traversing the tree wouldn't be as good as it is with map.

_roCkY_

8 years ago, # |

-16

Could you please also write a small function to search if a string exists as a substring in the original string ?

zetaz

8 years ago, # ^ |

-17

This is related to an on-going contest :|

+47

I'm so tired of things like that. Someone asks for some really well-known stuff and then someone else tells us that OMG NOOO THIS IS RELATED TO ON-GOING CONTEST.

No, really. It is the problem of contest-maker to make some non-trivial problems so such well-known tricks won't be helpful.