adamant's blog

By adamant, 10 years ago, translation, In English

Hi everyone! I was always wondered how cleverly interwoven so-called string algorithms. Six months ago I wrote here article about the possibility of a rapid transition from Z-function to prefix-function and backward. Some experienced users already know that such transitions are possible between the more complex string structures — suffix tree and suffix automaton. This transition is described at e -maxx.ru. Now I would like to tell you about such data structure as a suffix tree, and share the simple enough (theoretically) way of its fast building — obtain a suffix tree from suffix array .

I remind you that a suffix tree is a prefix tree, that contains all suffixes of specified string. In the simplest implementation it will require O(n2) time and memory — we simply add in the trie all suffixes one by one until we get what we want. In most cases it's too large. We will try to do something with it.

  1. At first we will reduce the memory complexity to O(n). In order to do this we need to get the following idea: if we have a group of edges that are connected in series and do not have branches, we can combine them into one, which will be presented with some substring, but not a single character. Thus, we obtain a compact trie (also known as radix tree or patricia tree). But that's not all. Finally, we note that we have no need to store the entire substring on the edge, so we can store only the indices of its beginning and its ending in the source string. That is what will give us the desired linear complexity. Really, vertices in our tree will now appear only in places separating lexicographically consecutive suffixes, and there will be no more than n - 1 of such places.

  2. Finally, let's reduce time complexity to O(n). In order to do this, we will approach the following strategy:
    1) Add to the trie lexicographically minimal suffix.
    2) For each of the next suffixes we will rise to lcp[i] and continue building trie from here.

    Surprisingly, this will be enough. In fact, the actions we will do are identical to the depth-first traversal of tree, which obviously runs in O(n).

Wait a minute, but in the title is written "building in O(nlogn)", and you got O(n), wtf?

Indeed, in fact, if we have a lcp array, suffix tree can already be built in O(n). However, there still remains one problem — we need to get in some way lcp array. And here we come to the aid of suffix array, which can be used in order to get lcp. Relatively simple method for this is described on the site e-maxx.ru. We also can get it in O(n) using Kasai algorithm. Finally, we can combine it with some linear suffix array algo in order to get suffix tree with O(n) complexity.

Advantages of this method:

  • Simple to understand.
  • Acceptable time and memory complexity.

disadvantages:  

  • Algorithm only works offline.  
  • The amount of code. It took me almost 300 lines (100 of which — for a suffix array), and then the whole evening to do something that will work. It was the first time for me building the suffix tree, so I can not say for sure whether it is possible to implement this algorithm easily.

Here, you can also see an example of code that does all these atrocities to create a suffix tree. Good luck to everyone and see you soon, I hope the article will be interesting :)

To check the correctness of the code were used the following tasks:
1393 — validation of building lcp.
1590 — validation of building suffix tree.

  • Vote: I like it
  • +82
  • Vote: I do not like it

»
10 years ago, # |
Rev. 2   Vote: I like it 0 Vote: I do not like it

@adamant:Very Nice Tutorial.If possible can you write the comments in English so that other users are able to understand

»
10 years ago, # |
Rev. 8   Vote: I like it +10 Vote: I do not like it

There's also the possibility to build the suffix array in and then construct the suffix tree from that.

EDIT Nevermind, I see you are using that same idea already :) Although I don't see why you would need 100 LoC to implement suffix array + LCP in . Our implementation is a mere 24 lines, but it's since it does not contain a linear time sort.

  • »
    »
    10 years ago, # ^ |
    Rev. 2   Vote: I like it +5 Vote: I do not like it

    Your implementation has memory complexity in spite of my . But anyway, your way to calculate suffix array is kinda interesting, thanks a lot!

  • »
    »
    9 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Isn't your suffix array build nlog2(n)? log(n) iteration in line 8 and nlog(n) for sorting in line 10. Correct me if i'm wrong.

    • »
      »
      »
      9 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      Yes I changed the code in the meantime. It used to use LSB radix sort, but that seems slower in practice.

      • »
        »
        »
        »
        9 years ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        If you really need better performance than this, which sometimes becomes necessity, you can use the one that implements bucket sort. Otherwise yours is way better, both to understand and code.

»
10 years ago, # |
  Vote: I like it 0 Vote: I do not like it

I have a question.

I build Suffix array (SA) and longest common prefix array (LCP) of a string ending in #, now I proceed to create suffix tree.

I can separate SA in buckets for first character, then, for each bucket I can insert first suffix in the tree but when I try to insert second item, can I get the node where I have to put new item in O(1)? I can only think in binary search saving each instance to inserted nodes for each item.

Example:

str = abaabbaaa#
idx:  0123456789
idx  SA      suffix     LCP
 0   9       #          N/A
 1   8       a#          0
 2   7       aa#         1
 3   6       aaa#        2
 4   2       aabbaaa#    2
 5   0       abaabbaaa#  1
 6   3       abbaaa#     1
 7   5       baaa#       0
 8   1       baabbaaa#   3
 9   4       bbaaa#      1

assume we inserted suffixes 0, 1, 2, 3 now, when I have to insert sa[4], what is the best way for getting node x (see picture)?

I hope my question is clear, thanks.

  • »
    »
    10 years ago, # ^ |
      Vote: I like it +4 Vote: I do not like it

    You're such an artist X_X

  • »
    »
    10 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    can I get the node where I have to put new item in O(1)?

    I don't know exactly but it looks like you can't. But as I said in the entry, if you will just move up on tree every time, it will be amortized because of its identity to dfs on tree. If you want to get moving to the next node non-amortized, you can keep for each node its ancestors that are pows of 2, i.e. 1, 2, 4, 8, 16, etc. Then you can get to the needed node in using this information.

    • »
      »
      »
      10 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      I'm writing my algorithm but I get struggle in that part.

      What did you use in your implementation?

      • »
        »
        »
        »
        10 years ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        After adding some node to the tree I just move upward to the lcp[i] position and add next node there.

        • »
          »
          »
          »
          »
          10 years ago, # ^ |
            Vote: I like it 0 Vote: I do not like it

          After lots of debugging, I finally get an implementation (java) that seems to work (at least with my tests).

          But I can't pass TLE in this

          I wonder if is possible to use this trick in java.

          My suffix array is O(n log n)

          • »
            »
            »
            »
            »
            »
            10 years ago, # ^ |
              Vote: I like it 0 Vote: I do not like it

            Can you show your code?

            • »
              »
              »
              »
              »
              »
              »
              10 years ago, # ^ |
                Vote: I like it 0 Vote: I do not like it

              link

              It have comments.

              • »
                »
                »
                »
                »
                »
                »
                »
                10 years ago, # ^ |
                  Vote: I like it 0 Vote: I do not like it

                Well, firstly I see that you always duplicate the node when inserting it. Maybe, you should do it only when you need (i.e. you shouldn't duplicate the node if it will be exact son). Also maybe you should try map<char,int> to get O(nlogk) instead of O(nk)

  • »
    »
    10 years ago, # ^ |
    Rev. 2   Vote: I like it 0 Vote: I do not like it

    I just understood that my entry has no useful pictures. Well, if somebody is interested in them, here is picture, showing suffix tree of k790alex's string made via GraphViz. * after the string means that this node can be the last in some suffix.