tenshi_kanade's blog

By tenshi_kanade, history, 8 years ago, In English

Prologue:

OK, so I finally decided to learn Suffix Tree. It's a monster I'd never dared to look in the eyes, I had always gotten away with using Suffix Array with the tail between my legs until now. The thing is that some problems require a more efficient data structure, so I decided to leave my fears behind and learn it once and for all.

First of all, I read about it in a few books and some papers online. When I understood its final structure and how to work with it, I searched an implementation online (no, I didn't plan on coding everything from scratch). I found a good and efficient implementation on Geeks for Geeks, but unfortunately it was written in C and it worked with native types.

The tweaking and testing:

I worked with the code and tweaked it here and there. I removed unnecessary comments and upgraded the code to C++ to make it work with strings instead of char pointers. I also added some extra functionality that didn't originally come with the implementation (Longest Common Substring, Longest Repeated Substring, Occurence Count, etc.). I finally wrapped everything into a class and went on to testing.

Everything worked fine and very fast too. I solved a few problems, but then I came across some other ones that couldn't be solved 'cause I realized that the class I had in my hands worked only with one single string. A lot of problems that are solved with Suffix Trees require a different kind of data structure, one that supports working with multiple strings. That data structure, I learned, is called Generalized Suffix Tree. This is where the problems began.

Generalized Suffix Tree:

I searched online about it, but all the papers and articles I found built it by adding a different delimiter character for each word. The problem with this implementation is that you can work with a very small number of words. A lot of problems give you 500, 100 or even 104 strings. There aren't that many characters.

My idea was to use a single delimiter character for every word, but then do a DFS and count how many delimiters I pass along the way. When I reach a leaf, I know what word this suffix belongs to by the amount of delimiters I passed.

So the problem boils down to this: We have a tree with numbers written on the leaves. How do we efficiently compute for every inner node how many different numbers are written on the leaves in its subtree?

My workaround (the best idea that crossed my mind) was to do a DFS and maintain a set for each node with the numbers I found on its subtree. By merging smaller sets into bigger ones and upgrading pointers instead of moving everything from one place to another, I was able to achieve something in the lines of O(Nlog2N), but the problem is that Suffix Array + LCP + Range Query ends up being faster than this Generalized Suffix Tree. That's not very nice after all the effort I put into it. But I still have hopes, I think there has to be a better way out there.

Final question:

How do you guys implement a Generalized Suffix Tree and how do you efficiently compute for every inner node, how many words this suffix belongs to?

Thanks very much in advance.

  • Vote: I like it
  • +25
  • Vote: I do not like it

»
8 years ago, # |
  Vote: I like it +33 Vote: I do not like it

I confess I know nothing about suffix trees, so I'd love to see the end result of this if you wanted to show it. But for now I can answer the tree question:

We have a tree with numbers written on the leaves. How do we efficiently compute for every inner node how many different numbers are written on the leaves in its subtree?

You can do this in linear time. For each node, you can compute the number of different numbers by adding the number of different numbers in each of the children, minus the count of numbers in more than one child.

To count the repeating numbers, a number will only be repeating when it meets the same number for the first time. So for each leaf compute the LCA of this leaf and the last leaf with the same number, and subtract one from the count of the LCA. This should be very easy to implement using something like Tarjan offline LCA.

As a side note, the only problems I know that can only be solved by suffix trees are online problems for a single string (you can add letters to the back or remove from the front), so you might want to make sure your class can do that as well.

  • »
    »
    8 years ago, # ^ |
      Vote: I like it +5 Vote: I do not like it

    That's a wonderful idea! Thank you very much!

    I'll implement it and see how the runtime is improved. My class still needs some testing, but if I see that it ends up working as it should, I'll surely post it here for the community.

»
8 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Why is big alphabet a problem? Use int instead of char.

  • »
    »
    8 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    I thought about that alternative too, but I feared it would be much slower and more cumbersome to use. I'll try with the approach mentioned above and then try to do it with integers and compare runtimes.

»
8 years ago, # |
Rev. 2   Vote: I like it +15 Vote: I do not like it

I think the following trick should work. Append a unique delimiter to the end of each string and after adding the string to the tree, set our current position in the tree to the root.

With suffix automata the same magic even without delimiters worked for me. But in case of suffix trees the delimiters are necessary as we have to create all terminal nodes.

UPD: Instead of just setting the position to the root you should go up by suffix links until the root and mark all nodes on this path as terminal for the current string.

»
8 years ago, # |
  Vote: I like it +10 Vote: I do not like it

In Dan Gusfield's book Algorithms on Strings, Trees and Sequences he explains how to do that in section 6.4 without adding extra characters between strings, I'm sorry but I can't explain it to you because I don't know how Ukkonen's algorithm works, and the explanation depends on it. I prefer Suffix Automaton for string problem, but I have to recognize that some problems look easier to see from the Suffix Tree 'angle'. Maybe you should read the section and then post your impressions here, or maybe someone else can do it.

  • »
    »
    8 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    My implementation uses Ukkonen's algorithm. In Wikipedia, it says that there's no need for extra characters because we can identify the word the suffix comes from by knowing the indices each word starts and ends. I understand that, but for queries like "How many words contain t as a substring?", we need to know how many different words are at each inner node's subtree. If I'm not wrong, what we use as delimiter characters does not change that.

    • »
      »
      »
      8 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      Well, in that case I think that ffao above give you the answer, other possible solution if the total of words is small, is to have a bitset in every node with the index of words that are descendants.

      • »
        »
        »
        »
        8 years ago, # ^ |
          Vote: I like it +5 Vote: I do not like it

        That was one of my first attempts :)

        But yes, I already implemented the solution proposed by ffao and it works really nice. I'll proceed to testing now, and when it's finished, I'll share the class.

        Thanks very much.