Generalized Suffix Tree

Revision en3, by tenshi_kanade, 2016-09-28 06:38:59

Prologue:

OK, so I finally decided to learn Suffix Tree. It's a monster I'd never dared to look in the eyes, I had always gotten away with using Suffix Array with the tail between my legs until now. The thing is that some problems require a more efficient data structure, so I decided to leave my fears behind and learn it once and for all.

First of all, I read about it in a few books and some papers online. When I understood its final structure and how to work with it, I searched an implementation online (no, I didn't plan on coding everything from scratch). I found a good and efficient implementation on Geeks for Geeks, but unfortunately it was written in C and it worked with native types.

The tweaking and testing:

I worked with the code and tweaked it here and there. I removed unnecessary comments and upgraded the code to C++ to make it work with strings instead of char pointers. I also added some extra functionality that didn't originally come with the implementation (Longest Common Substring, Longest Repeated Substring, Occurence Count, etc.). I finally wrapped everything into a class and went on to testing.

Everything worked fine and very fast too. I solved a few problems, but then I came across some other ones that couldn't be solved 'cause I realized that the class I had in my hands worked only with one single string. A lot of problems that are solved with Suffix Trees require a different kind of data structure, one that supports working with multiple strings. That data structure, I learned, is called Generalized Suffix Tree. This is where the problems began.

Generalized Suffix Tree:

I searched online about it, but all the papers and articles I found built it by adding a different delimiter character for each word. The problem with this implementation is that you can work with a very small number of words. A lot of problems give you 500, 100 or even 104 strings. There aren't that many characters.

My idea was to use a single delimiter character for every word, but then do a DFS and count how many delimiters I pass along the way. When I reach a leaf, I know what word this suffix belongs to by the amount of delimiters I passed.

So the problem boils down to this: We have a tree with numbers written on the leaves. How do we efficiently compute for every inner node how many different numbers are written on the leaves in its subtree?

My workaround (the best idea that crossed my mind) was to do a DFS and maintain a set for each node with the numbers I found on its subtree. By merging smaller sets into bigger ones and upgrading pointers instead of moving everything from one place to another, I was able to achieve something in the lines of O(Nlog2N), but the problem is that Suffix Array + LCP + Range Query ends up being faster than this Generalized Suffix Tree. That's not very nice after all the effort I put into it. But I still have hopes, I think there has to be a better way out there.

Final question:

How do you guys implement a Generalized Suffix Tree and how do you efficiently compute for every inner node, how many words this suffix belongs to?

Thanks very much in advance.

Tags #trees, string suffix structures, suffix tree, data structures

History

 
 
 
 
Revisions
 
 
  Rev. Lang. By When Δ Comment
en3 English tenshi_kanade 2016-09-28 06:38:59 8
en2 English tenshi_kanade 2016-09-28 06:37:23 6 Tiny change: 'structure works and how t' -> 'structure and how t'
en1 English tenshi_kanade 2016-09-28 06:36:30 3324 Initial revision (published)