sdssudhu's blog

By sdssudhu, history, 5 years ago, In English

I was recently reading about aho corasick and suffix trees. I felt that suffix tree operations are a superset of aho corasick operations. Is my assumption correct or can aho corasick perform some kind of query/operation that cannot be performed by suffix trees.

  • Vote: I like it
  • +4
  • Vote: I do not like it

»
5 years ago, # |
  Vote: I like it +9 Vote: I do not like it

Suffix tree contains the suffixes of 1 string, Aho–Corasick trie contains multiple (perhaps different) strings and all suffix links and other techniques allow to solve different kind of problems that seems not to be solved easily with suffix tree only, for example, find the number of integers in [A...B], 1<=A<=B<=1e1000 such as they don't contain S1,S2,S3,...,SN as substrings (S1...SN consist of decimal digits).

  • »
    »
    5 years ago, # ^ |
      Vote: I like it +7 Vote: I do not like it

    Ok. I get it. So aho-corasick is useful when there are plenty of input strings.

    • »
      »
      »
      5 years ago, # ^ |
      Rev. 2   Vote: I like it +16 Vote: I do not like it

      A suffix tree and Aho-Corasick are based on different techniques. They can do similar stuff, like finding all appearances of a set of patterns in a given text.

      The main difference is, that you create a suffix tree over the text, and then find appearances of patterns in the tree. In contrary you create the Aho-Corasick for the patterns, and then iterate over the chars in the text over the trie and you find all appearances that end at the current position. Depending on your need one of the structures can have an advantage over the other.

      The structures can also be used to compute other tasks. However since one structure encodes just one string, and the other one a set of strings, each of the structures can compute tasks efficiently that the other structures cannot. E.g. the suffix tree can be used to compute the longest palindrome in the text, or the longest common substring. Or you can use Aho-Corasick to find the shortest string that contains all patterns in it.

      As a final note, if you create an Aho-Corasick trie with just one pattern, the structure of the trie is a path with some additional suffix links. It is exactly the same as the graph induced by the Prefix algorithm. And you can see the Aho-Corasick as an extension to the Prefix function to work with multiple strings. If you know the Prefix function, you will understand Aho-Corasick very easily, it will almost feel trivial. Suffix trees however (especially their linear construction) are much harder to understand.

      • »
        »
        »
        »
        5 years ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        Ok. I sort of get the differences between the 2 structures.

        One more thing is that I read about aho-corasick from cp-algorithms. In that for finding all strings from a given set in a text they have mentioned something about a concept called exit link. I couldn't understand that part properly. Can you explain it or even if you have a code for it that would help me.

        Thanks

        • »
          »
          »
          »
          »
          5 years ago, # ^ |
          Rev. 2   Vote: I like it +3 Vote: I do not like it

          Actually I wrote that article on cp-algorithms. (Well, I translated it from e-maxx.ru). I'll try to add a few nice pictures and an extended explanation in the article tomorrow.

          For now only something very short: During the construction of the trie you have marked all vertices that correspond to the end of a pattern and stored which patterns end in each vertex. Currently you are at vertex v (e.g. after processing the first x characters of your text). And you want to know all patterns end exactly at this position. You can find them by doing the following:

          while (v != root) {
              print pattern ending in v
              v = suffix_link(v)
          }

          It should be obvious that this prints all patterns. A pattern that appears at that location is a suffix of the text and by following suffix links we check exactly all possible occurences.

          Take a look at the following Aho-Corasick trie for the pattern aaa, aab, and a (red are positions where pattern end, blue are suffix links):

          asdf

          You currently processed the text xyzaaa and are at vertex 3. You print the pattern aaa, follow the link to 2, follow the link to 1 and print a and follow the link to 0. Notice that we visited 2 but actually accomplished nothing at that location, since no pattern ends at that vertex. Exit links are just suffix links that skip such redundant vertices. The exit link from 3 would point directly to 1. And therefore you are guaranteed to find all ans pattern that end at that position in the text in O(ans) time.

        • »
          »
          »
          »
          »
          5 years ago, # ^ |
          Rev. 2   Vote: I like it +8 Vote: I do not like it

          My implementation of Aho-Corasick is here: Github Notice, I don't compute the array go on the fly, but instead do a BFS after adding all strings, during which I generate all suffix and exit links (here called next_terminal).

          For reference, I learned the BFS approach from here.

          • »
            »
            »
            »
            »
            »
            5 years ago, # ^ |
              Vote: I like it 0 Vote: I do not like it

            Thanks for taking the time for the explanation. I will go through these.