Anthony_Smith's blog

By Anthony_Smith, history, 2 years ago, In English

An Aho-Corasick Automaton can solve the problem, "Given some text T and a set of patterns P, find every occurrence (every index) of every pattern in T", in O(total number of occurrences + |T| + total length of all patterns) time. My question is, why is the Aho-Corasick Automaton necessary when we can solve this problem with Suffix Arrays too? Are suffix arrays too slow in competition problems? Or am I simply misunderstanding the inner workings of these concepts?

Valid (I think) method using Suffix Arrays: Construct a suffix array over T in O(|T|log|T|) time. Then, for each pattern, we can find all its occurrences in O(pattern size * log|T| + number of occurrences) time by binary searching through the suffix array (See CSES Finding Patterns), then looping through all valid suffices and collecting their starting indices (the indices where the pattern occurs). So, the final time complexity would just be O(|T|log|T| + total pattern length * log|T| + total number of occurrences). This is pretty much the Aho-Corasick Automaton's time complexity multiplied by log|T|, which doesn't seem all that much of a big difference to me.

  • Vote: I like it
  • -2
  • Vote: I do not like it

»
2 years ago, # |
  Vote: I like it +3 Vote: I do not like it

Your suffix array solution requires $$$O(|T|)$$$ memory, whereas Aho-Corasick automaton uses $$$O(|P|)$$$. Memory footprint is important for the running time as well, since working with smaller collocated regions of memory is usually faster.

»
2 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Because Aho-Corasick Automaton has some interesting properties and there are problems that can only be solved using it, and vice versa.

»
2 years ago, # |
  Vote: I like it +10 Vote: I do not like it

Your logic is sound, but there are certain counterpoints to your conclusion.

This is pretty much the Aho-Corasick Automaton's time complexity multiplied by log|T|, which doesn't seem all that much of a big difference to me.

This does, actually, make a difference. We sometimes neglect the logarithmic factor of sorting because sorting is fast, but it's only fast on relatively small arrays and when cache locality is high. This is the case for sorting an array of integers of size $$$10^5$$$ which is a popular case for us competitive programmers, but this rarely holds in practice.

You can, perhaps, get rid of the logarithmic factor by employing a suffix automaton instead, but that's still going to be slower for reasons described below.

An adjacent reason is that both algorithms are online, after a fashion:

  • In the case of the Aho-Corasick automaton, the patterns are fixed and the searched string is handled character by character;
  • In the case of the suffix structures, the searched string is fixed beforehand and the patterns are handled in order.

The former is useful when you have a large file and you want to find a small substring in it, or one of several substrings. If you have a one gigabyte file, you can't just afford to build a suffix array or an automaton for it, but you can read its characters from left to right and do a simple automaton step for each character. As far as I am aware, this is how grep -F worked in the early days, before the Aho-Corasick automaton was replaced with a more generic regex engine.

The latter is useful when- -duh, I can't even think of a practical use case. You can probably make up something, but an automaton would be a more robust solution either way.