1k_trash's blog

By 1k_trash, 10 years ago, In English

Hi everyone!

A while ago I've decided to learn Aho-Corasick algorithm and found the following problem at spoj.com:

  • Input to the program consists of a series of lines. The first line contains the string M (no more than 100000 characters long). The next line contains an integer N (<1000) the number of query strings. Each of the next N lines contain a string S (each of which is no more than 2000 characters long). Output should consist of N lines each with a character 'Y'/'N' indicating whether the string S is a substring of String M or not.

I've implemented Aho-Corasick algo and received CRASH. I think, it's a normal situation for such input size.

Now I'd like to know, is this problem solvable using Aho-Corasick or only suffix tree fits in given constraints?

Thanks.

  • Vote: I like it
  • 0
  • Vote: I do not like it

»
10 years ago, # |
  Vote: I like it +1 Vote: I do not like it

easy with suffix automaton/array

  • »
    »
    10 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Yup, thanks, but i want to train my Aho-Corasick skills!

»
10 years ago, # |
  Vote: I like it +1 Vote: I do not like it

I haven't looked through all of your code, but it seems that maximal size of Aho-Corasick automaton is
2000·1000 = 2·106 but you allocate memory only for 1.1·106 states which shouldn't be enough.

»
10 years ago, # |
  Vote: I like it +9 Vote: I do not like it

I find it hard to believe that you could just declare a structure with 2·106 nodes, each containing an array with 260 elements, a map<> and a bitset with 103 bits, and would not end up receiving MEMLIMIT. There should be a separate signal for this. MEMWOOT or something :D

By my calculations, it should need gigabytes of memory.

You should think about some improvements of your algorithm. You can get rid of the bitset completely by adding one post-AhoCorasick traversal. Also, why 260? I believe most ASCII characters won't appear in the input... which also removes any need for map<>. You can also try to store sons differently in this case.

Still, fitting into time and memory might be tough with test data this large. SPOJ is not exactly known for loose constraints. In both time and memory.