BumbleBee's blog

By BumbleBee, 5 years ago, In English

Given a string $$$S$$$ , I want to find the number of different substrings of each length in the range $$$[1,|S|]$$$.

I can do it using LCP array which can be constructed using the Suffix Array of the given string. But I am trying to perform this task using Suffix Automaton.

Is it possible to do this using Suffix Automaton? If yes, then how?

  • Vote: I like it
  • +5
  • Vote: I do not like it

| Write comment?
»
5 years ago, # |
Rev. 2   Vote: I like it 0 Vote: I do not like it

Here's my go at this. If anyone finds there to be something wrong, please reply to this comment. The terminology I'm using is w.r.t the SA article on cp-algorithms.com. A state in SA corresponds to a no. of substrings (prefixes of some suffix). It contains all substrings in a contiguous range of length $$$(len(link(v)), len(v)]$$$. So, you can perform a dfs on the SA and for every vertex $$$v$$$, you can add $$$1$$$ to the count of all lengths of substrings from $$$[len(link(v))+1, len(v)]$$$.

  • »
    »
    5 years ago, # ^ |
    Rev. 2   Vote: I like it 0 Vote: I do not like it

    At first, I tried this way too. But it turns out that it won't work in all cases. Suppose, you have reached a state that has a child which is already visited. If you skip this child state you will miss some paths/substrings. If you don't skip visited nodes, the complexity won't be $$$O(N)$$$ anymore. In my case, the length of the string can be $$$10^6$$$ and I am trying to perform this task in $$$O(N)$$$.

    UPD: Your idea is correct. I thought I will miss some substrings by skipping visited states. But that won't happen if I count for all lengths in the range $$$[len(link(v))+1,len(v)]$$$ for all states $$$v$$$.

    • »
      »
      »
      5 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      Just to ensure that the problem is not from an ongoing contest. Can you please post the problem link?

    • »
      »
      »
      5 years ago, # ^ |
        Vote: I like it +2 Vote: I do not like it

      Here is my accepted solution using the idea I described above. There must be a bug in your implementation. I spent an hour finding out in mine. I suggest you to thoroughly read the article that I linked in the answer above. A node may correspond to more than one substring, but in that case, a smaller substring will always be a suffix of the larger one. So, in performing the step of adding $$$1$$$ for all substrings having length in the range $$$[len(link(v)) + 1, len(v)]$$$, we are actually considering all substrings that the current state corresponds to and no substring ending at the current state is missed.