deepak1527's blog

By deepak1527, history, 5 years ago, In English

How to solve the following problem. Find the sum of the number of distinct characters in all the distinct substrings of S. 1<=|S|<=100000. S="aabb"
Set of distinct sub-strings of "aabb" = {a,b,aa,ab,bb,aab,abb,aabb} sum = 1 + 1 + 1 + 2 + 1 + 2 + 2 + 2 = 12.

Thanks!

  • Vote: I like it
  • 0
  • Vote: I do not like it

»
5 years ago, # |
  Vote: I like it +28 Vote: I do not like it

Can you give the link to the problem so that I can verify that it's not a part of some ongoing contest?

  • »
    »
    5 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    No it's not a part of any ongoing contest, this problem was asked in coding round of a company and that round has ended.

»
5 years ago, # |
  Vote: I like it +19 Vote: I do not like it

I would like to propose a solution. It would be nice if someone can verify it.

The main idea is that for each character, we have to figure out its contribution to the final sum i.e. for each character $$$S_{i}$$$ in our string, we have to find how many distinct substrings exist such that $$$S_{i}$$$ is the first occurrence of its kind in that substring? This can be answered using a Suffix Automaton. (If we didn't have to count distinct substrings, the problem would have an easier solution.)

So we will build a Suffix Automaton on the given string. Lets name the starting node of the automaton as $$$t_{0}$$$. Consider an edge from node $$$u$$$ to node $$$v$$$ using the character $$$c$$$. The contribution of $$$c$$$ to the final sum = $$$dp[u][c] * dp[v]$$$ where,

$$$dp[u][c]$$$ = The number of paths from $$$t_{0}$$$ to $$$u$$$ such that we never use an edge with character $$$c$$$

$$$dp[v]$$$ = The number of paths that begin from node $$$v$$$ and end at any other node.

The above two dp's can be calculated fairly easily. The final answer should be the sum of the contribution of each edge in the Suffix Automaton.

Since you mentioned that this question was asked in the coding round of a company, it should have an easier solution.

»
5 years ago, # |
  Vote: I like it 0 Vote: I do not like it

We can solve this with a suffix array by recognizing that each unique substring can be represented as a prefix of a suffix.

Let's construct a suffix array. Now we'll walk forward, starting with the first string in the suffix array. For the first string in the suffix array all of the prefixes of this suffix are valid. For the next string only the prefixes longer than the longest common prefix between it and the last string are unique. All the others were counted as part of some earlier suffix. This will take nlogn time to construct the suffix array, nlogn time to find the longest common prefix between every two suffixes, and finally n time to sum all the values for each suffix.