RedNextCentury's blog

By RedNextCentury, history, 4 years ago, In English

Hello,

Recently my friend encountered the following problem during a coding interview:

You are given a string $$$S$$$ and a set $$$T$$$ of $$$n$$$ strings. You need to choose a multiset from $$$T$$$ (you can choose the same string any number of times) and then concatenate the strings from the multiset in any order to obtain the string $$$S$$$. What is the maximum possible size of such a multiset?

There is a simple $$$O(|S|^2)$$$ solution using DP + trie. First add all strings from $$$T$$$ to a trie. Let $$$DP[i]$$$ be the answer for $$$S_{i...|S|-1}$$$. To calculate $$$DP[i]$$$, we traverse the trie according to the substring $$$S_{i...|S|-1}$$$, and try to update $$$DP[i]$$$ for each string we encounter in the trie.

Assuming the sum of the lengths of strings from the set $$$T$$$ is $$$X$$$, there is also a solution with complexity $$$O(|S| * sqrt(X))$$$:

Spoiler

Is there a faster solution to this problem? I've tried for a long time without any success.

Any help would be appreciated, thanks!

  • Vote: I like it
  • +62
  • Vote: I do not like it

»
4 years ago, # |
  Vote: I like it +38 Vote: I do not like it

You can use Suffix Automaton to somewhat speedup the squared DP solution:

Build a suffix automaton for string $$$S$$$ in $$$O(|S|)$$$ complexity, and for each string $$$T_j$$$ iterate over it and follow the transitions of each letter starting from the root (while you can, or else the current string is not a substring of $$$S$$$) and store $$$|T_j|$$$ in its corresponding state, then we can follow suffix links from each prefix of $$$S$$$ and iterate over all lengths of $$$T_j$$$ on the way (which would be suffixes of the current prefix of $$$S$$$) and update $$$dp_i$$$, this way we'll iterate over all occurrences of strings $$$T$$$ in $$$S$$$. However, we need to calculate the nearest state (reachable through suffix links) which has at least one string from $$$T$$$ for each state, because we don't want to iterate over states which don't represent any string from $$$T$$$, we can do that running a simple $$$DFS$$$ from the root using suffix links.

Now assuming $$$|S| <= 10^5$$$, $$$\sum_{i=0}^{|T|-1} |T_i| <= 10^5$$$ sum of all occurrences of $$$T_i$$$ would be around $$$4 * 10^7$$$ which is slightly worse than $$$O(|S| * sqrt(\sum_{i=0}^{|T|-1} |T_i|))$$$, I just thought this solution was nice :)

  • »
    »
    4 years ago, # ^ |
      Vote: I like it +25 Vote: I do not like it

    Thanks, that's a really nice solution! Suffix Automaton is amazing :D

»
4 years ago, # |
  Vote: I like it +25 Vote: I do not like it

I don't think there is a faster solution. But there is a simpler one just use aho-corasick to find the occurrences and at each index get the array of lengths $$$L$$$ where $$$S[i-L[i] + 1 \dots i]$$$ exists in the set and just find the minimum $$$DP[i]$$$.

»
4 years ago, # |
Rev. 2   Vote: I like it 0 Vote: I do not like it

Somebody should really post this somewhere. Somewhere where you can submit code for it.