[Help]How to solve this expected value problem for trie

→ Pay attention

Before contest
Codeforces Round 941 (Div. 1)
4 days

Before contest
Codeforces Round 941 (Div. 2)
4 days

→ Streams

CodeChef Starters 131 Solution Discussion

By aryanc403

Before stream 30:04:36

View all →

→ Top rated

#	User	Rating
1	ecnerwala	3650
2	Benq	3582
3	Geothermal	3570
3	orzdevinwang	3570
5	cnnfls_csy	3569
6	tourist	3565
7	maroonrk	3532
8	Radewoosh	3522
9	Um_nik	3483
10	jiangly	3468

Countries | Cities | Organizations

View all →

→ Top contributors

#	User	Contrib.
1	maomao90	174
2	awoo	164
3	adamant	163
4	TheScrasse	159
5	nor	158
6	maroonrk	156
7	-is-this-fft-	151
8	SecondThread	147
9	orz	146
10	pajenegod	145

View all →

→ Find user

→ Recent actions

Detailed →

its_aks_ulure's blog

[Help]How to solve this expected value problem for trie

By its_aks_ulure, history, 4 years ago, In English

I was preparing some problems for a college contest from the past few weeks and came up with this problem idea.

Given N words consisting of either lowercase English letters or a special character '#'. 
Find the expected number of nodes in the trie after a trie is constructed replacing all the special character('#') in the words uniformly randomly with any letter between 'a' to 'z'.

I want to know if there exists any polynomial time solution for this problem.

Thanks

#help, #trie

its_aks_ulure
4 years ago
6

Comments (6)

Write comment?

its_aks_ulure

4 years ago, # |

+16

Bump! Anyone who can help me?

→ Reply

emorgan

4 years ago, # |

← Rev. 2 →

+29

Suppose we are given an ordered list of strings $$$S$$$. Assume the words are inserted in the order in which they appear in $$$S$$$. By linearity of expectation, the final answer is

$$$ \sum_\limits{i=1}^n \sum_\limits{j=1}^{|S_i|} P(i, j) $$$

where $$$P(i, j)$$$ is the probability that character $$$j$$$ of string $$$S_i$$$ is not already in the trie when it is added. Without loss of generality, assume that $$$S_{ij}$$$ is not #, since if it is, we can just add up the probabilities for all characters a-z and divide by $$$26$$$. (edit: this isn't necessary, the algorithm below works perfectly fine if $$$S_{ij}$$$ is #)

The probability that $$$S_{ij}$$$ is not in the trie, is simply the product of the probabilities that $$$S_i$$$ does not share a prefix of length $$$j$$$ with $$$S_k$$$, for all $$$k<i$$$. For fixed $$$i,j,k$$$, we just do casework on the two prefixes. If they disagree anywhere, the answer is automatically $$$1$$$. Otherwise, the probability that they end up agreeing is $$${\frac 1{26}}$$$ to the power of the number of indices where at least one string has a #. Subtract from $$$1$$$ to get the probability that they disagree.

Let $$$m$$$ be the length of each string, and let $$$a=26$$$. The total runtime is $$$O(an^2m)$$$, which I'm sure can be optimized further, but is indeed a polynomial time solution. This is also just a rough sketch of a solution and may have errors, I welcome corrections in comment replies.

→ Reply

dragonslayerintraining

4 years ago, # ^ |

+32

I don't believe the claim "The probability that $$$S_{ij}$$$ is not in the trie, is simply the product of the probabilities that $$$S_i$$$ does not share a prefix of length $$$j$$$ with $$$S_k$$$, for all $$$k<i$$$", because the events could be dependent.

Consider the case where $$$S_1=\text{ab},S_2=\text{ac}$$$, $$$S_3=\text{#}$$$ and $$$j=1$$$.

→ Reply

emorgan

4 years ago, # ^ |

Yes, I see. So we would actually have to iterate over all possible configurations of $$$S_i$$$ achievable by replacing # with a-z, which is $$$O(a^m)$$$ in the worst case.

If we ignore the case of having a # in previous strings before $$$S_i$$$, we can permute the indices up to $$$j$$$ to float all of the # characters to the beginning, then delete all elements of $$$S$$$ before index $$$i$$$ which disagree at at least one index, since they contribute nothing to $$$P(i, j)$$$. So, without loss of generality, we can assume $$$S_i$$$ contains nothing but # characters, and the problem reduces down to "given a bunch of strings of length $$$j$$$, what is the probability that a uniformly randomly chosen string of length $$$j$$$ is not equal to any of the other strings?" This can be answered easily, the answer is $$$1-\text{# of unique strings}/a^j$$$.

So, if we ignore the case of # occurring in other strings, then we can solve it in polynomial time. We have a solution for when $$$S_i$$$ contains no # character, and all other strings contain # characters. We also have a solution for when $$$S_i$$$ contains a # characters, and all other strings contain no # character. Is it possible to combine these two solutions somehow to get a general case answer?

I think the answer is no, because if we try to perform the same trick of separating $$$S_i$$$ into # components and non-# components, which have potentially $$$2^n$$$ duplicate-deletion scenarios, as opposed to only 1 in the previous case. However, this is still an improvement over the $$$a^m$$$ time we had before. So, I now think that the problem is NP-hard.

→ Reply

SuperJ6

4 years ago, # |

+17

Would love to see the problem added to a judge by someone (or maybe it already exists?).

→ Reply

bicsi

4 years ago, # |

+14

I would suspect this problem is NP-hard, just like most counting problems that boil down to fixing some order of elements. I have no idea how to (dis)prove that, though, so I might be terribly wrong here.

→ Reply