Noam527's blog

By Noam527, history, 6 years ago, In English

I have 2 questions in mind, just out of curiousity :). They both consider tries (not compressed). Also note — by a trie's size I'm not considering its root (even though, it does not matter for the purpose of this problem).

1) For some 2 natural numbers N and K (suppose K ≤ N), what is the maximal size of a trie that can be achieved if you take some string S of length N, with the size of its language (number of distinct characters it can contain) being K, and insert into the trie all its suffixes (so, a suffix trie)?

For instance, if N = 2, K = 2, we can form the string "ab", which makes the trie of size 3. On the other hand, if N = 10, K = 1, the only string we can form is "aaaaaaaaaa" which makes the trie size 10.

2) How do you form such a string with maximal trie size? An algorithm or an idea, both are acceptable.

Thanks.

  • Vote: I like it
  • +11
  • Vote: I do not like it

»
6 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by Noam527 (previous revision, new revision, compare).

»
6 years ago, # |
  Vote: I like it 0 Vote: I do not like it

For K = 1 the problem is trivial. For K > 1 the string anbn will produce an asymptotically largest trie of size at least n2 (you basically get a chain from bn for every possible prefix of an). Were you interested in the optimal answer? From Wiki it seems like Fibonacci word gives you the worst case for K = 2 and I wonder if some generalization of it leads to the answer for other cases.

»
6 years ago, # |
Rev. 2   Vote: I like it +7 Vote: I do not like it

I believe this size is .

Let's show it if N = Km for some integer (I think it extends in the general case but it's more cumbersome).

  • First we show that we can reach that number. Let S be a chain such that the substrings Si... Si + m - 1 of length m (potentially looping around the string) are all distinct. This can be reduced to a Eulerian cycle problem if we take words of length m - 1 as vertices, which proves its existence and gives an efficient algorithm for construction. Then we know that the suffixes will be separate in the trie from their m-th character onwards, so we have at least nodes as desired.
  • Then we show that we can't do any better. Clearly there are at most 1 + K + ... + Km = O(N) nodes at the levels  ≤ m, and because of the length of the suffixes we have nodes at most at the levels lower than that. So indeed we're bounded by nodes.

Maybe it's possible to refine this approach and eliminate the O(N) imprecision, but I'm already happy with this.