### 1k_trash's blog

By 1k_trash, 6 years ago, translation, ,

Sup, guys!

Let's consider the following problem: we have a set of strings (initially empty) and must handle two types of queries:

1. Add new string to the set.
2. For some input string do a check, if some string from the set occurs in the input string as a substring.

That's all. Of course, time/memory consuming should be as low as possible. Now I don't have an idea better than "Okay, let's use Aho-Corasick algo and each time, when we face query 1, let's destroy our old automaton and create a new one from scratch".

• +26

 » 6 years ago, # |   +1 A straightforward extension of your algorithm would be to partition the pattern set into groups of size , where p is the number of patterns. Then you only need to rebuild the automaton for pattern strings. Queries will also be slower by a factor of .
•  » » 6 years ago, # ^ | ← Rev. 2 →   0 Thanks, that's a good idea, but let me ask a question: why sqrt? I know about general sqrt-decomposition approach, but I've always been interested, why exactly P^(1/2), not P^(1/3) or log(P) or something like that? Is there some kind of proof that "sqrt is better"?
•  » » » 6 years ago, # ^ | ← Rev. 4 →   +1 The thing is, you can chose any factor k. Then your query time is multiplied by k and your update time is divided by k. So your query time becomes and your update time becomes . We want to optimize the worst case, so we chose k such that max(k, p / k) is minimal. It just so happens that is the optimal choice because then k = p / k in that case.I ignored some factors here that are dependent on the string lengths, so if there are certain restrictions on how long your query strings and patterns strings can get, than might not be the optimal choice. You could for example group the pattern strings by length.
 » 6 years ago, # | ← Rev. 6 →   0 Forget the algorithm below, since it's not working. Here's a variant that uses hashing to achieve poly-logaritmic time: For all strings you ever see, precompute a rolling hash for the string, so that you can compare any substrings, even from different strings, in O(log n) where n is the minimum of the lengths of those substrings. Keep a balanced binary search tree of all the patterns. Given a query string s, for every suffix s[i..|s|], locate its neighbors in the BBST and check whether one of them is a prefix of the suffix. The total runtime is quadratically logarithmic in the input size. It would be interesting to see a variant of this that doesn't rely on hashing. It would also be interesting to see whether we can shave of one of the log factors.Is this a real problem somewhere? I'd like to try an implementation of this.
•  » » 6 years ago, # ^ |   +5 The nearest neighbor could be some string in between, no? Like, if you have strings {abc,abca,abcd}, then the nearest neighbors of abcc could be abca and abcd, no?
•  » » » 6 years ago, # ^ |   0 You're correct. Let me try to revise the algorithm to fix that (although I'm not sure it's possible with this approach).
•  » » 6 years ago, # ^ |   0 Nope, it's not a problem from online-judge/contest/etc, just a kind of work I've faced during my university "research".
 » 6 years ago, # |   +15 If you are not looking for the "online" solution, just look at the problem 163E - e-Government and its editorial.