Find strings containing a certain substring dynamically with queries

→ Pay attention

Before contest
Codeforces Round 941 (Div. 1)
13:10:06
Register now »

*has extra registration

Before contest
Codeforces Round 941 (Div. 2)
13:10:06
Register now »

*has extra registration

→ Streams

Atcoder ABC #351 Short Solution Discussion

By aryanc403

Before stream 12:15:05

View all →

→ Top rated

#	User	Rating
1	ecnerwala	3649
2	Benq	3581
3	orzdevinwang	3570
4	Geothermal	3569
4	cnnfls_csy	3569
6	tourist	3565
7	maroonrk	3531
8	Radewoosh	3521
9	Um_nik	3482
10	jiangly	3468

Countries | Cities | Organizations

View all →

→ Top contributors

#	User	Contrib.
1	maomao90	174
2	awoo	164
3	adamant	161
4	TheScrasse	159
5	nor	158
6	maroonrk	156
7	-is-this-fft-	152
8	SecondThread	147
9	orz	146
10	pajenegod	145

View all →

→ Find user

→ Recent actions

Detailed →

Robbinb1993's blog

Find strings containing a certain substring dynamically with queries

By Robbinb1993, history, 3 years ago, In English

Hi,

For this problem I have a set S with initially 0 strings.

There are three types of queries:

Add a string to set S with id = queryId.
Remove the string from set S with id = removeId.
Read a string M and print all strings in set S that have M as substring.

Let N be the size of S.

How would I solve this efficiently? I hope to reach a complexity of O((length of search string) + N) for query 3 and O(length of string) for queries 1 and 2. Is this possible, with for example a suffix tree?

I would like to use this for a fast search engine for a (big) data set that is changing dynamically.

strings

Robbinb1993
3 years ago
13

Comments (13)

Write comment?

Robbinb1993

3 years ago, # |

Auto comment: topic has been updated by Robbinb1993 (previous revision, new revision, compare).

→ Reply

Robbinb1993

3 years ago, # |

Auto comment: topic has been updated by Robbinb1993 (previous revision, new revision, compare).

→ Reply

Lezendary_sandwich

3 years ago, # |

You can do it with an extra log overhead using suffix automaton.

→ Reply

Robbinb1993

3 years ago, # ^ |

Cool, do you have any sources how to support the removal of a string from the suffix automaton?

→ Reply

Lezendary_sandwich

3 years ago, # ^ |

https://codeforces.com/edu/course/2/lesson/2

→ Reply

Robbinb1993

3 years ago, # ^ |

Thanks, I see this is a suffix array instead of suffix automaton though. But both should be possible. I don't see the mentioning of deleting an added string from the suffix array, I am not sure if it is possible to do this after creation.

→ Reply

Robbinb1993

3 years ago, # ^ |

For suffix automaton and suffix array it is built from only one string, unlike suffix tree. So we should use a delimiter to support multiple strings to be searched. I am not sure if this would support a deletion operation of one of the substrings contained in the concatenated string from which the data structure was built.

→ Reply

Lezendary_sandwich

3 years ago, # ^ |

Create suffix arrays for strings being added $$$O(S\log(s))$$$. When deleting remove their suffix arrays.

→ Reply

Robbinb1993

3 years ago, # ^ |

← Rev. 2 →

I see. But I think the complexity for a search will be more efficient when using a Suffix tree. It will only be a traversal of M characters (the search pattern string length) and the vertices in the suffix tree will store all ID's of the strings that are below it in the tree (which are added when the string was added to the suffix tree, on its path). If we are at our final character of the search pattern, we will return the set of strings with endpoints below this vertex in the tree and this is our result. Which is only O(M + N) for a search.

→ Reply

radoslav11

3 years ago, # ^ |

← Rev. 3 →

You can actually generalise the suffix automaton for multiple strings too (without adding the annoying delimiter) with a very small modification to the construction algorithm. Basically, whenever you're adding a new string, you reset the last node to 0/the root. You then need to also handle the case of already having a transition with the letter you're adding, but that's pretty similar to the normal SA cases. Here's the implementation I use. When you add a new string, you simply set last=0 and add the letter as normal.

Note that for the specific problem you have, this approach needs some tweaking as if the queries come online, you'll might change the structure of the SA, meaning that you need to be careful if you use the internal suffix links that form a tree (which is probably the easiest way to perform what you described). The easiest solution is probably to solve everything offline, by pre-building the generalised SA, and maintaining the suffix tree with a flag on every node. Now every addition/removal of a string is equivalent to adding/removing 1 on a path to the root of a tree, and the query is equivalent to counting the number of nonzero elements. All of those can be handled easily in $$$O(\log)$$$.

→ Reply

Robbinb1993

3 years ago, # ^ |

Thank you, that is a nice modification. In my case it has to be used in practice where you don't know in front what queries and what input will be given, similar to the interactive problems. So unfortunately an offline solution is not possible. The use case is a big product table of a retailer in which data can be modified, added and deleted at any time. And users should be able to search quickly for search pattern matches in this list of data.

→ Reply

radoslav11

3 years ago, # ^ |

If you need it for production, maybe a relatively good idea would be to use bucketing + the idea I mentioned, say by rebuilding the SA on every $$$B=500$$$ updates. Then you'll be able to query the latest version to get the matches and after that you'll only cross check with at most $$$B$$$ strings (which should be fast enough).

→ Reply

Robbinb1993

3 years ago, # ^ |

Yes I think this is the way to go. Seems like a deletion operation is not so easy to implement for suffix data structures. Not a lot can be found about this.

→ Reply