How to find the longest repeated (non-overlapping) substring

→ Обратите внимание

До соревнования
Codeforces Round 940 (Div. 2) and CodeCraft-23
2 дня
Зарегистрироваться »

*есть доп. регистрация

→ Лидеры (рейтинг)

№	Пользователь	Рейтинг
1	ecnerwala	3648
2	Benq	3580
3	orzdevinwang	3570
4	cnnfls_csy	3569
5	Geothermal	3568
6	tourist	3565
7	maroonrk	3530
8	Radewoosh	3520
9	Um_nik	3481
10	jiangly	3467

Страны | Города | Организации

Всё →

→ Лидеры (вклад)

№	Пользователь	Вклад
1	maomao90	174
2	awoo	164
2	adamant	164
4	TheScrasse	159
4	nor	159
6	maroonrk	156
7	-is-this-fft-	150
8	SecondThread	147
9	orz	146
10	pajenegod	145

Всё →

→ Найти пользователя

→ Прямой эфир

Детальнее →

Блог пользователя k790alex

How to find the longest repeated (non-overlapping) substring

Автор k790alex, 10 лет назад, По-английски

Problem:

Given a string S find the longest repeated substring non overlaps.

1 <= |S| <= 50,000

Input:
ABBBB

output:
BB

I'm trying to solve a problem like this, after thinking some time I came up with this solution using suffix array:

pos[i] -> sorted suffix array
lcp[i] -> longest common prefix between i-th suffix and (i-1)-th suffix
lcp2[i] -> like lcp but without overlaps -> lcp[i] = min(l p[i], abs( pos[i] - pos[i - 1] ) )

ans = max( lcp2[i] ) for 1 <= i < |S|

my approach was working for some cases but it fails for the test I wrote before.

I could think in a brute force solution which I think its some obvious:

For all repeated substring SUB, try all substring of SUB and test if have overlaps.

Do you have a better way?

Thanks.

k790alex
10 лет назад
23

Комментарии (21)

Показать архивные | Написать комментарий?

oversolver

10 лет назад, # |

easy, if you know suffix automaton. just calculate first and last position in string for all states.

→ Ответить

k790alex

10 лет назад, # ^ |

thanks, I gonna read about suffix automaton.

If the problem is to find the longest repeated substring (non-overlapping) repeted (al least / exactly) k times, suffix automaton works?

→ Ответить

oversolver

10 лет назад, # ^ |

← Rev. 8 →

yes, the same

upd i forgot about non-overlaping. with this condition i know solution only with O(Nlog²N) time instead of O(N)

one more upd it is O(KNlog²N). but here was a discussion about similar problem with k=3, maybe helps.

→ Ответить

Fdg

10 лет назад, # ^ |

can you explain solution for k non-overlaping substrings?

→ Ответить

oversolver

10 лет назад, # ^ |

← Rev. 5 →

big upd lets try again with HandIeNeeded's solution. after fixing LEN (in bs) and separating array to segments, for every segment make two sets: set of indexes and set of difference of consecutive pairs, for example for segment with indexes {1, 3, 4} it is {1, 2}. so, while in second set minimum<LEN we need to pop it out and modificate both sets. after that, check for size if at least K. it looks O(Nlog²N).

+ even easier: need only sorted array of every segment, and greedy will works: first index i, lower_bound(i+LEN) and so on K times.

→ Ответить

Fdg

10 лет назад, # ^ |

← Rev. 2 →

-8

It looks like greedy solution. We don't know which number we need to pop out from the first set (it can be 3 or 4). That's why I am not sure, what it will work. But probably after building the suffix array we can easily solve problem for each segment, using fenwick tree.

→ Ответить

oversolver

10 лет назад, # ^ |

← Rev. 2 →

i explained, please check for mistakes:)

→ Ответить

Stomach_ache

10 лет назад, # |

hash maybe help

→ Ответить

k790alex

10 лет назад, # ^ |

how?

→ Ответить

Hohol

10 лет назад, # ^ |

← Rev. 2 →

+11

Why is it downvoted?

First, binary search over length. For fixed length L let's check if there is an answer of this length. For each substring of length L find it's hash. For each hash store first and last occurence. If for some hash first and last occurences are far enough from each other, answer for this length is true.

→ Ответить

kurlak

9 лет назад, # ^ |

← Rev. 2 →

This approach takes O(n^2 lg n) time:

O(n) time to iterate over each substring of length k
O(n) time to hash each substring
O(lg n) time to binary search

If you use a rolling hash, however, you can solve in O(n lg n) time (in the best case).

→ Ответить

NeverSayNever

9 лет назад, # ^ |

have a look at the solution proposed by me below

→ Ответить

HandIeNeeded

10 лет назад, # |

← Rev. 2 →

Use binary search. When we check a length K, could it be the ans? We separate the array lcp[] into many groups, each contains continous elements of array lcp[] and all lcp[i]>= K, and we should check the max_index of suffix and min_index of suffix in each group to see if their difference is bigger than K.And it's done.:D

→ Ответить

NeverSayNever

9 лет назад, # |

Here is my solution to this problem using binary search or string hashing.

Observation : 1. If there exists such string having length x then there must be atleast one such string for all lengths < x.

use this idea to binary search on the length. Fixed a length (say z) find hashes corresponding to each string of length z. There will be exactly n — z + 1 such hashes where n is the length of the given string.
Keep one extra information with each hash that is starting index of this string in the given string.
sort this vector of pairs now and iterate over it to find the maximum distance between the starting index of two string having equal hash value. This will help you in testing whether these two strings are overlapping or not. One you find the answer depending upon your answer you may proceed to any half of the binary search.

complexity : O ( N * log(N) * log(N) )

→ Ответить

NeverSayNever

9 лет назад, # ^ |

UPDATE : this solution will also work if you want to find the longest length substring which occurs k times in the original string and each of this k substrings of maximum length are non-overlapping.

→ Ответить

kurlak

9 лет назад, # ^ |

This works, but I'm not sure where you got the complexity of O(n * log^2(n)) from. If you're hashing a string, that will take O(n) time, so the complexity would be O(n^2 lg n). That being said, you could hash in constant time by using a rolling hash. The problem with that, however, is that you could run into collisions. The best case would be O(n lg n) though.

→ Ответить

ko_osaga

9 лет назад, # ^ |

Of course it will use rolling hash. It's time complexity is O(nlg^2n) due to it's O(lgn) time sorting step.

→ Ответить

kurlak

9 лет назад, # ^ |

What sorting step?

→ Ответить

ko_osaga

9 лет назад, # ^ |

He sorts the vector of pairs in step 3.

→ Ответить

kurlak

9 лет назад, # ^ |

← Rev. 3 →

Oh, weird. You don't have to do that sorting step though. For finding a repeated substring of size k, you can just iterate over the n-k+1 substrings of size k. When you add the current substring to the hash table (which maps "hash" to "start index of first occurrence"), check to see if there isn't already an entry. If an entry already exists, determine whether the current substring overlaps with the first occurrence of that substring. If it does, then there exist two substrings that don't overlap of size k.

See some quick code I wrote here: http://pastebin.com/A3deJSZ0

→ Ответить

ko_osaga

9 лет назад, # ^ |

I think sorting / hashtables are just personal preferences. I would use c++ set if I have to implement it. It gives O(nlg^2n) time complexity.

→ Ответить

Соревнования по программированию 2.0

Время на сервере: 19.04.2024 12:52:03 (k2).

Десктопная версия, переключиться на мобильную.

При поддержке