Please help me with a string search problem

→ Обратите внимание

До соревнования
Codeforces Round 950 (Div. 3)
23:52:45
Зарегистрироваться »

→ Лидеры (рейтинг)

№	Пользователь	Рейтинг
1	tourist	3757
2	jiangly	3647
3	Benq	3581
4	orzdevinwang	3570
5	Geothermal	3569
5	cnnfls_csy	3569
7	Radewoosh	3509
8	ecnerwala	3486
9	jqdai0815	3474
10	gyh20	3447

Страны | Города | Организации

Всё →

→ Лидеры (вклад)

№	Пользователь	Вклад
1	maomao90	171
2	awoo	165
3	adamant	164
4	TheScrasse	159
5	nor	155
5	maroonrk	155
7	-is-this-fft-	152
8	Petr	147
9	orz	146
10	pajenegod	145

Всё →

→ Найти пользователя

→ Прямой эфир

Детальнее →

Блог пользователя Koyote

Please help me with a string search problem

Автор Koyote, история, 12 месяцев назад, По-английски

I encountered this problem at a past local competition. Here is the brief statement:

Given two string $$$s$$$ of length $$$n$$$ and t of length $$$m$$$ ($$$n$$$, $$$m$$$ $$$\le$$$ $$$10^{5}$$$) find the longest, lexicographically smallest substring of $$$s$$$ that appears in string $$$t$$$ at least $$$k$$$ times. The input is given in a way that there is at least one substring satisfies the constraints.

I have tried to find answers on Google and the best I could find is this problem from HackerEarth . The suggested solutions were either hashing with binary search, or suffix tree. And I am not knowledgeable enough to transform the problem into the original problem. So far the best I could have done is to iterate every substring of $$$s$$$ in $$$O(n^2)$$$, then use string search algorithm (KMP/Z-function) to count number of occurrences, yielding total complexity of $$$O(n^2*(n+m))$$$. Please give hints to an approach for this problem.

string, hard, suffix tree, hashing

Koyote
12 месяцев назад
9

Комментарии (9)

Написать комментарий?

Koyote

12 месяцев назад, # |

I forgot. The first line of input contains 3 positive integers $$$n$$$, $$$m$$$, and $$$k$$$. The second line contains the string $$$s$$$ and the third line contains the string $$$t$$$. Here is an example test case:

Input:

7 19 3
GATTACA
TACATTACGCATTACACAT

Output:

ACA

Explanation: TAC and ACA are the two longest substring of $$$s$$$ that satisfy, and ACA is lexicographically smaller than TAC.

→ Ответить

tminh_hk20

7 месяцев назад, # ^ |

this is my code about this problem:

#include<bits/stdc++.h>
using namespace std;
#define int long long
#define ii pair<int,int>
#define fi first
#define se second
const int N=1e5+10,MOD=1e9+7,base=311;
int n,m,k,has1[N],has2[N],p[N];
int l,r,mi;
string s,t;
ii res;

int get1(int i, int j){
    return (has1[j]-has1[i-1]*p[j-i+1]+MOD*MOD)%MOD;
}

int get2(int i, int j){
    return (has2[j]-has2[i-1]*p[j-i+1]+MOD*MOD)%MOD;
}

bool compound(int a, int b, int n){
    if (b==-1) return true;
    int l =0,r=n-1,res=-1;

    while(l<=r){

        int m = (l+r)/2;

        if (get2(a,a+m)==get2(b,b+m)){
            l = m+1;
            res=m;
        }else{
            r =m-1;
        }

    }

    if (res==n-1) return true;
    return (t[a+res+1]<t[b+res+1]);
}

signed main(){

    ios_base::sync_with_stdio(0);
    cin.tie(0); cout.tie(0);

    freopen("GATTACA.INP","r",stdin);
    freopen("GATTACA.OUT","w",stdout);

    cin>>n>>m>>k;
    cin>>s>>t;

    s =" "+s;
    t =" "+t;

    p[0]=1;
    for (int i=1;i<=max(n,m)+1;i++){
        p[i]=(p[i-1]*base)%MOD;
    }

    has1[0]=0;
    for (int i=1;i<=n;i++){
        has1[i]=(has1[i-1]*base+s[i])%MOD;
    }

    has2[0]=0;
    for (int i=1;i<=m;i++){
        has2[i]=(has2[i-1]*base+t[i])%MOD;
    }

    l=1;
    r=1;
    bool div=0;
    res={-1,-1};

    while(l<=r){
        //cout <<l<<" "<<r<<endl;
        if (div)
            mi = (l+r)/2;
        else
            mi = r;

        bool check=0;
        unordered_map<int,int> mp,ap;

        for (int i=1;i<=n-mi+1;i++){
            ap[get1(i,i+mi-1)]++;
        }

        for (int i=1;i<=m-mi+1;i++){
            int tmp=get2(i,i+mi-1);
            mp[tmp]++;

            if (ap[tmp] && mp[tmp]>=k){
                check=1;
                if (mi<res.se-res.fi+1) continue;

                if (mi>res.se-res.fi+1){
                    res={i,i+mi-1};
                    continue;
                }

                if (compound(i,res.fi,mi)){
                    res = {i,i+mi-1};
                }
            }

        }

        if (div){
            if (check)
                l = mi+1;
            else
                r = mi-1;
        }else{
            if (check){
                l =r;
            }else{
                div=1;
            }
            r*=2;
        }


    }

    for (int i=res.fi;i<=res.se;i++)
        cout <<t[i];



}

→ Ответить

ayhan23

12 месяцев назад, # |

← Rev. 2 →

I have found O((M+N)*LOG_M*LOG_N) solution using binary search on the length of substring + suffix array for sorting substrings of length X. We also need to use hashing for counting occurences.

I don't know if there is any better solution than this one.

→ Ответить

Koyote

12 месяцев назад, # ^ |

That is good enough, please explain more.

→ Ответить

ayhan23

12 месяцев назад, # ^ |

We can use binary search on the answer technique. We fix the length of substring. Let's call the mid value X. We can use binary search because suppose that the answer is A(length of the substring). Then, an answer with length (A-1) must exist because you can pop the last character of the valid substring.

We use hashing to calculate the number of occurences (on string t with substring length X). This part is O(MlogM) (due to std::map). Finally , to find the lexicographically minimum substring we need to use suffix array to sort substring of length X. You can learn suffix array and it's applications from CP-ALGO. Also sorry for my poor english.

→ Ответить

Koyote

12 месяцев назад, # ^ |

No problem. Thank you for your idea, I'll try to implement it.

→ Ответить

ayhan23

12 месяцев назад, # ^ |

Your welcome :)

→ Ответить

bicsi

12 месяцев назад, # |

← Rev. 2 →

Translation: find a string of maximal length that has at least one occurrence in $$$s$$$ and at least $$$k$$$ occurrences in $$$t$$$. One way to do that is:

Build a suffix automaton of $$$s\$t$$$
For each node in automaton compute the number of occurrences in $$$s$$$ and in $$$t$$$
Out of all the valid nodes, choose the min lex one

A similar solution can be done with suffix arrays and two pointers technique:

Sort the suffixes of $$$s\$t$$$
For each $$$i \leq n + m$$$ find the smallest $$$j$$$ such that there is at least one index from $$$s$$$ and $$$k$$$ indices from $$$t$$$ in the range $$$sa[i..j]$$$ (via two pointers)
Update the solution if $$$lcp[i..j]$$$ is strictly better than the current best. This makes sure to pick the lex min solution.

→ Ответить

Koyote

12 месяцев назад, # ^ |

Thank you so much. That is so clever.

→ Ответить

Соревнования по программированию 2.0

Время на сервере: 02.06.2024 17:42:16 (j2).

Десктопная версия, переключиться на мобильную.

При поддержке