cpp11's blog

By cpp11, 3 years ago, In English,

I am learning suffix arrays. I understood the O(nlogn) implementation of suffix array. But I am not being able to understand LCP calculation. Could someone explain how to calculate LCP from suffix arrays? Thanks in advance.

 
 
 
 

»
3 years ago, # |
Rev. 2   Vote: I like it +10 Vote: I do not like it

Kasai's algorithm is pretty easy and works in O(n).

Let's look at the two continuous suffixes in the suffix array. Let their indexes in suffix array be i1 and i1 + 1. If their lcp > 0, then if we delete first letter from both of them. We can easily see that new strings will have the same relative order. Also we can see that lcp of new strings will be exactly lcp - 1.

Let's now look at the string wich we have got from the i suffix by deleting its first character. Obviously it is some suffix of the string too. Let its index be i2. Let's look at the lcp of suffixes i2 and i2 + 1. We can see that it's lcp will be at least already mentioned lcp - 1. This is associated with certain properties of lcp array, in particular, that lcp(i, j) = min(lcpi, lcpi + 1, ..., lcpj - 1).

And finally let's make the algorithm based on the mentioned above. We will need an additional array rank[n], wich will contain the index in the suffix array of the suffix starting in index i. Firstly we should calculate the lcp of the suffix with index rank[0]. Then let's iterate through all suffixes in order in which we meet them in the string and calculate lcp[rank[i]] in naive way, BUT starting it from lcp[rank[i - 1]] - 1. Easy to see that now we have O(n) algorithm because on the each step our lcp decreasing not more than by 1 (except the case when rank[i] = n - 1).

Implementation:

vector<int> kasai(string s, vector<int> sa)
{
    int n=s.size(),k=0;
    vector<int> lcp(n,0);
    vector<int> rank(n,0);

    for(int i=0; i<n; i++) rank[sa[i]]=i;

    for(int i=0; i<n; i++, k?k--:0)
    {
        if(rank[i]==n-1) {k=0; continue;}
        int j=sa[rank[i]+1];
        while(i+k<n && j+k<n && s[i+k]==s[j+k]) k++;
        lcp[rank[i]]=k;
    }
    return lcp;
}

  • »
    »
    3 years ago, # ^ |
      Vote: I like it +3 Vote: I do not like it

    There is also a way to build it in O(nlogn) with a segment tree described on the e-maxx, but in my opinion it is much harder and slower.

    • »
      »
      »
      3 years ago, # ^ |
        Vote: I like it +3 Vote: I do not like it

      And there's another O(NlogN) algorithm which is much more intuitive: find LCP of each pair of consecutive suffixes using binary search and hashes. However, I'm not really sure what's easier and faster to implement: this method or Kasai's (and several other guys')

      • »
        »
        »
        »
        3 years ago, # ^ |
          Vote: I like it +9 Vote: I do not like it

        Yes, but hashes are evil, we don't want use them :)

        • »
          »
          »
          »
          »
          3 years ago, # ^ |
            Vote: I like it +8 Vote: I do not like it

          Why exactly? Due to anti-hash tests? Try hashing mod 2^64 and a randomly chosen reasonably small prime.

          • »
            »
            »
            »
            »
            »
            3 years ago, # ^ |
              Vote: I like it 0 Vote: I do not like it

            I just dislike hashes and trying to avoid them almost always when I have such opportunity. Also, double hashing is quite slow :(

            • »
              »
              »
              »
              »
              »
              »
              3 years ago, # ^ |
                Vote: I like it +3 Vote: I do not like it

              That's why 2^64 — what makes double hashing slow is especially the modulo operation, if you use just long long, then it's fast, but it's easy to make anti-hash tests, which is what the other part (mod smaller prime) takes care of while retaining decent runtime.

              • »
                »
                »
                »
                »
                »
                »
                »
                3 years ago, # ^ |
                  Vote: I like it 0 Vote: I do not like it

                Interesting trick. But I still dislike hashes :)

  • »
    »
    3 years ago, # ^ |
      Vote: I like it -8 Vote: I do not like it

    That was helpful.I got the idea. Thanks a lot.But Could you explain why lcp(i, j) = min(lcpi, lcpi + 1, ..., lcpj-1). this property is true?

    • »
      »
      »
      3 years ago, # ^ |
        Vote: I like it +5 Vote: I do not like it

      write a suffix array + lcp in a paper, you will notice that property.

    • »
      »
      »
      3 years ago, # ^ |
        Vote: I like it +1 Vote: I do not like it

      For example, we know lcp(i, j - 1). Obviously if lcp[j - 1] < lcp(i, j - 1) then lcp(i, j) = lcp[j - 1], otherwise lcp(i, j) = lcp(i, j - 1), i.e. lcp(i, j) = min(lcp(i, j - 1), lcp[j - 1]). Now we could rewrite lcp(i, j - 1) in this formula in the same way and get what we get.

      • »
        »
        »
        »
        3 years ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        It is clear now. Thanks :)

      • »
        »
        »
        »
        2 years ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        Do you mean lcp(1,4) in abcabcd = min(lcp(1,2),lcp(2,3),lcp(3,4)) = min(0,0,0) = 0 ?

        • »
          »
          »
          »
          »
          20 months ago, # ^ |
          Rev. 2   Vote: I like it 0 Vote: I do not like it

          here lcp(i,j) means lcp(suffix from sa[i],suffix from sa[j]) ,where sa=>suffix array

  • »
    »
    3 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    what is rank array....please explain a little more !!!!!

    • »
      »
      »
      3 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      rank array is just a reverse function for suffix array

      • »
        »
        »
        »
        3 months ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        what if we used j=sa[i]+1;

        ??/

        • »
          »
          »
          »
          »
          3 months ago, # ^ |
            Vote: I like it 0 Vote: I do not like it

          The idea is following thing. Let s='abcdefghi' Then LCP(0)=lcp(abcdefghi, bcdefghi) =|bcdefghi|=8

          And then we cut one character from left from each string and move one position forward and calculate LCP(1). It would be LCP(1)=|cdefghi|=7=LCP(0)-1

          So if j=sa[i] +1 we can't say that LCP(i) =k-1, we should check prefix fully.

          • »
            »
            »
            »
            »
            »
            3 months ago, # ^ |
              Vote: I like it 0 Vote: I do not like it

            lcp(abcdefghi, bcdefghi) =|bcdefghi|=8
            wat

            • »
              »
              »
              »
              »
              »
              »
              3 months ago, # ^ |
                Vote: I like it 0 Vote: I do not like it

              lcp(abcdefghi, bcdefghi) =|bcdefghi|=0

              right??

              • »
                »
                »
                »
                »
                »
                »
                »
                3 months ago, # ^ |
                  Vote: I like it 0 Vote: I do not like it

                Sorry, I did a huge mistake.

                I thought actually about the following.
                We have string suff[i] ='abcdefghi'. Let LCP(i) =X.

                Then for suff[j] ='bcdefghi' LCP(j) >=X-1.

            • »
              »
              »
              »
              »
              »
              »
              3 months ago, # ^ |
                Vote: I like it 0 Vote: I do not like it

              I thought about something like example of cutting It should be something like LPC(0)=lcp(abcdhhh, abcdiii) = |abcd|=4 => lcp(bcdhhh,bcdiii)=|bcd|=LCP(0)-1

              Maybe that is some sort if mild brain damage

        • »
          »
          »
          »
          »
          3 months ago, # ^ |
            Vote: I like it 0 Vote: I do not like it

          Adamant mentioned that lcp(i, j) =min(LCP(k)) {k=[i:j)} That is because we can take lcp between cur and next suffixes in the string, not in the suffix array, as a lower bound.

        • »
          »
          »
          »
          »
          3 months ago, # ^ |
            Vote: I like it 0 Vote: I do not like it

          I spent some time trying to run algorithm in my imagination.

          Despite algorithm is very confusing and it seems that it can't work it works actually. I tested it on my machine today morning.

          We consider i-th suffix of original string.
          Find where in suffix array i-th suffix lies and then find where lexicographically next suffix lies in original string. Then compare characters one by one and count of characters successfully matched is LCP.

          And for some reason it is legal to start comparing not from first character but from k-th.

          Why? What is k? k is LCP of rank[i-1]. And i-th suffix is also a suffix of (i-1)-suffix.

          Moreover, suff(i-1)=c+suff(i).

          So if k >1 then x=y+z, where |y|=k-1.

  • »
    »
    3 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Postfix increment is bad!

    • »
      »
      »
      3 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      You're bad!

      • »
        »
        »
        »
        3 months ago, # ^ |
        Rev. 5   Vote: I like it 0 Vote: I do not like it

        I saw postfix increment implementation in GCC C++.
        It was like this. [] [int &x] {int y=++x; return y;} So, postfix uses prefix form as a subroutine and therefore is slower.

        UPD: Postfix increment does unnecessarily job. So it consumes additional energy without a real purpose. Energy is not infinite you know. You are bringing us closer to the
        heat death of the Universe.

        UPD2: I understood. You are talking about compiler optimization. It is almost certain that compiler will remove postfix and put prefix form.
        Nonetheless, it requires additional time, additional energy and thus additional heating.

        • »
          »
          »
          »
          »
          3 months ago, # ^ |
            Vote: I like it +10 Vote: I do not like it

          Kill yourself, then you can slow down heat death of universe.

          • »
            »
            »
            »
            »
            »
            3 months ago, # ^ |
            Rev. 3   Vote: I like it 0 Vote: I do not like it

            It might help or might not. Now if kill myself then everybody would think I have killed myself because of codeforces.com user advised me to do so. It may trigger a huge response. Or may not. The question is pretty complex. We should discuss that question very hard.

  • »
    »
    4 weeks ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Can you please explain the notation used above. I am having doubt in lcp(i, j) and lcp_i

»
2 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Very well explained. Thanks!