I am learning suffix arrays. I understood the O(nlogn) implementation of suffix array. But I am not being able to understand LCP calculation. Could someone explain how to calculate LCP from suffix arrays? Thanks in advance.

By Shayan

Before stream
12:09:37

# | User | Rating |
---|---|---|

1 | jiangly | 3677 |

2 | Benq | 3601 |

3 | ecnerwala | 3542 |

4 | maroonrk | 3540 |

5 | cnnfls_csy | 3539 |

6 | orzdevinwang | 3493 |

7 | inaFSTream | 3478 |

8 | Um_nik | 3430 |

9 | Rebelz | 3409 |

10 | Geothermal | 3408 |

# | User | Contrib. |
---|---|---|

1 | maomao90 | 174 |

2 | adamant | 166 |

3 | awoo | 162 |

4 | TheScrasse | 161 |

5 | nor | 160 |

6 | maroonrk | 158 |

7 | SecondThread | 156 |

8 | Um_nik | 151 |

9 | Geothermal | 145 |

9 | BledDest | 145 |

I am learning suffix arrays. I understood the O(nlogn) implementation of suffix array. But I am not being able to understand LCP calculation. Could someone explain how to calculate LCP from suffix arrays? Thanks in advance.

↑

↓

Codeforces (c) Copyright 2010-2024 Mike Mirzayanov

The only programming contests Web 2.0 platform

Server time: Feb/29/2024 07:30:23 (k2).

Desktop version, switch to mobile version.

Supported by

User lists

Name |
---|

Kasai's algorithm is pretty easy and works in

O(n).Let's look at the two continuous suffixes in the suffix array. Let their indexes in suffix array be

i_{1}andi_{1}+ 1. If theirlcp> 0, then if we delete first letter from both of them. We can easily see that new strings will have the same relative order. Also we can see that lcp of new strings will be exactlylcp- 1.Let's now look at the string wich we have got from the

isuffix by deleting its first character. Obviously it is some suffix of the string too. Let its index bei_{2}. Let's look at the lcp of suffixesi_{2}andi_{2}+ 1. We can see that it'slcpwill be at least already mentionedlcp- 1. This is associated with certain properties of lcp array, in particular, thatlcp(i,j) =min(lcp_{i},lcp_{i + 1}, ...,lcp_{j - 1}).And finally let's make the algorithm based on the mentioned above. We will need an additional array

rank[n], wich will contain the index in the suffix array of the suffix starting in indexi. Firstly we should calculate the lcp of the suffix with indexrank[0]. Then let's iterate through all suffixes in order in which we meet them in the string and calculatelcp[rank[i]] in naive way, BUT starting it fromlcp[rank[i- 1]] - 1. Easy to see that now we haveO(n) algorithm because on the each step ourlcpdecreasing not more than by 1 (except the case whenrank[i] =n- 1).Implementation:

There is also a way to build it in

O(nlogn) with a segment tree described on the e-maxx, but in my opinion it is much harder and slower.And there's another

O(NlogN) algorithm which is much more intuitive: find LCP of each pair of consecutive suffixes using binary search and hashes. However, I'm not really sure what's easier and faster to implement: this method or Kasai's (and several other guys')Yes, but hashes are evil, we don't want use them :)

Why exactly? Due to anti-hash tests? Try hashing mod 2^64 and a randomly chosen reasonably small prime.

I just dislike hashes and trying to avoid them almost always when I have such opportunity. Also, double hashing is quite slow :(

That's why 2^64 — what makes double hashing slow is especially the modulo operation, if you use just long long, then it's fast, but it's easy to make anti-hash tests, which is what the other part (mod smaller prime) takes care of while retaining decent runtime.

Interesting trick. But I still dislike hashes :)

That was helpful.I got the idea. Thanks a lot.But Could you explain why lcp(i, j) = min(lcpi, lcpi + 1, ..., lcpj-1). this property is true?

write a suffix array + lcp in a paper, you will notice that property.

For example, we know

lcp(i,j- 1). Obviously iflcp[j- 1] <lcp(i,j- 1) thenlcp(i,j) =lcp[j- 1], otherwiselcp(i,j) =lcp(i,j- 1), i.e.lcp(i,j) =min(lcp(i,j- 1),lcp[j- 1]). Now we could rewritelcp(i,j- 1) in this formula in the same way and get what we get.It is clear now. Thanks :)

Do you mean lcp(1,4) in abcabcd = min(lcp(1,2),lcp(2,3),lcp(3,4)) = min(0,0,0) = 0 ?

here lcp(i,j) means lcp(suffix from sa[i],suffix from sa[j]) ,where sa=>suffix array

what is rank array....please explain a little more !!!!!

rank array is just a reverse function for suffix array

what if we used j=sa[i]+1;

??/

The idea is following thing. Let s='abcdefghi' Then LCP(0)=lcp(abcdefghi, bcdefghi) =|bcdefghi|=8

And then we cut one character from left from each string and move one position forward and calculate LCP(1). It would be LCP(1)=|cdefghi|=7=LCP(0)-1

So if j=sa[i] +1 we can't say that LCP(i) =k-1, we should check prefix fully.

lcp(abcdefghi, bcdefghi) =|bcdefghi|=8

wat

Postfix increment is bad!

You're bad!

I saw postfix increment implementation in GCC C++.

It was like this. [] [int &x] {int y=++x; return y;} So, postfix uses prefix form as a subroutine and therefore is slower.

UPD: Postfix increment does unnecessarily job. So it consumes additional energy without a real purpose. Energy is not infinite you know. You are bringing us closer to the

heat death of the Universe.

UPD2: I understood. You are talking about compiler optimization. It is almost certain that compiler will remove postfix and put prefix form.

Nonetheless, it requires additional time, additional energy and thus additional heating.

Kill yourself, then you can slow down heat death of universe.

Can you please explain the notation used above. I am having doubt in lcp(i, j) and lcp_i

https://www.hackerrank.com/challenges/find-strings/topics

http://www.mi.fu-berlin.de/wiki/pub/ABI/RnaSeqP4/suffix-array.pdf