LLI_E_P_JI_O_K's blog

By LLI_E_P_JI_O_K, history, 15 months ago, translation, In English,

There is a well known algorithm of searching the largest common subsequence of two sequences with lengths N and M in O(N·M) time: Link to the algorithm description

But I've recently heard there is some technique that allows to reduce time if one of the sequences is short enough and works in O(min(N, M)2 + max(N, M)). Does anybody know how to do it?

 
 
 
 
  • Vote: I like it
  • +26
  • Vote: I do not like it

»
15 months ago, # |
  Vote: I like it +33 Vote: I do not like it

Seems pretty obvious: dpi, j stores the least k such that LCS of i-th prefix of small sequence and k-th prefix of large sequence is j.

  • »
    »
    15 months ago, # ^ |
      Vote: I like it +18 Vote: I do not like it

    Wow, definitely, it's quite easy, thanks a lot! :)

  • »
    »
    15 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    For moving to the state dp[i + 1][j + 1] we need to find the first entry of element s[i + 1] in big sequence starting from the dp[i][j] + 1 position, how to implement it in O(1) if the alphabet size is big? Won't there be an additional log2(max(N, M)) in the time estimation?

    • »
      »
      »
      15 months ago, # ^ |
      Rev. 2   Vote: I like it +20 Vote: I do not like it

      (Kind of overkill probably.)

      Let n be the length of longer sequence and m be the length of shorter sequence.

      First thing we do is deleting all characters from longer sequence that don't apper in shorter sequence in O(n + m) time using hashmap. Now, there are only m symbols left in the "effective alphabet". Also this construction allows us to think that alphabet is just [0, m - 1] (this part seems to be unnecessary if alphabet is something like [0, n - 1], but should reduce hidden constant if alphabet is big enough).

      One thing we may notice is that iterating over (i, j) in any kind of lexicographical order is unnecessary: we may iterate over them in order of increasing dp[i][j], which will turn out to be more conventient. This way all dp recalculations will still happen in valid order.

      To achieve this, we will use n + 1 queues: queue number k will hold all (i, j) such that dp[i][j] = k. Then we can just iterate k from 0 to n - 1, consider all pairs from k-th queue (there are two types of transitions (i, j, k) -> (i + 1, j, k) just adds new pair in current queue for later consideration, and (i, j, k) -> (i + 1, j + 1, nk > k) is interesting, but affects only later queues).

      The difference between this and naive update order is that now queries "find first symbol equal to c in longer string from position p inclusive" happen in increasing order of p. Suppose that we have an array of answers for all such queries for position p. Transition to p + 1 is simple, because only for one symbol answer changes — exactly for symbol at p-th position of longer string. And for it, answer changes to next position of this symbol in longer string. Next positions of each symbol can be precalculated in O(n + m), so we can answer this queries in O(1) after some precalculations, because p only increases.

      TL;DR: process states in order of increasing values of dp[i][j], this way start of the query segments always move right, which makes recalculating answers in O(1) possible.

      • »
        »
        »
        »
        15 months ago, # ^ |
        Rev. 4   Vote: I like it 0 Vote: I do not like it

        Ok, thanks a lot, interesting approach :)

      • »
        »
        »
        »
        15 months ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        Not seems to be an "overkill". Additional log2(max(N, M)) can seriously influence to the time.