-synx-'s blog

By -synx-, history, 7 years ago, In English

We are given 2 lines (DNA Sequences) of equal length of upto 105 characters consisting of alphabets A, T, G, C only.
We have to find the ith cyclic shift of one string for which maximum characters are matched between the two strings.
We have to do it in better than O(n2) obviously. Any hints?

  • Vote: I like it
  • +22
  • Vote: I do not like it

»
7 years ago, # |
  Vote: I like it +15 Vote: I do not like it

Suppose that strings are composed only by 0 and 1.. If the text is s with length(s) = n and the pattern is t with length(t) = m, reverse the pattern first, and then for each string build the polynomial P(w) = w[0] + w[1] × x + ... + w[length(w) - 1] × xlength(w) - 1. The coefficient of the term i + m - 1 in P(s + s) × P(rev(t)) give you the matches of the i - th cyclic shift of s with t.. For solving the original problem calculate the answer for each different letters and add up the result for obtain the answer for the original.

»
7 years ago, # |
  Vote: I like it +8 Vote: I do not like it
  • Consider masks for each letter (e.g. for 'A': AATGCA -> 110001).
  • Treat mask as a polynomial: x5 + x4 + 1.
  • Multiply poly(S1) and poly(reversed(S2)) modulo xn - 1. Modulo means setting the relation xn = 1 which is saying that rotating left by n changes nothing. To implement: the result of multiplication has size 2n, just add the high half to the low and remove the high half). Reverse is needed because normal convolution will pair first letters "asymmetrically".
  • The result will contain number of matches for each position.
  • Sum such results for all letters and choose the maximum.