Interesting problem involving convolutions

#	User	Rating
1	tourist	3880
2	jiangly	3669
3	ecnerwala	3654
4	Benq	3627
5	orzdevinwang	3612
6	Geothermal	3569
6	cnnfls_csy	3569
8	jqdai0815	3532
9	Radewoosh	3522
10	gyh20	3447

#	User	Contrib.
1	awoo	161
1	maomao90	161
3	adamant	156
4	maroonrk	153
5	-is-this-fft-	148
5	atcoder_official	148
5	SecondThread	148
8	Petr	147
9	nor	144
10	TheScrasse	142

The objective of this post is NOT understanding FFT or Karatsuba, or any algorithm that performs fast convolutions, it's about understanding one of its uses

Hi Codeforces community! I wanted to write this blog entry to help coders who would like to see a nice example of convolutions used in a problem which, at first sight, has little to no relation with polynomial multiplication (of course it has, it's a convolution after all)

The problem is as follows: You are given two strings S and P, with lengths n and m respectively, where m ≤ n

For each substring T of S with length m, we have to find the one that maximizes the number of positions with the same character, i. e T[i] = P[i]

For the sake of simplicity, let us assume that our strings consists only of letters a and b. For example, if S = baabab and P = bba, let's calculate the "similarity" between P and each substring of S of length 3.

s(baa, bba) = 2

s(aab, bba) = 0

s(aba, bba) = 2

s(bab, bba) = 1

We can see that there are two substrings which maximize the similarity (any answer will be fine)

A naive approach has a O(n·m) time complexity and uses O(n + m) memory

This should be OK if 1 ≤ n, m ≤ 1000. What about 1 ≤ n, m ≤ 10⁵? We have to find a faster way to apply $\text{[math]}$ for every substring T of S

First, let's try to change the formula above with a more mathematical approach. If we map each character {a,b}={0,1}, we can apply the next formula:

$\text{[math]}$ . This will give us the number of 1's (originally, b's) that match.

Let S^c be the complement of S (0 becomes 1, and viceversa). To compute the number of 0's that match, we apply

$\text{[math]}$

Now what? This doesn't improve the complexity at all

Let's remember how the convolution f of two discrete functions g and h is defined:

$\text{[math]}$

Now we're talking! This looks pretty similar to the formulas previously explained.

Next we will try to change our formulas to a simple convolution. First we will say that our array S and P are two functions g and h whose domain is [0, n - 1] and range is {0,1}.

$\text{[math]}$

But the formula is incorrect, this convolution, for each x applies the similarity formula, but reversing P!

This is our first issue, which we can easily solve by reversing P ourselves. Let's call P_r the reversed version of P. We can see that

$\text{[math]}$ is equivalent to $\text{[math]}$

Our second issue is, m ≤ n. We can easily solve this issue (again) by adding some trailing zerores to P_r, which is equivalent to add leading zeroes to P until m = n

But, what is f(x) exactly, and how can this help us to improve the time complexity?

We saw that $\text{[math]}$ , which is the same as taking a preffix of S and (this is the tricky part) a suffix of P, each of length x + 1

If we take x = m - 1, we can see that this applies the similarity function to the whole pattern P and the first substring of S which starts at 0. If we take x > m - 1 then we apply the function to P and the (x - m + 1) - th substring of S. And if we take x < m - 1 we apply the function to a substring which goes out of the bounds of S.

Let's check our example S = baabab and P = bba. First, let's map each character S = 100101 and P = 110.

Then,P_r = 011 and let's add some trailing zeroes: P_r = 011000.

Let's calculate the convolution f

f(0) = 0

f(1) = 1

f(2) = 1

f(3) = 0

f(4) = 1

f(5) = 1

Great! f computes the number of 1's that match. What about the number of 0's? Let's apply the same procedure to S^c and P^c

S^c = 011010 and P^c = 001

P_r^c = 100000

f^c(0) = 0

f^c(1) = 1

f^c(2) = 1

f^c(3) = 0

f^c(4) = 1

f^c(5) = 0

Fine! Now, the answer should be max f(i) + f^c(i) where i is in [m - 1, n - 1]

With this approach, we can solve the problem with a O(nlogn) time complexity using an algorithm like FFT to perform the convolution, or maybe Karatsuba algorithm

And what if the alphabet size is not 2?

For each character on the alphabet, we apply the convolution. Suppose f_{a_i} is the convolution between S and P where, if S[i] = = a_i then S[i] = 1, else S[i] = 0 (the same applies for P), for every character a_i on the alphabet a

The answer should be max $\text{[math]}$ with j in [m - 1, n - 1]

This solution has a O(|a|·nlogn) time complexity, where |a| is the size of the alphabet

And what now?

Here are some problems which can be solved using this approach

ADAMATCH

MAXMATCH

K inversions

Hope you like this post, and helps you solving some convolution problems!

Comments (5)

Show archived | Write comment?

snacache

6 years ago, # |

Auto comment: topic has been updated by snacache (previous revision, new revision, compare).

→ Reply

Walid_Amin

Thank you for this nice article, i really like convolution problems :)

Anyway, the paragraph that explains when the alphabet is not 2, i think it should be with j [m-1,n-1] without the complement part, since we are summing over all the alphabet.

6 years ago, # ^ |

You're totally right! Thanks, fixed!

rubmary

Great!

I_love_Harpae

+11

Thank you for this article! I've seen a lot of string matching problems for which the solution was to utilize some sort of convolution, and I had no idea where that came in and how it applied. This cleared a lot of things for me. Thanks again!

snacache's blog