Convolution and String Matching (small alphabet)

#	User	Rating
1	tourist	3845
2	jiangly	3707
3	Benq	3630
4	orzdevinwang	3573
5	Geothermal	3569
5	cnnfls_csy	3569
7	jqdai0815	3532
8	ecnerwala	3501
9	gyh20	3447
10	Rebelz	3409

#	User	Contrib.
1	maomao90	171
2	adamant	163
3	awoo	161
4	nor	152
5	maroonrk	151
5	-is-this-fft-	151
7	TheScrasse	148
8	atcoder_official	146
9	Petr	145
10	pajenegod	144

The objective of this post is NOT understanding FFT or Karatsuba, or any algorithm that performs fast convolutions, it's about understanding one of its uses

Hi Codeforces community! I wanted to write this blog entry to help coders who would like to see a nice example of convolutions used in a problem which, at first sight, has little to no relation with polynomial multiplication (of course it has, it's a convolution after all)

The problem is as follows: You are given two strings S and P, with lengths n and m respectively, where m ≤ n

For each substring T of S with length m, we have to find the one that maximizes the number of positions with the same character, i. e T[i] = P[i]

For the sake of simplicity, let us assume that our strings consists only of letters a and b. For example, if S = baabab and P = bba, let's calculate the "similarity" between P and each substring of S of length 3.

s(baa, bba) = 2

s(aab, bba) = a

s(aba, bba) = 2

s(bab, bba) = b

We can see that there are two substrings which maximize the similarity (any answer will be fine)

A naive approach has a O(n·m) time complexity and uses O(n + m) memory

This should be OK if 1 ≤ n, m ≤ 1000. What about 1 ≤ n, m ≤ 10⁵? We have to find a faster way to apply $\text{[math]}$ for every substring T of S

First, let's try to change the formula above with a more mathematical approach. If we map each character {a,b}={0,1}, we can apply the next formula:

$\text{[math]}$ . This will give us the number of 1's (originally, b's) that match.

Let S^c be the complement of S (0 becomes 1, and viceversa). To compute the number of 0's that match, we apply

$\text{[math]}$

Now what? This doesn't improve the complexity at all

Let's remember how the convolution f of two discrete functions g and h is defined:

$\text{[math]}$

Now we're talking! This looks pretty similar to the formulas previously explained.

Next we will try to change our formulas to a simple convolution. First we will say that our array S and P are two functions g and h whose domain is [0, n - 1] and range is {0,1}.

$\text{[math]}$

But the formula is incorrect, this convolution, for each x applies the similarity formula, but reversing P!

This is our first issue, which we can easily solve by reversing P ourselves. Let's call P_r the reversed version of P. We can see that

$\text{[math]}$ is equivalent to $\text{[math]}$

Our second issue is, m ≤ n. We can easily solve this issue (again) by adding some trailing zerores to P_r, which is equivalent to add leading zeroes to P until m = n

But, what is f(x) exactly, and how can this help us to improve the time complexity?

We saw that $\text{[math]}$ , which is the same as taking a preffix of S and (this is the tricky part) a suffix of P, each of length x + 1

If we take x = m - 1, we can see that this applies the similarity function to the whole pattern P and the first substring of S which starts at 0. If we take x > m - 1 then we apply the function to P and the (x - m + 1) - th substring of S. And if we take x < m - 1 we apply the function to a substring which goes out of the bounds of S.

Let's check our example S = baabab and P = bba. First, let's map each character S = 100101 and P = 110.

Then,P_r = 011 and let's add some trailing zeroes: P_r = 011000.

Let's calculate the convolution f

f(0) = 0

f(1) = 1

f(2) = 1

f(3) = 0

f(4) = 1

f(5) = 1

Great! f computes the number of 1's that match. What about the number of 0's? Let's apply the same procedure to S^c and P^c

S^c = 011010 and P^c = 001

P_r^c = 100000

f^c(0) = 0

f^c(1) = 1

f^c(2) = 1

f^c(3) = 0

f^c(4) = 1

f^c(5) = 0

Fine! Now, the answer should be max f(i) + f^c(i) where i is in [m - 1, n - 1]

With this approach, we can solve the problem with a O(nlogn) time complexity using an algorithm like FFT to perform the convolution, or maybe Karatsuba algorithm

And what if the alphabet size is not 2?

For each character on the alphabet, we apply the convolution. Suppose f_{a_i} is the convolution between S and P where, if S[i] = = a_i then S[i] = 1, else S[i] = 0 (the same applies for P), for every character a_i on the alphabet a

The answer should be max $\text{[math]}$ with j in [m - 1, n - 1]

This solution has a O(|a|·nlogn) time complexity, where |a| is the size of the alphabet

And what now?

Here are some problems which can be solved using this approach

ADAMATCH

MAXMATCH

K inversions

Hope you like this post, and helps you solving some convolution problems!

Rev.	By	When	Δ	Comment
en6	snacache	2018-05-11 05:08:43	4
en5	snacache	2018-05-11 00:23:31	17	Tiny change: 'f_{a_i}(j) + f^{c}_{a_i}(j)}' -> 'f_{a_i}(j)}'
en4	snacache	2018-05-10 23:54:16	62
en3	snacache	2018-05-10 23:49:52	2787	Tiny change: '$ memory.<\b>\n\nThis' -> '$ memory.</b>\n\nThis' (published)
en2	snacache	2018-05-10 21:17:51	2109	Tiny change: 'y, where $m\leqn$' -> 'y, where $n\beqm$'
en1	snacache	2018-05-10 19:07:47	444	Initial revision (saved to drafts)

History