Problem: Link

Submission: Link

**Part 1: Calculate the Prefix function (next array) in $$$O(n)$$$**

We can calculate the prefix function (next array) in linear time using the following code:

```
//next is a 0-indexed array that is initialized to zero.
//s is a 0-indexed string
for(int i = 1, j = 0; i < n; ++i){
while(j && s[i] != s[j]) j = next[j-1];
if(s[i] == s[j]) j++;
next[i] = j;
}
```

Why is it $$$O(n)$$$? The key idea is the amortized analysis. $$$j$$$ increases $$$n$$$ times, each times increase at most $$$1$$$ (in fact $$$0$$$ or $$$1$$$, $$$0$$$ if `s[i] != s[j]`

and $$$1$$$ if `s[i] == s[j]`

). Therefore, $$$j$$$ increases at most $$$n$$$. Since $$$j$$$ is never a negative number (you can prove it by induction on the `next`

array), the amount $$$j$$$ decreases $$$\leq$$$ the amount $$$j$$$ increases $$$\leq n$$$. Each times $$$j$$$ decreases at least $$$1$$$ (as $$$next[j-1] \leq j-1$$$), so $$$j$$$ decreases at most $$$n$$$ times. To sum all, $$$j$$$ increases and decreases at most $$$2n$$$ times, which is $$$O(n)$$$.

But there is a trap! We can calculate the whole `next`

array in $$$O(n)$$$, but for a suffix of length $$$k$$$, calculating the items of the `next`

array that corresponds with that suffix is not $$$O(k)$$$!!! In fact, the worst complexity for every suffix, even a single character, could also be $$$O(n)$$$! In fact the example is very easy to construct: `aaaaa....az`

, the next array is:

$$$[0, 1, ..., num(a)-1, 0]$$$.

For $$$0 \leq i \leq num(a)$$$, the line `j = next[j-1]`

executes $$$0$$$ times, and when the index comes to $$$z$$$, `j = next[j-1]`

would execute $$$num(a)$$$ times! That means, the computational burden on the suffix could also be $$$O(n)$$$!

The GrandLisboa earns $$$100$$$ million from $$$10$$$ gamblers, it may earn $$$90$$$ million from you and $$$10$$$ million from another $$$9$$$ gamblers, that is amortization! Let's come back to this problem. When you append each suffix $$$t$$$ to the original string $$$s$$$, the complexity to handle the suffix is not $$$O(|t|)$$$, in fact the worst case is $$$O(|s|)$$$, and according to the analysis above, such suffix is very easy to construct!

**Part 2: A fix**

We note that most `next`

transition is unnecessary. For example,the $$$0$$$-indexed string $$$s$$$ is `ababcababa`

and we are now checking the last character, which is `a`

. it is not necessary to scan `abab`

(length=$4$) at all, because the prefix `abab`

(`s[0:4]`

) is followed by `c`

while the prefix `abab`

(`s[5:9]`

) is followed by `a`

. So, scan the first two characters `ab`

is enough! Inspired by this, we define the last array:

```
//last[i][c]: Among 0 <= k <= i such that [0...k] is a prefix of [0,...,i] and [i-k,...,i] is a suffix of [0,...,i], the max k such that s[k+1]==c
//For example, s = "abcabaabd", last[7]['d'] = 7, last[7][c] = 1.
//s[0:1] = "ab" is a prefix of s[0:7]("abcabaab").
//s[0:1] is also a suffix of s[0:7]
//and s[1+1] = s[2] = 'c'
//The transition function: last[i][c] = (c == s[i+1] ? (next[i] >= 1 ? last[next[i]-1][c] : -1) : i);
```

We can easily transfer to the proper place in $$$O(1)$$$ time. Deducing `last[i][*]`

from `last[i-1][*]`

costs $$$O(|\alpha|)$$$ time ($$$\alpha$$$ is the alphabet), and such transition only happens $$$O(n)$$$ times, so we can fix this issue in $$$O(n |\alpha|)$$$ time.

As a newbie, what do I learn from this problem?

(1) First, carefully thinks about why the naive algorithm fails. This has been carefully analyzed in the Part $$$1$$$.

(2) Second, sometimes adding one dimension is really helpful.

(3) We should pay attention to the small constraints, they might be the breakthroughs. The "small constraints" are mostly explicit, but sometimes they are implicit and tricky. In this problem, the "lowercase Latin letters" is an implicit constraint that restricts the size of alphabet to $$$26$$$. When we work on some problems we may ignore this "implicit small constraints", which are the key factors. We may take the "lowercase Latin letters" for granted, but please note that the KMP algorithm never requires this! Here is another problem having "implicit small constraints": Diverse Substrings.

**Part 3: Why don't us generalize this idea to the original KMP algorithm?**

(1)The original $$$KMP$$$ algorithm is already linear and achieves the best theoretical complexity.

(2)The character set in the original $$$KMP$$$ algorithm is not restricted to the lower Latin letters. It can handle arbitrary character sets in linear time, not depending on the size of the character sets.