ko_osaga's blog

By ko_osaga, history, 15 months ago, In English

I recently solved some problems that involved the concept of Lyndon decomposition. Honestly, most of them were too hard to understand for me. I'm just trying to think out loud about things I've read, so I can learn ideas or better takes from smarter people?

Note that I will omit almost all proofs as I can't do that. I believe all unproven claims below are facts, but it is always great to have doubts about anything.

1. Lyndon decomposition, definition, and algorithms

Partly copy-pasted from this link.

A string is called simple (or a Lyndon word), if it is strictly smaller than any of its own nontrivial suffixes. Examples of simple strings are $$$a, b, ab, aab, abb, abcd, abac$$$.

It can be shown that a string is simple, if and only if it is strictly smaller than all its nontrivial cyclic shifts. As a corollary, it can be observed that simple words are never periodic (it is not a repetition of some words for $$$2$$$ or more times).

The Lyndon decomposition of string $$$s$$$ is a factorization $$$s = w_1 w_2 \ldots w_k$$$, where all strings $$$w_i$$$ are simple, and are in non-increasing order $$$w_1 \geq w_2 \geq \ldots \geq w_k$$$.

Alternatively, the Lyndon decomposition of string $$$s$$$ can be represented as $$$s = w_1^{p_1} w_2^{p_2} \ldots w_k^{p_k}$$$. Here, $$$p_i$$$ are positive integers, and $$$w^p_i$$$ denotes the string $$$w$$$ repeated for $$$p_i$$$ times. All strings $$$w_i$$$ are simple, and are in decreasing order $$$w_1 > w_2 > \ldots > w_k$$$. The only difference is that the group of identical factors is grouped as a chunk such as $$$w^p_i$$$.

It is claimed that for any string such a factorization exists and it is unique. However, I can't prove it.

1.1 Algorithm

There are two algorithms that compute the Lyndon decomposition in linear time. The first algorithm is the well-known Duval algorithm. E-maxx has a good explanation on this, so I won't discuss it here.

Another algorithm is conceptually much simpler. Given a string $$$S$$$, consider the greedy algorithm that repeatedly removes the smallest suffix from $$$S$$$. By definition, the greedy algorithm always removes a simple word, so the algorithm will return a decomposition consisting of simple words. We believe that the Lyndon decomposition is unique, thus algorithm returns a Lyndon decomposition.

Let's compute the time complexity, the algorithm will iterate at most $$$O(N)$$$ times, and it can find the smallest suffix naively in $$$O(N^2)$$$ time, so the naive implementation will take $$$O(N^3)$$$ time. However, the smallest suffix is just the first entry of the suffix array, so using the fastest suffix array algorithm can optimize each phase to $$$O(N)$$$, giving an $$$O(N^2)$$$ algorithm.

Should we compute the suffix array from scratch in each phase? The removal of a suffix does change the ordering in the suffix array. For example, $$$abac < ac$$$, but $$$aba > a$$$.

However, this issue doesn't apply to our application, where we remove the smallest suffix. Therefore, given a suffix array $$$SA_0, \ldots, SA_{N - 1}$$$ for the string $$$S$$$, one can simply iterate from $$$SA_0$$$ to $$$SA_{N - 1}$$$, and cut the string as long as it is the leftmost position we encountered. As the suffix array can be solved in $$$O(N)$$$, this gives an $$$O(N)$$$ solution to the Lyndon decomposition. I can't prove why this is true. But this looks like a folklore algorithm, so I believe it's true.

2. Computing Lyndon decomposition for each substring

For a string of size $$$N$$$, the Lyndon decomposition may have at most $$$O(N)$$$ size, in which case the above algorithms are already optimal. Hence, in this section, we only discuss finding the smallest suffix for each substring in near-constant time, since it may

  • lead to an algorithm for computing Lyndon decomposition in near-linear time on output size, by the above greedy algorithm.
  • yield some small implicit structure (tree) that captures the Lyndon decomposition for all interesting substrings

2.1. Lyndon decomposition for all suffixes

The removal of a prefix does not change the ordering in the suffix array. To find the smallest suffix in $$$S[x ...]$$$, just find the first entry in the suffix array such that $$$SA_i \geq x$$$.

2.2. Lyndon decomposition for all prefixes

Duval's algorithm is basically incremental since it repeatedly adds a letter $$$s[j]$$$ to the existing structure. This hints that the Lyndon decomposition can be computed for all prefixes, although it's not entirely straightforward.

I came up with the algorithm to compute all min suffixes for all prefixes. There are other algorithms to compute the min suffixes, such as the one ecnerwala described in this comment.

Duval algorithm maintains a pre-simple string in each iteration. Consider a pre-simple string $$$t = ww\ldots w\overline{w}$$$ for the current prefix. Except for the last string $$$\overline{w}$$$, every other string are simple. And if we take the Lyndon decomposition of $$$\overline{w}$$$, the first element of it is the prefix of $$$\overline{w}$$$, which is obviously less than $$$w$$$. As we know that Lyndon decomposition is unique, we can see that the last element of Lyndon decomposition of $$$\overline{w}$$$ is exactly the smallest suffix of the current prefix.

Thus, the naive algorithm is the following:

  • If $$$\overline{w}$$$ is empty, $$$w$$$ is the smallest suffix of the given prefix.
  • Otherwise, the smallest suffix of $$$\overline{w}$$$ is the smallest suffix for the given prefix.

However, we don't have to recompute the smallest suffix of $$$\overline{w}$$$ every time. In the decomposition algorithm, we fix the string $$$s_1 = s[0 : i)$$$ and compute the decomposition for the suffix $$$s[i \ldots]$$$. For each relevant $$$i$$$, we use dynamic programming. Let $$$MinSuf[j]$$$ be the length of smallest suffix of $$$S[i \ldots j)$$$ for $$$j > i$$$. If $$$\overline{w}$$$ is empty the smallest suffix is $$$w$$$. Otherwise, since $$$\overline{w}$$$ is exactly the string $$$S[i \ldots i + |\overline{w}|)$$$, $$$MinSuf[j] = MinSuf[i + |\overline{w}|]$$$. Therefore we can obtain a simple recursive formula.

2.3 Lyndon decomposition for all substrings?

This paper contains some ideas, so if you are interested, give it a try :)

3. The Runs Theorem

Run is a concept that is useful for solving problems related to repeats. Even if you never heard of the name, anyone who solved some challenging suffix array problems will be familiar with it.

Given a string $$$S$$$, the tuple $$$(l, r, p)$$$ is a run of string $$$S$$$ if

  • $$$0 \le l < r \le |S|$$$
  • $$$1 \le p \le |S|$$$
  • $$$r - l \geq 2p$$$
  • $$$p$$$ is the smallest positive integer where $$$S[i] = S[i + p]$$$ holds for all $$$l \le i < r - p$$$
  • The above four properties doesn't hold for tuple $$$(l - 1, r, p)$$$ and $$$(l, r + 1, p)$$$

Let $$$-S$$$ be the string where all elements are inverted: Specifically, we assign s[i] = 'a' + 'z' - s[i] for all elements of $$$S$$$, so that the usual comparison order is reverted, except the empty character which has the lowest priority.

Given a string $$$S$$$, a Lyndon prefix is the longest prefix that is a Lyndon word. Given a suffix array of $$$S$$$, this Lyndon prefix can be easily computed. Recall an algorithm that computes the Lyndon decomposition given a suffix array. Let $$$Rank_i$$$ be the inverse of the suffix array. Then, we can see that the length of the Lyndon prefix is the smallest $$$i$$$ such that $$$Rank_i < Rank_0$$$ (or $$$|S|$$$ if such does not exist). Similarly, we can also compute this for all suffixes $$$S[i \ldots]$$$: find the smallest $$$j > 0$$$ such that $$$Rank_{i + j} < Rank_i$$$.

For each suffix of $$$S$$$ and $$$-S$$$, we compute the Lyndon prefix $$$[i, j)$$$ and take them as a "seed". Start from the tuple $$$(i, j, j - i)$$$, and extend the tuple in both direction as long as $$$S[i] = S[i + p]$$$ holds. Specifically, Let $$$k$$$ be the maximum number such that $$$S[i, i + k) = S[j, j + k)$$$ and $$$l$$$ be the maximum number such that $$$S[i - l, i) = S[j - l, j)$$$. Then we obtain a run $$$(i - l, j + k, j - i)$$$. Both $$$k, l$$$ can be computed in $$$O(\log N)$$$ time with suffix arrays.

It's easy to verify that those elements are actually the run of the string. If we remove all duplicated runs, the following fact holds:

Fact 1. Those we computed are exactly the set of all Runs.

Fact 2. There are at most $$$n$$$ runs.

Fact 3. The sum of $$$(j - i) / p$$$ for all runs are at most $$$3n$$$.

Fact 4. The sum of 2-repeats ($$$j - i - 2p + 1$$$) obtained from runs are at most $$$n \log n$$$.

Fact 3 is useful when we want to enumerate all repeats. Suppose that we have to enumerate all possible repeats. A string "aaaa" can be considered as a repeat of "a" 4 times, but it is also a repeat of "aa" 2 times. In this case, we have to enumerate all multiples of $$$p$$$ — but by Fact 3, that does not affect the overall complexity.

Fact 1, 2, 3 can be found on this paper. I think Fact 4 is not hard to prove, but that doesn't mean I've done it, nor do I have a reference that states this fact.

4. Lexicographically minimum substring reverse

Given a string $$$S$$$, you can select $$$0$$$ or more non-overlapping substrings, and reverse them. What is the lexicographically minimum result you can obtain from the single iteration of this operation?

Let $$$S^R$$$ be the reverse of $$$S$$$. The answer is to take the Lyndon decomposition for $$$S^R$$$, and reverse each substring from that respective position.

I don't know why this works.

Intuitively, we are replacing each prefix of $$$S$$$ to the minimum suffix of $$$S^R$$$. Replacing each prefix to the minimum possible suffix seems like a good trade. Do you agree or disagree? XD

5. Minimal Rotation from Lyndon decomposition

Given a string $$$S$$$, what is the lexicographically minimum result you can obtain by taking a cyclic shift of $$$S$$$?

The answer can be found by finding the smallest suffix of length $$$> |S|$$$ for string $$$S + S$$$, and rotating at the respective position. This suffix can be found with Lyndon decomposition. Therefore we can solve this in $$$O(n)$$$ time, which is great.

What about just reversing a minimum suffix of $$$S$$$? Unfortunately, cases like "acabab", "dacaba" are the countercase. If we can reduce this problem into a minimum suffix instance, we can solve this problem for all prefixes, suffixes, and possibly substrings, so that's really unfortunate...

.. or maybe not. For a string $$$S$$$, consider it's Lyndon factorization $$$S = w_1^{p_1} w_2^{p_2} w_3^{p_3} \ldots w_k^{p_k}$$$. Clearly, taking the middle of periods is a bad idea. And taking only $$$w_k^{p_k}$$$ as a candidate is wrong.

Then what about trying to crack the tests? Let $$$SFX_j = w_j^{p_j} w_{j+1}^{p_{j + 1}} \ldots w_k^{p_k}$$$. Then, we can try all $$$SFX_j$$$ in range $$$k - 69 \le j \le k + 1$$$ as a candidate. It looks really hard to create an anti-test for this approach.

Lemma. Minimum rotation exists in the last $$$\log_2 |S|$$$ candidates of $$$SFX_j$$$. (Observation 6)

This provides an algorithm for computing the minimum rotation in $$$O(Q(n) \log n)$$$ time, where $$$Q(n)$$$ is time to compute the minimum suffix.

Practice problems

Minimum suffix for each prefix

Run Enumeration

Lexicographically minimum substring reverse

Minimum rotation for each substring

  • Vote: I like it
  • +105
  • Vote: I do not like it

15 months ago, # |
Rev. 2   Vote: I like it +29 Vote: I do not like it

It is claimed that for any string such a factorization exists and it is unique. However, I can't prove it.

Existence: repeatedly take the longest prefix that is a Lyndon word, and add it to the decomposition. We need to prove $$$w_{i} \geq w_{i + 1}$$$. Assume the contrary. Then, the concatenation of $$$w_{i}$$$ and $$$w_{i + 1}$$$ is a Lyndon word, thus $$$w_{i}$$$ wasn't maximal.

Uniqueness: induction on the length of the string, trivial for $$$n = 1$$$. Now, take two arbitrary decompositions of the same string $$$S$$$ of length $$$n$$$, the first of which has words of length $$$x_1[i]$$$ and the second of length $$$x_2[i]$$$ (1-indexed). WLOG assume $$$x_1[1] < x_2[1]$$$, as if $$$x_1[1] = x_2[1]$$$, by induction the decompositions are equal. Now, let $$$j$$$ be the maximum index such that $$$y = \sum_{i \leq j} x_1[j] < x_2[1]$$$. Then, the string $$$S[y + 1, x_2[1]]$$$ is lexicographically at most $$$S[y + 1, y + x_1[j + 1]]$$$, which is lexicographically at most $$$S[1, x_1[1]]$$$, which is lexicographically at most $$$S[1, x_2[1]]$$$. But $$$S[1, x_2[1]]$$$ is a Lyndon word, a contradiction.

An easier proof might exist, this is just the first thing I came up with.