[Tutorial] Dictionary of Basic Factors — O(1) String Matching

Revision en6, by Leonardo16, 2021-06-23 02:20:42

Hello, codeforces.
I think this technique is important, and easy to code, although I haven't found much about it, so I've decided to make a blog explaining it.This technique allows to compare substrings lexicographically in $$$O (1)$$$ with a preprocessing in $$$O (NlogN)$$$.
As this is my first post on codeforces, please let me know if there are any mistakes.

Prerequisites:
  • Basic strings knowledge.
  • Sparse Table(Optional).

What is a Dictionary of Basic Factors?

As in the sparse table we will have a matrix of size $$$N\log{N}$$$ called $$$DFB$$$, where $$$DFB[i][j]$$$ tells us the position of the string $$$S[i ... i + 2^j-1]$$$ (the first time it appears) among all the strings of size $$$2^j$$$ ordered lexicographically.

So, how do we build the matrix? Let's start with the lowest level, for powers of size $$$2^0$$$, we simply put the position of the character $$$S[i]$$$ in the dictionary (1 for 'A', 2 for 'B' ... and so on).

For the following $$$ DFB[i][j] $$$ we can save a pair of numbers that are $$$(DFB[i][j-1], DFB[i+2^{j-1}][j-1]) $$$

Now we replace the pairs by their position (the first time it appears) on the ordered array of those pairs, this can be done with Counting Sort in $$$O(N)$$$ (I'll explain it more deeply later), otherwise with a normal sort in $$$O(NlogN)$$$ (But this will slow down the algorithm).

We do that for all powers of two and we will have the complete matrix.

Handling queries:

Suppose we want to make a query that is "Say which of the two substrings $$$S[l_{1}...r_{1}]$$$ and $$$S[l_{2}...r_{2}]$$$ is smaller lexicographically?".
First of all, let's analyze that if the substrings have different sizes, we can compare only the prefixes of size $$$M$$$ where $$$M$$$ is the minimum size between the two substrings, if both prefixes are equal then the smallest lexicographically is the substring of smaller size.
As in the Sparse Table, suppose $$$ k = \ log {(r-l)} $$$, then we can represent the substring $$$ S [l ... r] $$$ as the pair $$$ ( DBF [l] [k], DFB [r-2^k+1] [k] ) $$$. Now to compare two substrings of equal size we simply lexicographically compare their pairs in $$$O(1)$$$.
For example, suppose we have the string "ABACABA" and we want to know which string is lexicographically smaller between $$$ S [0 ... 5] $$$ and $$$ S [1 ... 6] $$$ ("ABACAB" and "BACABA").Since $$$\log{(5-0)}=2$$$, the pairs to compare would be: $$$(DBF [0] [2], DBF [2] [2])$$$ and $$$(DBF [1] [2], DBF [2] [2])$$$, which would be $$$(1,2)$$$ and $$$(2,4)$$$ therefore the first substring is lexicographically smaller.

Proof:

Tags #strings, #string matching

History

 
 
 
 
Revisions
 
 
  Rev. Lang. By When Δ Comment
en13 English Leonardo16 2021-06-29 00:24:36 22
en12 English Leonardo16 2021-06-28 21:05:23 111
en11 English Leonardo16 2021-06-24 20:08:52 237 Tiny change: 'strings. \nThis t' -> 'strings. \nThis t'
en10 English Leonardo16 2021-06-23 02:53:32 4 Tiny change: '[2], DBF [2] [2])$, w' -> '[2], DBF [3] [2])$, w' (published)
en9 English Leonardo16 2021-06-23 02:39:03 663 Tiny change: ' in $O(N)$ (I'll explain it more deeply later), otherwis' -> ' in $O(N)$, otherwis'
en8 English Leonardo16 2021-06-23 02:23:48 4 Tiny change: 'hose pairs, this can be' -> 'hose pairs. This can be'
en7 English Leonardo16 2021-06-23 02:22:01 8 Tiny change: 'irst time it appears) ' -> 'irst time the pair appears) '
en6 English Leonardo16 2021-06-23 02:20:42 44
en5 English Leonardo16 2021-06-23 01:41:15 810 Tiny change: 'he string ABACABA and we wa' -> 'he string "ABACABA" and we wa'
en4 English Leonardo16 2021-06-23 01:03:39 602 Tiny change: 'l sort in $N\log{N}$(But this ' -> 'l sort in O(NlogN) (But this '
en3 English Leonardo16 2021-06-23 00:33:42 73 Tiny change: 'ize $N\logN$ called ' -> 'ize $N\log N$ called '
en2 English Leonardo16 2021-06-23 00:30:48 153
en1 English Leonardo16 2021-06-23 00:27:13 1289 Initial revision (saved to drafts)