Regular Expression problem -- better solution?

#	User	Rating
1	tourist	3880
2	jiangly	3669
3	ecnerwala	3654
4	Benq	3627
5	orzdevinwang	3612
6	Geothermal	3569
6	cnnfls_csy	3569
8	jqdai0815	3532
9	Radewoosh	3522
10	gyh20	3447

#	User	Contrib.
1	awoo	161
2	maomao90	160
3	adamant	156
4	maroonrk	153
5	-is-this-fft-	148
5	atcoder_official	148
5	SecondThread	148
8	Petr	147
9	nor	144
9	TheScrasse	144

I've recently come across the following problem in a real life scenario:

You are given a regular expression with two special characters: * matches any sequence of characters (including the empty sequence) and ? matches any one character.

For example, a?b matches acb, but it does not match abc, accb or ab. a*?b however does match accb. abc*f?h matches abcdefgh.

You have to write a program that checks if a pattern and a string match.

Obviously, we can write an O(n^2) algorithm which looks like this:

bool matches(const char *pat, const char *s) {
  if (*pat == 0)
    return (*s == 0);
  if ((*pat == '?' || *pat == *s) && (*s != 0))
    return matches(pat + 1, s + 1);
  if (*pat == '*')
    return ((*s != 0) && matches(pat, s + 1)) || matches(pat + 1, s);
  return false;
}

For the scenario I have encountered, O(n^2) is more than enough, however I've been wondering for a while if a linear time algorithm (or anyway something better than quadratic) exists. I've been giving it some thought and can't seem to come up with anything. It looks like so it seems like some linear algorithm should exist, right?

Comments (2)

Write comment?

Rafbill

7 years ago, # |

+10

According to http://publications.csail.mit.edu/lcs/pubs/pdf/MIT-LCS-TM-041.pdf, it is possible to solve the problem involving only characters and ? in time $\text{[math]}$ (where n is the length of the text, m is the length of the pattern, and Σ is the alphabet size).

A way to solve the full problem is to split the pattern in blocks separated by *. Then, we can loop over blocks, search for the first occurrence of each block in our text, and remove all characters that are before or inside that occurrence.

In order to obtain a subquadratic algorithm, we need to be able to find the first occurrence of a pattern with ? in time $\text{[math]}$ , when the pattern is matched in the first k characters of the text. To do so, it is enough to use the offline algorithm for the first 1, 2, 4, ... characters of the text, until the pattern is matched or we know that the pattern is not matched in the whole string.

The resulting algorithm for ? and * should work in time $\text{[math]}$

→ Reply

mouse_wireless

7 years ago, # ^ |

-10

This seems like it should work.

I don't really understand the part about using "the offline algorithm" (since our text changes (by removing characters from the start) after every token I don't see how we can do offline computations), but at any rate you can do binary search which I think would bring the complexity to O(n logn logm log(sigma)), which is still sub-quadratic.

Please note that I haven't read the linked paper (only the abstract and a few lines).

mouse_wireless's blog