Regular Expression problem -- better solution?

I've recently come across the following problem in a real life scenario:

You are given a regular expression with two special characters: * matches any sequence of characters (including the empty sequence) and ? matches any one character.

For example, a?b matches acb, but it does not match abc, accb or ab. a*?b however does match accb. abc*f?h matches abcdefgh.

You have to write a program that checks if a pattern and a string match.

Obviously, we can write an O(n^2) algorithm which looks like this:

bool matches(const char *pat, const char *s) {
  if (*pat == 0)
    return (*s == 0);
  if ((*pat == '?' || *pat == *s) && (*s != 0))
    return matches(pat + 1, s + 1);
  if (*pat == '*')
    return ((*s != 0) && matches(pat, s + 1)) || matches(pat + 1, s);
  return false;
}

For the scenario I have encountered, O(n^2) is more than enough, however I've been wondering for a while if a linear time algorithm (or anyway something better than quadratic) exists. I've been giving it some thought and can't seem to come up with anything. It looks like so it seems like some linear algorithm should exist, right?

Comments (2)

Write comment?

Rafbill

6 years ago, # |

+10

According to http://publications.csail.mit.edu/lcs/pubs/pdf/MIT-LCS-TM-041.pdf, it is possible to solve the problem involving only characters and ? in time $\text{[math]}$ (where n is the length of the text, m is the length of the pattern, and Σ is the alphabet size).

A way to solve the full problem is to split the pattern in blocks separated by *. Then, we can loop over blocks, search for the first occurrence of each block in our text, and remove all characters that are before or inside that occurrence.

In order to obtain a subquadratic algorithm, we need to be able to find the first occurrence of a pattern with ? in time $\text{[math]}$ , when the pattern is matched in the first k characters of the text. To do so, it is enough to use the offline algorithm for the first 1, 2, 4, ... characters of the text, until the pattern is matched or we know that the pattern is not matched in the whole string.

The resulting algorithm for ? and * should work in time $\text{[math]}$

→ Reply

mouse_wireless

6 years ago, # ^ |

-10

This seems like it should work.

I don't really understand the part about using "the offline algorithm" (since our text changes (by removing characters from the start) after every token I don't see how we can do offline computations), but at any rate you can do binary search which I think would bring the complexity to O(n logn logm log(sigma)), which is still sub-quadratic.

Please note that I haven't read the linked paper (only the abstract and a few lines).

#	User	Rating
1	ecnerwala	3648
2	Benq	3580
3	orzdevinwang	3570
4	cnnfls_csy	3569
5	Geothermal	3568
6	tourist	3565
7	maroonrk	3530
8	Radewoosh	3520
9	Um_nik	3481
10	jiangly	3467

#	User	Contrib.
1	maomao90	174
2	awoo	164
2	adamant	164
4	TheScrasse	159
4	nor	159
6	maroonrk	156
7	-is-this-fft-	150
8	SecondThread	147
9	orz	146
10	pajenegod	145

mouse_wireless's blog