Help with Suffix Automaton Solution

#	User	Rating
1	ecnerwala	3649
2	Benq	3581
3	orzdevinwang	3570
4	Geothermal	3569
4	cnnfls_csy	3569
6	tourist	3565
7	maroonrk	3531
8	Radewoosh	3521
9	Um_nik	3482
10	jiangly	3468

#	User	Contrib.
1	maomao90	174
2	awoo	164
3	adamant	163
4	TheScrasse	159
5	nor	158
6	maroonrk	156
7	-is-this-fft-	151
8	SecondThread	147
9	orz	146
10	pajenegod	145

Hello all,

I'm trying to solve this problem on UVa: http://uva.onlinejudge.org/index.php?option=com_onlinejudge&Itemid=8&page=show_problem&category=24&problem=3783 (from ICPC South America Regional Finals 2011).

I've seen how to solve it using Suffix Arrays/LCP Arrays. However, I'm trying to solve it using a Suffix Automaton, mainly to learn more/get more used to it.

So, my approach is this:

Concatenate each file using a different separator $ (one that is not a-z).
Build the Suffix Automaton for the resulting string
Now, instead of finding the end-sets of each node (the indexes on where the substrings represented by that node end), I use a bitmask on each node the following way: If index i would appear in the end-set of node n and i belongs to file f, turn f on in node n (n.mask |= (1<<f)).
To see the how many searchable subsets are there, I do a DFS from the root (ignoring any '$/!/..." transitions) and count how many different bitmasks I have. (Also, I ignore the source node that represents the empty string).

The rationale behind this is that if the substring appears on indexes belonging to the files f0, f1..., that substring (and all substrings represented by that node) can be used to search that subset (and only that one). Ignoring the '$'-transitions should guarantee that we don't end up checking any invalid substring.

I'm getting WA with this approach. (It gives the correct answer for the mojority of the test cases, though :( ).

Could anyone please help me find out what's wrong with this idea?

currentState = root fullString = "word1$word2#word3%..." currentWord = 0 for each currentChar in fullString { currentState = the state reached by currentState using transition currentChar; if (currentChar is an end-of-word symbol [$, #, %, ...]) { //means we've reached the end of the current word and are going to the next ++currentWord; } else { auxState = currentState; while (auxState != root) { // propagate while we don't reach the root auxState.mask = (1LL<<currentWord); // Turn on the bit referent to the current word in that state auxState = auxState.suffixParent; } } }

Comments (11)

Write comment?

AlexanderBolshakov

10 years ago, # |

You should have some minor bug in your code, because your idea is correct. Just try to perform a stress-test of your solution: generate lots of small tests and find one which makes your solution give a wrong answer. Small tests are very useful since you can simulate your algorithm by hands, and this should be helpful in finding the bug.

→ Reply

alv-r-

10 years ago, # ^ |

← Rev. 2 →

Thanks for the reply. It took me a while to reply back because I had to do an insane amount of debugging lol. Every small test I did was giving me the right answer and the smallest one that failed on the official test cases was big.

Turns out I was doing mask |= (1<<fileNum), but using unsigned long long, changing to (1ULL<<fileNum) fixed it.

I don't get why though. It was putting a lot of other 'on' bits on some states, but shouldn't the signed bit affect only the 64th one? (I never use the 64th, cause F<=60).

Moreover, as far as I understand, any shift should shift the 64th away (except for fileNum 0, which didn't give me any problems at all...). What was the problem in this case?

My solution ended up being:

Build automaton for A$B!C#...
Run each string A, B, C, ... separately and mark the masks (instead of running the full A$B!C#... string and marking according to the index, like I was doing before)
Do a dfs to mark which states are reachable (don't go on with it if there is a '$' transition)
For reachable states, count the number of different masks

Dunno if any of this made a difference or if it was the 1ULL problem all this time, will revert to the previous solution later and see what happens.

rlac

When you say 1 << x, the result will be stored in an integer value. So, if x is larger than 31, this will cause an overflow. Changing it to 1ll << x will fix the problem, it isn't necessary to use an unsigned long long.

Thanks. I would expect (1<<32 or more) to give all zeroes. I was getting really weird values (and for some test cases only... a lot of big ones 'worked well').

Seems like shifting more than the size is undefined behaviour, though.

http://www.iso-9899.info/n1256.html#6.5.7p4 for reference

sn23581

in another approach, dfs is unnecessary.. calculate mask for EACH state of the automata and keep them in a set. while inserting mask in set, if the state is affected by your dummy characters [which are appended after each of the strings' insertion], then you have to compare the effective length (length since last dummy character) of the state to it's link's length; if they differ then you take this state's mask in the set, otherwise not. size of the set is your answer.

But wouldn't this be the same (timewise)?

Let's say the constant amount of work you do (inserting the masks, comparing lengths, etc) = T and the other constant work = K.

Doing it the way you're saying is N*(T+K). But doing the dfs later would be N*K + N*T, which is essentially the same (plz correct me if I'm wrong).

It may make the code shorter though (mine is already too big haha), so I'll do it to see what happens :D

PlayLikeNeverB4

9 years ago, # |

1) Can you explain how to compute the masks after you build the suffix automaton?

2) How do you solve this problem using Suffix arrays? I tried something, but it's hard to explain. I looked at ranges of suffixes (at most F) around index i and when the LCP is different then you can add the current mask to the set.

Thanks!

9 years ago, # ^ |

← Rev. 4 →

Suppose you have the automaton for the full string "word1$word2#word3%..." with each sate also pointing to its suffix parent (also called suffix pointer, parent pointer, etc), you can add a new property "mask" to each state (a bitmsk representing the set of words searchable by the substrings represented by that state).

To calculate the masks, you can run the full string in the automaton and for each state you reach you add that word to the set (mask) and propagate it to its suffix parent until you reach the root. (Skipping the end of word symbols, since they aren't a part of any actual word).

Something like this:

For solving with a Suffix Array, I think you can use something like this:

If you have the LCP array, you can derive the "searchable suffix sets" in a way that for each index i, there's a set from [beg+1, ..., i, ..., end-1] where beg it's the first index to the left where lcp[idx] < lcp[i] and end is the first index to the right where that happens.

You can kinda consider the LCP a histogram where each set is a contiguous part on the x axis in that histogram. Then you need to find which words appear in that suffix set to find the actual set.

Example:

lcp: 1 2 3 4 3 1 2 0 1
idx: 0 1 2 3 4 5 6 7 8

You can search the suffixes represented by indexes 0 to 6, 1 to 4, 2 to 4, 3, 6, 8. Then for each of these you need to see the set of words represented by it, that will be a searchable subset.

Thanks, I got both of the solutions working.

I was building the masks correctly, but I had a stupid bug. By the way, your approach seems slow (length^2). Did you get Accepted with that algorithm?

← Rev. 3 →

Yes, I've got AC using it. I think it's actually linear (amortized), I can't really explain why though. :P

I studied/implemented it from a paper and on my implementation I build the end-sets the same way I build the masks here (my implementation for reference, se the 'buildEndSets' function).

All I know is that when thinking about how to build the end-sets, I realized something like "oh, this will be amortized linear because [something in the same veins as why the automaton has at most 2*w-1 states and as why the construction algorithm I use is linear even though it has also 2 fors]. I'll reread the paper later and try to explain why.

I may be wrong, but I've used this code on a good amount of problems (where a linear solution was expected) and got AC (with that or with something similar, like the masks), so I think it's correct.

But anyway, do you have an approach that seems faster? Would you mind sharing it? I don't know any other way to do it.

Wouldn't it be O(N^2) for a string like "aaa...aaa". I tested it and it was very slow. My approach was to first mark only the "currentState"s. Then, visit the nodes in the decreasing order of their "length", and just send the mask to the suffix.

alv-r-'s blog