Understanding Suffix Automaton in depth

№	Пользователь	Рейтинг
1	tourist	3690
2	jiangly	3647
3	Benq	3581
4	orzdevinwang	3570
5	Geothermal	3569
5	cnnfls_csy	3569
7	Radewoosh	3509
8	ecnerwala	3486
9	jqdai0815	3474
10	gyh20	3447

№	Пользователь	Вклад
1	maomao90	174
2	awoo	164
3	adamant	162
4	TheScrasse	160
5	nor	158
6	maroonrk	156
7	-is-this-fft-	152
8	orz	146
9	pajenegod	145
9	SecondThread	145

I am having a lot of trouble with understanding suffix automaton with all it's details.

I have solved about 5 problems that contained very basic applications for it and I am stuck at some points.

I can't really understand how suffix links works, and what the congruence classes are. I know that the best resource is e-maxx site but actually I don't understand Russian and the translators (Google, Yandex , ... etc) sucks. Proofs are translated grammatically wrong and I really can't understand it in depth. The other resources like Crochemore's books are nice but they are actually long and they go beyond what an ACMer needs to understand a data structure.

So, it would be very nice if some one can explain suffix links and congruence classes to me.

Also, there are lots of people do calculate something called the right array in their codes and their comments are either Russian or Chinese and I can't really understand what and why is those 3 loops there.

I hope that someone can really explain all that.

for (int i = 0,p = S; i < n; ++i) ++r[p = ch[p][s[i] — 'a']]; static int b[N],t[N]; for (int i = 1; i <= cnt; ++i) ++b[l[i]]; for (int i = 1; i <= n; ++i) b[i] += b[i - 1]; for (int i = 1; i <= cnt; ++i) t[b[l[i]]--] = i; for (int i = cnt; i; --i) r[fa[t[i]]] += r[t[i]];

int occur[max_size]; // occur[state] = 1 if state is prefix of string vector<int> state_with_len[max_len]; for(int state = 0; state < state_count; state++) state_with_len[len[state]].push_back(state); // If you traverse states from bigger length to smallest you will have // Reversed topological order. The same which you use in dfs actually. for(int length = max_len - 1; length > 0; length--) for(auto state: state_with_len[length]) occur[link[state]] += occur[state];

Комментарии (17)

Написать комментарий?

Safrout

9 лет назад, # |

Am I asking for something stupid so that people are ignoring to answer or am I asking for an extremely difficult stuff that no one can answer that?!

I really want to know if I do have problems.

→ Ответить

adamant

9 лет назад, # ^ |

← Rev. 2 →

+14

Well, I suppose that suffix structures are extremely not interesting for most of people here :)

Well, ok. I can help you, but you need to come up with some certain questions for it.

By the way, what is "right array"? Can you give me some russian comment where it is mentioned?

← Rev. 9 →

Well, what I need is the following :

1- A good explanation for suffix links (some examples that clarifies what an end position class is). I know that the suffix link points to the node with the largest suffix of my node that is not included in my end position class. but why is that and what is that used for ?

2- I saw these loops in lots of people solutions who use Suffix Automaton and I really couldn't understand what does they do (They look useful).

for (int i = 0,p = S; i < n; ++i)
	++r[p = ch[p][s[i] &mdash; 'a']];
static int b[N],t[N];
for (int i = 1; i <= cnt; ++i) ++b[l[i]];
for (int i = 1; i <= n; ++i) b[i] += b[i - 1];
for (int i = 1; i <= cnt; ++i) t[b[l[i]]--] = i;
for (int i = cnt; i; --i) r[fa[t[i]]] += r[t[i]];

3- An explanation for generating the longest common substring of a set of strings (if getting the solution is easy after understanding the last 2 points then don't answer this point).

Thank you very much

Omg, I don't know for sure but second code looks like counting sort from suffix array. Can you refer to some comments or submissions with it, please?

They are written in these 2 code :

1- A solution for SPOJ NSUBSTR problem :

https://gist.github.com/lazycal/11244808

2- A solution for SPOJ LCS2 problem :

https://gist.github.com/foreseeable/6266824

Ok, I think, I understood what this code states for. This is very weird way to do this thing but anyway.

==1==

for (int i = 0,p = S; i < n; ++i)
	++r[p = ch[p][s[i] - 'a']];

You want to count for every state of suffix automata how many times does it appear in string. It will be array r. You initialize it with 1 in every state which corresponds to some prefix of string.

==2==

static int b[N],t[N];
for (int i = 1; i <= cnt; ++i) ++b[l[i]];
for (int i = 1; i <= n; ++i) b[i] += b[i - 1];
for (int i = 1; i <= cnt; ++i) t[b[l[i]]--] = i;

Actually in this part you obtain lexicographical order of states. It is equivalent to counting sort by lengths of states. At the end array t contains topological sort of states.

==3==

for (int i = cnt; i; --i) r[fa[t[i]]] += r[t[i]];

Finally you finish calculating your array r by pushing initial values through the suffix links of automaton in reversed topological sort order. This procedure is described on e-maxx, I suppose.

==4==

Here is equivalent simplificated code for such thing.

P.S. I suppose this is Chinese code? Such crazy one :)

Yes it is a Chinese code indeed. Thank you very much for your time.

I am really happy to talk to you again. BTW you are first guy to point to me that Suffix Automaton actually exists. So, thank you for that too.

Yeah, I remember. You're welcome :)

+13

I wrote about endpos concept here. Please write if more clarification is needed.

Now I understand the definition clearly. Thank you for that. I will spend some time understanding why is that useful. If you have any nice references on understanding why is it useful that would nice. Thank you again.

Well, I know how to use them, but it is hard to me to clearly explain things it is based on. It is kind of intuitive for me. And I still have references except e-maxx. Maybe I'll write entry about suffix automaton but very next time...

The main thing why it is very very useful is the fact that suffix links of automaton of string s is simply suffix tree of reversed s. This makes suffix automaton incredibly powerful structure.

It somehow became more clear after pointing to the suffix tree thing.

I will be waiting for your blog entry.

Thank you very much.

I will bother you with one more question. From where did you get that experience and intuition. What have you gone through to get it ?!

Also please include the other references if you can.

For the first time suffix automaton was explained to me by I_love_natalia. After that I solved a lot of problem, sometimes using some information from e-maxx and/or my friends... That's all, I suppose.

Okay, so can you give me the problems that made a difference to your understanding (if you remember them for sure) ?

This and this were useful. BUt actually I solved very big amount of problems...

elita15

8 лет назад, # ^ |

[user:adamant]Hi,I understood the suffix automata algorithm given on e-maxx however I am not able to understand how to find longest common substring of multiple string. (The translation of that part isn't very clear :( ). Could you please give me a brief idea on how suffix automata is used to solve this problem? (SPOJ-LCS2) Thanks in advance.

Блог пользователя Safrout