Suffix Automaton Implementation

#	User	Rating
1	ecnerwala	3648
2	Benq	3580
3	orzdevinwang	3570
4	cnnfls_csy	3569
5	Geothermal	3568
6	tourist	3565
7	maroonrk	3530
8	Radewoosh	3520
9	Um_nik	3481
10	jiangly	3467

#	User	Contrib.
1	maomao90	174
2	adamant	164
2	awoo	164
4	TheScrasse	160
5	nor	159
6	maroonrk	156
7	-is-this-fft-	150
7	SecondThread	150
9	orz	146
10	pajenegod	145

I've recently been working on suffix automaton code and I have an implementation that I believe to be more intuitive and easier to type (certainly shorter) than the implementation I've seen circulating around (mainly adapted from cp-algorithms). It also appears to benchmark a little faster than the cp-algorithms implementation as well, which is a nice perk. Please, let me know if you can find some improvements to make, either making the code shorter, faster, etc.

Edit: The primary purpose of this code is for ICPC-style contests which limit printed material and have no digital materials (unlike codeforces which allows copy/paste during contests). Thus, my primary focus is brevity and speed, not necessarily readability. There's always tradeoffs; for example, if 20 characters would be added to make the code much more readable, than that's a valid consideration, but I want the number of lines and number of characters to be as small as reasonable without making the code completely arcane.

// len is the longest-length substring ending here
// pos is the first index in the string matching here
// term is whether this node is a terminal (aka a suffix)
struct st { int len, pos, term; st *link; map<char, st*> next; };
st *suffixAutomaton(string &str) {
    st *last = new st(), *root = last;
    for(auto c : str) {
        st *p = last, *cur = last = new st{last->len + 1, last->len};
        while(p && !p->next.count(c))
            p->next[c] = cur, p = p->link;
        if (!p) cur->link = root;
        else {
            st *q = p->next[c];
            if (p->len + 1 == q->len) cur->link = q;
            else {
                st *clone = new st{p->len+1, q->pos, 0, q->link, q->next};
                for (; p && p->next[c] == q; p = p->link)
                    p->next[c] = clone;
                q->link = cur->link = clone;
            }
        }
    }
    while(last) last->term = 1, last = last->link;
    return root;
}

template <const int sigma, const int char_offset> class SuffixAutomaton { public: class Node { public: Node() : parent_(-1), suffix_link_(-1) { memset(&next_, -1, sigma * sizeof(int)); } explicit Node(int parent) : parent_(parent), suffix_link_(-1) { memset(&next_, -1, sigma * sizeof(int)); } int parent_; int suffix_link_; int length_{}; int occurrences_{}; std::array <int, sigma> next_{}; bool is_terminal_{}; }; [[nodiscard]] std::optional <const Node&> GetNode(size_t id) const { if (id <= nodes_count_) { return aut_[id]; } return std::nullopt; } explicit SuffixAutomaton(const std::string::const_iterator& first, const std::string::const_iterator& last) { aut_.reserve(std::distance(first, last) * 2); aut_.emplace_back(); int last_state = 0; for (auto it = first; it != last; ++it) { last_state = Extend(last_state, *it - char_offset); } while (last_state > 0) { aut_[last_state].occurrences_ = 1; aut_[last_state].is_terminal_ = true; last_state = aut_[last_state].suffix_link_; } TopologicalSort(); for (int i = nodes_count_ - 1; i >= 0; --i) { const int& node = sorted_order_[i]; for (const int& to : aut_[node].next_) { if (to != -1) { aut_[node].occurrences_ += aut_[to].occurrences_; } } } } int GetOccurrences(const std::string::const_iterator& first, const std::string::const_iterator& last) { int state = 0; for (auto it = first; it != last; ++it) { if (aut_[state].next_[*it - char_offset] == -1) { return 0; } state = aut_[state].next_[*it - char_offset]; } return aut_[state].occurrences_; } private: void TopologicalSort() { std::vector <int> deque; /* reserve for efficiency*/ deque.reserve(nodes_count_ + 1); sorted_order_.reserve(nodes_count_ + 1); std::vector <int> in_degree(nodes_count_ + 1); for (int i = 0; i <= nodes_count_; ++i) { for (const int& to : aut_[i].next_) { if (to != -1) { in_degree[to] += 1; } } } for (int i = 0; i <= nodes_count_; ++i) { if (in_degree[i] == 0) { deque.push_back(i); } } auto deque_it = begin(deque); while (deque_it != deque.end()) { int node = *deque_it; std::advance(deque_it, 1); sorted_order_.push_back(node); for (const int& to : aut_[node].next_) { if (to != -1 && --in_degree[to] == 0) { deque.push_back(to); } } } if (size(sorted_order_) != nodes_count_ + 1) { throw std::logic_error("Wrong automaton construction"); } } int Clone(int parent, const int& current_transition, const int& c) { aut_.push_back(aut_[current_transition]); aut_.back().parent_ = parent; aut_.back().length_ = aut_[parent].length_ + 1; int clone = ++nodes_count_; aut_[current_transition].suffix_link_ = clone; while (parent != -1 && aut_[parent].next_[c] == current_transition) { aut_[parent].next_[c] = clone; parent = aut_[parent].suffix_link_; } return clone; } int Choose(int node, const int& c) { int q = aut_[node].next_[c]; if (aut_[q].parent_ == node) { return q; } else { return Clone(node, q, c); } } int Extend(int last_state, const int& c) { if (aut_[last_state].next_[c] != -1) { return Choose(last_state, c); } int new_state = ++nodes_count_; aut_.emplace_back(last_state); aut_[new_state].length_ = aut_[last_state].length_ + 1; while (last_state != -1 && aut_[last_state].next_[c] == -1) { aut_[last_state].next_[c] = new_state; last_state = aut_[last_state].suffix_link_; } if (last_state != -1) { aut_[new_state].suffix_link_ = Choose(last_state, c); } else { aut_[new_state].suffix_link_ = 0; } return new_state; } int nodes_count_{}; std::vector <Node> aut_; std::vector <int> sorted_order_; };

void add_letter(char c) { int p = last; last = sz++; len[last] = fpos[last] = len[p] + 1; for(; to[p][c] == 0; p = link[p]) { to[p][c] = last; } int q = to[p][c]; if(q == last) { return; } if(len[q] == len[p] + 1) { link[last] = q; return; } int cl = sz++; to[cl] = to[q]; link[cl] = link[q]; fpos[cl] = fpos[q]; len[cl] = len[p] + 1; link[q] = link[last] = cl; for(; to[p][c] == q; p = link[p]) { to[p][c] = cl; } }

Comments (22)

Write comment?

NTTilted

13 months ago, # |

← Rev. 2 →

-27

I personally prefer not to use pointers until you don't have to. You can take a look on my implementation: i think, it should be clear (anyway, I've spent hours to make it such clean and fast as i think(lol))

Spoiler

→ Reply

bicsi

13 months ago, # ^ |

+32

I think you might be missing the point of the post.

Also

Also 2

Thanks for pointing these things out! Suprisingly, I've never thought about just sorting states by their length. And sorry for inconvenience with reading my comment, I've already edited it.

Java

+24

What doing?

Lior

+30

Relevant profile pic.

I really like this implementation meaf!

adamant

My typical implementation with fpos looks as follows:

I think, in terms of simplicity, it's generally better to have shorter lines (e.g. don't go longer than 80 characters per line), and avoid writing several operations on the same line, unless you're absolutely certain it will make the code better... Besides that, structuring code in a way that earlier conditions allow preliminary termination (vs having few nested levels of conditions) is often considered simpler to comprehend.

P.S. In your code, you use while loop in the first part and for loop in the second, which is inconsistent and probably should be avoided...

Meaf

Ah yes, I forgot to mention that number of lines is important since this is for ICPC book code primarily rather than codeforces, so line count and character count are quite important. I'll take a look at re-replacing the while with the for, though, I see that it might be a little confusing to read during contest in a hackpack

nikgaevoy

If you are concerned about the possibility to implement it from scratch during a contest, then the main improvement point I see is surprisingly replacing pointers with std::vector + indices. Usually, I advocate for using pointers (especially, smart pointers) in tree-like data structures, but here I believe that indices are clearly superior. The idea of the pointer-based approach in general is that you can have the best of both worlds: the speed overhead becomes negligible if you are using a proper allocator, you can make your data structure persistent just by changing two lines of code, and even if on top of that you need to do a lot of deletions and the memory becomes a concern, it is also solvable without rewriting the whole data structure. However, in this particular case pointers don't do anything really useful. The suffix automaton can't really be made persistent (because of the amortization) and the deletion is also not a thing, unless you remove the last character only and this is basically a pop_back or two. On the other hand, indices are easier to debug, which may be important for some variations of suffix automata that you may want to implement.

The problem I am referring to is that there is no single best way to implement things like trie (Aho-Corasick included). The problem is in the variable that you call next. The naive solution is to use map or even unordered_map in each node and hope for the best, but it is not the fastest one. The fastest solution is to use std::array, but it works only if memory is not a problem and $$$\lvert \Sigma \rvert$$$ is small. Another approach that is somewhere in the middle is to make unordered_map that maps pairs (node, char) into nodes and in that case, using pointers makes this part nondeterministic, which is potentially problematic for debugging.

Just for the reference, here is my minimalistic implementation. It is slightly longer, takes 69 lines of code (nice!), but most of the difference comes from the different formatting, the algorithm is essentially the same.

It kind of can be made persistent, if you absolutely need it at all cost, but you'd need a data structure like link-cut tree on top of it to do so, unfortunately. Like, you would need to maintain the suffix link tree as a dynamic tree, and then the changes you do to the automaton are expressed as some updates on a path from a vertex to one of its ancestors. Probably for this specific purpose a dynamic suffix array or suffix tree would make a bit more sense.

But even then you store nodes in a link-cut tree and still can survive accessing them through the indices. I don't see it as any kind of problem, especially, having in mind that you usually have this interface in a regular link-cut tree.

Accessing vertices of the inner tree by index in logarithmic time instead of constant may look as a problem, but, first of all, I think it is unavoidable because of the persistency and also we have bigger problems with the fact that we can't use splay trees anymore (due to similar amortized problems), so our link-cut tree would work in time $$$\Theta(\log^2)$$$.

But maybe there is a bit more clever solution than simply putting everything in a link-cut tree. At least, our paths have some additional structure that we potentially may use. I think, it should be either known and written somewhere in a literature or a promising topic to research.

lightseba

If it’s possible to reverse the string, it’s easier to deal with adding/removing from the front of the string, as opposed to the end (or equivalently, working with prefixes instead of suffixes). A nice offline solution for a persistent prefix array where you add and remove from the end of the string works by building a trie on the queries, and running the suffix array algorithm on the prefixes of the trie (similar to binary lifting). Then you can build the full LCS (as opposed to LCP) array of everything as normal, and maintaining the current LCS array in a set as you DFS the trie. Alternatively, you can do it fully online with hashing to compare suffixes and compute LCS.

There’s a funny problem May I Add A Letter? where the judge solution was persistent suffix automaton, but it ended up getting hacked.

ffao

To correct the record: the real author of that problem is arthur.nascimento and the official solution is some weird dynamic suffix array structure that can't get hacked because it's not amortized (I think it's similar to what you described in your first paragraph).

However, some contestants used a persistent suffix automaton to get AC because the cases weren't strong enough. At least in Brazil this is partially my fault, since a few weeks before the contest I mentioned using persistent suffix automaton for a different problem (where the string was highly constrained so I don't think it was hackable -- but I didn't prove it).

Oh my bad, I thought that user was a judge from this blog. That is one of my favorite problems, thanks to you both for working on it!

nor

Regarding pointers, I think the current implementation is pretty nice, apart from the use of new everywhere. Using a simple function to allocate nodes from a deque will make it faster (about as fast as using a vector and indices) and also can be debugged better (if your point about debugging was about having all nodes in one place). Using pointers for trees is also quite idiomatic (and short) and that's also a reason why many ICPC team codebooks (including KACTL) use them despite being slow.

Maybe there is a pre-allocation and overload of new happening behind the scenes..

With deques, allocation automatically happens in blocks, so it's better in general.

Ironically, I found this implementation to be faster than the standard CP-Algorithms implementation using a vector and indices! And yes, this code is for ICPC hackpack/book code, so I prioritized code length over speed. To that point, is there a way you can think of removing the new keyword?

Hi, thanks for the comment! I implemented both and did my best to minimize the size, and I found that using pointers was significantly shorter than using indices. It's also just easier to use, in my opinion :p.

I'll take a look at your implementation and see if there are any tricks I can incorporate into my code. But yes, the primary concern is the ability to type it from scratch in a contest; I don't really much care about my codeforces rating (look at my profile, lol); instead, I'm focusing on ICPC contests (since I've qualified for the NAC and am World Finals hopeful), so I want my book code to be as short and implementable as possible

In that case, I think that indices are the better option. Unlike things like fft, you would never implement the suffix automaton in its shortest form, instead, it will always contain a dp of some sort. In these cases, debuggability is a major concern (and in ICPC it always is a concern in general) and indices allow you to simply print all vertices and check the correctness of your dp on paper without spending any computer time.

Also, speaking from experience, I would say that the reliability of your library both in team reference and inside your head is the #1 priority, especially for the WF. So, I don't think that reducing the code size by, let's say, 10% should be a priority of any kind when you are building your team reference document, and all reasonable implementations of the suffix automaton vary in length on that 10% at most (if we ignore the formatting).

I found that for many applications, it suffices to add a single value to the node and then do the dfs / dp, but maybe I need more SAM problems. Do you have any that come to mind?

MR.JO

+11

nice implementation

Meaf's blog