# Aho-Corasick algorithm

Let there be a set of strings with the total length $m$ (sum of all lengths). The Aho-Corasick algorithm constructs a data structure similar to a trie with some additional links, and then constructs a finite state machine (automaton) in $O(m k)$ time, where $k$ is the size of the used alphabet.

The algorithm was proposed by Alfred Aho and Margaret Corasick in 1975.

## 1. Construction of the trie

Formally a trie is a rooted tree, where each edge of the tree is labeled by some letter. All outgoing edge from one vertex must have different labels.

Consider any path in the trie from the root to any vertex. If we write out the labels of all edges on the path, we get a string that corresponds to this path. For any vertex in the trie we will associate the string from the root to the vertex.

Each vertex will also have a flag $\text{leaf}$ which will be true, if any string from the given set corresponds to this vertex.

Accordingly to build a trie for a set of strings means to build a trie such that each $\text{leaf}$ vertex will correspond to one string from the set, and conversely that each string of the set corresponds to one $\text{leaf}$ vertex.

We now describe how to construct a trie for a given set of strings in linear time with respect to their total length.

We introduce a structure for the vertices of the tree.

const int K = 26;

struct Vertex {
int next[K];
bool leaf = false;

Vertex() {
fill(begin(next), end(next), -1);
}
};

vector<Vertex> trie(1);


Here we store the trie as an array of $\text{Vertex}$. Each $\text{Vertex}$ contains the flag $\text{leaf}$, and the edges in the form of ans array $\text{next}[]$, where $\text{next}[i]$ is the index to the vertex that we reach by following the character $i$, or $-1$, if there is no such edge. Initially the trie consists of only one vertex – the root – with the index $0$.

Now we implement a function that will add a string $s$ to the trie. The implementation is extremely simple: we start at the root node, and as long as there are edges corresponding to the characters of $s$ we follow them. If there is no edge for one character, we simply generate a new vertex and connect it via an edge. At the end of the process we mark the last vertex with flag $\text{leaf}$.

void add_string(string const& s) {
int v = 0;
for (char ch : s) {
int c = ch - 'a';
if (trie[v].next == -1) {
trie[v].next = trie.size();
trie.emplace_back();
}
v = trie[v].next;
}
trie[v].leaf = true;
}


The implementation obviously runs in linear time. And since every vertex store $k$ links, it will use $O(m k)$ memory.

It is possible to decrease the memory consumption to $O(m)$ by using a map instead of an array in each vertex. However this will increase the complexity to $O(n \log k)$.

## 2. Construction of an automaton

Suppose we have built a trie for the given set of strings. Now let’s look at it from a different side. If we look at any vertex. The string that corresponds to it is a prefix of one or more strings in the set, thus each vertex of the trie can be interpreted as a position in one or more strings from the set.

In fact the trie vertices can be interpreted as states in a finite deterministic automaton. From any state we can transition – using some input letter – to other states, i.e. to another position in the set of strings. For example, if there is only one string in the trie $abc$, and we are standing at vertex $2$ (which corresponds to the string $ab$), then using the letter $c$ we can transition to the state $3$.

Thus we can understand the edges of the trie as transitions in an automaton according to the corresponding letter. However for an automaton we cannot restrict the possible transitions for each state. If we try to perform a transition using a letter, and there is no corresponding edge in the trie, then we nevertheless must go into some state.

More strictly, let us be in a state $p$ corresponding to the string $t$, and we want to transition to a different state with the character $c$. If there is an edge labeled with this letter $c$, then we can simply go over this edge, and get the vertex corresponding to $t + c$. If there is no such edge, then we must find the state corresponding to the longest proper suffix of the string $t$ (the longest available in the trie), and try to perform a transition via $c$ from there.

For example let the trie be constructed by the strings $ab$ and $bc$, and we are currently at the vertex corresponding to $ab$, which is a $\text{leaf}$. For a transition with the letter $c$, we are forced to go to the state corresponding to the string $b$, and from there follow the edge with the letter $c$.

suffix link for a vertex $p$ is a edge that points to the longest proper suffix of the string corresponding to the vertex $p$. The only special case is the root of the trie, the suffix link will point to itself. Now we can reformulate the statement about the transitions in the automaton like this: while from the current vertex of the trie there is no transition using the current letter (or until we reach the root), we follow the suffix link.

Thus we reduced the problem of constructing an automaton to the problem of finding suffix links for all vertices of the trie. However we will build these suffix links, oddly enough, using the transitions constructed in the automaton.

Note that if we want to find a suffix link for some vertex $v$, then we firstly have the base case that the root vertex has its suffix link as itself, and all nodes that are immediate children of the root vertex (i.e the vertices associated with prefixes of length one) also have their suffix links as the root vertex. Moreover, the suffix links of all deeper vertices can be evaluated as follows: we can go to the ancestor $p$ of this current vertex (let $c$ be the letter of the edge from $p$ to $v$), then follow its suffix link, and perform the transition with the letter $c$ from there.

Thus the problem of finding the transitions has been reduced to the problem of finding suffix links, and the problem of finding suffix links has been reduced to the problem of finding a suffix link and a transition, except for vertices closer to the root. So we have a recursive dependence that we can resolve in linear time.

Let’s move to the implementation. Note that we now will store the ancestor $p$ and the character $pch$ of the edge from $p$ to $v$ for each vertex $v$. Also at each vertex we will store the suffix link $\text{link}$ (or $-1$ if it hasn’t been calculated yet), and in the array $\text{go}[k]$ the transitions in the machine for each symbol (again $-1$ if it hasn’t been calculated yet).

const int K = 26;

struct Vertex {
int next[K];
bool leaf = false;
int p = -1;
char pch;