Explain Boyer-Moore Algorithm with Example

Technology CommunityCategory: StringsExplain Boyer-Moore Algorithm with Example
VietMX Staff asked 3 years ago

The insight behind Boyer-Moore is that if you start searching for a pattern in a string starting with the last character in the pattern, you can jump your search forward multiple characters when you hit a mismatch.

Let’s say our pattern p is the sequence of characters p1p2, …, pn and we are searching a string s, currently with p aligned so that pn is at index i in s.

E.g.:

s = WHICH FINALLY HALTS.  AT THAT POINT...
p = AT THAT
i =       ^

The B-M paper makes the following observations:

(1) if we try matching a character that is not in p then we can jump forward n characters:

‘F’ is not in p, hence we advance n characters:

s = WHICH FINALLY HALTS.  AT THAT POINT...
p =        AT THAT
i =              ^

(2) if we try matching a character whose last position is k from the end of p then we can jump forward k characters:

‘ ‘s last position in p is 4 from the end, hence we advance 4 characters:

s = WHICH FINALLY HALTS.  AT THAT POINT...
p =            AT THAT
i =                  ^

Now we scan backwards from i until we either succeed or we hit a mismatch. (3a) if the mismatch occurs k characters from the start of p and the mismatched character is not in p, then we can advance (at least) k characters.

‘L’ is not in p and the mismatch occurred against p6, hence we can advance (at least) 6 characters:

s = WHICH FINALLY HALTS.  AT THAT POINT...
p =                  AT THAT
i =                        ^

However, we can actually do better than this. (3b) since we know that at the old i we’d already matched some characters (1 in this case). If the matched characters don’t match the start of p, then we can actually jump forward a little more (this extra distance is called ‘delta2’ in the paper):

s = WHICH FINALLY HALTS.  AT THAT POINT...
p =                   AT THAT
i =                         ^

At this point, observation (2) applies again, giving

s = WHICH FINALLY HALTS.  AT THAT POINT...
p =                       AT THAT
i =                             ^

and bingo! We’re done.

Implementation
public static int[] makeD1(char[] pat) {
    int[] table = new int[255];
    for (int i = 0; i < 255; i++)
        table[i] = pat.length;
    for (int i = 0; i < pat.length - 1; i++)
        table[pat[i]] = pat.length - 1 - i;
    return table;
}

public static boolean isPrefix(char[] word, int pos) {
    int suffixlen = word.length - pos;
    for (int i = 0; i < suffixlen; i++)
        if (word[i] != word[pos + i])
            return false;
    return true;
}

public static int suffix_length(char[] word, int pos) {
    int i;
    for (i = 0;
        ((word[pos - i] == word[word.length - 1 - i]) &&
            (i < pos)); i++) {}
    return i;
}

public static int[] makeD2(char[] pat) {
    int[] delta2 = new int[pat.length];
    int p;
    int last_prefix_index = pat.length - 1;
    for (p = pat.length - 1; p >= 0; p--) {
        if (isPrefix(pat, p + 1))
            last_prefix_index = p + 1;
        delta2[p] = last_prefix_index + (pat.length - 1 - p);
    }
    for (p = 0; p < pat.length - 1; p++) {
        int slen = suffix_length(pat, p);
        if (pat[p - slen] != pat[pat.length - 1 - slen])
            delta2[pat.length - 1 - slen] = pat.length - 1 - p + slen;
    }
    return delta2;
}

public static int bm(char[] string, char[] pat) {
    int[] d1 = makeD1(pat);
    int[] d2 = makeD2(pat);
    int i = pat.length - 1;
    while (i < string.length) {
        int j = pat.length - 1;
        while (j >= 0 && (string[i] == pat[j])) {
            i--;
            j--;
        }
        if (j < 0)
            return (i + 1);
        i += Math.max(d1[string[i]], d2[j]);
    }
    return -1;
}