The insight behind Boyer-Moore is that if you start searching for a pattern in a string starting with the last character in the pattern, you can jump your search forward multiple characters when you hit a mismatch.
Let’s say our pattern p
is the sequence of characters p1
, p2
, …, pn
and we are searching a string s
, currently with p
aligned so that pn
is at index i
in s
.
E.g.:
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
The B-M paper makes the following observations:
(1) if we try matching a character that is not in p
then we can jump forward n
characters:
‘F’ is not in p
, hence we advance n
characters:
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
(2) if we try matching a character whose last position is k
from the end of p
then we can jump forward k
characters:
‘ ‘s last position in p
is 4 from the end, hence we advance 4 characters:
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
Now we scan backwards from i
until we either succeed or we hit a mismatch. (3a) if the mismatch occurs k
characters from the start of p
and the mismatched character is not in p
, then we can advance (at least) k
characters.
‘L’ is not in p
and the mismatch occurred against p6
, hence we can advance (at least) 6 characters:
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
However, we can actually do better than this. (3b) since we know that at the old i
we’d already matched some characters (1 in this case). If the matched characters don’t match the start of p
, then we can actually jump forward a little more (this extra distance is called ‘delta2’ in the paper):
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
At this point, observation (2) applies again, giving
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
and bingo! We’re done.
public static int[] makeD1(char[] pat) {
int[] table = new int[255];
for (int i = 0; i < 255; i++)
table[i] = pat.length;
for (int i = 0; i < pat.length - 1; i++)
table[pat[i]] = pat.length - 1 - i;
return table;
}
public static boolean isPrefix(char[] word, int pos) {
int suffixlen = word.length - pos;
for (int i = 0; i < suffixlen; i++)
if (word[i] != word[pos + i])
return false;
return true;
}
public static int suffix_length(char[] word, int pos) {
int i;
for (i = 0;
((word[pos - i] == word[word.length - 1 - i]) &&
(i < pos)); i++) {}
return i;
}
public static int[] makeD2(char[] pat) {
int[] delta2 = new int[pat.length];
int p;
int last_prefix_index = pat.length - 1;
for (p = pat.length - 1; p >= 0; p--) {
if (isPrefix(pat, p + 1))
last_prefix_index = p + 1;
delta2[p] = last_prefix_index + (pat.length - 1 - p);
}
for (p = 0; p < pat.length - 1; p++) {
int slen = suffix_length(pat, p);
if (pat[p - slen] != pat[pat.length - 1 - slen])
delta2[pat.length - 1 - slen] = pat.length - 1 - p + slen;
}
return delta2;
}
public static int bm(char[] string, char[] pat) {
int[] d1 = makeD1(pat);
int[] d2 = makeD2(pat);
int i = pat.length - 1;
while (i < string.length) {
int j = pat.length - 1;
while (j >= 0 && (string[i] == pat[j])) {
i--;
j--;
}
if (j < 0)
return (i + 1);
i += Math.max(d1[string[i]], d2[j]);
}
return -1;
}