Explain Knuth-Morris-Pratt (KMP) Algorithm in Plain English

Technology Community › Category: Strings › Explain Knuth-Morris-Pratt (KMP) Algorithm in Plain English

VietMX Staff asked 4 years ago

The KMP algorithm is an efficient string matching algorithm due to Donald Knuth, Vaughan Pratt, and James H. Morris.

For the string matching problem (find p in S) the naive approach would be to slide a window of size p all over S and check if it matches. That works but actually is inefficient. It takes O(p*S) time.

Instead of naive approach KMP is a linear time (O(n)) algorithm that exploits the observation that every time a match (or a mismatch) happens, the pattern itself contains enough information to dictate where the new examination should begin from. Every time the naive function fails (or succeeds), it starts matching over from the next character. This might not be necessary. We could use our knowledge of the point of last matching and the “structure” of the test string to know where the next matching should begin from.

How To Understand Prefix Function and Partial Match Table?

The “structure” of the string is encapsulated in the Prefix (or Failure) Function that we use to create Partial Match Table. Now, in order to talk about the meaning, we need to know about proper prefixes and proper suffixes.

Proper prefix: All the characters in a string, with one or more cut off the end. “S”, “Sn”, “Sna”, and “Snap” are all the proper prefixes of “Snape”.
Proper suffix: All the characters in a string, with one or more cut off the beginning. “agrid”, “grid”, “rid”, “id”, and “d” are all proper suffixes of “Hagrid”.

Saying that here’s the partial match table for the pattern abababca where each cell value is “The length of the longest proper prefix in the (sub)pattern that matches a proper suffix in the same (sub)pattern”.

char:  | a | b | a | b | a | b | c | a |
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |

Let’s also try it for cell five, which concerns “ababa”. We have four proper prefixes (“a”, “ab”, “aba”, and “abab”) and four proper suffixes (“a”, “ba”, “aba”, and “baba”). Now, we have two matches: “a” and “aba” are both proper prefixes and proper suffixes. Since “aba” is longer than “a”, it wins, and cell five gets value 3.

How To Use Partial Match Table?

We can use the values in the partial match table to skip ahead (rather than redoing unnecessary old comparisons) when we find partial matches. The formula works like this:

If a partial match of length partial_match_length is found and table[partial_match_length] > 1, we may skip ahead partial_match_length - table[partial_match_length - 1] characters.*

Let’s say we’re matching the pattern “abababca” against the text “bacbababaabcbab”. Here’s our partial match table again for easy reference:

char:  | a | b | a | b | a | b | c | a |
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |

The first time we get a partial match is here:

bacbababaabcbab
 |
 abababca

This is a partial_match_length of 1. The value at table[partial_match_length - 1] (or table[0]) is 0, so we don’t get to skip ahead any. The next partial match we get is here:

bacbababaabcbab
    |||||
    abababca

This is a partial_match_length of 5. The value at table[partial_match_length - 1] (or table[4]) is 3. That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 5 - table[4] or 5 - 3 or 2) characters:

// x denotes a skip

bacbababaabcbab
    xx|||
      abababca

This is a partial_match_length of 3. The value at table[partial_match_length - 1] (or table[2]) is 1. That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 3 - table[2] or 3 - 1 or 2) characters:

// x denotes a skip

bacbababaabcbab
      xx|
        abababca

At this point, our pattern is longer than the remaining characters in the text, so we know there’s no match.

Things to remember:

Once table is built, this algorithm has complexity O(n) even for binary alphabet (compare with O(m*n) for brute force).
Table is built very quickly at start since pattern is usually very small.
Table-building part is O(m). Search part is O(n). Total complexity O(m+n) even on binary alphabet.
Compare with O(m*n) for brute force on binary alphabet.

Complexity Analysis

Because the prefix-function already tells us what needs to be done about the already matched parts, we do not need to check those parts again and hence, only one pass over the string is sufficient. So, the time complexity of the matching is O(n) .
What about the pre-computation? It turns out that, if correctly implemented, it can be done in O(m) time where m is a length of a pattern.
Overall, it takes O(m+ n) time. But since m≤ n, the time is bounded by O(n). This is certainly better than the usual O(n²)
We only require as much space as the pattern. So, the space requirement is O(m).

Implementation

public static int KMP(String search, String target) {
    int[] failureTable = failureTable(target);

    int targetPointer = 0; // current char in target string
    int searchPointer = 0; // current char in search string

    while (searchPointer < search.length()) { // while there is more to search with, keep searching
        if (search.charAt(searchPointer) == target.charAt(targetPointer)) { // case 1
            // found current char in targetPointer in search string
            targetPointer++;
            if (targetPointer == target.length()) { // found all characters
                int x = target.length() + 1;
                return searchPointer - x; // return starting index of found target inside searched string
            }
            searchPointer++; // move forward if not found target string
        } else if (targetPointer > 0) { // case 2
            // use failureTable to use pointer pointed at nearest location of usable string prefix
            targetPointer = failureTable[targetPointer];
        } else { // case 3
            // targetPointer is pointing at state 0, so restart search with current searchPointer index
            searchPointer++;
        }
    }
    return -1;
}

/**
 * Returns an int[] that points to last valid string prefix, given target string
 */
public static int[] failureTable(String target) {
    int[] table = new int[target.length() + 1];
    // state 0 and 1 are guarenteed be the prior
    table[0] = -1;
    table[1] = 0;

    // the pointers pointing at last failure and current satte
    int left = 0;
    int right = 2;

    while (right < table.length) { // RIGHT NEVER MOVES RIGHT UNTIL ASSIGNED A VALID POINTER
        if (target.charAt(right - 1) == target.charAt(left)) { // when both chars before left and right are equal, link both and move both forward
            left++;
            table[right] = left;
            right++;
        } else if (left > 0) { // if left isn't at the very beginning, then send left backward
            // by following the already set pointer to where it is pointing to
            left = table[left];
        } else { // left has fallen all the way back to the beginning
            table[right] = left;
            right++;
        }
    }
    return table;
}

Question Tags: Senior Level, Strings