Knuth-Morris-Pratt

Another linear alternative for string pattern matching. Ideal for cases for unnatural languages such as those with few alphabets. DNA for instance. With smaller alphabets you are increasing the odds of a longest prefix suffix match, so you will tend to skip across the the text for frequently with KMP.

The Partial Match Table

Given the pattern abababca

char:   a  b  a  b  a  b  c  a 
index:  0  1  2  3  4  5  6  7  
value:  0  0  1  2  3  4  0  1

How This Works

What we are interested is the longest proper prefix-suffix matching.

Proper Suffixes:

bababca
ababca
babca
abca
bca
ca
a

Proper Prefixes:

abababc
ababab
ababa
abab
aba
ab
a

Back to the table:

char:   a  b  a  b  a  b  c  a 
index:  0  1  2  3  4  5  6  7  
value:  0  0  1  2  3  4  0  1

The first thing to understand is that at index i, we are representing [0,i] of the string. So that means that the first character will not have any prefix-suffix match.

@index1      *
char:     a  b  a  b  a  b  c  a 
index:    0  1  2  3  4  5  6  7  
value:    0  0  1  2  3  4  0  1 
matches:  (none)

@index2         *
char:     a  b  a  b  a  b  c  a 
index:    0  1  2  3  4  5  6  7  
value:    0  0  1  2  3  4  0  1 
matches:  *     * (a)

@index3            *
char:     a  b  a  b  a  b  c  a 
index:    0  1  2  3  4  5  6  7  
value:    0  0  1  2  3  4  0  1 
matches:  <-->  <--> (ab) 

@index4               *
char:     a  b  a  b  a  b  c  a 
index:    0  1  2  3  4  5  6  7  
value:    0  0  1  2  3  4  0  1 
matches:  <----> <----> (aba) 

@index5                  *
char:     a  b  a  b  a  b  c  a 
index:    0  1  2  3  4  5  6  7  
value:    0  0  1  2  3  4  0  1 
matches:  <-----<-->-----> (abab) 

@index6                     *
char:     a  b  a  b  a  b  c  a 
index:    0  1  2  3  4  5  6  7  
value:    0  0  1  2  3  4  0  1 
matches:  (none) 

@index7                        *
char:     a  b  a  b  a  b  c  a 
index:    0  1  2  3  4  5  6  7  
value:    0  0  1  2  3  4  0  1 
matches:  *                    * (a)

Using the Prefix Table

We use the prefix table to skip ahead, and avoid unnessessary comparisons in the process. The formula works like this: If a we find a partial match of the string and pattern L and if prefix_table[L] > 1, then we may skip ahead L - prefix_table[L - 1] characters.

For example:

Given

Text    : a b a b a b c a 
Pattern : b a c b a b a b a a b c b a b

Preprocessed prefix table:
char:     a  b  a  b  a  b  c  a 
index:    0  1  2  3  4  5  6  7  
value:    0  0  1  2  3  4  0  1

Lets begin:

Our begin partial match begins here.

Text    : b a c b a b a b a a b c b a b
Pattern :   a b a b a b c a 

L = 1;
table[L - 1] = 0;
skip = 0;
Text    : b a c b a b a b a a b c b a b
Pattern :         a b a b a b c a 

L = 5;
table[L - 1] = 3; 
skip = 5 - 3 = 2;


Text    : b a c b a b a b a a b c b a b
Pattern :                 a b a b a b c a 
Conclusion: no match found;

Proof

Goal: How can we show that performing these skips is guaranteed to be safe? In other words, how can be show that skipping any number of shifts that is < skip will guarantee to mismatch?

Lets take a look at this example:

T: a b c d a b c d ...
P: a b c d a b c y
M: <--->   <--->

According to the formula, the allignment should move 7-3 = 4 places.

T: a b c d a b c d ...
P:         a b c d a b c y

Intuitively, what is happening is that we are aligning abc to the next occurrence of abc in T. These are guaranteed to match, since that is what is validly defined within the prefix table. We can show that otherwise shifting by alignments fewer than 3 would guarantee to conflict. This is analogous to for example moving to the next valid occurrence of the first character in P. For example, when we mismatch, we know that at least the first character must match with the next occurrence of a, so lets jump to this next location. (Base condition)

The idea is the same with the longest prefix-suffix match. Rather than limiting ourselves to only the first character, why not extend this idea to match the characters we already know to exist within the string (Inductive case)? And we know where this is. Because we have guarantee from the prefix table that the suffix is equivalent to prefix, then moving the said alignment with the prefix would to the very least align properly with the suffix. What happens after, we are not interested in.

We can perform this jumps immediately because our analysis is only depend on P and not T. That means we are using information only from P, to decided on where to jump next.

Just like if we were to search for the next valid occurrence of the first character in P, and moving to this location, the longest prefix suffix match does each of the characters in the match. Because we have a guarantee that the prefix must at least align with the suffix in order to have a complete matching of P in T, moving to this alignment is the most optimal way to guarantee that at least the prefix of P will match in the search for the next alignment in T.

Last updated