Boyer-Moore

Is another efficient string pattern matching algorithm that is the standard benchmark from practical string search literature. Here is some insight.

Consider the naive approach that compares fix pattern P with every substring of T of length $|P|$

P: word
T: There would have been a time
         word

As we can see w and o have been matched, but u and r have failed. Instead of moving one more alignment as the naive algorithm does, we can show that skipping 2 alignments can guarantee no fault in the general algorithm. This is because u does not exist anywhere in P, so it should be safe just moving past P after u.

Boyer-Moore uses this principles as one of its heuristics for good character matching. The algorithm is able to determine from its character comparisons whether or not it can skip other pointless comparisons.

Rules

The search string T is read from left to right, but the character comparisons are made from right to left.

P: word
T: There would have been a time
    <---- word

The Bad Character Rule: Upon mismatch of a character (1) if the character is not in the string P move the alignment pasted the mismatched character, or (2) if the character is otherwise in the string, move the alignment to the first character that aligns with the mismatched character (iterating right to left).

// assuming the bad character rule alone

step 1: initial alignment between T and P
       .
T: GCTTCTGCTACCTTTTGCGCGCGCGCGGAA
P: CCTTTTGC

step 2: C mismatched with T, so we move P to align with 
        the first C
            .
T: GCTTCTGCTACCTTTTGCGCGCGCGCGGAA
P:    CCTTTTGC

step 3: the comparisons begin again, and we find a 
        mismatch with the A and G, where A does not
        exist in P
T: GCTTCTGCTACCTTTTGCGCGCGCGCGGAA
P:           CCTTTTGC

The Good Suffix Rule: let t=the substring that is currently matched with between T and P. (1) Skip to there is a match with between t and T, or (2) P moves past t.

// assuming the bad character rule alone

step 1: initial alignment between T and P
        .<->
T: CGTGCCTACTTACTTACTTACTTACGCGAA
P: CTTACTTAC

step 2: there was a mismatch between C and T, and
        we were able to move 3 alignments, to the 
        next occurrence of `TAC`.
        .<----->
T: CGTGCCTACTTACTTACTTACTTACGCGAA
P:     CTTACTTAC


step 3: Next we find that suffix of P CTTAC matched
        with suffix of T CCTAC, so that is where the
        next alignment got shifted.
           <------->
T: CGTGCCTACTTACTTACTTACTTACGCGAA
P:         CTTACTTAC

Putting it Togeher

Both the bad character rule and good suffix rule are used to skip alignments in the search text, that the naive algorithm would have otherwise ignored. Using both rules, Boyer Moore optimizes the amount of skips by taking maximum of both rules.

// assuming both rules into play

step 1: initial alignment between T and P
           .
T: GTTATAGCTGATCGCCGGCGTAGCGGCGAA
P: GTAGCGGCG

step 2: The good suffix rule did not apply because
        the mismatch length was 1. So we move to 
        the first occurrence of T.
        BC = 6
        GS = 0
               .<-> 
T: GTTATAGCTGATCGCCGGCGTAGCGGCGAA
P:        GTAGCGGCG

step 3: Now the GCR @ (.) can move 1 space (0 alignments),
        while the GSR can move to the next GCG with 2 alignments.
        BC = 0
        GS = 2
               .<---->   
T: GTTATAGCTGATCGCGGCGTAGCGGCGAA
P:           GTAGCGGCG


step 4: BCR says to move P past C, as C does not exist in 
        current P (GTA). GSR says that we can move the entire
        string past first G, since the substring does not match
        any of P.
        BC = 2
        GS = 7
                     <-------> 
T: GTTATAGCTGATCGCGGCGTAGCGGCGAA
P:                   GTAGCGGCG

Time Complexity: $O(n)$ Space Complexity: $O(\sum *|P|)$ , where $\sum$ defines the alphabet of T.

Implementation Details

In practice we need to be able to immediately know the length to skip in T. To do this, a preprocessing step must be done to produce a 2 dimensional lookup table given P and T, for both rule BC and GS. For example:

Lookup Table for the Bad Character Rule.

P: TCGC
T: AATCAATAGC

\sum() = ACGT
P = TCGC

Look up Table:
  T  C  G  C
A 0  1  0  3
C 0  -  2  -
G 0  1  -  0
T -  0  1  2

PreviousZ Algorithm NextKnuth-Morris-Pratt

Last updated 4 years ago