📜  Boyer Moore算法|良好的后缀启发式

📅  最后修改于: 2021-04-23 07:47:50             🧑  作者: Mango

我们已经讨论了Boyer Moore算法的不良字符启发式变体。在本文中,我们将讨论模式搜索的良好后缀启发式方法。就像不良字符启发式一样,生成了用于后缀良好启发式的预处理表。

良好的后缀启发式

t为与模式P的子字符串匹配的文本T的子字符串。现在我们将模式转移到:
1)P中t的另一次出现与T中的t匹配。
2)P的前缀,与t的后缀匹配
3)P越过t

情况1:P中t的另一次出现与T中的t匹配
模式P可能包含t的更多出现。在这种情况下,我们将尝试移动模式以使该出现与文本T中的t对齐。例如-

说明:在上面的示例中,我们获得了文本T的子字符串t,该字符串与模式P匹配(绿色),然后在索引2处不匹配。现在,我们将在P中搜索t的出现(“ AB”)。发生在位置1(在黄色背景中)开始,因此我们将模式右移2次以使t中的t与T中的t对齐。这是原始的Boyer Moore的弱规则,效果不大,我们将讨论Strong Good Suffix规则不久。

情况2:P的前缀,与T中的t的后缀匹配
并非总是可能在P中找到t的出现。有时根本没有出现,在这种情况下,有时我们可以搜索与P的某些前缀匹配的t的后缀,并尝试通过移动P来对齐它们。 例如 –

说明:在上面的示例中,在不匹配之前,我们在索引2-4处将t(“ BAB”)与P(绿色)匹配。但是,因为在P中不存在t,所以我们将搜索与t的后缀匹配的P前缀。我们发现前缀“ AB”(在黄色背景中)从索引0开始,它与整数t不匹配,但与后缀t从索引3开始的“ AB”匹配。因此,我们现在将模式转换3次以使前缀与后缀对齐。

情况3:P越过t
如果以上两种情况都不满足,我们将把模式移到t之后。例如 –

说明:如果在上面的示例中,则P中不存在t(“ AB”),并且P中也没有与t后缀相匹配的前缀。因此,在那种情况下,我们永远无法在索引4之前找到任何完美匹配,因此我们会将P移过t ie。索引5。

强后缀启发式

假设子串q = P [i至n]T中的t匹配,而c = P [i-1]是不匹配字符。现在,与情况1不同,我们将在P中搜索t,该t后面不包含字符c 。然后,通过移动模式P将最接近的这种情况与T中的t对齐。例如–

说明:在上面的示例中, q = P [7至8]在T中与t匹配。不匹配的字符c在位置P [6]处为“ C”。现在,如果我们开始在P中搜索t,我们将从位置4开始获得t的第一个匹配项。但是,该匹配项的前面是等于c的“ C”,因此我们将跳过它并继续搜索。在位置1,我们再次出现了t(在黄色背景中)。此事件之前是“ A”(蓝色),不等同于c。因此,我们将模式P移位6次,以使这种情况与T中的t对齐。我们这样做是因为我们已经知道字符c =“ C”会导致不匹配。因此,如果出现任何以c开头的t,当与t对齐时,都会再次导致不匹配,因此,最好跳过此步骤。

良好后缀启发式的预处理

作为预处理的一部分,将创建一个数组移位。如果在位置i-1发生不匹配,则每个包含shift [i]的条目都将偏移距离模式。即,从位置i开始的模式的后缀匹配,并且在位置i-1发生不匹配。对于强后缀和上面讨论的情况2,分别进行预处理。

1)预处理以获得良好的后缀
在讨论预处理之前,让我们首先讨论边界的概念。边框是一个既是后缀又是前缀的子字符串。例如,在字符串“ ccacc”中“ c”是边界, “ cc”是边界,因为它出现在字符串的两端,但“ cca”不是边界。

作为预处理的一部分,将计算数组bpos (边界位置)。每个条目bpos [i]都包含给定模式P中从索引i开始的后缀的边界起始索引。
从位置m开始的后缀φ没有边界,因此bpos [m]设置为m + 1 ,其中m是模式的长度。
移位位置由无法向左扩展的边界获得。以下是预处理代码–

void preprocess_strong_suffix(int *shift, int *bpos,
                  char *pat, int m)
{
    int i = m, j = m+1;
    bpos[i] = j;
    while(i > 0)
    {
        while(j <= m && pat[i-1] != pat[j-1])
        {
            if (shift[j] == 0)
                shift[j] = j-i;
            j = bpos[j];
        }
        i--; j--;
        bpos[i] = j; 
    }
}

说明:考虑模式P =“ ABBABAB”,m = 7

  • 从位置i = 5开始的后缀“ AB”的最宽边界是从位置7开始的φ(无),因此bpos [5] = 7。
  • 在位置i = 2时,后缀为“ BABAB”。此后缀的最大边框是从位置4开始的“ BAB”,因此j = bpos [2] = 4。

    我们可以使用以下示例了解bpos [i] = j

    如果字符哪个位置i-1是等于字符?在位置j-1处,我们知道边界将是? +从位置j开始的位置i处的后缀边界,这等效于说在i-1处的后缀边界从j-1bpos [i-1] = j-1或代码中开始–

    i--; 
    j--; 
    bpos[ i ] = j
    

    但是,如果位置i-1处的字符与字符不匹配?在位置j-1处,然后我们继续向右搜索。现在我们知道–

    1. 边框宽度将小于从位置j开始的边框。小于x…φ
    2. 边框必须以开头并以φ结尾,或者可以为空(不存在边框)。

    基于以上两个事实,我们将继续在子字符串x…φ中从位置j到m进行搜索。下一个边界应该在j = bpos [j]处。更新j之后,我们再次将位置j-1(?)的字符与#进行比较,如果它们相等,则得到边界,否则继续向右搜索直到j> m 。此过程由代码显示–

    while(j <= m && pat[i-1] != pat[j-1])
    {
        j = bpos[j];
    }
    i--; j--;
    bpos[i]=j;
    

    在上面的代码中查看这些条件–

    pat[i-1] != pat[j-1] 
    

    这是我们在壳体2所讨论的。当T的图案P的发生之前的字符比P中不匹配的字符不同,我们停止跳过发生和转移模式的条件。所以这里P [i] == P [j]P [i-1]!= p [j-1]因此我们将模式从i转移到j 。因此, shift [j] = jij的记录器。因此,无论何时在位置j发生任何不匹配,我们都将模式shift [j + 1]位置向右移动。
    在上面的代码中,以下条件非常重要–

    if (shift[j] == 0 )
    

    该条件防止了具有相同边界的后缀对shift [j]值的修改。例如,考虑模式P =“ addbddcdd” ,在这种情况下,当我们为i = 4计算bpos [i-1]时,则j = 7。我们最终将设置shift [7] = 3的值。现在,如果我们为i = 1计算bpos [i-1],则j = 7,如果没有测试,我们将再次设置shift [7] = 6的值shift [j] ==0。这意味着如果我们在位置6不匹配,我们将模式P 3的位置移到右侧而不是6位置。

    2)案例2的预处理
    在情况2的预处理中,对于每个后缀,确定该后缀中包含的整个模式最宽边界
    模式最宽边框的起始位置完全存储在bpos [0]中
    在下面的预处理算法中,该值bpos [0]最初存储在数组移位的所有空闲条目中。但是,当模式的后缀比bpos [0]短时,算法将继续使用模式的下一个较宽边界,即bpos [j]。

    以下是搜索算法的实现–

    C++
    /* C program for Boyer Moore Algorithm with 
       Good Suffix heuristic to find pattern in
       given text string */
      
    #include 
    #include 
      
    // preprocessing for strong good suffix rule
    void preprocess_strong_suffix(int *shift, int *bpos,
                                    char *pat, int m)
    {
        // m is the length of pattern 
        int i=m, j=m+1;
        bpos[i]=j;
      
        while(i>0)
        {
            /*if character at position i-1 is not equivalent to
              character at j-1, then continue searching to right
              of the pattern for border */
            while(j<=m && pat[i-1] != pat[j-1])
            {
                /* the character preceding the occurrence of t in 
                   pattern P is different than the mismatching character in P, 
                   we stop skipping the occurrences and shift the pattern
                   from i to j */
                if (shift[j]==0)
                    shift[j] = j-i;
      
                //Update the position of next border 
                j = bpos[j];
            }
            /* p[i-1] matched with p[j-1], border is found.
               store the  beginning position of border */
            i--;j--;
            bpos[i] = j; 
        }
    }
      
    //Preprocessing for case 2
    void preprocess_case2(int *shift, int *bpos,
                          char *pat, int m)
    {
        int i, j;
        j = bpos[0];
        for(i=0; i<=m; i++)
        {
            /* set the border position of the first character of the pattern
               to all indices in array shift having shift[i] = 0 */ 
            if(shift[i]==0)
                shift[i] = j;
      
            /* suffix becomes shorter than bpos[0], use the position of 
               next widest border as value of j */
            if (i==j)
                j = bpos[j];
        }
    }
      
    /*Search for a pattern in given text using
      Boyer Moore algorithm with Good suffix rule */
    void search(char *text, char *pat)
    {
        // s is shift of the pattern with respect to text
        int s=0, j;
        int m = strlen(pat);
        int n = strlen(text);
      
        int bpos[m+1], shift[m+1];
      
        //initialize all occurrence of shift to 0
        for(int i=0;i= 0 && pat[j] == text[s+j])
                j--;
      
            /* If the pattern is present at the current shift, then index j
                 will become -1 after the above loop */
            if (j<0)
            {
                printf("pattern occurs at shift = %d\n", s);
                s += shift[0];
            }
            else
                /*pat[i] != pat[s+j] so shift the pattern
                  shift[j+1] times  */
                s += shift[j+1];
        }
      
    }
      
    //Driver 
    int main()
    {
        char text[] = "ABAAAABAACD";
        char pat[] = "ABA";
        search(text, pat);
        return 0;
    }


    Java
    /* Java program for Boyer Moore Algorithm with 
    Good Suffix heuristic to find pattern in
    given text string */
    class GFG 
    {
      
    // preprocessing for strong good suffix rule
    static void preprocess_strong_suffix(int []shift, int []bpos,
                                          char []pat, int m)
    {
        // m is the length of pattern 
        int i = m, j = m + 1;
        bpos[i] = j;
      
        while(i > 0)
        {
            /*if character at position i-1 is not 
            equivalent to character at j-1, then 
            continue searching to right of the
            pattern for border */
            while(j <= m && pat[i - 1] != pat[j - 1])
            {
                /* the character preceding the occurrence of t 
                in pattern P is different than the mismatching 
                character in P, we stop skipping the occurrences 
                and shift the pattern from i to j */
                if (shift[j] == 0)
                    shift[j] = j - i;
      
                //Update the position of next border 
                j = bpos[j];
            }
            /* p[i-1] matched with p[j-1], border is found.
            store the beginning position of border */
            i--; j--;
            bpos[i] = j; 
        }
    }
      
    //Preprocessing for case 2
    static void preprocess_case2(int []shift, int []bpos,
                                  char []pat, int m)
    {
        int i, j;
        j = bpos[0];
        for(i = 0; i <= m; i++)
        {
            /* set the border position of the first character 
            of the pattern to all indices in array shift
            having shift[i] = 0 */
            if(shift[i] == 0)
                shift[i] = j;
      
            /* suffix becomes shorter than bpos[0], 
            use the position of next widest border
            as value of j */
            if (i == j)
                j = bpos[j];
        }
    }
      
    /*Search for a pattern in given text using
    Boyer Moore algorithm with Good suffix rule */
    static void search(char []text, char []pat)
    {
        // s is shift of the pattern 
        // with respect to text
        int s = 0, j;
        int m = pat.length;
        int n = text.length;
      
        int []bpos = new int[m + 1];
        int []shift = new int[m + 1];
      
        //initialize all occurrence of shift to 0
        for(int i = 0; i < m + 1; i++) 
            shift[i] = 0;
      
        //do preprocessing
        preprocess_strong_suffix(shift, bpos, pat, m);
        preprocess_case2(shift, bpos, pat, m);
      
        while(s <= n - m)
        {
            j = m - 1;
      
            /* Keep reducing index j of pattern while 
            characters of pattern and text are matching 
            at this shift s*/
            while(j >= 0 && pat[j] == text[s+j])
                j--;
      
            /* If the pattern is present at the current shift, 
            then index j will become -1 after the above loop */
            if (j < 0)
            {
                System.out.printf("pattern occurs at shift = %d\n", s);
                s += shift[0];
            }
            else
              
                /*pat[i] != pat[s+j] so shift the pattern
                shift[j+1] times */
                s += shift[j + 1];
        }
      
    }
      
    // Driver Code
    public static void main(String[] args) 
    {
        char []text = "ABAAAABAACD".toCharArray();
        char []pat = "ABA".toCharArray();
        search(text, pat);
    }
    } 
      
    // This code is contributed by 29AjayKumar


    Python3
    # Python3 program for Boyer Moore Algorithm with 
    # Good Suffix heuristic to find pattern in 
    # given text string
      
    # preprocessing for strong good suffix rule
    def preprocess_strong_suffix(shift, bpos, pat, m):
      
        # m is the length of pattern
        i = m
        j = m + 1
        bpos[i] = j
      
        while i > 0:
              
            '''if character at position i-1 is 
            not equivalent to character at j-1, 
            then continue searching to right 
            of the pattern for border '''
            while j <= m and pat[i - 1] != pat[j - 1]:
                  
                ''' the character preceding the occurrence 
                of t in pattern P is different than the 
                mismatching character in P, we stop skipping
                the occurrences and shift the pattern 
                from i to j '''
                if shift[j] == 0:
                    shift[j] = j - i
      
                # Update the position of next border
                j = bpos[j]
                  
            ''' p[i-1] matched with p[j-1], border is found. 
            store the beginning position of border '''
            i -= 1
            j -= 1
            bpos[i] = j
      
    # Preprocessing for case 2
    def preprocess_case2(shift, bpos, pat, m):
        j = bpos[0]
        for i in range(m + 1):
              
            ''' set the border position of the first character 
            of the pattern to all indices in array shift
            having shift[i] = 0 '''
            if shift[i] == 0:
                shift[i] = j
                  
            ''' suffix becomes shorter than bpos[0], 
            use the position of next widest border
            as value of j '''
            if i == j:
                j = bpos[j]
      
    '''Search for a pattern in given text using 
    Boyer Moore algorithm with Good suffix rule '''
    def search(text, pat):
      
        # s is shift of the pattern with respect to text
        s = 0
        m = len(pat)
        n = len(text)
      
        bpos = [0] * (m + 1)
      
        # initialize all occurrence of shift to 0
        shift = [0] * (m + 1)
      
        # do preprocessing
        preprocess_strong_suffix(shift, bpos, pat, m)
        preprocess_case2(shift, bpos, pat, m)
      
        while s <= n - m:
            j = m - 1
              
            ''' Keep reducing index j of pattern while characters of 
                pattern and text are matching at this shift s'''
            while j >= 0 and pat[j] == text[s + j]:
                j -= 1
                  
            ''' If the pattern is present at the current shift, 
                then index j will become -1 after the above loop '''
            if j < 0:
                print("pattern occurs at shift = %d" % s)
                s += shift[0]
            else:
                  
                '''pat[i] != pat[s+j] so shift the pattern 
                shift[j+1] times '''
                s += shift[j + 1]
      
    # Driver Code
    if __name__ == "__main__":
        text = "ABAAAABAACD"
        pat = "ABA"
        search(text, pat)
      
    # This code is contributed by
    # sanjeev2552


    C#
    /* C# program for Boyer Moore Algorithm with 
    Good Suffix heuristic to find pattern in
    given text string */
    using System;
      
    class GFG 
    {
      
    // preprocessing for strong good suffix rule
    static void preprocess_strong_suffix(int []shift, 
                                         int []bpos,
                                         char []pat, int m)
    {
        // m is the length of pattern 
        int i = m, j = m + 1;
        bpos[i] = j;
      
        while(i > 0)
        {
            /*if character at position i-1 is not 
            equivalent to character at j-1, then 
            continue searching to right of the
            pattern for border */
            while(j <= m && pat[i - 1] != pat[j - 1])
            {
                /* the character preceding the occurrence of t 
                in pattern P is different than the mismatching 
                character in P, we stop skipping the occurrences 
                and shift the pattern from i to j */
                if (shift[j] == 0)
                    shift[j] = j - i;
      
                //Update the position of next border 
                j = bpos[j];
            }
            /* p[i-1] matched with p[j-1], border is found.
            store the beginning position of border */
            i--; j--;
            bpos[i] = j; 
        }
    }
      
    //Preprocessing for case 2
    static void preprocess_case2(int []shift, int []bpos,
                                 char []pat, int m)
    {
        int i, j;
        j = bpos[0];
        for(i = 0; i <= m; i++)
        {
            /* set the border position of the first character 
            of the pattern to all indices in array shift
            having shift[i] = 0 */
            if(shift[i] == 0)
                shift[i] = j;
      
            /* suffix becomes shorter than bpos[0], 
            use the position of next widest border
            as value of j */
            if (i == j)
                j = bpos[j];
        }
    }
      
    /*Search for a pattern in given text using
    Boyer Moore algorithm with Good suffix rule */
    static void search(char []text, char []pat)
    {
        // s is shift of the pattern 
        // with respect to text
        int s = 0, j;
        int m = pat.Length;
        int n = text.Length;
      
        int []bpos = new int[m + 1];
        int []shift = new int[m + 1];
      
        // initialize all occurrence of shift to 0
        for(int i = 0; i < m + 1; i++) 
            shift[i] = 0;
      
        // do preprocessing
        preprocess_strong_suffix(shift, bpos, pat, m);
        preprocess_case2(shift, bpos, pat, m);
      
        while(s <= n - m)
        {
            j = m - 1;
      
            /* Keep reducing index j of pattern while 
            characters of pattern and text are matching 
            at this shift s*/
            while(j >= 0 && pat[j] == text[s + j])
                j--;
      
            /* If the pattern is present at the current shift, 
            then index j will become -1 after the above loop */
            if (j < 0)
            {
                Console.Write("pattern occurs at shift = {0}\n", s);
                s += shift[0];
            }
            else
              
                /*pat[i] != pat[s+j] so shift the pattern
                shift[j+1] times */
                s += shift[j + 1];
        }
    }
      
    // Driver Code
    public static void Main(String[] args) 
    {
        char []text = "ABAAAABAACD".ToCharArray();
        char []pat = "ABA".ToCharArray();
        search(text, pat);
    }
    } 
      
    // This code is contributed by PrinciRaj1992



    输出:

    pattern occurs at shift = 0
    pattern occurs at shift = 5
    

    参考

    • http://www.iti.fh-flensburg.de/lang/algorithmen/pattern/bmen.htm