📜  模式搜索的KMP算法

📅  最后修改于: 2021-05-06 09:25:07             🧑  作者: Mango

给定文本txt [0..n-1]和模式pat [0..m-1] ,编写一个函数search(char pat [],char txt []) ,将所有出现的pat []都打印在txt中[] 。您可以假设n> m

例子:

输入:txt [] =“这是一个测试文本” pat [] =“ TEST”输出:在索引10处找到的模式输入:txt [] =“ AABAACAADAABAABA” pat [] =“ AABA”输出:在索引0处找到的模式在索引9处找到的模式在索引12处找到的模式模式搜索

模式搜索是计算机科学中的一个重要问题。当我们在记事本/单词文件或浏览器或数据库中搜索字符串时,将使用模式搜索算法来显示搜索结果。

我们在上一篇文章中讨论了朴素的模式搜索算法。朴素算法的最坏情况复杂度是O(m(n-m + 1))。在最坏的情况下,KMP算法的时间复杂度为O(n)。

KMP(Knuth Morris Pratt)模式搜索
朴素的图案搜索算法并不在我们看到许多匹配字符,随后是不匹配的字符的情况下很好地工作。以下是一些示例。

txt[] = "AAAAAAAAAAAAAAAAAB"
   pat[] = "AAAAB"

   txt[] = "ABABABCABABABCABABABC"
   pat[] =  "ABABAC" (not a worst case, but a bad case for Naive)

KMP匹配算法使用模式的退化属性(具有相同子模式的模式在模式中出现多次),并将最坏情况下的复杂度提高到O(n)。 KMP算法背后的基本思想是:每当检测到不匹配时(在某些匹配之后),我们就已经知道下一个窗口文本中的某些字符。我们利用此信息来避免匹配我们知道将始终匹配的字符。让我们考虑下面的例子来理解这一点。

Matching Overview
txt = "AAAAABAAABA" 
pat = "AAAA"

We compare first window of txt with pat
txt = "AAAAABAAABA" 
pat = "AAAA"  [Initial position]
We find a match. This is same as Naive String Matching.

In the next step, we compare next window of txt with pat.
txt = "AAAAABAAABA" 
pat =  "AAAA" [Pattern shifted one position]
This is where KMP does optimization over Naive. In this 
second window, we only compare fourth A of pattern
with fourth character of current window of text to decide 
whether current window matches or not. Since we know 
first three characters will anyway match, we skipped 
matching first three characters. 

Need of Preprocessing?
An important question arises from the above explanation, 
how to know how many characters to be skipped. To know this, 
we pre-process pattern and prepare an integer array 
lps[] that tells us the count of characters to be skipped. 

预处理概述:

  • KMP算法对pat []进行预处理,并构造一个大小为m(与模式大小相同)的辅助lps [] ,用于在匹配时跳过字符。
  • 名称lps表示最长的适当前缀,该前缀也是后缀。 。一个适当的前缀是前缀整字符串不允许的。例如,“ ABC”的前缀是“”,“ A”,“ AB”和“ ABC”。适当的前缀为“”,“ A”和“ AB”。字符串的后缀是“”,“ C”,“ BC”和“ ABC”。
  • 我们在子模式中搜索lps。更清楚地,我们集中于模式的子字符串,即前缀和后缀。
  • 对于i = 0到m-1的每个子模式pat [0..i],lps [i]存储最大匹配正确前缀的长度,该前缀也是子模式pat [0..i]的后缀。
    lps[i] = the longest proper prefix of pat[0..i] 
                  which is also a suffix of pat[0..i]. 

    注意: lps [i]也可以定义为最长前缀,这也是适当的后缀。我们需要在一个地方正确使用以确保不考虑整个子字符串。

    Examples of lps[] construction:
    For the pattern “AAAA”, 
    lps[] is [0, 1, 2, 3]
    
    For the pattern “ABCDE”, 
    lps[] is [0, 0, 0, 0, 0]
    
    For the pattern “AABAACAABAA”, 
    lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]
    
    For the pattern “AAACAAAAAC”, 
    lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4] 
    
    For the pattern “AAABAAA”, 
    lps[] is [0, 1, 2, 0, 1, 2, 3]
    

    搜索算法:
    与Naive算法不同,在Naive算法中,我们将模式滑动一个并在每次移位时比较所有字符,我们使用lps []中的值来确定要匹配的下一个字符。这个想法是不匹配我们知道会匹配的字符。

    如何使用lps []确定下一个位置(或知道要跳过的字符数)?

    • 我们先从与文本的当前窗口的字符J = 0轻拍[J]的比较。
    • 我们保持匹配字符txt [i]和pat [j],并保持i和j递增,而pat [j]和txt [i]保持匹配
    • 当我们看到不匹配时
      • 我们知道字符pat [0..j-1]与txt [ij…i-1]相匹配(请注意,j以0开头,仅在存在匹配项时递增)。
      • 从上面的定义中我们也知道lps [j-1]是pat [0…j-1]的字符计数,它们都是正确的前缀和后缀。
      • 从以上两点可以得出结论,我们不需要将这些lps [j-1]字符与txt [ij…i-1]匹配,因为我们知道这些字符仍然可以匹配。让我们考虑上面的例子来理解这一点。
    txt[] = "AAAAABAAABA" 
    pat[] = "AAAA"
    lps[] = {0, 1, 2, 3} 
    
    i = 0, j = 0
    txt[] = "AAAAABAAABA" 
    pat[] = "AAAA"
    txt[i] and pat[j] match, do i++, j++
    
    i = 1, j = 1
    txt[] = "AAAAABAAABA" 
    pat[] = "AAAA"
    txt[i] and pat[j] match, do i++, j++
    
    i = 2, j = 2
    txt[] = "AAAAABAAABA" 
    pat[] = "AAAA"
    pat[i] and pat[j] match, do i++, j++
    
    i = 3, j = 3
    txt[] = "AAAAABAAABA" 
    pat[] = "AAAA"
    txt[i] and pat[j] match, do i++, j++
    
    i = 4, j = 4
    Since j == M, print pattern found and reset j,
    j = lps[j-1] = lps[3] = 3
    
    Here unlike Naive algorithm, we do not match first three 
    characters of this window. Value of lps[j-1] (in above 
    step) gave us index of next character to match.
    i = 4, j = 3
    txt[] = "AAAAABAAABA" 
    pat[] =  "AAAA"
    txt[i] and pat[j] match, do i++, j++
    
    i = 5, j = 4
    Since j == M, print pattern found and reset j,
    j = lps[j-1] = lps[3] = 3
    
    Again unlike Naive algorithm, we do not match first three 
    characters of this window. Value of lps[j-1] (in above 
    step) gave us index of next character to match.
    i = 5, j = 3
    txt[] = "AAAAABAAABA" 
    pat[] =   "AAAA"
    txt[i] and pat[j] do NOT match and j > 0, change only j
    j = lps[j-1] = lps[2] = 2
    
    i = 5, j = 2
    txt[] = "AAAAABAAABA" 
    pat[] =    "AAAA"
    txt[i] and pat[j] do NOT match and j > 0, change only j
    j = lps[j-1] = lps[1] = 1 
    
    i = 5, j = 1
    txt[] = "AAAAABAAABA" 
    pat[] =     "AAAA"
    txt[i] and pat[j] do NOT match and j > 0, change only j
    j = lps[j-1] = lps[0] = 0
    
    i = 5, j = 0
    txt[] = "AAAAABAAABA" 
    pat[] =      "AAAA"
    txt[i] and pat[j] do NOT match and j is 0, we do i++.
    
    i = 6, j = 0
    txt[] = "AAAAABAAABA" 
    pat[] =       "AAAA"
    txt[i] and pat[j] match, do i++ and j++
    
    i = 7, j = 1
    txt[] = "AAAAABAAABA" 
    pat[] =       "AAAA"
    txt[i] and pat[j] match, do i++ and j++
    
    We continue this way...
    

    C++
    // C++ program for implementation of KMP pattern searching
    // algorithm
    #include 
      
    void computeLPSArray(char* pat, int M, int* lps);
      
    // Prints occurrences of txt[] in pat[]
    void KMPSearch(char* pat, char* txt)
    {
        int M = strlen(pat);
        int N = strlen(txt);
      
        // create lps[] that will hold the longest prefix suffix
        // values for pattern
        int lps[M];
      
        // Preprocess the pattern (calculate lps[] array)
        computeLPSArray(pat, M, lps);
      
        int i = 0; // index for txt[]
        int j = 0; // index for pat[]
        while (i < N) {
            if (pat[j] == txt[i]) {
                j++;
                i++;
            }
      
            if (j == M) {
                printf("Found pattern at index %d ", i - j);
                j = lps[j - 1];
            }
      
            // mismatch after j matches
            else if (i < N && pat[j] != txt[i]) {
                // Do not match lps[0..lps[j-1]] characters,
                // they will match anyway
                if (j != 0)
                    j = lps[j - 1];
                else
                    i = i + 1;
            }
        }
    }
      
    // Fills lps[] for given patttern pat[0..M-1]
    void computeLPSArray(char* pat, int M, int* lps)
    {
        // length of the previous longest prefix suffix
        int len = 0;
      
        lps[0] = 0; // lps[0] is always 0
      
        // the loop calculates lps[i] for i = 1 to M-1
        int i = 1;
        while (i < M) {
            if (pat[i] == pat[len]) {
                len++;
                lps[i] = len;
                i++;
            }
            else // (pat[i] != pat[len])
            {
                // This is tricky. Consider the example.
                // AAACAAAA and i = 7. The idea is similar
                // to search step.
                if (len != 0) {
                    len = lps[len - 1];
      
                    // Also, note that we do not increment
                    // i here
                }
                else // if (len == 0)
                {
                    lps[i] = 0;
                    i++;
                }
            }
        }
    }
      
    // Driver program to test above function
    int main()
    {
        char txt[] = "ABABDABACDABABCABAB";
        char pat[] = "ABABCABAB";
        KMPSearch(pat, txt);
        return 0;
    }


    Java
    // JAVA program for implementation of KMP pattern
    // searching algorithm
      
    class KMP_String_Matching {
        void KMPSearch(String pat, String txt)
        {
            int M = pat.length();
            int N = txt.length();
      
            // create lps[] that will hold the longest
            // prefix suffix values for pattern
            int lps[] = new int[M];
            int j = 0; // index for pat[]
      
            // Preprocess the pattern (calculate lps[]
            // array)
            computeLPSArray(pat, M, lps);
      
            int i = 0; // index for txt[]
            while (i < N) {
                if (pat.charAt(j) == txt.charAt(i)) {
                    j++;
                    i++;
                }
                if (j == M) {
                    System.out.println("Found pattern "
                                       + "at index " + (i - j));
                    j = lps[j - 1];
                }
      
                // mismatch after j matches
                else if (i < N && pat.charAt(j) != txt.charAt(i)) {
                    // Do not match lps[0..lps[j-1]] characters,
                    // they will match anyway
                    if (j != 0)
                        j = lps[j - 1];
                    else
                        i = i + 1;
                }
            }
        }
      
        void computeLPSArray(String pat, int M, int lps[])
        {
            // length of the previous longest prefix suffix
            int len = 0;
            int i = 1;
            lps[0] = 0; // lps[0] is always 0
      
            // the loop calculates lps[i] for i = 1 to M-1
            while (i < M) {
                if (pat.charAt(i) == pat.charAt(len)) {
                    len++;
                    lps[i] = len;
                    i++;
                }
                else // (pat[i] != pat[len])
                {
                    // This is tricky. Consider the example.
                    // AAACAAAA and i = 7. The idea is similar
                    // to search step.
                    if (len != 0) {
                        len = lps[len - 1];
      
                        // Also, note that we do not increment
                        // i here
                    }
                    else // if (len == 0)
                    {
                        lps[i] = len;
                        i++;
                    }
                }
            }
        }
      
        // Driver program to test above function
        public static void main(String args[])
        {
            String txt = "ABABDABACDABABCABAB";
            String pat = "ABABCABAB";
            new KMP_String_Matching().KMPSearch(pat, txt);
        }
    }
    // This code has been contributed by Amit Khandelwal.


    Python
    # Python program for KMP Algorithm
    def KMPSearch(pat, txt):
        M = len(pat)
        N = len(txt)
      
        # create lps[] that will hold the longest prefix suffix 
        # values for pattern
        lps = [0]*M
        j = 0 # index for pat[]
      
        # Preprocess the pattern (calculate lps[] array)
        computeLPSArray(pat, M, lps)
      
        i = 0 # index for txt[]
        while i < N:
            if pat[j] == txt[i]:
                i += 1
                j += 1
      
            if j == M:
                print ("Found pattern at index " + str(i-j))
                j = lps[j-1]
      
            # mismatch after j matches
            elif i < N and pat[j] != txt[i]:
                # Do not match lps[0..lps[j-1]] characters,
                # they will match anyway
                if j != 0:
                    j = lps[j-1]
                else:
                    i += 1
      
    def computeLPSArray(pat, M, lps):
        len = 0 # length of the previous longest prefix suffix
      
        lps[0] # lps[0] is always 0
        i = 1
      
        # the loop calculates lps[i] for i = 1 to M-1
        while i < M:
            if pat[i]== pat[len]:
                len += 1
                lps[i] = len
                i += 1
            else:
                # This is tricky. Consider the example.
                # AAACAAAA and i = 7. The idea is similar 
                # to search step.
                if len != 0:
                    len = lps[len-1]
      
                    # Also, note that we do not increment i here
                else:
                    lps[i] = 0
                    i += 1
      
    txt = "ABABDABACDABABCABAB"
    pat = "ABABCABAB"
    KMPSearch(pat, txt)
      
    # This code is contributed by Bhavya Jain


    C#
    // C# program for implementation of KMP pattern
    // searching algorithm
    using System;
      
    class GFG {
      
        void KMPSearch(string pat, string txt)
        {
            int M = pat.Length;
            int N = txt.Length;
      
            // create lps[] that will hold the longest
            // prefix suffix values for pattern
            int[] lps = new int[M];
            int j = 0; // index for pat[]
      
            // Preprocess the pattern (calculate lps[]
            // array)
            computeLPSArray(pat, M, lps);
      
            int i = 0; // index for txt[]
            while (i < N) {
                if (pat[j] == txt[i]) {
                    j++;
                    i++;
                }
                if (j == M) {
                    Console.Write("Found pattern "
                                  + "at index " + (i - j));
                    j = lps[j - 1];
                }
      
                // mismatch after j matches
                else if (i < N && pat[j] != txt[i]) {
                    // Do not match lps[0..lps[j-1]] characters,
                    // they will match anyway
                    if (j != 0)
                        j = lps[j - 1];
                    else
                        i = i + 1;
                }
            }
        }
      
        void computeLPSArray(string pat, int M, int[] lps)
        {
            // length of the previous longest prefix suffix
            int len = 0;
            int i = 1;
            lps[0] = 0; // lps[0] is always 0
      
            // the loop calculates lps[i] for i = 1 to M-1
            while (i < M) {
                if (pat[i] == pat[len]) {
                    len++;
                    lps[i] = len;
                    i++;
                }
                else // (pat[i] != pat[len])
                {
                    // This is tricky. Consider the example.
                    // AAACAAAA and i = 7. The idea is similar
                    // to search step.
                    if (len != 0) {
                        len = lps[len - 1];
      
                        // Also, note that we do not increment
                        // i here
                    }
                    else // if (len == 0)
                    {
                        lps[i] = len;
                        i++;
                    }
                }
            }
        }
      
        // Driver program to test above function
        public static void Main()
        {
            string txt = "ABABDABACDABABCABAB";
            string pat = "ABABCABAB";
            new GFG().KMPSearch(pat, txt);
        }
    }
      
    // This code has been contributed by Amit Khandelwal.


    PHP


    输出:

    Found pattern at index 10

    预处理算法:
    在预处理部分,我们以lps []的形式计算值。为此,我们跟踪上一个索引的最长前缀后缀值的长度(为此目的,我们使用len变量)。我们将lps [0]和len初始化为0。如果pat [len]和pat [i]匹配,我们将len加1,并将增加的值分配给lps [i]。如果pat [i]和pat [len]不匹配且len不为0,则将len更新为lps [len-1]。有关详细信息,请参见下面的代码中的computeLPSArray()。

    预处理的图解(或lps []的构造)

    pat[] = "AAACAAAA"
    
    len = 0, i  = 0.
    lps[0] is always 0, we move 
    to i = 1
    
    len = 0, i  = 1.
    Since pat[len] and pat[i] match, do len++, 
    store it in lps[i] and do i++.
    len = 1, lps[1] = 1, i = 2
    
    len = 1, i  = 2.
    Since pat[len] and pat[i] match, do len++, 
    store it in lps[i] and do i++.
    len = 2, lps[2] = 2, i = 3
    
    len = 2, i  = 3.
    Since pat[len] and pat[i] do not match, and len > 0, 
    set len = lps[len-1] = lps[1] = 1
    
    len = 1, i  = 3.
    Since pat[len] and pat[i] do not match and len > 0, 
    len = lps[len-1] = lps[0] = 0
    
    len = 0, i  = 3.
    Since pat[len] and pat[i] do not match and len = 0, 
    Set lps[3] = 0 and i = 4.
    We know that characters pat
    len = 0, i  = 4.
    Since pat[len] and pat[i] match, do len++, 
    store it in lps[i] and do i++.
    len = 1, lps[4] = 1, i = 5
    
    len = 1, i  = 5.
    Since pat[len] and pat[i] match, do len++, 
    store it in lps[i] and do i++.
    len = 2, lps[5] = 2, i = 6
    
    len = 2, i  = 6.
    Since pat[len] and pat[i] match, do len++, 
    store it in lps[i] and do i++.
    len = 3, lps[6] = 3, i = 7
    
    len = 3, i  = 7.
    Since pat[len] and pat[i] do not match and len > 0,
    set len = lps[len-1] = lps[2] = 2
    
    len = 2, i  = 7.
    Since pat[len] and pat[i] match, do len++, 
    store it in lps[i] and do i++.
    len = 3, lps[7] = 3, i = 8
    
    We stop here as we have constructed the whole lps[].