📜  后缀数组|设置2(nLogn算法)

📅  最后修改于: 2021-04-24 03:32:13             🧑  作者: Mango

后缀数组是给定字符串的所有后缀的排序数组。定义类似于后缀树,后缀树是给定文本的所有后缀的压缩特里。

Let the given string be "banana".

0 banana                          5 a
1 anana     Sort the Suffixes     3 ana
2 nana      ---------------->     1 anana  
3 ana        alphabetically       0 banana  
4 na                              4 na   
5 a                               2 nana

The suffix array for "banana" is {5, 3, 1, 0, 4, 2}

我们讨论了构造后缀数组的朴素算法。天真的算法要考虑所有后缀,使用O(nLogn)排序算法对它们进行排序,并在排序时保持原始索引。时间朴素算法的复杂度是O(n 2 logN)的,其中n是输入字符串的字符数。

在这篇文章中,讨论了用于后缀数组构造的O(nLogn)算法。为了简单起见,让我们首先讨论O(n * Logn * Logn)算法。我们的想法是使用的事实,字符串将被分类为一个字符串的后缀。
我们首先根据第一个字符对所有后缀进行排序,然后根据前2个字符,然后按前4个字符排序,依此类推,而要考虑的字符数小于2n。重要的一点是,如果我们已根据前2个i字符对后缀进行排序,则可以使用Merge Sort等nLogn排序算法根据O(nLogn)时间中的前2个i + 1字符对后缀进行排序。这是可能的,因为可以在O(1)时间内比较两个后缀(我们只需要比较两个值,请参见下面的示例和代码)。

排序函数称为O(Logn)倍(请注意,我们以2的幂为单位增加要考虑的字符数)。因此,总体时间复杂度变为O(nLognLogn)。有关更多详细信息,请参见http://www.stanford.edu/class/cs97si/suffix-array.pdf。

让我们使用以上算法构建后缀数组示例字符串“ banana”。

根据前两个字符排序使用第一个字符的ASCII值为所有后缀分配一个等级。一种简单的分配等级的方法是对strp []的第i个后缀执行“ str [i] –’a’”

Index     Suffix            Rank
 0        banana             1   
 1        anana              0 
 2        nana               13 
 3        ana                0
 4        na                 13
 5        a                  0

对于每个字符,我们还存储下一个相邻字符的等级,即str [i + 1]处的字符等级(需要根据前2个字符对后缀进行排序)。如果一个字符是最后一个字符,则将下一个等级存储为-1

Index    Suffix            Rank          Next Rank 
 0       banana             1              0
 1       anana              0              13    
 2       nana               13             0
 3       ana                0              13
 4       na                 13             0 
 5       a                  0             -1

根据等级和相邻等级对所有后缀进行排序。等级被视为第一位数字或MSD,相邻等级被视为第二位数字。

Index    Suffix            Rank          Next Rank 
 5        a                  0              -1
 1        anana              0               13    
 3        ana                0               13
 0        banana             1               0
 2        nana               13              0
 4        na                 13              0

根据前四个字符排序
为所有后缀分配新的等级。为了分配新的等级,我们将排序后缀一一考虑。将0作为新等级分配给第一个后缀。为了给剩余的后缀分配等级,我们考虑在当前后缀之前的后缀的等级对。如果后缀的前一个等级对与后缀的前一个等级相同,则为其分配相同的等级。否则,分配前一个后缀的等级加1。

Index       Suffix          Rank       
  5          a               0     [Assign 0 to first]        
  1          anana           1     (0, 13) is different from previous
  3          ana             1     (0, 13) is same as previous     
  0          banana          2     (1, 0) is different from previous      
  2          nana            3     (13, 0) is different from previous      
  4          na              3     (13, 0) is same as previous

对于每个后缀str [i],还将下一个后缀的等级存储在str [i + 2]。如果i + 2处没有下一个后缀,则将下一个等级存储为-1

Index       Suffix          Rank        Next Rank
  5          a               0             -1
  1          anana           1              1      
  3          ana             1              0 
  0          banana          2              3
  2          nana            3              3 
  4          na              3              -1

根据排名和下一个排名对所有后缀进行排序。

Index       Suffix          Rank        Next Rank
  5          a               0             -1
  3          ana             1              0 
  1          anana           1              1      
  0          banana          2              3
  4          na              3             -1
  2          nana            3              3
C++
// C++ program for building suffix array of a given text
#include 
#include 
#include 
using namespace std;
 
// Structure to store information of a suffix
struct suffix
{
    int index; // To store original index
    int rank[2]; // To store ranks and next rank pair
};
 
// A comparison function used by sort() to compare two suffixes
// Compares two pairs, returns 1 if first pair is smaller
int cmp(struct suffix a, struct suffix b)
{
    return (a.rank[0] == b.rank[0])? (a.rank[1] < b.rank[1] ?1: 0):
               (a.rank[0] < b.rank[0] ?1: 0);
}
 
// This is the main function that takes a string 'txt' of size n as an
// argument, builds and return the suffix array for the given string
int *buildSuffixArray(char *txt, int n)
{
    // A structure to store suffixes and their indexes
    struct suffix suffixes[n];
 
    // Store suffixes and their indexes in an array of structures.
    // The structure is needed to sort the suffixes alphabatically
    // and maintain their old indexes while sorting
    for (int i = 0; i < n; i++)
    {
        suffixes[i].index = i;
        suffixes[i].rank[0] = txt[i] - 'a';
        suffixes[i].rank[1] = ((i+1) < n)? (txt[i + 1] - 'a'): -1;
    }
 
    // Sort the suffixes using the comparison function
    // defined above.
    sort(suffixes, suffixes+n, cmp);
 
    // At this point, all suffixes are sorted according to first
    // 2 characters.  Let us sort suffixes according to first 4
    // characters, then first 8 and so on
    int ind[n];  // This array is needed to get the index in suffixes[]
                 // from original index.  This mapping is needed to get
                 // next suffix.
    for (int k = 4; k < 2*n; k = k*2)
    {
        // Assigning rank and index values to first suffix
        int rank = 0;
        int prev_rank = suffixes[0].rank[0];
        suffixes[0].rank[0] = rank;
        ind[suffixes[0].index] = 0;
 
        // Assigning rank to suffixes
        for (int i = 1; i < n; i++)
        {
            // If first rank and next ranks are same as that of previous
            // suffix in array, assign the same new rank to this suffix
            if (suffixes[i].rank[0] == prev_rank &&
                    suffixes[i].rank[1] == suffixes[i-1].rank[1])
            {
                prev_rank = suffixes[i].rank[0];
                suffixes[i].rank[0] = rank;
            }
            else // Otherwise increment rank and assign
            {
                prev_rank = suffixes[i].rank[0];
                suffixes[i].rank[0] = ++rank;
            }
            ind[suffixes[i].index] = i;
        }
 
        // Assign next rank to every suffix
        for (int i = 0; i < n; i++)
        {
            int nextindex = suffixes[i].index + k/2;
            suffixes[i].rank[1] = (nextindex < n)?
                                  suffixes[ind[nextindex]].rank[0]: -1;
        }
 
        // Sort the suffixes according to first k characters
        sort(suffixes, suffixes+n, cmp);
    }
 
    // Store indexes of all sorted suffixes in the suffix array
    int *suffixArr = new int[n];
    for (int i = 0; i < n; i++)
        suffixArr[i] = suffixes[i].index;
 
    // Return the suffix array
    return  suffixArr;
}
 
// A utility function to print an array of given size
void printArr(int arr[], int n)
{
    for (int i = 0; i < n; i++)
        cout << arr[i] << " ";
    cout << endl;
}
 
// Driver program to test above functions
int main()
{
    char txt[] = "banana";
    int n = strlen(txt);
    int *suffixArr = buildSuffixArray(txt,  n);
    cout << "Following is suffix array for " << txt << endl;
    printArr(suffixArr, n);
    return 0;
}


Java
// Java program for building suffix array of a given text
import java.util.*;
class GFG
{
    // Class to store information of a suffix
    public static class Suffix implements Comparable
    {
        int index;
        int rank;
        int next;
 
        public Suffix(int ind, int r, int nr)
        {
            index = ind;
            rank = r;
            next = nr;
        }
         
        // A comparison function used by sort()
        // to compare two suffixes.
        // Compares two pairs, returns 1
        // if first pair is smaller
        public int compareTo(Suffix s)
        {
            if (rank != s.rank) return Integer.compare(rank, s.rank);
            return Integer.compare(next, s.next);
        }
    }
     
    // This is the main function that takes a string 'txt'
    // of size n as an argument, builds and return the
    // suffix array for the given string
    public static int[] suffixArray(String s)
    {
        int n = s.length();
        Suffix[] su = new Suffix[n];
         
        // Store suffixes and their indexes in
        // an array of classes. The class is needed
        // to sort the suffixes alphabatically and
        // maintain their old indexes while sorting
        for (int i = 0; i < n; i++)
        {
            su[i] = new Suffix(i, s.charAt(i) - '$', 0);
        }
        for (int i = 0; i < n; i++)
            su[i].next = (i + 1 < n ? su[i + 1].rank : -1);
 
        // Sort the suffixes using the comparison function
        // defined above.
        Arrays.sort(su);
 
        // At this point, all suffixes are sorted
        // according to first 2 characters.
        // Let us sort suffixes according to first 4
        // characters, then first 8 and so on
        int[] ind = new int[n];
         
        // This array is needed to get the index in suffixes[]
        // from original index. This mapping is needed to get
        // next suffix.
        for (int length = 4; length < 2 * n; length <<= 1)
        {
             
            // Assigning rank and index values to first suffix
            int rank = 0, prev = su[0].rank;
            su[0].rank = rank;
            ind[su[0].index] = 0;
            for (int i = 1; i < n; i++)
            {
                // If first rank and next ranks are same as
                // that of previous suffix in array,
                // assign the same new rank to this suffix
                if (su[i].rank == prev &&
                    su[i].next == su[i - 1].next)
                {
                    prev = su[i].rank;
                    su[i].rank = rank;
                }
                else
                {
                    // Otherwise increment rank and assign
                    prev = su[i].rank;
                    su[i].rank = ++rank;
                }
                ind[su[i].index] = i;
            }
             
            // Assign next rank to every suffix
            for (int i = 0; i < n; i++)
            {
                int nextP = su[i].index + length / 2;
                su[i].next = nextP < n ?
                   su[ind[nextP]].rank : -1;
            }
             
            // Sort the suffixes according
            // to first k characters
            Arrays.sort(su);
        }
 
        // Store indexes of all sorted
        // suffixes in the suffix array
        int[] suf = new int[n];
         
        for (int i = 0; i < n; i++)
            suf[i] = su[i].index;
 
        // Return the suffix array
        return suf;
    }   
     
    static void printArr(int arr[], int n)
    {
        for (int i = 0; i < n; i++)
            System.out.print(arr[i] + " ");
        System.out.println();
    }
     
    // Driver Code
    public static void main(String[] args)
    {
        String txt = "banana";
        int n = txt.length();
        int[] suff_arr = suffixArray(txt);
        System.out.println("Following is suffix array for banana:");
        printArr(suff_arr, n);
    }
}
 
// This code is contributed by AmanKumarSingh


Python3
# Python3 program for building suffix
# array of a given text
 
# Class to store information of a suffix
class suffix:
     
    def __init__(self):
         
        self.index = 0
        self.rank = [0, 0]
 
# This is the main function that takes a
# string 'txt' of size n as an argument,
# builds and return the suffix array for
# the given string
def buildSuffixArray(txt, n):
     
    # A structure to store suffixes
    # and their indexes
    suffixes = [suffix() for _ in range(n)]
 
    # Store suffixes and their indexes in
    # an array of structures. The structure
    # is needed to sort the suffixes alphabatically
    # and maintain their old indexes while sorting
    for i in range(n):
        suffixes[i].index = i
        suffixes[i].rank[0] = (ord(txt[i]) -
                               ord("a"))
        suffixes[i].rank[1] = (ord(txt[i + 1]) -
                        ord("a")) if ((i + 1) < n) else -1
 
    # Sort the suffixes according to the rank
    # and next rank
    suffixes = sorted(
        suffixes, key = lambda x: (
            x.rank[0], x.rank[1]))
 
    # At this point, all suffixes are sorted
    # according to first 2 characters.  Let
    # us sort suffixes according to first 4
    # characters, then first 8 and so on
    ind = [0] * n  # This array is needed to get the
                   # index in suffixes[] from original
                   # index.This mapping is needed to get
                   # next suffix.
    k = 4
    while (k < 2 * n):
         
        # Assigning rank and index
        # values to first suffix
        rank = 0
        prev_rank = suffixes[0].rank[0]
        suffixes[0].rank[0] = rank
        ind[suffixes[0].index] = 0
 
        # Assigning rank to suffixes
        for i in range(1, n):
             
            # If first rank and next ranks are
            # same as that of previous suffix in
            # array, assign the same new rank to
            # this suffix
            if (suffixes[i].rank[0] == prev_rank and
                suffixes[i].rank[1] == suffixes[i - 1].rank[1]):
                prev_rank = suffixes[i].rank[0]
                suffixes[i].rank[0] = rank
                 
            # Otherwise increment rank and assign   
            else: 
                prev_rank = suffixes[i].rank[0]
                rank += 1
                suffixes[i].rank[0] = rank
            ind[suffixes[i].index] = i
 
        # Assign next rank to every suffix
        for i in range(n):
            nextindex = suffixes[i].index + k // 2
            suffixes[i].rank[1] = suffixes[ind[nextindex]].rank[0] \
                if (nextindex < n) else -1
 
        # Sort the suffixes according to
        # first k characters
        suffixes = sorted(
            suffixes, key = lambda x: (
                x.rank[0], x.rank[1]))
 
        k *= 2
 
    # Store indexes of all sorted
    # suffixes in the suffix array
    suffixArr = [0] * n
     
    for i in range(n):
        suffixArr[i] = suffixes[i].index
 
    # Return the suffix array
    return suffixArr
 
# A utility function to print an array
# of given size
def printArr(arr, n):
     
    for i in range(n):
        print(arr[i], end = " ")
         
    print()
 
# Driver code
if __name__ == "__main__":
     
    txt = "banana"
    n = len(txt)
     
    suffixArr = buildSuffixArray(txt, n)
     
    print("Following is suffix array for", txt)
     
    printArr(suffixArr, n)
 
# This code is contributed by debrc


输出:

Following is suffix array for banana
5 3 1 0 4 2

注意,上述算法使用标准排序函数,因此时间复杂度为O(nLognLogn)。我们可以在此处使用“基数排序”将时间复杂度降低为O(nLogn)。
请注意,也可以在O(n)时间内构造后缀数组。我们将很快讨论O(n)算法。

参考:
http://www.stanford.edu/class/cs97si/suffix-array.pdf
http://www.cbcb.umd.edu/confcour/Fall2012/lec14b.pdf