📜  最长公共扩展/ LCE |第 3 组(分段树法)

📅  最后修改于: 2022-05-13 01:57:08.020000             🧑  作者: Mango

最长公共扩展/ LCE |第 3 组(分段树法)

先决条件:LCE(Set 1), LCE(Set 2), Suffix Array (n Log Log n), Kasai的算法和Segment Tree

最长公共扩展 (LCE) 问题考虑字符串s ,并为每一对 (L , R) 计算s的最长子字符串,该子串从 L 和 R 开始。在 LCE 中,在每个查询中我们必须回答从索引 L 和 R 开始的最长公共前缀的长度。

例子:
字符串:“阿巴巴巴”
查询: LCE(1, 2)、LCE(1, 6) 和 LCE(0, 5)

(1, 2), (1, 6) 和 (0, 5)给出的索引开始查找最长公共前缀的长度。

突出显示“绿色”的字符串是从相应查询的索引 L 和 R 开始的最长公共前缀。我们必须找到从索引- (1, 2), (1, 6) 和 (0, 5)开始的最长公共前缀的长度。

最长公共扩展

在本集中,我们将讨论解决 LCE 问题的分段树方法。

在 Set 2 中,我们看到 LCE 问题可以转化为 RMQ 问题。

为了有效地处理 RMQ,我们在 lcp 数组上构建了一个段树,然后有效地回答 LCE 查询。

要找到低位和高位,我们必须先计算后缀数组,然后从后缀数组计算逆后缀数组。

我们还需要 lcp 数组,因此我们使用 Kasai 算法从后缀数组中找到 lcp 数组。

完成上述操作后,我们只需在 lcp 数组中为每个查询从索引低到高找到最小值(如上所示)。

在没有证明的情况下,我们将使用直接结果(在数学证明之后推导出来)-

LCE (L, R) = RMQ lcp (invSuff[R], invSuff[L]-1)

下标 lcp 意味着我们必须在 lcp 数组上执行 RMQ,因此我们将在 lcp 数组上构建一个段树。

// A C++ Program to find the length of longest common
// extension using Segment Tree
#include
using namespace std;
  
// Structure to represent a query of form (L,R)
struct Query
{
    int L, R;
};
  
// Structure to store information of a suffix
struct suffix
{
    int index;  // To store original index
    int rank[2]; // To store ranks and next rank pair
};
  
// A utility function to get minimum of two numbers
int minVal(int x, int y)
{
    return (x < y)? x: y;
}
  
// A utility function to get the middle index from
// corner indexes.
int getMid(int s, int e)
{
    return s + (e -s)/2;
}
  
/*  A recursive function to get the minimum value
    in a given range of array indexes. The following
    are parameters for this function.
  
    st    --> Pointer to segment tree
    index --> Index of current node in the segment
              tree. Initially 0 is passed as root
              is always at index 0
    ss & se  --> Starting and ending indexes of the
                  segment represented by current
                  node, i.e., st[index]
    qs & qe  --> Starting and ending indexes of query
                 range */
int RMQUtil(int *st, int ss, int se, int qs, int qe,
                                           int index)
{
    // If segment of this node is a part of given range,
    // then return the min of the segment
    if (qs <= ss && qe >= se)
        return st[index];
  
    // If segment of this node is outside the given range
    if (se < qs || ss > qe)
        return INT_MAX;
  
    // If a part of this segment overlaps with the given
    // range
    int mid = getMid(ss, se);
    return minVal(RMQUtil(st, ss, mid, qs, qe, 2*index+1),
                  RMQUtil(st, mid+1, se, qs, qe, 2*index+2));
}
  
// Return minimum of elements in range from index qs
// (query start) to qe (query end).  It mainly uses RMQUtil()
int RMQ(int *st, int n, int qs, int qe)
{
    // Check for erroneous input values
    if (qs < 0 || qe > n-1 || qs > qe)
    {
        printf("Invalid Input");
        return -1;
    }
  
    return RMQUtil(st, 0, n-1, qs, qe, 0);
}
  
// A recursive function that constructs Segment Tree
// for array[ss..se]. si is index of current node in
// segment tree st
int constructSTUtil(int arr[], int ss, int se, int *st,
                                               int si)
{
    // If there is one element in array, store it in
    // current node of segment tree and return
    if (ss == se)
    {
        st[si] = arr[ss];
        return arr[ss];
    }
  
    // If there are more than one elements, then recur
    // for left and right subtrees and store the minimum
    // of two values in this node
    int mid = getMid(ss, se);
    st[si] =  minVal(constructSTUtil(arr, ss, mid, st, si*2+1),
                  constructSTUtil(arr, mid+1, se, st, si*2+2));
    return st[si];
}
  
/* Function to construct segment tree from given array.
   This function allocates memory for segment tree and
   calls constructSTUtil() to fill the allocated memory */
int *constructST(int arr[], int n)
{
    // Allocate memory for segment tree
  
    //Height of segment tree
    int x = (int)(ceil(log2(n)));
  
    // Maximum size of segment tree
    int max_size = 2*(int)pow(2, x) - 1;
  
    int *st = new int[max_size];
  
    // Fill the allocated memory st
    constructSTUtil(arr, 0, n-1, st, 0);
  
    // Return the constructed segment tree
    return st;
}
  
// A comparison function used by sort() to compare
// two suffixes Compares two pairs, returns 1 if
// first pair is smaller
int cmp(struct suffix a, struct suffix b)
{
    return (a.rank[0] == b.rank[0])?
           (a.rank[1] < b.rank[1] ?1: 0):
           (a.rank[0] < b.rank[0] ?1: 0);
}
  
// This is the main function that takes a string
// 'txt' of size n as an argument, builds and return
// the suffix array for the given string
vector buildSuffixArray(string txt, int n)
{
    // A structure to store suffixes and their indexes
    struct suffix suffixes[n];
  
    // Store suffixes and their indexes in an array
    // of structures. The structure is needed to sort
    // the suffixes alphabetically and maintain their
    // old indexes while sorting
    for (int i = 0; i < n; i++)
    {
        suffixes[i].index = i;
        suffixes[i].rank[0] = txt[i] - 'a';
        suffixes[i].rank[1] = ((i+1) < n)?
                            (txt[i + 1] - 'a'): -1;
    }
  
    // Sort the suffixes using the comparison function
    // defined above.
    sort(suffixes, suffixes+n, cmp);
  
    // At his point, all suffixes are sorted according to first
    // 2 characters.  Let us sort suffixes according to first 4
    // characters, then first 8 and so on
    int ind[n];  // This array is needed to get the index
                 // in suffixes[]
    // from original index.  This mapping is needed to get
    // next suffix.
    for (int k = 4; k < 2*n; k = k*2)
    {
        // Assigning rank and index values to first suffix
        int rank = 0;
        int prev_rank = suffixes[0].rank[0];
        suffixes[0].rank[0] = rank;
        ind[suffixes[0].index] = 0;
  
        // Assigning rank to suffixes
        for (int i = 1; i < n; i++)
        {
            // If first rank and next ranks are same as
            // that of previous suffix in array, assign
            // the same new rank to this suffix
            if (suffixes[i].rank[0] == prev_rank &&
              suffixes[i].rank[1] == suffixes[i-1].rank[1])
            {
                prev_rank = suffixes[i].rank[0];
                suffixes[i].rank[0] = rank;
            }
            else // Otherwise increment rank and assign
            {
                prev_rank = suffixes[i].rank[0];
                suffixes[i].rank[0] = ++rank;
            }
            ind[suffixes[i].index] = i;
        }
  
        // Assign next rank to every suffix
        for (int i = 0; i < n; i++)
        {
            int nextindex = suffixes[i].index + k/2;
            suffixes[i].rank[1] = (nextindex < n)?
                  suffixes[ind[nextindex]].rank[0]: -1;
        }
  
        // Sort the suffixes according to first k characters
        sort(suffixes, suffixes+n, cmp);
    }
  
    // Store indexes of all sorted suffixes in the suffix array
    vectorsuffixArr;
    for (int i = 0; i < n; i++)
        suffixArr.push_back(suffixes[i].index);
  
    // Return the suffix array
    return  suffixArr;
}
  
/* To construct and return LCP */
vector kasai(string txt, vector suffixArr,
                              vector &invSuff)
{
    int n = suffixArr.size();
  
    // To store LCP array
    vector lcp(n, 0);
  
    // Fill values in invSuff[]
    for (int i=0; i < n; i++)
        invSuff[suffixArr[i]] = i;
  
    // Initialize length of previous LCP
    int k = 0;
  
    // Process all suffixes one by one starting from
    // first suffix in txt[]
    for (int i=0; i0)
            k--;
    }
  
    // return the constructed lcp array
    return lcp;
}
  
// A utility function to find longest common extension
// from index - L and index - R
int LCE(int *st, vectorlcp, vectorinvSuff,
        int n, int L, int R)
{
    // Handle the corner case
    if (L == R)
        return (n-L);
  
    // Use the formula  -
    // LCE (L, R) = RMQ lcp (invSuff[R], invSuff[L]-1)
    return (RMQ(st, n, invSuff[R], invSuff[L]-1));
}
  
// A function to answer queries of longest common extension
void LCEQueries(string str, int n, Query q[],
                int m)
{
    // Build a suffix array
    vectorsuffixArr = buildSuffixArray(str, str.length());
  
    // An auxiliary array to store inverse of suffix array
    // elements. For example if suffixArr[0] is 5, the
    // invSuff[5] would store 0.  This is used to get next
    // suffix string from suffix array.
    vector invSuff(n, 0);
  
    // Build a lcp vector
    vectorlcp = kasai(str, suffixArr, invSuff);
  
    int lcpArr[n];
    // Convert to lcp array
    for (int i=0; i

输出:

LCE (1, 2) = 1
LCE (1, 6) = 3
LCE (0, 5) = 4

时间复杂度:构建 lcp 和后缀数组需要 O(N.logN) 时间。要回答每个查询需要 O(log N)。因此总体时间复杂度为 O(N.logN + Q.logN))。虽然我们可以使用其他算法在 O(N) 时间内构造 lcp 数组和后缀数组。
在哪里,
Q = LCE 查询数。
N = 输入字符串的长度。

辅助空间:
我们使用 O(N) 辅助空间来存储 lcp、后缀和反后缀数组以及段树。

性能比较:我们已经看到了三种计算 LCE 长度的算法。

第 1 组:朴素方法 [O(NQ)]
设置 2: RMQ-直接最小方法 [O(N.logN + Q. (|invSuff[R] – invSuff[L]|))]
Set 3 : Segment Tree Method [O(N.logN + Q.logN))]

invSuff[] = 输入字符串的反后缀数组。

从渐近时间复杂度来看,Segment Tree 方法似乎效率最高,而其他两种方法效率非常低。

但是当涉及到实际世界时,情况并非如此。如果我们为具有用于各种运行的随机字符串的典型文件绘制时间与 log((|invSuff[R] – invSuff[L]|) 之间的图表,则结果如下所示。

LCE
上图取自该参考资料。测试在 25 个文件上运行,随机字符串范围从 0.7 MB 到 2 GB。字符串的确切大小未知,但显然 2 GB 文件中有很多字符。这是因为 1 个字符= 1 个字节。因此,大约 1000 个字符等于 1 KB。如果一个页面上有 2000 个字符(双倍行距页面的合理平均值),那么它将占用 2K(2 KB)。这意味着大约需要 500 页文本才能达到 1 兆字节。因此 2 GB = 2000 MB = 2000*500 = 10,00,000 页文本!

从上图中可以清楚地看出,朴素方法(在第 1 组中讨论)表现最好(优于分段树方法)。

这是令人惊讶的,因为分段树方法的渐近时间复杂度远低于朴素方法。

事实上,在具有随机字符串的典型文件上,naive 方法通常比 Segment Tree Method 快 5-6 倍。另外不要忘记,朴素方法是一种就地算法,因此使其成为计算 LCE 的最理想算法。

最重要的是,当涉及到平均情况性能时,朴素方法是回答 LCE 查询的最佳选择。

当一种看起来更快的算法在实际测试中被效率较低的算法击败时,这种想法在计算机科学中很少发生。

我们了解到,虽然渐近分析是在纸上比较两种算法的最有效方法之一,但在实际使用中,有时事情可能会反过来。


参考:

http://www.sciencedirect.com/science/article/pii/S1570866710000377