📜  Ukkonen的后缀树构造–第6部分

📅  最后修改于: 2021-05-04 13:50:14             🧑  作者: Mango

本文是以下五篇文章的续篇:
Ukkonen的后缀树构造–第1部分
Ukkonen的后缀树构造–第2部分
Ukkonen的后缀树构造–第3部分
Ukkonen的后缀树构造–第4部分
Ukkonen的后缀树构造–第5部分
在阅读当前文章之前,请仔细阅读第1部分,第2部分,第3部分,第4部分和第5部分,在这里我们几乎看不到后缀树的基本知识,高级ukkonen算法,后缀链接以及三个实现技巧和activePoint以及一个示例字符串“ abcabxabcd”,我们经历了构建后缀树的所有阶段。
在这里,我们将看到用于表示后缀树的数据结构和代码实现。
在第5部分的末尾,我们讨论了在构建后缀树时以及稍后在不同应用程序中使用后缀树时将要执行的一些操作。
我们可能会考虑不同的可能数据结构来满足要求,其中某些数据结构在某些操作上可能会变慢而在某些操作上会变快。在这里,我们将在实现中使用以下代码:
我们将具有SuffixTreeNode结构来表示树中的每个节点。 SuffixTreeNode结构将具有以下成员:

  • children-这将是字母大小的数组。这会将当前节点的所有子节点存储在以不同字符开头的不同边缘上。
  • suffixLink –这将指向当前节点应通过后缀链接指向的其他节点。
  • 开始,结束–这两个将存储从父节点到当前节点的边缘标签详细信息。 (开始,结束)间隔指定节点连接到其父节点的边缘。每个边将连接两个节点,一个父对象和一个子节点,并且给定边的(开始,结束)间隔将存储在子节点中。假设有两个点A(父级)和B(子级)通过边与索引(5,8)连接,则此索引(5,8)将存储在节点B中。
  • suffixIndex –这对于叶子将为非负数,并将给出从根到此叶子的路径的后缀索引。对于非叶子节点,它将为-1。

该数据结构将快速响应所需的查询,如下所示:

  • 如何检查节点是否为根? —根是一个特殊的节点,没有父节点,因此它的开始和结束将为-1,对于所有其他节点,开始和结束索引将为非负数。
  • 如何检查节点是内部节点还是叶节点? — suffixIndex将在这里有所帮助。对于内部节点,它将为-1;对于叶节点,它将为非负。
  • 某个边缘上的路径标签的长度是多少? —每个边缘将具有开始和结束索引,并且路径标签的长度将是end-start + 1
  • 某些边缘上的路径标签是什么? —如果字符串是S,则路径标签将是从开始索引到结束索引(包括[开始,结束])的S的子字符串。
  • 如何检查节点A中给定字符c的出站边缘? —如果A-> children不为NULL,则有路径,如果为NULL,则无路径。
  • 距节点A一定距离d的边缘上的字符值是多少? —与节点A距离d处的字符为S [A-> start + d],其中S为字符串。
  • 内部节点通过后缀链接指向何处? —节点A将指向A-> suffixLink
  • 从根到叶的路径上的后缀索引是什么? —如果叶子节点在路径上为A,则该路径上的后缀索引将为A-> suffixIndex

以下是Ukkonen的后缀树构造的C实现。该代码可能看起来有些冗长,可能是由于注释量很大。

C
// A C program to implement Ukkonen's Suffix Tree Construction
#include 
#include 
#include 
#define MAX_CHAR 256
 
struct SuffixTreeNode {
    struct SuffixTreeNode *children[MAX_CHAR];
 
    //pointer to other node via suffix link
    struct SuffixTreeNode *suffixLink;
 
    /*(start, end) interval specifies the edge, by which the
    node is connected to its parent node. Each edge will
    connect two nodes, one parent and one child, and
    (start, end) interval of a given edge will be stored
    in the child node. Lets say there are two nods A and B
    connected by an edge with indices (5, 8) then this
    indices (5, 8) will be stored in node B. */
    int start;
    int *end;
 
    /*for leaf nodes, it stores the index of suffix for
    the path from root to leaf*/
    int suffixIndex;
};
 
typedef struct SuffixTreeNode Node;
 
char text[100]; //Input string
Node *root = NULL; //Pointer to root node
 
/*lastNewNode will point to newly created internal node,
waiting for it's suffix link to be set, which might get
a new suffix link (other than root) in next extension of
same phase. lastNewNode will be set to NULL when last
newly created internal node (if there is any) got it's
suffix link reset to new internal node created in next
extension of same phase. */
Node *lastNewNode = NULL;
Node *activeNode = NULL;
int count=0;
 
/*activeEdge is represeted as input string character
index (not the character itself)*/
int activeEdge = -1;
int activeLength = 0;
 
// remainingSuffixCount tells how many suffixes yet to
// be added in tree
int remainingSuffixCount = 0;
int leafEnd = -1;
int *rootEnd = NULL;
int *splitEnd = NULL;
int size = -1; //Length of input string
 
Node *newNode(int start, int *end)
{
    count++;
    Node *node =(Node*) malloc(sizeof(Node));
    int i;
    for (i = 0; i < MAX_CHAR; i++)
        node->children[i] = NULL;
 
    /*For root node, suffixLink will be set to NULL
    For internal nodes, suffixLink will be set to root
    by default in current extension and may change in
    next extension*/
    node->suffixLink = root;
    node->start = start;
    node->end = end;
 
    /*suffixIndex will be set to -1 by default and
    actual suffix index will be set later for leaves
    at the end of all phases*/
    node->suffixIndex = -1;
    return node;
}
 
int edgeLength(Node *n) {
    return *(n->end) - (n->start) + 1;
}
 
int walkDown(Node *currNode)
{
    /*activePoint change for walk down (APCFWD) using
    Skip/Count Trick (Trick 1). If activeLength is greater
    than current edge length, set next internal node as
    activeNode and adjust activeEdge and activeLength
    accordingly to represent same activePoint*/
    if (activeLength >= edgeLength(currNode))
    {
        activeEdge =
         (int)text[activeEdge+edgeLength(currNode)]-(int)' ';
        activeLength -= edgeLength(currNode);
        activeNode = currNode;
        return 1;
    }
    return 0;
}
 
void extendSuffixTree(int pos)
{
    /*Extension Rule 1, this takes care of extending all
    leaves created so far in tree*/
    leafEnd = pos;
 
    /*Increment remainingSuffixCount indicating that a
    new suffix added to the list of suffixes yet to be
    added in tree*/
    remainingSuffixCount++;
 
    /*set lastNewNode to NULL while starting a new phase,
    indicating there is no internal node waiting for
    it's suffix link reset in current phase*/
    lastNewNode = NULL;
 
    //Add all suffixes (yet to be added) one by one in tree
    while(remainingSuffixCount > 0) {
 
        if (activeLength == 0) {
            //APCFALZ
            activeEdge = (int)text[pos]-(int)' ';
        }
        // There is no outgoing edge starting with
        // activeEdge from activeNode
        if (activeNode->children[activeEdge] == NULL)
        {
            //Extension Rule 2 (A new leaf edge gets created)
            activeNode->children[activeEdge] =
                                  newNode(pos, &leafEnd);
 
            /*A new leaf edge is created in above line starting
            from an existng node (the current activeNode), and
            if there is any internal node waiting for it's suffix
            link get reset, point the suffix link from that last
            internal node to current activeNode. Then set lastNewNode
            to NULL indicating no more node waiting for suffix link
            reset.*/
            if (lastNewNode != NULL)
            {
                lastNewNode->suffixLink = activeNode;
                lastNewNode = NULL;
            }
        }
        // There is an outgoing edge starting with activeEdge
        // from activeNode
        else
        {
            // Get the next node at the end of edge starting
            // with activeEdge
            Node *next = activeNode->children[activeEdge];
            if (walkDown(next))//Do walkdown
            {
                //Start from next node (the new activeNode)
                continue;
            }
            /*Extension Rule 3 (current character being processed
            is already on the edge)*/
            if (text[next->start + activeLength] == text[pos])
            {
                //If a newly created node waiting for it's
                //suffix link to be set, then set suffix link
                //of that waiting node to current active node
                if(lastNewNode != NULL && activeNode != root)
                {
                    lastNewNode->suffixLink = activeNode;
                    lastNewNode = NULL;
                }
 
                //APCFER3
                activeLength++;
                /*STOP all further processing in this phase
                and move on to next phase*/
                break;
            }
 
            /*We will be here when activePoint is in middle of
            the edge being traversed and current character
            being processed is not on the edge (we fall off
            the tree). In this case, we add a new internal node
            and a new leaf edge going out of that new node. This
            is Extension Rule 2, where a new leaf edge and a new
            internal node get created*/
            splitEnd = (int*) malloc(sizeof(int));
            *splitEnd = next->start + activeLength - 1;
 
            //New internal node
            Node *split = newNode(next->start, splitEnd);
            activeNode->children[activeEdge] = split;
 
            //New leaf coming out of new internal node
            split->children[(int)text[pos]-(int)' '] =
                                      newNode(pos, &leafEnd);
            next->start += activeLength;
            split->children[activeEdge] = next;
 
            /*We got a new internal node here. If there is any
            internal node created in last extensions of same
            phase which is still waiting for it's suffix link
            reset, do it now.*/
            if (lastNewNode != NULL)
            {
            /*suffixLink of lastNewNode points to current newly
            created internal node*/
                lastNewNode->suffixLink = split;
            }
 
            /*Make the current newly created internal node waiting
            for it's suffix link reset (which is pointing to root
            at present). If we come across any other internal node
            (existing or newly created) in next extension of same
            phase, when a new leaf edge gets added (i.e. when
            Extension Rule 2 applies is any of the next extension
            of same phase) at that point, suffixLink of this node
            will point to that internal node.*/
            lastNewNode = split;
        }
 
        /* One suffix got added in tree, decrement the count of
        suffixes yet to be added.*/
        remainingSuffixCount--;
        if (activeNode == root && activeLength > 0) //APCFER2C1
        {
            activeLength--;
            activeEdge = (int)text[pos -
                            remainingSuffixCount + 1]-(int)' ';
        }
           
        //APCFER2C2
        else if (activeNode != root)
        {
            activeNode = activeNode->suffixLink;
        }
    }
}
 
void print(int i, int j)
{
    int k;
    for (k=i; k<=j; k++)
        printf("%c", text[k]);
}
 
//Print the suffix tree as well along with setting suffix index
//So tree will be printed in DFS manner
//Each edge along with it's suffix index will be printed
void setSuffixIndexByDFS(Node *n, int labelHeight)
{
    if (n == NULL) return;
 
    if (n->start != -1) //A non-root node
    {
        //Print the label on edge from parent to current node
        print(n->start, *(n->end));
    }
    int leaf = 1;
    int i;
    for (i = 0; i < MAX_CHAR; i++)
    {
        if (n->children[i] != NULL)
        {
            if (leaf == 1 && n->start != -1)
                printf(" [%d]\n", n->suffixIndex);
 
            //Current node is not a leaf as it has outgoing
            //edges from it.
            leaf = 0;
            setSuffixIndexByDFS(n->children[i],
                  labelHeight + edgeLength(n->children[i]));
        }
    }
    if (leaf == 1)
    {
        n->suffixIndex = size - labelHeight;
        printf(" [%d]\n", n->suffixIndex);
    }
}
 
void freeSuffixTreeByPostOrder(Node *n)
{
    if (n == NULL)
        return;
    int i;
    for (i = 0; i < MAX_CHAR; i++)
    {
        if (n->children[i] != NULL)
        {
            freeSuffixTreeByPostOrder(n->children[i]);
        }
    }
    if (n->suffixIndex == -1)
        free(n->end);
    free(n);
}
 
/*Build the suffix tree and print the edge labels along with
suffixIndex. suffixIndex for leaf edges will be >= 0 and
for non-leaf edges will be -1*/
void buildSuffixTree()
{
    size = strlen(text);
    int i;
    rootEnd = (int*) malloc(sizeof(int));
    *rootEnd = - 1;
 
    /*Root is a special node with start and end indices as -1,
    as it has no parent from where an edge comes to root*/
    root = newNode(-1, rootEnd);
 
    activeNode = root; //First activeNode will be root
    for (i=0; i


输出(Tree的每个边缘以及边缘上的子节点的后缀索引以DFS顺序打印。要更好地理解输出,请将其与第5部分文章中的最后一个数字43相匹配):

abbc [0]
b [-1]
bc [1]
c [2]
c [3]
Number of nodes in suffix tee are 6

现在我们可以在线性时间内构建后缀树,我们可以有效地解决许多字符串问题:

  • 检查给定的模式P是否为文本T的子字符串(当文本固定且模式更改时有用,否则为KMP
  • 查找出现在文本T中的给定模式P的所有出现
  • 查找最长的重复子串
  • 线性时间后缀数组的创建

通过对后缀树进行DFS遍历可以解决以上基本问题。
我们将很快发布有关上述问题以及其他类似问题的文章:

  • 构建通用后缀树
  • 线性时间最长的常见子串问题
  • 线性时间最长回文子串

和更多。
测试你的理解?

  1. 在纸上绘制字符串“ AABAACAADAABAAABAA $”的后缀树(带有正确的后缀链接,后缀索引),并查看其是否与代码输出匹配。
  2. 每个扩展名必须遵循以下三个规则之一:规则1,规则2和规则3。
    以下是在第i阶段(i> 5)的五个连续分机上应用的规则,这些规则是有效的:
    A)规则1,规则2,规则2,规则3,规则3
    B)第1条,第2条,第2条,第3条,第2条
    C)第2条,第1条,第1条,第3条,第3条
    D)规则1,规则1,规则1,规则1,规则1
    E)第2条,第2条,第2条,第2条,第2条,第2条
    F)第3条,第3条,第3条,第3条,第3条
  3. 上面第5阶段的有效顺序是什么
  4. 每个内部节点必须将其后缀链接设置为另一个节点(内部或根)。新创建的节点可以指向已经存在的内部节点吗?是否会发生在扩展名j中创建的新节点,可能无法在下一个扩展名j + 1中获得正确的后缀链接,而在后来的扩展名(如j + 2,j + 3等)中获得正确的后缀链接的情况?
  5. 尝试解决上面讨论的基本问题。

我们已经发布了有关后缀树应用程序的以下文章:

  • 后缀树应用程序1 –子字符串检查
  • 后缀树应用程序2 –搜索所有模式
  • 后缀树应用程序3 –最长重复子串
  • 后缀树应用程序4 –构建线性时间后缀数组
  • 广义后缀树1
  • 后缀树应用程序5 –最长公共子串
  • 后缀树应用6 –最长回文子串