📜  最长重复和非重叠子串(1)

📅  最后修改于: 2023-12-03 15:26:29.078000             🧑  作者: Mango

最长重复和非重叠子串

在字符串处理的算法中,常常需要寻找最长重复和非重叠子串。最长重复子串是指一个字符串中重复出现的最长的子串,而最长非重叠子串是指一个字符串中不重叠的、相同的最长的子串。

最长重复子串
Brute force 算法

最朴素的想法是对于字符串中的每个子串都检查一次是否是重复的。时间复杂度为 $O(n^3)$,其中 $n$ 为字符串长度。其中,第一层循环枚举字符串的所有子串,第二层循环枚举字串的起点,第三层循环枚举当前子串的每个字符所藏。

下面是 Python 实现:

def longest_repeated_substring(s):
    n = len(s)
    max_len = 0
    max_sub = ""
    for i in range(n):
        for j in range(i+1, n):
            flag = True
            for k in range(j-i):
                if s[i+k] != s[j+k]:
                    flag = False
                    break
            if flag and j-i > max_len:
                max_len = j-i
                max_sub = s[i:j]
    
    return max_sub
KMP 算法

对于字符串 $s$,构建它的后缀数组后,最长重复子串可以由后缀数组中某一对相邻的后缀的最长公共前缀得到。

KMP 算法是一种时间复杂度为 $O(n)$ 的快速搜索算法,其核心是构建一个前缀表(也称为失效函数)。前缀表的含义是在匹配过程中,遇到失配时应该回溯到上一次可行匹配的位置处再继续匹配。这样,我们可以在一遍扫描后就匹配出整个 $s$。

def build_prefix_table(s):
    n = len(s)
    t = [-1]*n
    j = -1
    for i in range(1, n):
        while j >= 0 and s[i] != s[j+1]:
            j = t[j]
        if s[i] == s[j+1]:
            j += 1
        t[i] = j
    
    return t

def longest_repeated_substring(s):
    t = build_prefix_table(s)
    n = len(s)
    max_len = 0
    max_sub = ""
    for i in range(n):
        j = t[i]
        if i + j + 1 == n and j > max_len:
            max_len = j
            max_sub = s[i-j:i+1]
    
    return max_sub
最长非重叠子串
Brute force 算法

非重叠子串需要考虑到重叠的问题。首先生成所有的子串,然后按照长度递减排序,从长到短枚举,找到第一个不重叠的子串即可。时间复杂度为 $O(n^3)$。

下面是 Python 实现:

def longest_non_overlap_substring(s):
    n = len(s)
    substrings = set()
    for i in range(n):
        for j in range(i+1, n+1):
            substrings.add(s[i:j])
    substrings = sorted(list(substrings), key=lambda x:-len(x))
    
    for i in range(len(substrings)):
        for j in range(i+1, len(substrings)):
            if substrings[i].find(substrings[j]) == -1 and substrings[j].find(substrings[i]) == -1:
                return substrings[j]
    
    return ""
Suffix Tree 算法

后缀树是字符串处理的重要工具。它可以支持快速地查询字符串的任意子串是否出现在其中,时间复杂度为 $O(m)$,其中 $m$ 为要查询的子串长度。后缀树不仅可以用于寻找最长重复子串,还可以用于寻找最长非重叠子串。时间复杂度为线性,即 $O(n)$。

class SuffixTreeNode:
    def __init__(self):
        self.children = {}
        self.start_idx = -1
        self.end_idx = -1
        self.index = -1

def build_suffix_tree(s):
    n = len(s)
    root = SuffixTreeNode()
    for i in range(n):
        node = root
        j = i
        while j < n:
            if s[j] not in node.children:
                new_node = SuffixTreeNode()
                new_node.start_idx = i
                new_node.end_idx = n-1
                new_node.index = i
                node.children[s[j]] = new_node
                break
            child = node.children[s[j]]
            k = child.start_idx
            while k <= child.end_idx and j < n and s[k] == s[j]:
                k += 1
                j += 1
            if k > child.end_idx:
                node = child
                continue
            if k <= child.end_idx:
                new_node = SuffixTreeNode()
                new_node.start_idx = child.start_idx
                new_node.end_idx = k-1
                new_node.index = child.index
                node.children[s[j]] = new_node

                child.start_idx = k
                child.index = -1
                new_child = SuffixTreeNode()
                new_child.start_idx = k
                new_child.end_idx = child.end_idx
                new_child.index = child.index
                new_node.children[s[k]] = child
                new_node.children[s[child.start_idx]] = new_child
                break
        
    return root

def find_longest_non_overlap_substring(node):
    max_depth = 0
    max_depth_node = None
    for c in node.children.values():
        depth = c.end_idx - c.start_idx + 1
        if c.index != -1 and depth > max_depth:
            max_depth = depth
            max_depth_node = c
        
        child_depth, child_node = find_longest_non_overlap_substring(c)
        if child_depth > max_depth:
            max_depth = child_depth
            max_depth_node = child_node
    
    if max_depth_node is None:
        return 0, None
    else:
        return max_depth, max_depth_node

def longest_non_overlap_substring(s):
    root = build_suffix_tree(s)
    return s[find_longest_non_overlap_substring(root)[1].start_idx:find_longest_non_overlap_substring(root)[1].end_idx+1]
参考文献
  • Gusfield, D. (1997). "Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology". Cambridge University Press.
  • 周志华. 机器学习. 2016.