📜  自然语言工具包-转换树

📅  最后修改于: 2020-10-14 09:28:05             🧑  作者: Mango


以下是转换树的两个原因-

  • 修改深度解析树并
  • 展平深解析树

将树或子树转换为句子

我们将在这里讨论的第一个方法是将Tree或subtree转换回句子或大块字符串。这非常简单,让我们在以下示例中进行查看-

from nltk.corpus import treebank_chunk
tree = treebank_chunk.chunked_sents()[2]
' '.join([w for w, t in tree.leaves()])

输出

'Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields
PLC , was named a nonexecutive director of this British industrial
conglomerate .'

深树展平

嵌套短语的深树不能用于训练块,因此我们必须在使用前将其扁平化。在下面的示例中,我们将使用第3个经过解析的句子,它是来自树库语料库的嵌套短语的深树。

为了实现这一点,我们定义了一个名为deeptree_flat()的函数,该函数将采用单个Tree并返回一个仅保留最低级别树的新Tree。为了完成大部分工作,它使用了一个辅助函数,我们将其命名为childtree_flat()

from nltk.tree import Tree
def childtree_flat(trees):
   children = []
   for t in trees:
      if t.height() < 3:
         children.extend(t.pos())
      elif t.height() == 3:
         children.append(Tree(t.label(), t.pos()))
      else:
         children.extend(flatten_childtrees([c for c in t]))
   return children
def deeptree_flat(tree):
   return Tree(tree.label(), flatten_childtrees([c for c in tree]))

现在,让我们在树库语料库的第三个已解析语句称为嵌套短语的深树上调用deeptree_flat()函数。我们将这些功能保存在名为deeptree.py的文件中。

from deeptree import deeptree_flat
from nltk.corpus import treebank
deeptree_flat(treebank.parsed_sents()[2])

输出

Tree('S', [Tree('NP', [('Rudolph', 'NNP'), ('Agnew', 'NNP')]),
(',', ','), Tree('NP', [('55', 'CD'), 
('years', 'NNS')]), ('old', 'JJ'), ('and', 'CC'),
Tree('NP', [('former', 'JJ'), 
('chairman', 'NN')]), ('of', 'IN'), Tree('NP', [('Consolidated', 'NNP'), 
('Gold', 'NNP'), ('Fields', 'NNP'), ('PLC', 
'NNP')]), (',', ','), ('was', 'VBD'), 
('named', 'VBN'), Tree('NP-SBJ', [('*-1', '-NONE-')]), 
Tree('NP', [('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN')]),
('of', 'IN'), Tree('NP', 
[('this', 'DT'), ('British', 'JJ'), 
('industrial', 'JJ'), ('conglomerate', 'NN')]), ('.', '.')])

建筑浅树

在上一节中,我们通过仅保留最低级别的子树来展平一棵嵌套短语的深树。在本节中,我们将仅保留最高级别的子树,即构建浅树。在下面的示例中,我们将使用第3个经过分析的句子,它是来自树库语料库的嵌套短语的深树。

为此,我们定义了一个名为tree_shallow()的函数,该函数将仅保留顶部子树标签,从而消除所有嵌套的子树。

from nltk.tree import Tree
def tree_shallow(tree):
   children = []
   for t in tree:
      if t.height() < 3:
         children.extend(t.pos())
      else:
         children.append(Tree(t.label(), t.pos()))
   return Tree(tree.label(), children)

现在,让我们称tree_shallow()第三句解析,这是嵌套短语深树,从树库主体函数。我们将这些功能保存在一个名为shallowtree.py的文件中。

from shallowtree import shallow_tree
from nltk.corpus import treebank
tree_shallow(treebank.parsed_sents()[2])

输出

Tree('S', [Tree('NP-SBJ-1', [('Rudolph', 'NNP'), ('Agnew', 'NNP'), (',', ','), 
('55', 'CD'), ('years', 'NNS'), ('old', 'JJ'), ('and', 'CC'), 
('former', 'JJ'), ('chairman', 'NN'), ('of', 'IN'), ('Consolidated', 'NNP'), 
('Gold', 'NNP'), ('Fields', 'NNP'), ('PLC', 'NNP'), (',', ',')]), 
Tree('VP', [('was', 'VBD'), ('named', 'VBN'), ('*-1', '-NONE-'), ('a', 'DT'), 
('nonexecutive', 'JJ'), ('director', 'NN'), ('of', 'IN'), ('this', 'DT'), 
('British', 'JJ'), ('industrial', 'JJ'), ('conglomerate', 'NN')]), ('.', '.')])

我们可以通过获取树木的高度来看到差异-

from nltk.corpus import treebank
tree_shallow(treebank.parsed_sents()[2]).height()

输出

3
from nltk.corpus import treebank
treebank.parsed_sents()[2].height()

输出

9

树标签转换

在解析树时,存在块树中不存在的各种标签类型。但是,在使用解析树训练分块器时,我们希望通过将某些Tree标签转换为更常见的标签类型来减少这种多样性。例如,我们有两个替代的NP子树,即NP-SBL和NP-TMP。我们可以将它们都转换为NP。在下面的示例中,让我们看看如何做到这一点。

为了实现这一点,我们定义了一个名为tree_convert()的函数,该函数采用以下两个参数-

  • 转换树
  • 标签转换映射

此函数将返回一个新的Tree,并根据映射中的值替换所有匹配的标签。

from nltk.tree import Tree
def tree_convert(tree, mapping):
   children = []
   for t in tree:
      if isinstance(t, Tree):
         children.append(convert_tree_labels(t, mapping))
      else:
         children.append(t)
   label = mapping.get(tree.label(), tree.label())
   return Tree(label, children)

现在,让我们在树库语料库的第3个已解析语句上调用tree_convert()函数,该语句是嵌套短语的深层树。我们将这些函数保存在名为converttree.py的文件中。

from converttree import tree_convert
from nltk.corpus import treebank
mapping = {'NP-SBJ': 'NP', 'NP-TMP': 'NP'}
convert_tree_labels(treebank.parsed_sents()[2], mapping)

输出

Tree('S', [Tree('NP-SBJ-1', [Tree('NP', [Tree('NNP', ['Rudolph']), 
Tree('NNP', ['Agnew'])]), Tree(',', [',']), 
Tree('UCP', [Tree('ADJP', [Tree('NP', [Tree('CD', ['55']), 
Tree('NNS', ['years'])]), 
Tree('JJ', ['old'])]), Tree('CC', ['and']), 
Tree('NP', [Tree('NP', [Tree('JJ', ['former']), 
Tree('NN', ['chairman'])]), Tree('PP', [Tree('IN', ['of']), 
Tree('NP', [Tree('NNP', ['Consolidated']), 
Tree('NNP', ['Gold']), Tree('NNP', ['Fields']), 
Tree('NNP', ['PLC'])])])])]), Tree(',', [','])]), 
Tree('VP', [Tree('VBD', ['was']),Tree('VP', [Tree('VBN', ['named']), 
Tree('S', [Tree('NP', [Tree('-NONE-', ['*-1'])]), 
Tree('NP-PRD', [Tree('NP', [Tree('DT', ['a']), 
Tree('JJ', ['nonexecutive']), Tree('NN', ['director'])]), 
Tree('PP', [Tree('IN', ['of']), Tree('NP', 
[Tree('DT', ['this']), Tree('JJ', ['British']), Tree('JJ', ['industrial']), 
Tree('NN', ['conglomerate'])])])])])])]), Tree('.', ['.'])])