📜  自然语言处理 |块树到文本和链接块转换

📅  最后修改于: 2022-05-13 01:55:30.986000             🧑  作者: Mango

自然语言处理 |块树到文本和链接块转换

我们可以将树或子树转换回句子或块字符串。为了理解如何做——下面的代码使用了 treebank_chunk 语料库的第一棵树。

代码#1:用空格连接树中的单词。

# Loading library    
from nltk.corpus import treebank_chunk
  
# tree
tree = treebank_chunk.chunked_sents()[0]
  
print ("Tree : \n", tree)
  
print ("\nTree leaves : \n", tree.leaves())
  
print ("\nSentence from tree : \n", ' '.join(
        [w for w, t in tree.leaves()]))

输出 :

Tree : 
 (S
  (NP Pierre/NNP Vinken/NNP), /,
  (NP 61/CD years/NNS)
  old/JJ, /,
  will/MD
  join/VB
  (NP the/DT board/NN)
  as/IN
  (NP a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD)
  ./.)

Tree leaves : 
 [('Pierre', 'NNP'), ('Vinken', 'NNP'), (', ', ', '), ('61', 'CD'), 
 ('years', 'NNS'), ('old', 'JJ'), (', ', ', '), ('will', 'MD'), ('join', 'VB'),
 ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'),
 ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]

Sentence from tree : 
 Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29 .

和上面的代码一样,标点符号不正确,因为句号和逗号被视为特殊词。所以,他们也得到了周围的空间。但是在下面的代码中,我们使用正则表达式替换来解决这个问题。

代码 #2: chunk_tree_to_sent()函数来改进代码 1

import re
  
# defining regex expression
punct_re = re.compile(r'\s([, \.;\?])')
  
def chunk_tree_to_sent(tree, concat =' '):
  
    s = concat.join([w for w, t in tree.leaves()])
    return re.sub(punct_re, r'\g<1>', s)


代码 #3:评估 chunk_tree_to_sent()

# Loading library    
from nltk.corpus import treebank_chunk
from transforms import chunk_tree_to_sent
  
# tree
tree = treebank_chunk.chunked_sents()[0]
  
print ("Tree : \n", tree)
  
print ("\nTree leaves : \n", tree.leaves())
  
print ("Tree to sentence : ", chunk_tree_to_sent(tree))

输出 :

Tree : 
 (S
  (NP Pierre/NNP Vinken/NNP), /,
  (NP 61/CD years/NNS)
  old/JJ, /,
  will/MD
  join/VB
  (NP the/DT board/NN)
  as/IN
  (NP a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD)
  ./.)

Tree leaves : 
 [('Pierre', 'NNP'), ('Vinken', 'NNP'), (', ', ', '), ('61', 'CD'), 
 ('years', 'NNS'), ('old', 'JJ'), (', ', ', '), ('will', 'MD'), ('join', 'VB'),
 ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'),
 ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]

Tree to sentence : 
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.

链接块转换
转换函数可以链接在一起以标准化块,并且生成的块通常更短,并且仍然具有相同的含义。

在下面的代码中——一个单独的块和一个可选的转换函数列表被传递给函数。此函数将调用块上的每个转换函数并返回最终块。

代码#4:

def transform_chunk(
        chunk, chain = [filter_insignificant, 
                        swap_verb_phrase, swap_infinitive_phrase, 
                        singularize_plural_noun], trace = 0):
    for f in chain:
        chunk = f(chunk)
          
        if trace:
            print (f.__name__, ':', chunk)
              
    return chunk


代码 #5:评估 transform_chunk

from transforms import transform_chunk
  
chunk = [('the', 'DT'), ('book', 'NN'), ('of', 'IN'), 
         ('recipes', 'NNS'), ('is', 'VBZ'), ('delicious', 'JJ')]
  
print ("Chunk : \n", chunk)
  
print ("\nTransformed Chunk : \n", transform_chunk(chunk))

输出 :

Chunk :  
[('the', 'DT'), ('book', 'NN'), ('of', 'IN'), ('recipes', 'NNS'), 
('is', 'VBZ'), ('delicious', 'JJ')]

Transformed Chunk : 
[('delicious', 'JJ'), ('recipe', 'NN'), ('book', 'NN')]