自然语言处理 |退避标记以组合标记器

什么是词性 (POS) 标记？这是一个将句子转换为形式的过程——单词列表、元组列表（每个元组都有一个形式（单词、标签））。 case of 中的标记是词性标记，表示该词是名词、形容词、动词等。 什么是退避标记？它是SequentialBackoffTagger最重要的功能之一，因为它允许将标记器组合在一起。这样做的好处是，如果标注器不知道某个词的标注，那么它可以将此标注任务传递给下一个退避标注器。如果那个人做不到，它可以将这个词传递给下一个退避标记器，依此类推，直到没有退避标记器可供检查。代码 #1：执行标记

Python3

# Loading Libraries
from nltk.tag import SequentialBackoffTagger
from nltk.tag import DefaultTagger
from nltk.tag import UnigramTagger 
 
from nltk.corpus import treebank
 
# initializing training and testing set   
train_data = treebank.tagged_sents()[:3000]
test_data = treebank.tagged_sents()[3000:]
 
# Defining Tag
tag1 = DefaultTagger('NN')
 
# Tagging
tag2 = UnigramTagger(train_data, backoff = tag1)
 
# Evaluation
tag2.evaluate(test_data)

Python3

from nltk.tag import SequentialBackoffTagger
 
print (tag1._taggers == [tag1])
 
print ("\n", tag2._taggers == [tag2, tag1])

Python3

# Loading Libraries
import pickle
 
# Opening file and writing
file = open('tagger.pickle', 'wb')
pickle.dump(tagger, file)
file.close()
 
# Reading file
file = open('tagger.pickle', 'rb')
# Loading
tagger = pickle.load(f)

输出：

0.8752428232246924

这个怎么运作？ SequentialBackoffTagger 类可以采用一个退避关键字参数，其值为 SequentialBackoffTagger 的另一个实例。在上面的代码中，unigram 词性标注器使用默认标注器进行退避，并在treebank.tagged_sents()数据集上进行训练。代码 #2：准备退避标记器的内部列表

Python3

from nltk.tag import SequentialBackoffTagger
 
print (tag1._taggers == [tag1])
 
print ("\n", tag2._taggers == [tag2, tag1])

输出：

True

True

这个怎么运作？

SequentialBackoffTagger 类被初始化，创建一个内部退避标记器列表，第一个元素是它自己。
如果给定了退避标记器，则会附加退避标记器的内部标记器列表。
SequentialBackoffTagger 类使用 _taggers 列表是调用 tag() 方法时退避标记器的内部列表。
对它们中的每一个调用choose_tag()，它会遍历其标记器列表。
当找到标签时，它会停止并返回标签。
如果主标记器可以标记单词，则将返回标记。
否则，它返回 None 并尝试下一个标记器，依此类推，直到找到标记，否则返回 None。

代码#3：保存和加载一个训练有素的标注器与泡菜。

Python3

# Loading Libraries
import pickle
 
# Opening file and writing
file = open('tagger.pickle', 'wb')
pickle.dump(tagger, file)
file.close()
 
# Reading file
file = open('tagger.pickle', 'rb')
# Loading
tagger = pickle.load(f)

输出：

nltk.data.load('tagger.pickle') will load the file