📜  在Python中使用NLTK对停用词进行语音标记

📅  最后修改于: 2020-04-27 14:22:06             🧑  作者: Mango

自然语言工具包(NLTK)是用于构建文本分析程序的平台。语音标签是NLTK模块更强大的方面之一。
为了运行下面的Python程序,您必须安装NLTK。请遵循安装步骤。

  • 打开终端,运行pip install nltk
  • 在命令提示符下编写Python,以便Python Interactive Shell准备执行您的代码/脚本。
  • 输入import NLTK
  • nltk.download()

将会弹出一个GUI,然后选择下载所有软件包的“全部”,然后单击“下载”。这将为您提供所有标记器,分块器,其他算法以及所有语料库,因此这就是安装将花费大量时间的原因。
例子:

import nltk
nltk.download()

让我们敲出一些快速的词汇:
语料库:正文;
词汇:单词及其含义。
令牌: 每个“实体”都是根据规则划分的内容的一部分。
在语料库语言学中,词性标记(POS标记PoS标记POST)也称为语法标记单词类别歧义消除

输入: Everything is all about money.
输出: [('Everything', 'NN'), ('is', 'VBZ'),
          ('all', 'DT'),('about', 'IN'),
          ('money', 'NN'), ('.', '.')]

以下是标签的列表,它们的含义以及一些示例:
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is”  think of it like “there exists”)
FW foreign word
IN preposition
/subordinating conjunction
JJ adjective 
‘big’
JJR adjective
, comparative ‘bigger’
JJS adjective
, superlative ‘biggest’
LS list marker 
1)
MD modal could
, will
NN noun
, singular ‘desk’
NNS noun plural 
‘desks’
NNP proper noun
, singular ‘Harrison’
NNPS proper noun
, plural ‘Americans’
PDT predeterminer 
‘all the kids’
POS possessive ending parent
‘s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to go ‘
to‘ the store.
UH interjection errrrrrrrm
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when

文本可能包含停用词,例如“ the”,“ is”,“ are”。可以从要处理的文本中过滤停用词。nlp研究中没有通用的停用词列表,但是nltk模块包含停用词列表。
您可以添加自己的停用词。转到您的NLTK下载目录路径 -> 语料库 -> 停用词 ->更新停用词文件取决于您使用的语言。在这里,我们使用英语(stopwords.words(‘english’))。

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
// 虚拟文字
txt = "Sukanya, Rajib and Naba are my good friends. " \
    "Sukanya is getting married next year. " \
    "Marriage is a big step in one’s life." \
    "It is both exciting and frightening. " \
    "But friendship is a sacred bond between people." \
    "It is a special kind of love between us. " \
    "Many of you must have tried searching for a friend "\
    "but never found the right one."
# sent_tokenize是来自nltk.tokenize.punkt模块的PunktSentenceTokenizer的实例之一
tokenized = sent_tokenize(txt)
for i in tokenized:
    # 单词分词器用于查找字符串中的单词和标点符号
    wordsList = nltk.word_tokenize(i)
    # 从wordList中删除停用词
    wordsList = [w for w in wordsList if not w in stop_words]
    #  使用匕首,这是词性标记器或POS标记器的一部分.
    tagged = nltk.pos_tag(wordsList)
    print(tagged)

输出:

[('Sukanya', 'NNP'), ('Rajib', 'NNP'), ('Naba', 'NNP'), ('good', 'JJ'), ('friends', 'NNS')]
[('Sukanya', 'NNP'), ('getting', 'VBG'), ('married', 'VBN'), ('next', 'JJ'), ('year', 'NN')]
[('Marriage', 'NN'), ('big', 'JJ'), ('step', 'NN'), ('one', 'CD'), ('’', 'NN'), ('life', 'NN')]
[('It', 'PRP'), ('exciting', 'VBG'), ('frightening', 'VBG')]
[('But', 'CC'), ('friendship', 'NN'), ('sacred', 'VBD'), ('bond', 'NN'), ('people', 'NNS')]
[('It', 'PRP'), ('special', 'JJ'), ('kind', 'NN'), ('love', 'VB'), ('us', 'PRP')]
[('Many', 'JJ'), ('must', 'MD'), ('tried', 'VB'), ('searching', 'VBG'), ('friend', 'NN'),
('never', 'RB'), ('found', 'VBD'), ('right', 'RB'), ('one', 'CD')]

基本上,POS标记器的目标是将语言(主要是语法上的)信息分配给子句单元。这样的单元被称为令牌,并且在大多数情况下,其对应于单词和符号(例如标点符号)