📜  在Python中使用NLTK标记文本

📅  最后修改于: 2020-04-26 10:20:02             🧑  作者: Mango

要运行以下Python程序,必须在系统中安装(NLTK)自然语言工具包。
NLTK模块是一个庞大的工具套件,旨在帮助您使用整个自然语言处理(NLP)方法。
为了安装NLTK,请在终端中运行以下命令。

  • sudo pip install nltk
  • 然后,只需键入Python,即可在终端中输入Python shell
  • 输入:import nltk
  • 输入:nltk.download(‘all’)

由于大量的令牌化程序,分块器,其他算法以及所有要下载的语料库,因此上述安装将花费一些时间。
一些经常使用的术语是:

    • 语料库 –正文。
    • 词汇 –单词及其含义。
    • 令牌 –每个“实体”都是根据规则进行拆分的一部分。例如,当将一个句子“标记”为单词时,每个单词都是一个令牌。如果您从段落中标记句子,则每个句子也可以是一个令牌。

因此,基本上令牌化涉及从文本主体中拆分句子和单词。

# 导入现有单词和句子令牌化的库
from nltk.tokenize import sent_tokenize, word_tokenize
text = "Natural language processing (NLP) is a field " + \
       "of computer science, artificial intelligence " + \
       "and computational linguistics concerned with " + \
       "the interactions between computers and human " + \
       "(natural) languages, and, in particular, " + \
       "concerned with programming computers to " + \
       "fruitfully process large natural language " + \
       "corpora. Challenges in natural language " + \
       "processing frequently involve natural " + \
       "language understanding, natural language" + \
       "generation frequently from formal, machine" + \
       "-readable logical forms), connecting language " + \
       "and machine perception, managing human-" + \
       "computer dialog systems, or some combination " + \
       "thereof."
print(sent_tokenize(text))
print(word_tokenize(text))`

输出

[‘Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.’, ‘Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof.’]
[‘Natural’, ‘language’, ‘processing’, ‘(‘, ‘NLP’, ‘)’, ‘is’, ‘a’, ‘field’, ‘of’, ‘computer’, ‘science’, ‘,’, ‘artificial’, ‘intelligence’, ‘and’, ‘computational’, ‘linguistics’, ‘concerned’, ‘with’, ‘the’, ‘interactions’, ‘between’, ‘computers’, ‘and’, ‘human’, ‘(‘, ‘natural’, ‘)’, ‘languages’, ‘,’, ‘and’, ‘,’, ‘in’, ‘particular’, ‘,’, ‘concerned’, ‘with’, ‘programming’, ‘computers’, ‘to’, ‘fruitfully’, ‘process’, ‘large’, ‘natural’, ‘language’, ‘corpora’, ‘.’, ‘Challenges’, ‘in’, ‘natural’, ‘language’, ‘processing’, ‘frequently’, ‘involve’, ‘natural’, ‘language’, ‘understanding’, ‘,’, ‘natural’, ‘language’, ‘generation’, ‘(‘, ‘frequently’, ‘from’, ‘formal’, ‘,’, ‘machine-readable’, ‘logical’, ‘forms’, ‘)’, ‘,’, ‘connecting’, ‘language’, ‘and’, ‘machine’, ‘perception’, ‘,’, ‘managing’, ‘human-computer’, ‘dialog’, ‘systems’, ‘,’, ‘or’, ‘some’, ‘combination’, ‘thereof’, ‘.’]

因此,在这里,我们创建了令牌,这些令牌最初是句子,之后是单词。