📜  Python – 使用 Enchant 标记文本

📅  最后修改于: 2022-05-13 01:55:34.682000             🧑  作者: Mango

Python – 使用 Enchant 标记文本




  • 语料库——文本的主体,单数。 Corpora 是 this 的复数形式。
  • 词典——单词及其含义。
  • 代币——每个“实体”都是根据规则拆分的任何事物的一部分。例如,当一个句子被“标记化”为单词时,每个单词都是一个标记。

分词器生成的项目是 (WORD, POS) 形式的元组,其中 WORD 是分词后的词,POS 是该词所在的字符串位置。

# import the module
from enchant.tokenize import get_tokenizer
# the text to be tokenized 
text = ("Natural language processing (NLP) is a field " + 
       "of computer science, artificial intelligence " + 
       "and computational linguistics concerned with " +  
       "the interactions between computers and human " +  
       "(natural) languages, and, in particular, " +  
       "concerned with programming computers to " + 
       "fruitfully process large natural language " +  
       "corpora. Challenges in natural language " +  
       "processing frequently involve natural " + 
       "language understanding, natural language" +  
       "generation frequently from formal, machine" +  
       "-readable logical forms), connecting language " +  
       "and machine perception, managing human-" + 
       "computer dialog systems, or some combination " +  
# getting tokenizer class
tokenizer = get_tokenizer("en_US")
token_list =[]
for words in tokenizer(text):
# print the words with POS

输出 :

只打印单词,而不是 POS :

# print only the words
word_list =[]
for tokens in token_list:

输出 :