讨论自然语言工具包(1)

📌 相关文章

📜 讨论自然语言工具包(1)

📅 最后修改于: 2023-12-03 15:41:43.380000 🧑 作者: Mango

自然语言工具包 (Natural Language Toolkit, NLTK)

自然语言工具包是一款开源的Python工具包，用于支持自然语言处理（NLP）任务，例如文本分类、标记、分块、解析、语言识别和语义分析等。

安装

通过pip命令可以安装nltk：

pip install nltk

用法

首先需要导入nltk包：

import nltk

分词

nltk包提供了多种分词器，常用的是基于正则表达式的分词器：

from nltk.tokenize import word_tokenize

sentence = "Natural Language Processing (NLP) is a subfield of computer science and artificial intelligence."
tokens = word_tokenize(sentence)
print(tokens)

输出：

['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'computer', 'science', 'and', 'artificial', 'intelligence', '.']

词性标注

nltk包中也提供了多种词性标注器，例如基于隐马尔可夫模型的标注器：

from nltk.tag import pos_tag

tags = pos_tag(tokens)
print(tags)

输出：

[('Natural', 'JJ'), ('Language', 'NN'), ('Processing', 'NNP'), ('(', '('), ('NLP', 'NNP'), (')', ')'), ('is', 'VBZ'), ('a', 'DT'), ('subfield', 'NN'), ('of', 'IN'), ('computer', 'NN'), ('science', 'NN'), ('and', 'CC'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('.', '.')]

其中，NN表示名词，VBZ表示动词，JJ表示形容词，IN表示介词，CC表示连词，NNP表示专有名词等。

停用词过滤

通常在文本处理中，需要去除掉停用词（例如：“a”， “an”， “the”， “is”， “of”等），以减少干扰。

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if not token.lower() in stop_words]

print(filtered_tokens)

输出：

['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'subfield', 'computer', 'science', 'artificial', 'intelligence', '.']

语料库

nltk包中包含了多个常用的语料库，例如布朗语料库（Brown Corpus）、新闻文本语料库（Reuters Corpus）、默认给定的英文停用词语料库等等。

from nltk.corpus import brown

words = brown.words()[:10]
print(words)

输出：

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']

总结

自然语言工具包（nltk）是一款基于Python的自然语言处理工具包，提供了丰富的文本处理功能，包括分词、词性标注、过滤停用词、特征提取等。同时，nltk还提供了多个常用的语料库，用于模型训练和测试。