📜  自然语言工具包-标记文本(1)

📅  最后修改于: 2023-12-03 14:57:08.625000             🧑  作者: Mango

自然语言工具包-标记文本

自然语言工具包(Natural Language Toolkit,简称NLTK)是Python中用于文本数据处理和自然语言处理的重要工具库,在文本分类、语义分析、情感分析、信息提取等领域都有广泛应用。NLTK中有很多模块,其中标记文本模块(Tagging)是其中最重要的之一。

标记文本的定义

标记文本就是将给定的文本中的词语进行分类,如将名词、动词、形容词等进行分类。因此,标记文本是自然语言处理中的重要环节之一。

NLTK中的标记文本模块

在NLTK中,标记文本的核心是nltk.pos_tag()函数,该函数可以将文本中的每一个词语进行标记(即分类)。

import nltk

text = 'John likes to watch movies. He likes action movies the most.'
tokens = nltk.word_tokenize(text)
tagged_words = nltk.pos_tag(tokens)
print(tagged_words)

# [{'John': 'NNP'}, {'likes': 'VBZ'}, {'to': 'TO'}, {'watch': 'VB'}, {'movies': 'NNS'},
# {'He': 'PRP'}, {'likes': 'VBZ'}, {'action': 'NN'}, {'movies': 'NNS'}, {'the': 'DT'}, {'most': 'RBS'}, {'.': '.'}]

在这个例子中,首先通过nltk.word_tokenize()将一个句子分割成单独的词语,然后通过nltk.pos_tag()对每一个词语进行标记,最后输出结果可以发现,NLTK对于每一个词语都给出了相应的词性标记。

标记的常用缩写

在NLTK中,针对不同的词性,有相应的缩写,这里给出常用的标记缩写。

| 缩写 | 词性 | 举例 | | --- | --- | --- | | CC | Coordinating conjunction | and,because,or | | CD | Cardinal number | one,two,three | | DT | Determiner | a,an,the | | EX | Existential there | there | | FW | Foreign word | bon appetit | | IN | Preposition or subordinating conjunction | in, of, with | | JJ | Adjective | big, tall, new | | JJR | Adjective, comparative | bigger, taller, newer | | JJS | Adjective, superlative | biggest, tallest, newest | | LS | List item marker | A,B,C | | MD | Modal | can, should, may | | NN | Noun, singular or mass | dog, house, love | | NNS | Noun, plural | dogs, houses, loves | | NNP | Proper noun, singular | Barack, Maria | | NNPS | Proper noun, plural | Obamas, Santas | | PDT | Predeterminer | all, both, many | | POS | Possessive ending | 's | | PRP | Personal pronoun | I, you, he | | RB | Adverb | quickly, loudly, often | | RBR | Adverb, comparative | faster, louder, often | | RBS | Adverb, superlative | fastest, loudest, often | | RP | Particle | up, out, over | | SYM | Symbol | %, $, # | | TO | to | to | | UH | Interjection | hi, oh, oops | | VB | Verb, base form | walk, talk, eat | | VBD | Verb, past tense | walked, talked, ate | | VBG | Verb, gerund or present participle | walking, talking, eating | | VBN | Verb, past participle | walked, talked, eaten | | VBP | Verb, non-3rd person singular present | walk, talk, eat | | VBZ | Verb, 3rd person singular present | walks, talks, eats | | WDT | Wh-determiner | which, what | | WP | Wh-pronoun | who, what | | WP$ | Possessive wh-pronoun | whose | | WRB | Wh-adverb | why, how |

结论

标记文本是自然语言处理中的重要步骤之一,NLTK提供了便捷的方法和工具,通过nltk.pos_tag()函数就可以将文本中的每一个词语进行标记。通过学习标记缩写,我们可以更好地理解标记文本的过程,为后续的自然语言处理奠定了基础。