📜  OpenNLP-引用的API

📅  最后修改于: 2020-11-23 03:53:11             🧑  作者: Mango


在本章中,我们将讨论本教程后续各章中将使用的类和方法。

句子检测

SentenceModel类

此类表示预定义的模型,该模型用于检测给定原始文本中的句子。此类属于包opennlp.tools.sentdetect

此类的构造函数接受句子检测器模型文件(en-sent.bin)的InputStream对象。

SentenceDetectorME类

此类属于包opennlp.tools.sentdetect ,它包含将原始文本拆分为句子的方法。此类使用最大熵模型评估字符串中的句子结尾字符,以确定它们是否表示句子结尾。

以下是此类的重要方法。

S.No Methods and Description
1

sentDetect()

This method is used to detect the sentences in the raw text passed to it. It accepts a String variable as a parameter and returns a String array which holds the sentences from the given raw text.

2

sentPosDetect()

This method is used to detect the positions of the sentences in the given text. This method accepts a string variable, representing the sentence and returns an array of objects of the type Span.

The class named Span of the opennlp.tools.util package is used to store the start and end integer of sets.

3

getSentenceProbabilities()

This method returns the probabilities associated with the most recent calls to sentDetect() method.

代币化

TokenizerModel类

此类表示预定义模型,该模型用于标记给定句子。此类属于包opennlp.tools.tokenizer

此类的构造函数接受令牌生成器模型文件(entoken.bin)的InputStream对象。

班级

为了执行标记化,OpenNLP库提供了三个主要类。这三个类都实现了称为Tokenizer的接口。

S.No Classes and Description
1

SimpleTokenizer

This class tokenizes the given raw text using character classes.

2

WhitespaceTokenizer

This class uses whitespaces to tokenize the given text.

3

TokenizerME

This class converts raw text in to separate tokens. It uses Maximum Entropy to make its decisions.

这些类包含以下方法。

S.No Methods and Description
1

tokenize()

This method is used to tokenize the raw text. This method accepts a String variable as a parameter, and returns an array of Strings (tokens).

2

sentPosDetect()

This method is used to get the positions or spans of the tokens. It accepts the sentence (or) raw text in the form of the string and returns an array of objects of the type Span.

除上述两种方法外, TokenizerME类还具有getTokenProbabilities()方法。

S.No Methods and Description
1

getTokenProbabilities()

This method is used to get the probabilities associated with the most recent calls to the tokenizePos() method.

NameEntityRecognition

TokenNameFinderModel类

此类表示预定义的模型,该模型用于查找给定句子中的命名实体。此类属于包opennlp.tools.namefind

此类的构造函数接受名称查找器模型文件(enner-person.bin)的InputStream对象。

NameFinderME类

该类属于包opennlp.tools.namefind ,它包含执行NER任务的方法。此类使用最大熵模型在给定的原始文本中查找命名实体。

S.No Methods and Description
1

find()

This method is used to detect the names in the raw text. It accepts a String variable representing the raw text as a parameter and, returns an array of objects of the type Span.

2

probs()

This method is used to get the probabilities of the last decoded sequence.

寻找词性

POSModel类

此类表示预定义的模型,该模型用于标记给定句子的语音部分。此类属于包opennlp.tools.postag

此类的构造函数接受pos-tagger模型文件(enpos-maxent.bin)的InputStream对象。

POSTaggerME类

此类属于包opennlp.tools.postag ,它用于预测给定原始文本的词性。它使用最大熵来做出决策。

S.No Methods and Description
1

tag()

This method is used to assign the sentence of tokens POS tags. This method accepts an array of tokens (String) as a parameter, and returns a tags (array).

2

getSentenceProbabilities()

This method is used to get the probabilities for each tag of the recently tagged sentence.

解析句子

ParserModel类

此类表示预定义的模型,该模型用于解析给定的句子。此类属于包opennlp.tools.parser

此类的构造函数接受解析器模型文件(en-parserchunking.bin)的InputStream对象。

解析器工厂类

此类属于包opennlp.tools.parser ,用于创建解析器。

S.No Methods and Description
1

create()

This is a static method and it is used to create a parser object. This method accepts the Filestream object of the parser model file.

ParserTool类

此类属于opennlp.tools.cmdline.parser包,用于解析内容。

S.No Methods and Description
1

parseLine()

This method of the ParserTool class is used to parse the raw text in OpenNLP. This method accepts −

  • A String variable representing the text to be parsed.
  • A parser object.
  • An integer representing the no.of parses to be carried out.

块状

ChunkerModel类

此类表示预定义的模型,该模型用于将句子分成较小的块。此类属于包opennlp.tools.chunker

此类的构造函数接受组模型文件(enchunker.bin)的InputStream对象。

ChunkerME班

此类属于名为opennlp.tools.chunker的程序包,用于将给定的句子分成较小的块。

S.No Methods and Description
1

chunk()

This method is used to divide the given sentence in to smaller chunks. It accepts tokens of a sentence and Parts Of Speech tags as parameters.

2

probs()

This method returns the probabilities of the last decoded sequence.