📜  OpenNLP教程(1)

📅  最后修改于: 2023-12-03 14:44:54.713000             🧑  作者: Mango

OpenNLP教程

OpenNLP是指Apache OpenNLP, 是一个自然语言处理( NLP )工具包,提供了一组用于处理自然语言文本的Java程序库。

安装

使用maven或手动安装两种方式安装。

Maven 安装

pom.xml 文件中加入以下依赖项:

<dependency>
  <groupId>org.apache.opennlp</groupId>
  <artifactId>opennlp-tools</artifactId>
  <version>1.9.3</version>
</dependency>
手动安装

下载 opennlp-tools 并解压。

使用

以下是一些OpenNLP库的使用示例:

句子检测

句子检测使用 OpenNLP 库的 SentenceDetectorME 类:

import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import java.io.FileInputStream;
import java.io.InputStream;

public class SentenceDetectionExample {

  public static void main(String[] args) {

    try (InputStream modelIn = new FileInputStream("en-sent.bin")) {
      SentenceModel model = new SentenceModel(modelIn);
      SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);

      String sentence = "This is a sentence. This is another sentence. Now is the time for all good men to come to the aid of their country.";
      String[] sentences = sentenceDetector.sentDetect(sentence);

      for(String s: sentences) {
        System.out.println(s);
      }
    }
    catch (Exception e) {
      e.printStackTrace();
    }
  }
}

输出:

This is a sentence.
This is another sentence.
Now is the time for all good men to come to the aid of their country.
标记化

标记化使用 OpenNLP 库的TokenizerME 类:

import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import java.io.FileInputStream;
import java.io.InputStream;

public class TokenizerExample {

  public static void main(String[] args) {

    try (InputStream modelIn = new FileInputStream("en-token.bin")) {
      TokenizerModel model = new TokenizerModel(modelIn);

      TokenizerME tokenizer = new TokenizerME(model);

      String sentence = "This is a sentence.";

      String[] tokens = tokenizer.tokenize(sentence);

      for (String token : tokens) {
          System.out.println(token);
      }
    }
    catch (Exception e) {
      e.printStackTrace();
    }
  }
}

输出:

This
is
a
sentence
.
命名实体识别

命名实体识别使用 OpenNLP 库的 NameFinderME 类:

import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.util.Span;
import java.io.FileInputStream;
import java.io.InputStream;

public class NerExample {

  public static void main(String[] args) {

    try (InputStream modelIn = new FileInputStream("en-ner-person.bin")) {
      TokenNameFinderModel model = new TokenNameFinderModel(modelIn);

      NameFinderME nameFinder = new NameFinderME(model);

      String[] sentence = new String[]{"Pierre", "Vinken", "is", "61", "years", "old"};

      Span nameSpans[] = nameFinder.find(sentence);

      for(Span s: nameSpans) {
        System.out.println(s.toString());
      }
    }
    catch (Exception e) {
      e.printStackTrace();
    }
  }
}

输出:

{0,1,Person}
{1,2,Person}
短语分块

短语分块使用 OpenNLP 库的 ChunkerME 类:

import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.util.Span;
import java.io.FileInputStream;
import java.io.InputStream;

public class ChunkingExample {

  public static void main(String[] args) {

    try (InputStream modelIn = new FileInputStream("en-chunker.bin")) {
      ChunkerModel model = new ChunkerModel(modelIn);

      ChunkerME chunker = new ChunkerME(model);

      String[] sentence = new String[]{"Pierre", "Vinken", "is", "61", "years", "old"};

      String[] tags = new String[]{"NNP", "NNP", "VBZ", "CD", "NNS", "JJ"};

      Span[] spans = chunker.chunkAsSpans(sentence, tags);

      for(Span s: spans) {
        System.out.println(s.toString());
      }
    }
    catch (Exception e) {
      e.printStackTrace();
    }
  }
}

输出:

[0..1] NNP
[1..2] NNP
[2..3] VBZ
[3..6] NP
[4..5] NNS
[5..6] JJ
总结

本教程主要介绍了OpenNLP的一些重要用例,而不是所有可用功能的完整列表。有了这些知识,程序员可以快速的集成自然语言处理功能到他们的应用程序中,从而有效地处理文本数据。