📜  Lucene-分析

📅  最后修改于: 2020-11-12 04:47:49             🧑  作者: Mango


在前面的章节中,我们已经看到Lucene使用IndexWriter来使用分析器来分析文档,然后根据需要创建/打开/编辑索引。在本章中,我们将讨论在分析过程中使用的各种类型的Analyzer对象和其他相关对象。了解分析过程以及分析器的工作方式将使您对Lucene如何编制文档索引有更深入的了解。

以下是我们将在适当时候讨论的对象列表。

S.No. Class & Description
1 Token

Token represents text or word in a document with relevant details like its metadata (position, start offset, end offset, token type and its position increment).

2 TokenStream

TokenStream is an output of the analysis process and it comprises of a series of tokens. It is an abstract class.

3 Analyzer

This is an abstract base class for each and every type of Analyzer.

4 WhitespaceAnalyzer

This analyzer splits the text in a document based on whitespace.

5 SimpleAnalyzer

This analyzer splits the text in a document based on non-letter characters and puts the text in lowercase.

6 StopAnalyzer

This analyzer works just as the SimpleAnalyzer and removes the common words like ‘a’, ‘an’, ‘the’, etc.

7 StandardAnalyzer

This is the most sophisticated analyzer and is capable of handling names, email addresses, etc. It lowercases each token and removes common words and punctuations, if any.