📜  Python |位置指数

📅  最后修改于: 2021-04-17 03:59:33             🧑  作者: Mango

本文讨论了为信息检索(IR)系统构建反向索引。但是,在现实的IR系统中,我们不仅会遇到单字查询(例如“狗”,“计算机”或“ alex”),还会遇到短语查询(例如“冬天来了”,“纽约” ”或“凯文在哪里”)。要处理此类查询,使用倒排索引是不够的。

为了更好地理解动机,请考虑用户查询“圣玛丽学校”。现在,倒排索引将为我们提供独立包含术语“圣”,“玛丽”和“学校”的文档列表。但是,我们实际需要的是文件,其中“圣玛丽学校”的整个短语都逐字出现。为了成功回答此类查询,我们需要一个文档索引,该文档还应存储术语的位置。

张贴清单
对于倒排索引,过帐列表是该术语出现的文档列表。它通常按文档ID排序,并以链接列表的形式存储。

上图显示了术语“ hello”的示例发布列表。它指示“ hello”出现在文档ID为3、5、10、23和27的文档中。它还指定了文档频率5(以绿色突出显示)。给定的是示例Python数据格式,其中包含字典和用于存储发布列表的链接列表。

{"hello" : [5, [3, 5, 10, 23, 27] ] }

在位置索引的情况下,该术语在特定文档中出现的位置也与docID一起存储。

上图显示了为位置索引实施的相同过帐列表。蓝色框表示术语“ hello”在相应文档中的位置。例如,“ hello”出现在文档5中的三个位置:120、125和278。此外,该术语的出现频率也存储在每个文档中。给出的是相同的示例Python数据格式。

{"hello" : [5, [ {3 : [3, [120, 125, 278]]}, {5 : [1, [28] ] }, {10 : [2, [132, 182]]}, {23 : [3, [0, 12, 28]]}, {27 : [1, [2]]} ] }

为了简单起见,也可以在各个文档中省略术语“频率”(如示例代码中所做的那样)。数据格式如下所示。

{"hello" : [5, {3 : [120, 125, 278]}, {5 : [28]}, {10 : [132, 182]}, {23 : [0, 12, 28]}, {27 : [2]} ] }

建立位置索引的步骤

  • 提取文档。
  • 删除停用词,阻止生成的词。
  • 如果字典中已经存在该单词,则添加文档及其出现的相应位置。否则,创建一个新条目。
  • 还要更新每个文档的单词频率以及否。出现在其中的文件数量。

代码
为了实现位置索引,我们使用了一个名为“ 20个新闻组”的样本数据集。

# importing libraries
import numpy as np
import os
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
from natsort import natsorted
import string
  
def read_file(filename):
    with open(filename, 'r', encoding ="ascii", errors ="surrogateescape") as f:
        stuff = f.read()
  
    f.close()
      
    # Remove header and footer.
    stuff = remove_header_footer(stuff)
      
    return stuff
  
def remove_header_footer(final_string):
    new_final_string = ""
    tokens = final_string.split('\n\n')
  
    # Remove tokens[0] and tokens[-1]
    for token in tokens[1:-1]:
        new_final_string += token+" "
    return new_final_string
  
def preprocessing(final_string):
        # Tokenize.
    tokenizer = TweetTokenizer()
    token_list = tokenizer.tokenize(final_string)
  
    # Remove punctuations.
    table = str.maketrans('', '', '\t')
    token_list = [word.translate(table) for word in token_list]
    punctuations = (string.punctuation).replace("'", "")
    trans_table = str.maketrans('', '', punctuations)
    stripped_words = [word.translate(trans_table) for word in token_list]
    token_list = [str for str in stripped_words if str]
  
    # Change to lowercase.
    token_list =[word.lower() for word in token_list]
    return token_list
  
# In this example, we create the positional index for only 1 folder.
folder_names = ["comp.graphics"]
  
# Initialize the stemmer.
stemmer = PorterStemmer()
  
# Initialize the file no.
fileno = 0
  
# Initialize the dictionary.
pos_index = {}
  
# Initialize the file mapping (fileno -> file name).
file_map = {}
  
for folder_name in folder_names:
  
    # Open files.
    file_names = natsorted(os.listdir("20_newsgroups/" + folder_name))
  
    # For every file.
    for file_name in file_names:
  
        # Read file contents.
        stuff = read_file("20_newsgroups/" + folder_name + "/" + file_name)
          
        # This is the list of words in order of the text.
        # We need to preserve the order because we require positions.
        # 'preprocessing' function does some basic punctuation removal,
        # stopword removal etc.
        final_token_list = preprocessing(stuff)
  
        # For position and term in the tokens.
        for pos, term in enumerate(final_token_list):
              
                    # First stem the term.
                    term = stemmer.stem(term)
  
                    # If term already exists in the positional index dictionary.
                    if term in pos_index:
                          
                        # Increment total freq by 1.
                        pos_index[term][0] = pos_index[term][0] + 1
                          
                        # Check if the term has existed in that DocID before.
                        if fileno in pos_index[term][1]:
                            pos_index[term][1][fileno].append(pos)
                              
                        else:
                            pos_index[term][1][fileno] = [pos]
  
                    # If term does not exist in the positional index dictionary 
                    # (first encounter).
                    else:
                          
                        # Initialize the list.
                        pos_index[term] = []
                        # The total frequency is 1.
                        pos_index[term].append(1)
                        # The postings list is initially empty.
                        pos_index[term].append({})      
                        # Add doc ID to postings list.
                        pos_index[term][1][fileno] = [pos]
  
        # Map the file no. to the file name.
        file_map[fileno] = "20_newsgroups/" + folder_name + "/" + file_name
  
        # Increment the file no. counter for document ID mapping              
        fileno += 1
  
# Sample positional index to test the code.
sample_pos_idx = pos_index["andrew"]
print("Positional Index")
print(sample_pos_idx)
  
file_list = sample_pos_idx[1]
print("Filename, [Positions]")
for fileno, positions in file_list.items():
    print(file_map[fileno], positions)

输出:

Positional Index
[10, {215: [2081], 539: [66], 591: [879], 616: [462, 473], 680: [135], 691: [2081], 714: [4], 809: [333], 979: [0]}]
Filename, [Positions]
20_newsgroups/comp.graphics/38376 [2081]
20_newsgroups/comp.graphics/38701 [66]
20_newsgroups/comp.graphics/38753 [879]
20_newsgroups/comp.graphics/38778 [462, 473]
20_newsgroups/comp.graphics/38842 [135]
20_newsgroups/comp.graphics/38853 [2081]
20_newsgroups/comp.graphics/38876 [4]
20_newsgroups/comp.graphics/38971 [333]
20_newsgroups/comp.graphics/39663 [0]