在Python使用 NLTK 更正单词

nltk代表自然语言工具包，是一个功能强大的套件，由可用于统计自然语言处理的库和程序组成。这些库可以实现标记化、分类、解析、词干提取、标记、语义推理等。这个工具包可以让机器理解人类语言。

我们将使用两种方法进行拼写纠正。每种方法都采用拼写错误的单词列表，并为每个不正确的单词提供正确单词的建议。它尝试在正确拼写列表中查找距离最短且首字母与拼错单词相同的单词。然后它返回与给定条件匹配的单词。这些方法可以根据它们用于查找最接近的单词的距离度量来区分。 nltk 的 'words' 包被用作正确单词的字典。

方法一：使用Jaccard距离法

Jaccard 距离，与 Jaccard 系数相反，用于衡量两个样本集之间的差异。我们通过用1减去Jaccard系数得到Jaccard距离。我们也可以通过用联合的大小和两个集合的交集的大小之差除以联合的大小来得到它。我们使用被称为字符而不是标记的 Q-gram（这些相当于 N-gram）。 Jaccard 距离由以下公式给出。

$Dj(A,B)= 1-J(A,B)= (|A ∪ B|-|A ∩ B|) / |A ∪ B|$

分步实施

第 1 步：首先，我们安装并导入我们之前讨论过的 nltk 套件和 Jaccard 距离度量。 “ngrams”用于在给定窗口中获取一组共现词，并从 nltk.utils 包中导入。

Python3

# importing the nltk suite 
import nltk
  
# importing jaccard distance
# and ngrams from nltk.util
from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams

Python3

# Downloading and importing
# package 'words' from nltk corpus
nltk.download('words')
from nltk.corpus import words
  
  
correct_words = words.words()

Python3

# list of incorrect spellings
# that need to be corrected 
incorrect_words=['happpy', 'azmaing', 'intelliengt']
  
# loop for finding correct spellings
# based on jaccard distance
# and printing the correct word
for word in incorrect_words:
    temp = [(jaccard_distance(set(ngrams(word, 2)),
                              set(ngrams(w, 2))),w)
            for w in correct_words if w[0]==word[0]]
    print(sorted(temp, key = lambda val:val[0])[0][1])

Python3

# importing the nltk suite 
import nltk
  
# importing edit distance  
from nltk.metrics.distance  import edit_distance

Python3

# Downloading and importing package 'words'
nltk.download('words')
from nltk.corpus import words
correct_words = words.words()

Python3

# list of incorrect spellings
# that need to be corrected 
incorrect_words=['happpy', 'azmaing', 'intelliengt']
  
# loop for finding correct spellings
# based on edit distance and
# printing the correct words
for word in incorrect_words:
    temp = [(edit_distance(word, w),w) for w in correct_words if w[0]==word[0]]
    print(sorted(temp, key = lambda val:val[0])[0][1])

第 2 步：现在，我们从 nltk 下载器下载“单词”资源（其中包含单词的正确拼写列表）并通过 nltk.corpus 将其导入并将其分配给正确的单词。

蟒蛇3

# Downloading and importing
# package 'words' from nltk corpus
nltk.download('words')
from nltk.corpus import words
  
  
correct_words = words.words()

第 3 步：我们定义了需要正确拼写的不正确单词列表。然后我们对错误单词列表中的每个单词运行一个循环，在该循环中，我们计算错误单词与每个正确拼写单词的 Jaccard 距离，其中每个正确拼写单词具有相同的首字母，以字符二元组的形式。然后我们按升序对它们进行排序，使最短距离在最上面，并提取与其对应的单词并打印出来。

蟒蛇3

# list of incorrect spellings
# that need to be corrected 
incorrect_words=['happpy', 'azmaing', 'intelliengt']
  
# loop for finding correct spellings
# based on jaccard distance
# and printing the correct word
for word in incorrect_words:
    temp = [(jaccard_distance(set(ngrams(word, 2)),
                              set(ngrams(w, 2))),w)
            for w in correct_words if w[0]==word[0]]
    print(sorted(temp, key = lambda val:val[0])[0][1])

输出：

执行 Jaccard Distance 以找到正确拼写单词后的输出屏幕截图

方法二：使用编辑距离法

编辑距离通过查找将一个字符串转换为另一个字符串所需的最少操作数来衡量两个字符串之间的差异。可以执行的转换是：

插入一个新字符：

bat -> bats (insertion of 's')

删除现有字符。

care -> car (deletion of 'e')

替换现有字符。

bin -> bit (substitution of n with t)

两个现有连续字符的换位。

sing -> sign (transposition of ng to gn)

分步实施

第一步：首先，我们安装并导入nltk套件。

蟒蛇3

# importing the nltk suite 
import nltk
  
# importing edit distance  
from nltk.metrics.distance  import edit_distance

第 2 步：现在，我们从 nltk 下载器下载“单词”资源（包含正确拼写的单词），并通过 nltk.corpus 将其导入并将其分配给正确的单词。

蟒蛇3

# Downloading and importing package 'words'
nltk.download('words')
from nltk.corpus import words
correct_words = words.words()

第 3 步：我们定义了需要正确拼写的不正确单词列表。然后我们对错误单词列表中的每个单词运行一个循环，在该循环中我们计算错误单词与每个具有相同首字母的正确拼写单词的编辑距离。然后我们按升序对它们进行排序，使最短距离在最上面，并提取与其对应的单词并打印出来。

蟒蛇3

# list of incorrect spellings
# that need to be corrected 
incorrect_words=['happpy', 'azmaing', 'intelliengt']
  
# loop for finding correct spellings
# based on edit distance and
# printing the correct words
for word in incorrect_words:
    temp = [(edit_distance(word, w),w) for w in correct_words if w[0]==word[0]]
    print(sorted(temp, key = lambda val:val[0])[0][1])

输出：

执行编辑距离以查找正确拼写单词后的输出屏幕截图