Python – 高效的文本数据清理

我们过去主要以行列格式拥有数据的日子已经一去不复返了，或者我们可以说结构化数据。目前，正在收集的数据比结构化数据更非结构化。我们拥有文本、图像、音频等形式的数据，并且结构化数据与非结构化数据的比率多年来一直在下降。非结构化数据每年以 55-65% 的速度增长。

因此，我们需要学习如何处理非结构化数据，以便能够从中提取相关信息并使其有用。在处理文本数据时，在将其用于预测或分析之前对其进行预处理非常重要。
在本文中，我们将学习使用Python的各种文本数据清理技术。

让我们以一条推文为例：

I enjoyd the event which took place yesteday & I luvd it ! The link to the show is 
http://t.co/4ftYom0i It's awesome you'll luv it #HadFun #Enjoyed BFN GN

我们将逐步对这条推文进行数据清理。

数据清理步骤

1) 清除 HTML字符：很多 HTML 实体，如 ' ,& ,<等可以在网络上的大多数可用数据中找到。我们需要从我们的数据中去除这些。您可以通过两种方式做到这一点：

通过使用特定的正则表达式或
通过使用可用的模块或包（ Python的htmlparser ）

我们将使用Python中已有的模块。

代码：

python3

#Escaping out HTML characters
from html.parser import HTMLParser
 
tweet="I enjoyd the event which took place yesteday & I lovdddd itttt ! The link to the show is http://t.co/4ftYom0i It's awesome you'll luv it #HadFun #Enjoyed BFN GN" 
tweet=HTMLParser().unescape(tweet)
print("After removing HTML characters the tweet is:-\n{}".format(tweet))

python3

#Encode from UTF-8 to ascii
encode_tweet =tweet.encode('ascii','ignore')
print("encode_tweet = \n{}".format(encode_tweet))
 
#decode from ascii to UTF-8
decode_tweet=encode_tweet.decode(encoding='UTF-8')
print("decode_tweet = \n{}".format(decode_tweet))

python3

#library for regular expressions
import re   
 
# remove hyperlinks
tweet = re.sub(r'https?:\/\/.\S+', "", tweet)
 
# remove hashtags
# only removing the hash # sign from the word
tweet = re.sub(r'#', '', tweet)
 
# remove old style retweet text "RT"
tweet = re.sub(r'^RT[\s]+', '', tweet)
 
print("After removing Hashtags,URLs and Styles the tweet is:-\n{}".format(tweet))

python3

#dictionary consisting of the contraction and the actual value
Apos_dict={"'s":" is","n't":" not","'m":" am","'ll":" will",
           "'d":" would","'ve":" have","'re":" are"}
 
#replace the contractions
for key,value in Apos_dict.items():
    if key in tweet:
        tweet=tweet.replace(key,value)
 
print("After Contraction replacement the tweet is:-\n{}".format(tweet))

python3

import re
#separate the words
tweet = " ".join([s for s in re.split("([A-Z][a-z]+[^A-Z]*)",tweet) if s])
print("After splitting attached words the tweet is:-\n{}".format(tweet))

python3

#convert to lower case
tweet=tweet.lower()
print("After converting to lower case the tweet is:-\n{}".format(tweet))

python3

#open the file slang.txt
file=open("slang.txt","r")
slang=file.read()
 
#separating each line present in the file
slang=slang.split('\n')
 
tweet_tokens=tweet.split()
slang_word=[]
meaning=[]
 
#store the slang words and meanings in different lists
for line in slang:
    temp=line.split("=")
    slang_word.append(temp[0])
    meaning.append(temp[-1])
 
#replace the slang word with meaning
for i,word in enumerate(tweet_tokens):
    if word in slang_word:
        idx=slang_word.index(word)
        tweet_tokens[i]=meaning[idx]
         
tweet=" ".join(tweet_tokens)
print("After slang replacement the tweet is:-\n{}".format(tweet))

python3

import itertools
#One letter in a word should not be present more than twice in continuation
tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))
print("After standardizing the tweet is:-\n{}".format(tweet))
 
from autocorrect import Speller
spell = Speller(lang='en')
#spell check
tweet=spell(tweet)
print("After Spell check the tweet is:-\n{}".format(tweet))

python3

import nltk
#download the stopwords from nltk using
nltk.download('stopwords')
#import stopwords
from nltk.corpus import stopwords
 
#import english stopwords list from nltk
stopwords_eng = stopwords.words('english')
 
tweet_tokens=tweet.split()
tweet_list=[]
#remove stopwords
for word in tweet_tokens:
    if word not in stopwords_eng:
        tweet_list.append(word)
 
print("tweet_list = {}".format(tweet_list))

python3

#for string operations
import string         
clean_tweet=[]
#remove punctuations
for word in tweet_list:
    if word not in string.punctuation:
        clean_tweet.append(word)
 
print("clean_tweet = {}".format(clean_tweet))

输出：

2）编码和解码数据：是将信息从简单易懂的字符转换为复杂符号的过程，反之亦然。文本数据有不同形式的编码和解码，如“UTF8”、“ascii”等。我们应该以标准编码格式保存我们的数据。最常见的格式是 UTF-8 格式。

给定的推文已经是 UTF-8 格式，因此我们将其编码为 ascii 格式，然后将其解码为 UTF-8 格式以解释该过程。

代码：

蟒蛇3

#Encode from UTF-8 to ascii
encode_tweet =tweet.encode('ascii','ignore')
print("encode_tweet = \n{}".format(encode_tweet))
 
#decode from ascii to UTF-8
decode_tweet=encode_tweet.decode(encoding='UTF-8')
print("decode_tweet = \n{}".format(decode_tweet))

输出：

3) 删除 URL、主题标签和样式：在我们的文本数据集中，我们可以有超链接、主题标签或样式，例如 twitter 数据集的转推文本等。这些不提供相关信息并且可以删除。在主题标签中，只会删除井号“#”。为此，我们将使用re 库来执行正则表达式操作。

代码：

蟒蛇3

#library for regular expressions
import re   
 
# remove hyperlinks
tweet = re.sub(r'https?:\/\/.\S+', "", tweet)
 
# remove hashtags
# only removing the hash # sign from the word
tweet = re.sub(r'#', '', tweet)
 
# remove old style retweet text "RT"
tweet = re.sub(r'^RT[\s]+', '', tweet)
 
print("After removing Hashtags,URLs and Styles the tweet is:-\n{}".format(tweet))

输出：

4）收缩替换：文本数据可能包含用于收缩的撇号。示例- “没有”表示“没有”等。这可以改变单词或句子的含义。因此，我们需要用标准词典替换这些撇号。为此，我们可以有一个字典，其中包含需要替换单词的值并使用它。

Few of the contractions used are:-
n't --> not        'll --> will
's  --> is        'd  --> would
'm  --> am        've --> have
're --> are

代码：

蟒蛇3

#dictionary consisting of the contraction and the actual value
Apos_dict={"'s":" is","n't":" not","'m":" am","'ll":" will",
           "'d":" would","'ve":" have","'re":" are"}
 
#replace the contractions
for key,value in Apos_dict.items():
    if key in tweet:
        tweet=tweet.replace(key,value)
 
print("After Contraction replacement the tweet is:-\n{}".format(tweet))

输出：

5）拆分附加词：一些词被连接在一起，例如—— “ForTheWin” 。这些需要分开才能从中提取含义。拆分后，将是“For The Win” 。

代码：

蟒蛇3

import re
#separate the words
tweet = " ".join([s for s in re.split("([A-Z][a-z]+[^A-Z]*)",tweet) if s])
print("After splitting attached words the tweet is:-\n{}".format(tweet))

输出：

6）转换为小写：将您的文本转换为小写以避免区分大小写相关问题。

代码：

蟒蛇3

#convert to lower case
tweet=tweet.lower()
print("After converting to lower case the tweet is:-\n{}".format(tweet))

输出：

7）俚语查找：现在使用的俚语很多，可以在文本数据中找到。所以我们需要用它们的含义来替换它们。我们可以使用俚语词典，就像我们为收缩替换所做的那样，或者我们可以创建一个包含俚语的文件。俚语的例子是：-

asap --> as soon as possible
b4   --> before
lol  --> laugh out loud
luv  --> love
wtg  --> way to go

我们正在使用一个由单词组成的文件。您可以下载文件 slang.txt。该文件的来源取自此处。

代码：

蟒蛇3

#open the file slang.txt
file=open("slang.txt","r")
slang=file.read()
 
#separating each line present in the file
slang=slang.split('\n')
 
tweet_tokens=tweet.split()
slang_word=[]
meaning=[]
 
#store the slang words and meanings in different lists
for line in slang:
    temp=line.split("=")
    slang_word.append(temp[0])
    meaning.append(temp[-1])
 
#replace the slang word with meaning
for i,word in enumerate(tweet_tokens):
    if word in slang_word:
        idx=slang_word.index(word)
        tweet_tokens[i]=meaning[idx]
         
tweet=" ".join(tweet_tokens)
print("After slang replacement the tweet is:-\n{}".format(tweet))

输出：

8) 标准化和拼写检查：文本中可能存在拼写错误或格式不正确。例如，“驾驶”的“驾驶”或“我想念这个”的“我想念这个”。我们可以使用Python的自动更正库来更正这些。您也可以使用其他可用的库。首先，您必须使用以下命令安装库 -

#install autocorrect library
 pip install autocorrect

代码：

蟒蛇3

import itertools
#One letter in a word should not be present more than twice in continuation
tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))
print("After standardizing the tweet is:-\n{}".format(tweet))
 
from autocorrect import Speller
spell = Speller(lang='en')
#spell check
tweet=spell(tweet)
print("After Spell check the tweet is:-\n{}".format(tweet))

输出：

9）删除停用词：停用词是在文本中频繁出现但没有增加意义的词。为此，我们将使用 nltk 库，由用于预处理数据的模块组成。它为我们提供了停用词列表。您也可以根据用例创建自己的停用词列表。

首先，确保您已安装nltk库。如果没有，则使用命令下载它-

#install nltk library
 pip install nltk

代码：

蟒蛇3

import nltk
#download the stopwords from nltk using
nltk.download('stopwords')
#import stopwords
from nltk.corpus import stopwords
 
#import english stopwords list from nltk
stopwords_eng = stopwords.words('english')
 
tweet_tokens=tweet.split()
tweet_list=[]
#remove stopwords
for word in tweet_tokens:
    if word not in stopwords_eng:
        tweet_list.append(word)
 
print("tweet_list = {}".format(tweet_list))

输出：

10) 删除标点符号：标点符号由!,<@#&$等组成。

代码：

蟒蛇3

#for string operations
import string         
clean_tweet=[]
#remove punctuations
for word in tweet_list:
    if word not in string.punctuation:
        clean_tweet.append(word)
 
print("clean_tweet = {}".format(clean_tweet))

输出：

这些是我们通常对文本数据格式执行的一些数据清理技术。您还可以执行一些高级数据清理，如语法检查等。