📜  Gensim-创建LDA主题模型

📅  最后修改于: 2020-10-16 02:28:47             🧑  作者: Mango


本章将帮助您学习如何在Gensim中创建潜在Dirichlet分配(LDA)主题模型。

在NLP的主要应用之一(自然语言处理)中,自动从大量文本中提取有关主题的信息。大量文本可能来自酒店评论,tweet,Facebook帖子,来自任何其他社交媒体频道的评论,电影评论,新闻报道,用户反馈,电子邮件等。

在这个数字时代,了解人们/客户在谈论什么,了解他们的观点和问题,对于企业,政治运动和管理人员来说都是非常有价值的。但是,是否可以手动阅读大量文本,然后从主题中提取信息?

不,这不对。它需要一种自动算法,该算法可以读取大量文本文档并自动从中提取所需的信息/主题。

LDA的作用

LDA进行主题建模的方法是将文档中的文本分类为特定主题。建模为Dirichlet分布,LDA建立-

  • 每个文档模型的主题和
  • 每个主题模型的单词

提供LDA主题模型算法后,为了获得良好的主题-关键字分布组合,请重新安排-

  • 文档中的主题分布以及
  • 主题内的关键字分布

在处理过程中,LDA做出的一些假设是-

  • 每个文档都被建模为主题的多名义分布。
  • 每个主题都被建模为单词的多名词分布。
  • 我们应该选择正确的数据语料库,因为LDA假定每个文本块都包含相关的单词。
  • LDA还假定这些文档是由多个主题组成的。

用Gensim实施

在这里,我们将使用LDA(潜在Dirichlet分配)从数据集中提取自然讨论的主题。

加载数据集

我们将使用的数据集是“ 20个新闻组”的数据集,其中包含来自新闻报道各个部分的数千篇新闻文章。在Sklearn数据集下可用。我们可以在以下Python脚本的帮助下轻松下载-

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

让我们在以下脚本的帮助下查看一些示例新闻-

newsgroups_train.data[:4]
["From: lerxst@wam.umd.edu (where's my thing)\nSubject: 
WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: 
University of Maryland, College Park\nLines: 
15\n\n I was wondering if anyone out there could enlighten me on this car 
I saw\nthe other day. It was a 2-door sports car, looked to be from the 
late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. 
In addition,\nthe front bumper was separate from the rest of the body. 
This is \nall I know. If anyone can tellme a model name, 
engine specs, years\nof production, where this car is made, history, or 
whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,
\n- IL\n ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",

"From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final 
Call\nSummary: Final call for SI clock reports\nKeywords: 
SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: 
University of Washington\nLines: 11\nNNTP-Posting-Host: carson.u.washington.edu\n\nA 
fair number of brave souls who upgraded their SI clock oscillator have\nshared their 
experiences for this poll. Please send a brief message detailing\nyour experiences with 
the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat 
sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies 
are especially requested.\n\nI will be summarizing in the next two days, so please add 
to the network\nknowledge base if you have done the clock upgrade and haven't answered 
this\npoll. Thanks.\n\nGuy Kuo \n",

'From: twillis@ec.ecn.purdue.edu (Thomas E Willis)\nSubject: 
PB questions...\nOrganization: Purdue University Engineering 
Computer Network\nDistribution: usa\nLines: 36\n\nwell folks, 
my mac plus finally gave up the ghost this weekend after\nstarting 
life as a 512k way back in 1985. sooo, i\'m in the market for 
a\nnew machine a bit sooner than i intended to be...\n\ni\'m looking 
into picking up a powerbook 160 or maybe 180 and have a bunch\nof 
questions that (hopefully) somebody can answer:\n\n* does anybody 
know any dirt on when the next round of powerbook\nintroductions 
are expected? i\'d heard the 185c was supposed to make an\nappearence 
"this summer" but haven\'t heard anymore on it - and since i\ndon\'t 
have access to macleak, i was wondering if anybody out there had\nmore 
info...\n\n* has anybody heard rumors about price drops to the powerbook 
line like the\nones the duo\'s just went through recently?\n\n* what\'s 
the impression of the display on the 180? i could probably swing\na 180 
if i got the 80Mb disk rather than the 120, but i don\'t really have\na 
feel for how much "better" the display is (yea, it looks great in the\nstore, 
but is that all "wow" or is it really that good?). could i solicit\nsome 
opinions of people who use the 160 and 180 day-to-day on if its
worth\ntaking the disk size and money hit to get the active display? 
(i realize\nthis is a real subjective question, but i\'ve only played around 
with the\nmachines in a computer store breifly and figured the opinions 
of somebody\nwho actually uses the machine daily might prove helpful).\n\n* 
how well does hellcats perform? ;)\n\nthanks a bunch in advance for any info - 
if you could email, i\'ll post a\nsummary (news reading time is at a premium 
with finals just around the\ncorner... :
( )\n--\nTom Willis \\ twillis@ecn.purdue.edu \\ Purdue Electrical 
Engineering\n---------------------------------------------------------------------------\
n"Convictions are more dangerous enemies of truth than lies." - F. W.\nNietzsche\n',

'From: jgreen@amber (Joe Green)\nSubject: Re: Weitek P9000 ?\nOrganization: 
Harris Computer Systems Division\nLines: 14\nDistribution: world\nNNTP-Posting-Host: 
amber.ssd.csd.harris.com\nX-Newsreader: TIN [version 1.1 PL9]\n\nRobert 
J.C. Kyanko (rob@rjck.UUCP) wrote:\n >abraxis@iastate.edu writes in article 
:\n> > Anyone know about the 
Weitek P9000 graphics chip?\n > As far as the low-level stuff goes, it looks 
pretty nice. It\'s got this\n> quadrilateral fill command that requires just 
the four points.\n\nDo you have Weitek\'s address/phone number? I\'d like to get 
some information\nabout this chip.\n\n--\nJoe Green\t\t\t\tHarris 
Corporation\njgreen@csd.harris.com\t\t\tComputer Systems Division\n"The only 
thing that really scares me is a person with no sense of humor.
"\n\t\t\t\t\t\t-- Jonathan Winters\n']

先决条件

我们需要NLTK的停用词和Scapy的英语模型。两者都可以如下下载-

import nltk;
nltk.download('stopwords')
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])

导入必要的程序包

为了建立LDA模型,我们需要导入以下必要的包-

import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt

准备停用词

现在,我们需要导入停用词并使用它们-

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

清理文字

现在,借助Gensim的simple_preprocess(),我们需要将每个句子标记为单词列表。我们还应该删除标点符号和不必要的字符。为了做到这一点,我们将创建一个名为send_to_words()的函数-

def sent_to_words(sentences):
   for sentence in sentences:
      yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
data_words = list(sent_to_words(data))

建立Bigram和Trigram模型

众所周知,二字是在文档中经常同时出现的两个单词,而三字母组是在文档中经常同时出现的三个单词。借助Gensim的短语模型,我们可以做到这一点-

bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

过滤掉停用词

接下来,我们需要过滤掉停用词。除此之外,我们还将创建函数来生成二元组,三元组和词形化-

def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc))
if word not in stop_words] for doc in texts]
def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
   return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
   texts_out = []
   for sent in texts:
     doc = nlp(" ".join(sent))
     texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
   return texts_out

为主题模型构建字典和语料库

现在,我们需要构建字典和语料库。我们也在前面的示例中做到了-

id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]

建立LDA主题模型

我们已经实现了训练LDA模型所需的一切。现在,该构建LDA主题模型了。对于我们的实现示例,可以在以下代码行的帮助下完成-

lda_model = gensim.models.ldamodel.LdaModel(
   corpus=corpus, id2word=id2word, num_topics=20, random_state=100, 
   update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True
)

实施实例

让我们看一下构建LDA主题模型的完整实现示例-

import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
data = newsgroups_train.data
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
data = [re.sub('\s+', ' ', sent) for sent in data]
data = [re.sub("\'", "", sent) for sent in data]
print(data_words[:4]) #it will print the data after prepared for stopwords
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc)) 
   if word not in stop_words] for doc in texts]
def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
   [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
   texts_out = []
   for sent in texts:
      doc = nlp(" ".join(sent))
      texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
   return texts_out
data_words_nostops = remove_stopwords(data_words)
data_words_bigrams = make_bigrams(data_words_nostops)
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=[
   'NOUN', 'ADJ', 'VERB', 'ADV'
])
print(data_lemmatized[:4]) #it will print the lemmatized data.
id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]
print(corpus[:4]) #it will print the corpus we created above.
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:4]] 
#it will print the words with their frequencies.
lda_model = gensim.models.ldamodel.LdaModel(
   corpus=corpus, id2word=id2word, num_topics=20, random_state=100, 
   update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True
)

现在,我们可以使用上面创建的LDA模型来获取主题,以计算模型复杂性。