📜  lda scikit learn - Python (1)

📅  最后修改于: 2023-12-03 15:32:37.968000             🧑  作者: Mango

LDA (Latent Dirichlet Allocation) with Scikit-Learn in Python

LDA is a topic modeling technique used to uncover the hidden topics present within a collection of documents. The LDA algorithm models a document as a mixture of topics, each represented by a probability distribution over words in the vocabulary.

In this article, we will demonstrate how to perform LDA using Scikit-Learn in Python.

Loading the Data

Before we start with the LDA implementation, let's first load the necessary dataset. We will be using the 20 Newsgroups dataset which is a collection of articles from 20 different categories.

from sklearn.datasets import fetch_20newsgroups

newsgroups_data = fetch_20newsgroups(subset='all', shuffle=True, random_state=42)
Data Preprocessing

Next, we need to preprocess the data by removing any stop words, punctuation, and numbers.

import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
stopwords_eng = stopwords.words('english')
pattern = r'\b[A-Za-z]+\b'

def preprocess_text(text):
    text = text.lower()
    text = re.sub('[^A-Za-z]+', ' ', text)
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words if not word in stopwords_eng]
    words = [word for word in words if len(word) > 2]
    words = ' '.join(words)
    return words
    
preprocessed_data = [preprocess_text(text) for text in newsgroups_data.data]
Feature Extraction

Now, we will extract the features from the preprocessed data using the CountVectorizer class provided by Scikit-Learn.

from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(max_features=5000)
feature_matrix = count_vectorizer.fit_transform(preprocessed_data)
LDA Modeling

Finally, we can now perform LDA on the feature matrix using the LatentDirichletAllocation class provided by Scikit-Learn.

from sklearn.decomposition import LatentDirichletAllocation

lda_model = LatentDirichletAllocation(n_components=20, max_iter=10, learning_method='online', random_state=42)
lda_model.fit(feature_matrix)
Displaying Results

To display the results, we can print the top words associated with each topic.

def display_topics(model, feature_names, num_top_words):
    for index, topic in enumerate(model.components_):
        message = f'Topic {index}: '
        message += ' '.join([feature_names[i] for i in topic.argsort()[:-num_top_words - 1:-1]])
        print(message)

display_topics(lda_model, count_vectorizer.get_feature_names(), 20)

This will display the top 20 words associated with each of the 20 topics generated by our LDA model.

Conclusion

In this article, we have demonstrated how to perform LDA using Scikit-Learn in Python. We loaded the necessary dataset, preprocessed the data, extracted the features, and modeled with LDA. Finally, we displayed the top words associated with each topic.