Have you ever clustered a collection of texts and wondered what predominant topics underlie each group? How can you pinpoint the essence of each cluster comprising of large volume of words? Is there a way to succinctly represent the core topic of each cluster using Python? Text clustering is a powerful technique in natural language processing (NLP) that groups documents into clusters based on their content. Once you’ve clustered your data, a natural follow-up question arises: “What are these clusters about?” In this article, we’ll discuss two different methods to find the dominant topics of text clusters using Python. Meanwhile, check out my post on text clustering – Text Clustering Python Examples – Steps, Algorithms.
We will look into the following two techniques for finding the topics of the text clusters:
Before getting into techniques for finding the topics, lets create text clusters for a bunch of documents (customer_complaints_1.csv) using K-Means Algorithm
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Load the data again (Google Collab)
df = pd.read_csv('/content/customer_complaints_1.csv')
# Create TF-IDF embeddings
tfidf_vectorizer = TfidfVectorizer(max_df=0.85, max_features=1000, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(df['text'])
# Reduce dimensionality using PCA
pca = PCA(n_components=2)
reduced_tfidf = pca.fit_transform(tfidf_matrix.toarray())
# Cluster using KMeans with 3 clusters (as determined previously)
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(reduced_tfidf)
# Plot the clusters
plt.figure(figsize=(10, 6))
plt.scatter(reduced_tfidf[:, 0], reduced_tfidf[:, 1], c=clusters, cmap='rainbow')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='black', marker='X', label='Centroids')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('Clusters of Customer Complaints')
plt.legend()
plt.grid(True)
plt.show()
The above code will cluster the text (19 rows in the file) in three different clusters as shown below. You can find the code related to Elbow method in this blog – Text Clustering Python Examples to arrive at K = 3.
Now that we have created text clusters, lets find the topics for these clusters.
One of the simplest and most intuitive ways to find out the dominant topic of a text cluster is to aggregate all the texts within the cluster and then extract the most frequent terms or keywords. Here are the steps:
Here is the Python Code for finding topics using text aggregation and keywords extraction.
from collections import Counter
import re
def extract_keywords(text, num_keywords=5):
"""Extract the most frequent keywords from a given text."""
# Tokenize the text into words and filter out short and non-alphabetic tokens
words = [word for word in re.findall(r'\w+', text.lower()) if len(word) > 2 and word.isalpha()]
# Count the frequency of each word
word_freq = Counter(words)
# Return the most common words as keywords
return [word for word, freq in word_freq.most_common(num_keywords)]
# Aggregate texts for each cluster
cluster_texts = df.groupby(clusters)['text'].apply(' '.join)
# Extract top keywords for each cluster
cluster_keywords = cluster_texts.apply(extract_keywords)
cluster_keywords
The following will get printed:
0 [the, and, out, internet, that]
1 [the, and, for, they, contract]
2 [the, you, service, and, they]
Name: text, dtype: object
Based on the above keywords, we can arrive at the following topics:
Another more sophisticated approach involves using the Term Frequency-Inverse Document Frequency (TF-IDF) statistic, a popular technique in information retrieval. The idea here is to quantify the importance of each term not just within a document, but across the entire corpus. Here are the steps:
Here is the Python code that can be used to implement the above:
def get_top_terms_by_summing_tfidf(cluster_id, matrix, vectorizer, n_terms=5):
"""Get the top terms for a given cluster by summing the TF-IDF scores."""
# Filter the matrix to only include documents from the given cluster
cluster_matrix = matrix[clusters == cluster_id]
# Sum the TF-IDF scores
summed_tfidf = cluster_matrix.sum(axis=0)
# Convert to numpy array for easier indexing
summed_tfidf = summed_tfidf.A1
# Get the top term indices
top_term_indices = summed_tfidf.argsort()[-n_terms:][::-1]
# Convert indices to actual terms
top_terms = [vectorizer.get_feature_names_out()[index] for index in top_term_indices]
return top_terms
# Extract top terms for each cluster using the new method
cluster_top_terms_by_sum = {cluster_id: get_top_terms_by_summing_tfidf(cluster_id, tfidf_matrix, tfidf_vectorizer) for cluster_id in range(3)}
cluster_top_terms_by_sum
The following will get printed:
{0: [‘internet’, ‘comcast’, ‘cable’, ‘tech’, ‘security’],
1: [‘contract’, ‘speed’, ‘mbps’, ‘xfinity’, ‘customer’],
2: [‘rude’, ‘service’, ‘customer’, ‘adding’, ‘boxes’]}
The following can be recommended as the topic by looking at the words:
With a large volume of documents we encounter in today’s data-driven world, understanding the core essence of clustered texts is more crucial than ever. As we’ve seen, Python offers robust methods, from straightforward keyword extraction to the more intricate TF-IDF analysis, to suggest possible topics of our clusters.
Whether you’re diving into customer feedback, analyzing scholarly articles, or exploring any large text dataset, these methods provide valuable insights, transforming clusters from mere groups of texts into meaningful categories with distinct themes. As you continue your journey in text analytics, remember that the true power of clustering lies not just in grouping similar documents, but in comprehending and conveying the stories they tell.
In recent years, artificial intelligence (AI) has evolved to include more sophisticated and capable agents,…
Adaptive learning helps in tailoring learning experiences to fit the unique needs of each student.…
With the increasing demand for more powerful machine learning (ML) systems that can handle diverse…
Anxiety is a common mental health condition that affects millions of people around the world.…
In machine learning, confounder features or variables can significantly affect the accuracy and validity of…
Last updated: 26 Sept, 2024 Credit card fraud detection is a major concern for credit…