Bag of words technique to convert to numerical feature vector
Last updated: 6th Jan, 2024
Most machine learning algorithms require numerical input for training the models. Bag of words (BoW) effectively converts text data into numerical feature vectors, making it compatible with a wide range of machine learning algorithms, from linear classifiers like logistic regression to complex ones like neural networks. In this post, you will learn about the concepts of bag-of-words model and how to train a text classification model using Python Sklearn. Some of the most common text classification problems includes sentiment analysis, spam filtering etc. In these problems, one can apply bag-of-words technique to train machine learning models for text classification. It will be good to understand the concepts of bag-of-words model while beginning on learning advanced NLP techniques for text classification in machine learning.
Bag of words model helps convert the text into numerical representation (numerical feature vectors) such that the same can be used to train models using machine learning algorithms. Here are the key steps of fitting a bag-of-words model:
The picture below represents the above concept with the help of an example. Note some of the following:
The following are some of the most common applications of BoW in machine learning:
Let’s write Python Sklearn code to construct the bag-of-words from a sample set of documents. To construct a bag-of-words model based on the word counts in the respective documents, the CountVectorizer class implemented in scikit-learn is used. In the code given below, note the following:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
#
# Create sample set of documents
#
docs = np.array(['Mirabai has won a silver medal in weight lifting in Tokyo olympics 2021',
'Sindhu has won a bronze medal in badminton in Tokyo olympics',
'Indian hockey team is in top four team in Tokyo olympics 2021 after 40 years'])
#
# Fit the bag-of-words model
#
bag = vectorizer.fit_transform(docs)
#
# Get unique words / tokens found in all the documents. The unique words / tokens represents
# the features
#
print(vectorizer.get_feature_names())
#
# Associate the indices with each unique word
#
print(vectorizer.vocabulary_)
#
# Print the numerical feature vector
#
print(bag.toarray())
Here is how the output would look like:
You could learn more about the bags of model from the following video:
In this section, you will learn about how to fit or train a text classification model using bag-of-words technique. Pay attention to some of the following before looking into the Python code:
#
# Creating training data set from bag-of-words and dummy label
#
X = bag.toarray()
y = np.array([1, 1, 0, 0, 1, 0, 0, 1])
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
#
# Create training and test split
#
X_train, X_test, y_train, y_test = train_test_split(X, y)
#
# Create an instance of LogisticRegression classifier
#
lr = LogisticRegression(C=100.0, random_state=1, solver='lbfgs', multi_class='ovr')
#
# Fit the model
#
lr.fit(X_train, y_train)
#
# Create the predictions
#
y_predict = lr.predict(X_test)
# Use metrics.accuracy_score to measure the score
print("LogisticRegression Accuracy %.3f" %metrics.accuracy_score(y_test, y_predict))
We’ve all been in that meeting. The dashboard on the boardroom screen is a sea…
When building a regression model or performing regression analysis to predict a target variable, understanding…
If you've built a "Naive" RAG pipeline, you've probably hit a wall. You've indexed your…
If you're starting with large language models, you must have heard of RAG (Retrieval-Augmented Generation).…
If you've spent any time with Python, you've likely heard the term "Pythonic." It refers…
Large language models (LLMs) have fundamentally transformed our digital landscape, powering everything from chatbots and…