Python – Text Classification using Bag-of-words Model

Bag of words technique to convert to numerical feature vector

In this post, you will learn about the concepts of bag-of-words (BoW) model and how to train a text classification model using Python Sklearn. Some of the most common text classification problems includes sentiment analysis, spam filtering etc. In these problems, one can apply bag-of-words technique to train machine learning models for text classification.

It will be good to understand the concepts of bag-or-words model while beginning on learning advanced NLP techniques for text classification in machine learning. The following topics will be covered in this post:

  • What is a bag-of-words model?
  • How to fit a bag-of-words model using Python Sklearn?
  • How to fit a text classification model using bag-of-words technique?

What is a bag-of-words model?

Bag of words model helps convert the text into numerical representation (numerical feature vectors) such that the same can be used to train models using machine learning algorithms. Here are the key steps of fitting a bag-of-words model:

  1. Create a vocabulary indices of words or tokens from the entire set of documents. The vocabulary indices can be created in alphabetical order. 
  2. Construct the numerical feature vector for each document that represents how frequent each word appears in different documents. The feature vector representing each will be sparse in nature as the words in each document will represent only a small subset of words out of all words (bag-of-words) present in entire set of documents.

The picture below represents the above concept. Note some of the following:

  • Number of words in header represents unique words in all the three documents listed in first column
  • Against each document, number represents number of occurences. For example, for the first document, “bird” occured for 5 times, “the” occured for two times and “about” occured for 1 time.

Bag of words technique to convert to numerical feature vector

Creating a bag-of-words model using Python Sklearn

Let’s write Python Sklearn code to construct the bag-of-words from a sample set of documents. To construct a bag-of-words model based on the word counts in the respective documents, the CountVectorizer class implemented in scikit-learn is used. In the code given below, note the following:

  • CountVectorizer (sklearn.feature_extraction.text.CountVectorizer) is used to fit the bag-or-words model. As a result of fitting the model, the following happens.
    • The fit_transform method of CountVectorizer takes an array of text data, which can be documents or sentences. In the example given below, the numpay array consisting of text is passed as an argument.
    • The numpy array consisting of text is used to create the dictionary consisting of vocabulary indices. The vocabulary indices represent unique words and indices arranged in the alphabetical order. In the example given below, there are three documents stored in the numpy array. The first element is 2021:0, second term is 40:1, third term is after:2, fourth term is badminton:3 and so on and so forth. The documents stored in the numpy array represents the outcome of indian atheletes in current Tokyo olympics.
    • Numerical feature vectore for each document is created based on frequency of words occuring in each document. For example, the “medal” word in first document, “Mirabai has won a silver medal in weight lifting in Tokyo olympics 2021” has indices of 12 and occured once in the document. However, the word “in” having indices 8 has occured twice (2 times) in the document. Note “2” in first vector.
    • You can use NLTK for different purposes such as stemming, spell correction etc.
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
# Create sample set of documents
docs = np.array(['Mirabai has won a silver medal in weight lifting in Tokyo olympics 2021',
                 'Sindhu has won a bronze medal in badminton in Tokyo olympics',
                 'Indian hockey team is in top four team in Tokyo olympics 2021 after 40 years'])
# Fit the bag-of-words model
bag = vectorizer.fit_transform(docs)
# Get unique words / tokens found in all the documents. The unique words / tokens represents 
# the features
# Associate the indices with each unique word
# Print the numerical feature vector

Here is how the output would look like:

bag of words model countvectorizer sklearn python
Fig 1. Bag of words model created using Sklearn Count Vectorizer

You could learn more about the bags of model from the following video:

Fitting a Text Classification Model using Bag-of-words Technique

In this section, you will learn about how to fit or train a text classification model using bag-of-words technique. Pay attention to some of the following before looking into the Python code:

# Creating training data set from bag-of-words  and dummy label
X = bag.toarray()
y = np.array([1, 1, 0, 0, 1, 0, 0, 1])

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# Create training and test split
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Create an instance of LogisticRegression classifier
lr = LogisticRegression(C=100.0, random_state=1, solver='lbfgs', multi_class='ovr')
# Fit the model
#, y_train)
# Create the predictions
y_predict = lr.predict(X_test)
# Use metrics.accuracy_score to measure the score
print("LogisticRegression Accuracy %.3f" %metrics.accuracy_score(y_test, y_predict))
Ajitesh Kumar
Follow me

Ajitesh Kumar

I have been recently working in the area of Data Science and Machine Learning / Deep Learning. In addition, I am also passionate about various different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia etc and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data etc. I would love to connect with you on Linkedin.
Posted in Data Science, Machine Learning, Python. Tagged with , , .

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload the CAPTCHA.