Categories: Big Data

Data Science – Data Cleaning R Commands for Text Classification Problems

This article represents concepts and related R command set used to clean the text in order to make it ready for text classification. The R command set belongs to tm package. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.

Lets load a set of messages along with appropriate classification using following command.

messages <- read.table( file.choose(), sep="\t", stringsAsFactors=FALSE)

The messages data frame could have two features, such as type and text where each piece of text is associated with an appropriate type. Once done, lets go ahead and create a Corpus object out of all the message text. Following command helps to create a Corpus object. For those of you who are new to Corpus object, note that it comes as part of famous text mining R package named as “tm”.

corpus <- Corpus(VectorSource(messages$text))

Once Corpus object is created it is time to clean the text.

 

R Commands for Cleaning Text

Following is cleaned as part of text cleaning activity:

  • Change the case of all words to lowercase. In earlier versions of tm package, it was ok to use commands such as tm_map(corpus, tolower). However, if you have recently installed tm package, using the tm_map command as mentioned earlier would throw error such as Error: inherits(doc, “TextDocument”) is not TRUE. Thus, with latest version of tm package, it is recommended to use following command for chaging to lower case. tm_map(corpus, content_transformer(tolower))
  • Remove numbers
  • Remove punctuation
  • Remove stop words. These are words such as to, and, or, but etc.
  • Strip whitespaces

Following is command set that achieves above objectives:

# Change all the words to lowercase
corpus_clean <- tm_map(corpus, content_transformer(tolower))

# Remove all the numbers
corpus_clean <- tm_map(corpus_clean, removeNumbers)

# Remove the stop words such as to, and, or etc.
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords())

# Remove punctuation
corpus_clean <- tm_map(corpus_clean, removePunctuation)

# Remove whitespaces
corpus_clean <- tm_map(corpus_clean, stripWhitespace)

 

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking. Check out my other blog, Revive-n-Thrive.com

Recent Posts

Feature Engineering in Machine Learning: Python Examples

Last updated: 3rd May, 2024 Have you ever wondered why some machine learning models perform…

2 days ago

Feature Selection vs Feature Extraction: Machine Learning

Last updated: 2nd May, 2024 The success of machine learning models often depends on the…

2 days ago

Model Selection by Evaluating Bias & Variance: Example

When working on a machine learning project, one of the key challenges faced by data…

3 days ago

Bias-Variance Trade-off in Machine Learning: Examples

Last updated: 1st May, 2024 The bias-variance trade-off is a fundamental concept in machine learning…

3 days ago

Mean Squared Error vs Cross Entropy Loss Function

Last updated: 1st May, 2024 As a data scientist, understanding the nuances of various cost…

3 days ago

Cross Entropy Loss Explained with Python Examples

Last updated: 1st May, 2024 In this post, you will learn the concepts related to…

3 days ago