Lets load a set of messages along with appropriate classification using following command.
messages <- read.table( file.choose(), sep="\t", stringsAsFactors=FALSE)
The messages data frame could have two features, such as type and text where each piece of text is associated with an appropriate type. Once done, lets go ahead and create a Corpus object out of all the message text. Following command helps to create a Corpus object. For those of you who are new to Corpus object, note that it comes as part of famous text mining R package named as “tm”.
corpus <- Corpus(VectorSource(messages$text))
Once Corpus object is created it is time to clean the text.
R Commands for Cleaning Text
Following is cleaned as part of text cleaning activity:
- Change the case of all words to lowercase. In earlier versions of tm package, it was ok to use commands such as tm_map(corpus, tolower). However, if you have recently installed tm package, using the tm_map command as mentioned earlier would throw error such as Error: inherits(doc, “TextDocument”) is not TRUE. Thus, with latest version of tm package, it is recommended to use following command for chaging to lower case. tm_map(corpus, content_transformer(tolower))
- Remove numbers
- Remove punctuation
- Remove stop words. These are words such as to, and, or, but etc.
- Strip whitespaces
Following is command set that achieves above objectives:
# Change all the words to lowercase corpus_clean <- tm_map(corpus, content_transformer(tolower)) # Remove all the numbers corpus_clean <- tm_map(corpus_clean, removeNumbers) # Remove the stop words such as to, and, or etc. corpus_clean <- tm_map(corpus_clean, removeWords, stopwords()) # Remove punctuation corpus_clean <- tm_map(corpus_clean, removePunctuation) # Remove whitespaces corpus_clean <- tm_map(corpus_clean, stripWhitespace)