Data Science – R Packages & Methods for naive Bayes Classification

This article represents different R packages and related methods which could be used to create a naive Bayes classifier. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.
Following are the key packages described later in this article:

  • TM
  • WordCloud
  • e1071
  • Gmodels

 

Following is a list of R packages that could be used for naive Bayes classification:

  • TM Package: Originally created by Ingo Feinerer as a dissertation project at the Vienna University of Economics and Business, tm package is a very popular package that provides a framework for text mining applications within R. More about TM package could be found on http://tm.r-forge.r-project.org/ Following are some of the widely used methods in tm package:
    • Corpus() method is used to create R object that stores the text.
    • Methods representing datasources such as VectorSource() that takes vector of text as a parameter. A Corpus object could be created using a vector of messages or text using command such as Corpus(VectorSource(messages$text)). Here “messages” is a data frame consisting of a feature “text” which represent messages.
    • tm_map method: This method is used to clean the corpus object by removing numbers, stopwords, punctuations, whitespaces and, changing to lower case etc.
    • inspect() method to look at the corpus object created using Corpus() method.
    • DocumentTermMatrix() that takes Corpus object as argument and returns a sparse matrix. Each column/feature in this matrix represents words that appeared in the corpus. Each row represents a document. A particular row with values would represent one or more words count that appeared in a specific document.
    • findFreqTerms() method returns a character vector consisting of number of words. It takes DocumentTermMatrix object as argument and minimum number of messages in which word must appear.
  • WordCloud Package: Following method helps to visualize tag could of a Corpus object. More about wordcloud could be found on http://cran.r-project.org/web/packages/wordcloud/index.html
    • wordcloud() method takes argument such as Corpus object, min.freq (word to appear in minimum messages), max.words (most commonly found words count), random.order, scale etc.
  • e1071 Package: Developed at the statistics department at the Vienna University of Technology (TU Wien), e1071 package provides “naiveBayes” and “predict” method which could be used to create naiveBayes classifier and predict. Following represents these methods:
    # Takes argument as DocumentTermMatrix object and factor object representing the classification of each instance/row in DTM object. 
    msg_classifier <- naiveBayes(messages_dtm_train, messages_train$type)
    
    # Takes argument as classifier and DocumentTermMatrix object that needs to be predicted
    msg_test_pred <- predict(msg_classifier, messages_dtm_test)
    
  • gmodels Package: This package helps to evaluate the naiveBayes classifier model performance. The function which could be used to evaluate the model is following:
    • CrossTable() function is basically used to compare predicted value with actual value.
Ajitesh Kumar
Follow me

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking. Check out my other blog, Revive-n-Thrive.com
Posted in Big Data. Tagged with .

Leave a Reply

Your email address will not be published. Required fields are marked *