Data Science – Common Exploratory R Commands for Classification Problems

This article represents common exploratory R commands that could used during the stage of data preparation when solving classification problems. I found them being used when I have been going through KNN or naive Bayes algorithms. I know that there may be more to the list below. I would love to hear those additional commands from you. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.

 

In the set of commands listed below, a data frame, message_text, is used which is a set of text data, loaded using read.table command such as following:

messages_text <- read.table( file.choose(), sep="\t", stringsAsFactors=FALSE)
Common Exploratory R Commands for Data Preparation Stage

Using the commands listed below, following is achieved:

  • Seeing the summary of loaded data using str command
  • Changing the names of columns to desired names
  • Converting target feature from character vector to factor
  • Analyzing the percentage occurrence of different categories
# Find the summary information about the data frame loaded using command such as
# read.csv, read.table etc.
str(messages_text)

# Change the name of the columns to desired names; At times, during loading, the text file 
# could start straight away with the data. And, when that happens, the features are names as V1, V2 etc. 
# Thus, it may be good idea to name the features appropriately.
names(messages_text) <- c( "type", "text")

# as.factor command is frequenctly used to derive the categorical features as factor. When loaded, 
# this variable is loaded as character vector. 
messages_text$type <- as.factor(messages_text$type)

# table command when used on variable of class, factor, gives number of occurences of 
# different categories
table(messages_text$type)

# prop.table command when used on categorical variable (of class, factor) gives the percentage occurences of
# different categories
prop.table(table(messages_text$type))*100

# round command with prop.table gives the percentage occurence of categorical variable, 
# rounded by number of digits specified in the command
round(prop.table(table(messages_text$type))*100, digits=2)
Ajitesh Kumar

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.
Posted in Big Data. Tagged with , .