# Data Science – Common Exploratory R Commands for Classification Problems

This article represents common exploratory R commands that could used during the stage of data preparation when solving classification problems. I found them being used when I have been going through KNN or naive Bayes algorithms. I know that there may be more to the list below. I would love to hear those additional commands from you. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.

In the set of commands listed below, a data frame, message_text, is used which is a set of text data, loaded using read.table command such as following:

messages_text <- read.table( file.choose(), sep="\t", stringsAsFactors=FALSE)

###### Common Exploratory R Commands for Data Preparation Stage

Using the commands listed below, following is achieved:

• Seeing the summary of loaded data using str command
• Changing the names of columns to desired names
• Converting target feature from character vector to factor
• Analyzing the percentage occurrence of different categories
# Find the summary information about the data frame loaded using command such as
str(messages_text)

# Change the name of the columns to desired names; At times, during loading, the text file
# could start straight away with the data. And, when that happens, the features are names as V1, V2 etc.
# Thus, it may be good idea to name the features appropriately.
names(messages_text) <- c( "type", "text")

# as.factor command is frequenctly used to derive the categorical features as factor. When loaded,
# this variable is loaded as character vector.
messages_text$type <- as.factor(messages_text$type)

# table command when used on variable of class, factor, gives number of occurences of
# different categories
table(messages_text$type) # prop.table command when used on categorical variable (of class, factor) gives the percentage occurences of # different categories prop.table(table(messages_text$type))*100

# round command with prop.table gives the percentage occurence of categorical variable,
# rounded by number of digits specified in the command
round(prop.table(table(messages_text\$type))*100, digits=2)