Categories: Big Data

Data Science – Common Exploratory R Commands for Classification Problems

This article represents common exploratory R commands that could used during the stage of data preparation when solving classification problems. I found them being used when I have been going through KNN or naive Bayes algorithms. I know that there may be more to the list below. I would love to hear those additional commands from you. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.

 

In the set of commands listed below, a data frame, message_text, is used which is a set of text data, loaded using read.table command such as following:

messages_text <- read.table( file.choose(), sep="\t", stringsAsFactors=FALSE)
Common Exploratory R Commands for Data Preparation Stage

Using the commands listed below, following is achieved:

  • Seeing the summary of loaded data using str command
  • Changing the names of columns to desired names
  • Converting target feature from character vector to factor
  • Analyzing the percentage occurrence of different categories
# Find the summary information about the data frame loaded using command such as
# read.csv, read.table etc.
str(messages_text)

# Change the name of the columns to desired names; At times, during loading, the text file 
# could start straight away with the data. And, when that happens, the features are names as V1, V2 etc. 
# Thus, it may be good idea to name the features appropriately.
names(messages_text) <- c( "type", "text")

# as.factor command is frequenctly used to derive the categorical features as factor. When loaded, 
# this variable is loaded as character vector. 
messages_text$type <- as.factor(messages_text$type)

# table command when used on variable of class, factor, gives number of occurences of 
# different categories
table(messages_text$type)

# prop.table command when used on categorical variable (of class, factor) gives the percentage occurences of
# different categories
prop.table(table(messages_text$type))*100

# round command with prop.table gives the percentage occurence of categorical variable, 
# rounded by number of digits specified in the command
round(prop.table(table(messages_text$type))*100, digits=2)
Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking. Check out my other blog, Revive-n-Thrive.com

Recent Posts

How to Learn Effectively: A Holistic Approach

In this fast-changing world, the ability to learn effectively is more valuable than ever. Whether…

3 hours ago

How to Choose Right Statistical Tests: Examples

Last updated: 13th May, 2024 Whether you are a researcher, data analyst, or data scientist,…

6 hours ago

Data Lakehouses Fundamentals & Examples

Last updated: 12th May, 2024 Data lakehouses are a relatively new concept in the data…

22 hours ago

Machine Learning Lifecycle: Data to Deployment Example

Last updated: 12th May 2024 In this blog, we get an overview of the machine…

1 day ago

Autoencoder vs Variational Autoencoder (VAE): Differences, Example

Last updated: 12th May, 2024 In the world of generative AI models, autoencoders (AE) and…

1 day ago

Linear Regression T-test: Formula, Example

Last updated: 7th May, 2024 Linear regression is a popular statistical method used to model…

6 days ago