This article represents common exploratory R commands that could used during the stage of data preparation when solving classification problems. I found them being used when I have been going through KNN or naive Bayes algorithms. I know that there may be more to the list below. I would love to hear those additional commands from you. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.
In the set of commands listed below, a data frame, message_text, is used which is a set of text data, loaded using read.table command such as following:
messages_text <- read.table( file.choose(), sep="\t", stringsAsFactors=FALSE)
Common Exploratory R Commands for Data Preparation Stage
Using the commands listed below, following is achieved:
- Seeing the summary of loaded data using str command
- Changing the names of columns to desired names
- Converting target feature from character vector to factor
- Analyzing the percentage occurrence of different categories
# Find the summary information about the data frame loaded using command such as
# read.csv, read.table etc.
str(messages_text)
# Change the name of the columns to desired names; At times, during loading, the text file
# could start straight away with the data. And, when that happens, the features are names as V1, V2 etc.
# Thus, it may be good idea to name the features appropriately.
names(messages_text) <- c( "type", "text")
# as.factor command is frequenctly used to derive the categorical features as factor. When loaded,
# this variable is loaded as character vector.
messages_text$type <- as.factor(messages_text$type)
# table command when used on variable of class, factor, gives number of occurences of
# different categories
table(messages_text$type)
# prop.table command when used on categorical variable (of class, factor) gives the percentage occurences of
# different categories
prop.table(table(messages_text$type))*100
# round command with prop.table gives the percentage occurence of categorical variable,
# rounded by number of digits specified in the command
round(prop.table(table(messages_text$type))*100, digits=2)
Latest posts by Ajitesh Kumar (see all)
- Agentic Reasoning Design Patterns in AI: Examples - October 18, 2024
- LLMs for Adaptive Learning & Personalized Education - October 8, 2024
- Sparse Mixture of Experts (MoE) Models: Examples - October 6, 2024
I found it very helpful. However the differences are not too understandable for me