Data Science – 6 Steps to Perform Data Analysis using R

data analysis
This article represents steps that one could take to perform data analysis on available datasets using data science (machine learning algorithms) with the help of R programming language. The objective of this article is to introduce an approach for data science beginners to get started with data analysis. However, as you get experience you could adopt your own techniques that works for you. These are just my thoughts and there could be better way of approaching data analysis. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.
Following are the key steps which could be taken as a blueprint to perform data analysis:

  1. Data collection
  2. Data preparation
  3. Identify the machine learning algorithm and related model
  4. Train the model on the data
  5. Evaluate model performance
  6. Optimize model performance

 

Step 1 – Data collection

You start with identifying the data source and work on different ways to collect or gather the data. The data could come from one or more existing databases or streaming datasources such as log files. It could also come from public domain. The challenge here is to determine how large is the dataset that you would want to gather. Based upon what you want to achieve out of data analysis, this could be a smaller or larger datasets. Some of the factors that comes into consideration for data collection are following:

  • Time-boxed
  • Region-based

 

Step 2 – Data preparation

The next step is to explore the data and prepare it for further analysis. Following are some of the steps you would want to perform in this step:

  • Data Loading:Load the data into the tool. One of the handy commmand is read.csv.
  • Exploring Data Types:Identify which data is numeric and which of them are nominal (stringsAsFactors option) and import the data appropriately.
  • Studying Data FrameStudy the data frame exploring different observations and features.
  • Observe Data Observations & Features:Commands such as summary() comes very handy to explore the spread of the data across different features. For nominal data, command such as table() comes very handy.
  • Training & Test DatasetsFor preparing data for actual analysis, one may want to identify the sample datasets for training and test purpose. The training dataset is used to generate the model, which is then applied to the test dataset to generate predictions for evaluation. This procedure is also termed as holdout method.
  • Visualize Relationships & Data Patterns:You could one or more plots such as hist(), plot etc to identify the data pattern and also, relationship between different features.
  • Relationships among Features: You may want to understand relationship that exists between different features. For this, commands such as cor() can be used.

 

Step 3 – Identify the Machine Learning Algorithm and Related Model

Next step is to identify the machine learning algorithm that will be used to train the model. As you gain expertise, this could as well be first step and comes quite easily. You may note that these algorithms come as R packages that could be downloaded from CRAN and installed/loaded for further usage.

 

Step 4 – Train the Model on the Data

Once the algorithm is identified, it is used to train the model. For this purpose, you may want to use training dataset. One of the important step is to identify the R package and related commands that you would want to use in this stage. As an outcome of this step, one should be able to understand the result very well such that when using test datasets, one may be able to evaluate the model performance. Some of the following becomes clear at this stage:

  • Relationship between features
  • Data pattern

 

Step 5 – Evaluate Model Performance

The primary objective of this phase is to test the model effectiveness or assess the identified machine learning algorithm on test datasets. There are different approaches to this phase. For various models, summary() command comes very handy. Based on the package that you use, there could be additional commands which could be used to evaluate the model performance on test datasets.

 

Step 6 – Optimize Model Performance

Following is the objective of improving or optimizing machine learning algortihm:

  • Fine-tune the performance of machine learning models by searching for the optimal set of training conditions
  • Identify methods for combining models into groups that can be used to tackle the most challenging problems
  • Discover techniques for getting the maximum level of performance out of machine learning algorithm

 

Ajitesh Kumar

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.
Posted in Big Data. Tagged with .