- Data collection
- Data preparation
- Identify the machine learning algorithm and related model
- Train the model on the data
- Evaluate model performance
- Optimize model performance
Step 1 – Data collection
You start with identifying the data source and work on different ways to collect or gather the data. The data could come from one or more existing databases or streaming datasources such as log files. It could also come from public domain. The challenge here is to determine how large is the dataset that you would want to gather. Based upon what you want to achieve out of data analysis, this could be a smaller or larger datasets. Some of the factors that comes into consideration for data collection are following:
Step 2 – Data preparation
The next step is to explore the data and prepare it for further analysis. Following are some of the steps you would want to perform in this step:
- Data Loading:Load the data into the tool. One of the handy commmand is read.csv.
- Exploring Data Types:Identify which data is numeric and which of them are nominal (stringsAsFactors option) and import the data appropriately.
- Studying Data FrameStudy the data frame exploring different observations and features.
- Observe Data Observations & Features:Commands such as summary() comes very handy to explore the spread of the data across different features. For nominal data, command such as table() comes very handy.
- Training & Test DatasetsFor preparing data for actual analysis, one may want to identify the sample datasets for training and test purpose. The training dataset is used to generate the model, which is then applied to the test dataset to generate predictions for evaluation. This procedure is also termed as holdout method.
- Visualize Relationships & Data Patterns:You could one or more plots such as hist(), plot etc to identify the data pattern and also, relationship between different features.
- Relationships among Features: You may want to understand relationship that exists between different features. For this, commands such as cor() can be used.
Step 3 – Identify the Machine Learning Algorithm and Related Model
Next step is to identify the machine learning algorithm that will be used to train the model. As you gain expertise, this could as well be first step and comes quite easily. You may note that these algorithms come as R packages that could be downloaded from CRAN and installed/loaded for further usage.
Step 4 – Train the Model on the Data
Once the algorithm is identified, it is used to train the model. For this purpose, you may want to use training dataset. One of the important step is to identify the R package and related commands that you would want to use in this stage. As an outcome of this step, one should be able to understand the result very well such that when using test datasets, one may be able to evaluate the model performance. Some of the following becomes clear at this stage:
- Relationship between features
- Data pattern
Step 5 – Evaluate Model Performance
The primary objective of this phase is to test the model effectiveness or assess the identified machine learning algorithm on test datasets. There are different approaches to this phase. For various models, summary() command comes very handy. Based on the package that you use, there could be additional commands which could be used to evaluate the model performance on test datasets.
Step 6 – Optimize Model Performance
Following is the objective of improving or optimizing machine learning algortihm:
- Fine-tune the performance of machine learning models by searching for the optimal set of training conditions
- Identify methods for combining models into groups that can be used to tackle the most challenging problems
- Discover techniques for getting the maximum level of performance out of machine learning algorithm