This article represents some of the key steps one could take in order to create most effective model to solve a given machine learning problem, using different machine learning algorithms. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.
8 Key Steps for Solving A Machine Learning Problem
- Gather the data set: This is one of the most important step where the objective is to as much large volume of data set as possible. Given that features have been selected appropriately, large data set helps to minimize the training data set error and also, enable cross-validation and training data set error to converge to the minimum value. In case the features have not been selected appropriately, after a certain size of data set, there may not be much impact on error because of high bias or under-fitting case.
- Split the data set into following three classes of data sets:
- Training data set
- Cross-validation data set
- Test data set
The reason why this should be done is the scenario when test data set ends up fitting well with new features that is developed based on evaluation of test data set error. One could adopt the 60-20-20% split for training, cross-validation and test data set.
- Choose the most appropriate algorithm. There are guidelines based on which one could select a particular machine learning algorithm based on the problem at hand. For example, if this is about creating predictive model for estimating number such as price etc, one can choose one of the regression algorithm. If this is about classifying the input to one of the labels, it has to be a classification algorithm.
- Start with a very simplistic model with minimal and most prominent set of features. This would help one to get started very quickly without spending time in exploring the correct and most appropriate features set. Many a times, lot of time is spent on identification of most appropriate features.
- Plot learning curves to measure how error (prediction vs observed) varies with respect to some of the following:
- Adding more training examples. In simple words, collect more data sets.
- Adding more features
- Regularization parameters
Learning curves could very well help in examining the cases of high bias (under-fitting) or high variance (over-fitting).
- Manually examine the errors that happened while testing the algorithm with cross-validation data set. This process would primarily help in identifying new features.
- Identify one or more ways of doing numerical evaluation such that one could cross-check with numbers to evaluate the effectiveness of model vis-a-vis errors.
- Optimize the learning algorithm by including additional features or adding more examples etc, if required, and repeat the process of error analysis.
Latest posts by Ajitesh Kumar (see all)
- Difference: Binary vs Multiclass vs Multilabel Classification - September 13, 2024
- Sklearn LabelEncoder Example – Single & Multiple Columns - September 13, 2024
- ROC Curve & AUC Explained with Python Examples - September 8, 2024
Leave a Reply