12 Most Common Machine Learning Tasks

0
This article represents some of the most common machine learning tasks that one may come across while trying to solve a machine learning problem. Under each task are also listed a set of machine learning methods that could be used to resolve these tasks. Please feel free to comment/suggest if I missed mentioning one or more important points. Also, sorry for the typos.

Following are the key machine learning tasks briefed later in this article:

  1. Data gathering
  2. Data preprocessing
  3. Exploratory data analysis (EDA)
  4. Feature engineering
  5. Training machine learning models of the following kinds:
    • Regression
    • Classification
    • Clustering
  6. Multivariate querying
  7. Density estimation
  8. Dimensionality reduction
  9. Model / Algorithm selection
  10. Testing and matching
  11. Model monitoring
  12. Model retraining



Following are top 12 most common machine learning tasks that one could come across most frequently while solving an advanced analytics problem:

  1. Data Gathering: Any machine learning problem requires lot of data for training / testing purpose. Identifying right data sources and gathering data from these data sources is the key. Data could be found from databases, external agencies, internet etc.
  2. Data Preprocessing: Before starting training the models, it is of utmost important to prepare data appropriately. As part of data preprocessing, some of the following is done:
    • Data cleaning: Data cleaning requires one to identify attributes having not enough data or attributes which are not having variance. These data (rows and columns) need to be removed from training data set.
    • Missing data imputation: Handling missing data using data imputation techniques such as replacing missing data with mean, median or mode. Here is my post on this topic: Replace missing values with mean, median or mode
  3. Exploratory Data Analysis (EDA): Once data is preprocessed, the next step is to perform exploratory data analysis to understand data distribution and relationship between / within the data. Some of the following is performed as part of EDA:
    • Correlation analysis
    • Multicollinearity analysis
    • Data distribution analysis
  4. Feature Engineering: Feature engineering is one of the critical tasks which would be used when building machine learning models. Feature engineering is important because selecting right features would not only help build models of higher accuracy but also help achieve objectives related to building simpler models, reduce overfitting etc. Feature engineering includes some of the tasks such as deriving features from raw features, identifying important features, feature extraction and feature selection. The following are some of the techniques which could be used for feature selection:
    • Filter methods which helps in selecting features based on the outcomes of statistical tests. The following are some of the statistical tests which are used:
      • Pearson’s correlation
      • Linear discriminant analysis (LDA)
      • Analysis of Variance (ANOVA)
      • Chi-square tests
    • Wrapper methods which helps in feature selection by using a subset of features and determining the model accuracy. The following are some of the algorithms used:
      • Forward selection
      • Backward elimination
      • Recursive feature elimination
    • Regularization techniques which penalizes one or more features appropriately to come up with most important features. The following are some of the algorithms used:
      • LASSO (L1) regularization
      • Ridge (L2) regularization
      • Elasticnet regularization
      • Regularization with classification algorithms such as Logistic regression, SVM etc.
  5. Training Models: Once some of the features are determined, then comes training models with data related to those features. The following is a list of different types of machine learning problems and related algorithms which can be used to solve these problems:
    • Regression: Regression tasks mainly deal with estimation of numerical values (continuous variables). Some of the examples include estimation of housing price, product price, stock price etc. Some of the following ML methods could be used for solving regressions problems:
      • Kernel regression (Higher accuracy)
      • Gaussian process regression (Higher accuracy)
      • Regression trees
      • Linear regression
      • Support vector regression
      • LASSO / Ridge
      • Deep learning
      • Random forests
    • Classification: Classification tasks is simply related with predicting a category of a data (discrete variables). One of the most common example is predicting whether or not an email if spam or ham. Some of the common use cases could be found in the area of healthcare such as whether a person is suffering from a particular disease or not. It also has its application in financial use cases such as determining whether a transaction is fraud or not. The ML methods such as following could be applied to solve classification tasks:
      • Kernel discriminant analysis (Higher accuracy)
      • K-Nearest Neighbors (Higher accuracy)
      • Artificial neural networks (ANN) (Higher accuracy)
      • Support vector machine (SVM) (Higher accuracy)
      • Random forests (Higher accuracy)
      • Decision trees
      • Boosted trees
      • Logistic regression
      • naive Bayes
      • Deep learning
    • Clustering: Clustering tasks are all about finding natural groupings of data and a label associated with each of these groupings (clusters). Some of the common example includes customer segmentation, product features identification for product roadmap. Some of the following are common ML methods:
      • Mean-shift  (Higher accuracy)
      • Hierarchical clustering
      • K-means
      • Topic models
  6. Multivariate querying: Multivariate querying is about querying or finding similar objects. Some of the following ML methods could be used for such problems:
    • Nearest neighbors
    • Range search
    • Farthest neighbors
  7. Density estimation: Density estimation problems are related with finding likelihood or frequency of objects. In probability and statistics, density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. Some of the following ML methods could be used for solving density estimation tasks:
    • Kernel density estimation (Higher accuracy)
    • Mixture of Gaussians
    • Density estimation tree
  8. Dimensionality reduction (feature extraction): As per Wikipedia page on Dimension reduction , Dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction. Following are some of ML methods that could be used for dimension reduction:
    • Manifold learning/KPCA (Higher accuracy)
    • Principal component analysis
    • Independent component analysis
    • Gaussian graphical models
    • Non-negative matrix factorization
    • Compressed sensing
  9. Model selection / Algorithm selection: Many a times, there are multiple models which are trained using different algorithms. One of the important task is to select most optimal models for deploying them in production. Hyper parameter tuning is most common task performed as part of model selection. Also, if there are two models trained using different algorithms which has similar performance, then one also needs to perform algorithm selection.
  10. Testing and matching: Testing and matching tasks relates to comparing data sets. Following are some of the methods that could be used for such kind of problems:
    • Minimum spanning tree
    • Bipartite cross-matching
    • N-point correlation
  11. Model monitoring: Once the models are trained and deployed, they require to be monitored at regular intervals. Monitoring models require the processing actual values and predicted values and measure the model performance based on appropriate metrics.
  12. Model retraining: In case, the model performance degrades, the models are required to be retrained.  The following gets done as part of model retraining:
    • New features get determined
    • New algorithms can be used
    • Hyper parameters can get tuned
    • Model ensembles may get deployed



Ajitesh Kumar
Follow me
Share.

Leave A Reply

Time limit is exhausted. Please reload the CAPTCHA.