# Category Archives: Machine Learning

## Imputing Missing Data using Sklearn SimpleImputer

In this post, you will learn about how to use Python’s Sklearn SimpleImputer for imputing / replacing numerical & categorical missing data using different strategies. In one of the related article posted sometime back, the usage of fillna method of Pandas DataFrame is discussed. Here is the link, Replace missing values with mean, median and mode. Handling missing values is key part of data preprocessing and hence, it is of utmost importance for data scientists / machine learning Engineers to learn different techniques in relation imputing / replacing numerical or categorical missing values with appropriate value based on appropriate strategies. The following topics will be covered in this post: SimpleImputer explained with Python …

## When to use LabelEncoder – Python Example

In this post, you will learn about when to use LabelEncoder. As a data scientist, you must have a clear understanding on when to use LabelEncoder and when to use other encoders such as One-hot Encoder. Using appropriate type of encoders is key part of data preprocessing in machine learning model building lifecycle. Here are some of the scenarios when you could use LabelEncoder without having impact on model. Use LabelEncoder when there are only two possible values of a categorical features. For example, features having value such as yes or no. Or, maybe, gender feature when there are only two possible values including male or female. Use LabelEncoder for …

## Feature Extraction using PCA – Python Example

In this post, you will learn about how to use principal component analysis (PCA) for extracting important features (also termed as feature extraction technique) from a list of given features. As a machine learning / data scientist, it is very important to learn the PCA technique for feature extraction as it helps you visualize the data in the lights of importance of explained variance of data set. The following topics get covered in this post: What is principal component analysis? PCA algorithm for feature extraction PCA Python implementation step-by-step PCA Python Sklearn example What is Principal Component Analysis? Principal component analysis (PCA) is an unsupervised linear transformation technique which is primarily used …

## PCA Explained Variance Concepts with Python Example

In this post, you will learn about the concepts of explained variance which is one of the key concepts related to principal component analysis (PCA). The explained variance concepts will be illustrated with Python code examples. Some of the following topics will be covered: What is explained variance? Python code examples of explained variance What is Explained Variance? Explained variance refers to the variance explained by each of the principal components (eigenvectors). It can be represented as a function of ratio of related eigenvalue and sum of eigenvalues of all eigenvectors. Let’s say that there are N eigenvectors, then the explained variance for each eigenvector (principal component) can be expressed the …

## Why & When to use Eigenvalues & Eigenvectors?

In this post, you will learn about why and when you need to use Eigenvalues and Eigenvectors? As a data scientist / machine learning Engineer, one must need to have a good understanding of concepts related to Eigenvalues and Eigenvectors as these concepts are used in one of the most popular dimensionality reduction technique – Principal Component Analysis (PCA). In PCA, these concepts help in reducing the dimensionality of the data (curse of dimensionality) resulting in the simpler model which is computationally efficient and provides greater generalization accuracy. In this post, the following topics will be covered: Background – Why need Eigenvalues & Eigenvectors? What are Eigenvalues & Eigenvectors? When to …

## Sklearn SelectFromModel for Feature Importance

In this post, you will learn about how to use Sklearn SelectFromModel class for reducing the training / test data set to the new dataset which consists of features having feature importance value greater than a specified threshold value. This method is very important when one is using Sklearn pipeline for creating different stages and Sklearn RandomForest implementation (such as RandomForestClassifier) for feature selection. You may refer to this post to check out how RandomForestClassifier can be used for feature importance. The SelectFromModel usage is illustrated using Python code example. SelectFromModel Python Code Example Here are the steps and related python code for using SelectFromModel. Determine the feature importance using …

## Feature Importance using Random Forest Classifier – Python

In this post, you will learn about how to use Sklearn Random Forest Classifier (RandomForestClassifier) for determining feature importance using Python code example. This will be useful in feature selection by finding most important features when solving classification machine learning problem. It is very important to understand feature importance and feature selection techniques for data scientists to use most appropriate features for training machine learning models. Recall that other feature selection techniques includes L-norm regularization techniques, greedy search algorithms techniques such as sequential backward / sequential forward selection etc. The following are some of the topics covered in this post: Why feature importance? Random Forest for feature importance Using Sklearn RandomForestClassifier for Feature Importance Why …

## Sequential Forward Selection – Python Example

In this post, you will learn about one of feature selection techniques namely sequential forward selection with Python code example. Refer to my earlier post on sequential backward selection technique for feature selection. Sequential forward selection algorithm is a part of sequential feature selection algorithms. Some of the following topics will be covered in this post: Introduction to sequential feature selection algorithms Sequential forward selection algorithm Python example using sequential forward selection Introduction to Sequential Feature Selection Sequential feature selection algorithms including sequential forward selection algorithm belongs to the family of greedy search algorithms which are used to reduce an initial d-dimensional feature space to a k-dimensional feature subspace where k < d. …

## Sequential Backward Feature Selection – Python Example

In this post, you will learn about a feature selection technique called as Sequential Backward Selection using Python code example. Feature selection is one of the key steps in training the most optimal model in order to achieve higher computational efficiency while training the model, and also reduce the the generalization error of the model by removing irrelevant features or noise. Some of the important feature selection techniques includes L-norm regularization and greedy search algorithms such as sequential forward or backward feature selection, especially for algorithms which don’t support regularization. It is of utmost importance for data scientists to learn these techniques in order to build optimal models. Sequential backward …

## MinMaxScaler vs StandardScaler – Python Examples

In this post, you will learn about concepts and differences between MinMaxScaler & StandardScaler with the help of Python code examples. Note that these are classes provided by sklearn.preprocessing module and used for feature scaling purpose. As a data scientist, you will need to learn these concepts in order to train machine learning models using algorithms which requires features to be on the same scale. For algorithms such as random forests and decision trees which are scale invariant, you do not need to use these feature scaling techniques. The following topics are covered in this post: Why is feature scaling needed? Normalization vs Standardization MinMaxScaler for normalization StandardScaler for standardization …

## One-hot Encoding Concepts & Python Code Examples

In this post, you will learn about One-hot Encoding concepts and code examples using Python programming language. One-hot encoding is also called as dummy encoding. In this post, OneHotEncoder class of sklearn.preprocessing will be used in the code examples. As a data scientist or machine learning engineer, you must learn the one-hot encoding techniques as it comes very handy while training machine learning models. Some of the following topics will be covered in this post: One-hot encoding concepts Disadvantages of one-hot encoding Using OneHotEncoder for single categorical feature Using OneHotEncoder & ColumnTransformer for encoding multiple categorical features Using Pandas get_dummies API for one-hot encoding One-Hot Encoding Concepts Simply speaking, one-hot …

## Pandas – Append Columns to Dataframe

In this post, you will learn different techniques to append or add one column or multiple columns to Pandas Dataframe (Python). There are different scenarios where this could come very handy. For example, when there are two or more data frames created using different data sources, and you want to select a specific set of columns from different data frames to create one single data frame, the methods given below can be used to append or add one or more columns to create one single data frame. It will be good to know these methods as it helps in data preprocessing stage of building machine learning models. In this post, …

## LabelEncoder Example – Single & Multiple Columns

In this post, you will learn about LabelEncoder code examples for handling encoding labels related to categorical features of single and multiple columns in Python Pandas Dataframe. The following are some of the points which will get covered: Background What are labels and why encode them? How to use LabelEncoder to encode single & multiple columns (all at once)? When not to use LabelEncoder? Background When working with dataset having categorical features, you come across two different types of features such as the following. Many machine learning algorithms require the categorical data (labels) to be converted or encoded in the numerical or number form. Ordinal features – Features which has …

## Pandas – Fillna method for replacing missing values

In this post, you will learn about how to use fillna method to replace or impute missing values of one or more feature column with central tendency measures in Pandas Dataframe (Python).The central tendency measures which are used to replace missing values are mean, median and mode. Here is a detailed post on how, what and when of replacing missing values with mean, median or mode. This will be helpful in the data preprocessing stage of building machine learning models. Other technique used for filling missing values is backfill or bfill and forward-fill or ffill. Before going further and learn about fillna method, here is the Pandas sample dataframe we will work with. It represents marks in …

## Pandas Dataframe vs Numpy Array: What to Use?

In this post, you will learn about which data structure to use between Pandas Dataframe and Numpy Array when working with Scikit Learn libraries. As a data scientist, it is very important to understand the difference between Numpy array and Pandas Dataframe and when to use which data structure. Here are some facts: Scikit learn was originally developed to work well with Numpy array Numpy Ndarray provides a lot of convenient and optimized methods for performing several mathematical operations on vectors. Numpy array can be instantiated using the following manner: np.array([4, 5, 6]) Pandas Dataframe is an in-memory 2-dimensional tabular representation of data. In simpler words, it can be seen …

## Random Forest Classifier Python Code Example

In this post, you will learn about how to train a Random Forest Classifier using Python Sklearn library. This code will be helpful if you are a beginner data scientist or just want to quickly get code sample to get started with training a machine learning model using Random Forest algorithm. The following topics will be covered: Brief introduction of Random Forest Python code example for training a random forest classifier Brief Introduction to Random Forest Classifier Random forest can be considered as an ensemble of several decision trees. The idea is to aggregate the prediction outcome of multiple decision trees and create a final outcome based on averaging mechanism …