Category Archives: Data Science

K-Fold Cross Validation – Python Example

K-Fold Cross Validation Concepts with Python and Sklearn Code Example

In this post, you will learn about K-fold Cross Validation concepts with Python code example. It is important to learn the concepts cross validation concepts in order to perform model tuning with an end goal to choose model which has the high generalization performance. As a data scientist / machine learning Engineer, you must have a good understanding of the cross validation concepts in general.  The following topics get covered in this post: What and why of K-fold cross validation  When to select what values of K? K-fold cross validation with python (using cross-validation generators) K-fold cross validation with python (using cross_val_score) What and Why of K-fold Cross Validation K-fold cross validation …

Continue reading

Posted in Data Science, Machine Learning, Python. Tagged with , , , .

Sklearn Machine Learning Pipeline – Python Example

Machine learning pipeline sklearn implementation

In this post, you will learning about concepts about machine learning (ML) pipeline and how to build ML pipeline using Python Sklearn Pipeline (sklearn.pipeline) package. Getting to know how to use Sklearn.pipeline effectively for training/testing machine learning models will help automate various different activities such as feature scaling, feature selection / extraction and training/testing the models. It is recommended for data scientists (Python) to get a good understanding of Sklearn.pipeline.  The following are some of the topics covered in this post: Introduction to ML Pipeline Sklearn ML Pipeline Python code example Introduction to ML Pipeline Machine Learning (ML) pipeline, theoretically, represents different steps including data transformation and prediction through which data …

Continue reading

Posted in Data Science, Machine Learning, Python. Tagged with , , , .

Imputing Missing Data using Sklearn SimpleImputer

In this post, you will learn about how to use Python’s Sklearn SimpleImputer for imputing / replacing numerical & categorical missing data using different strategies. In one of the related article posted sometime back, the usage of fillna method of Pandas DataFrame is discussed. Here is the link, Replace missing values with mean, median and mode. Handling missing values is key part of data preprocessing and hence, it is of utmost importance for data scientists / machine learning Engineers to learn different techniques in relation imputing / replacing numerical or categorical missing values with appropriate value based on appropriate strategies. The following topics will be covered in this post: SimpleImputer explained with Python …

Continue reading

Posted in Data Science, Machine Learning, Python. Tagged with , , , .

When to use LabelEncoder – Python Example

In this post, you will learn about when to use LabelEncoder. As a data scientist, you must have a clear understanding on when to use LabelEncoder and when to use other encoders such as One-hot Encoder. Using appropriate type of encoders is key part of data preprocessing in machine learning model building lifecycle. Here are some of the scenarios when you could use LabelEncoder without having impact on model. Use LabelEncoder when there are only two possible values of a categorical features. For example, features having value such as yes or no. Or, maybe, gender feature when there are only two possible values including male or female. Use LabelEncoder for …

Continue reading

Posted in Data Science, Machine Learning, Python. Tagged with , , .

Feature Extraction using PCA – Python Example

Taj Mahal Side View

In this post, you will learn about how to use principal component analysis (PCA) for extracting important features (also termed as feature extraction technique) from a list of given features. As a machine learning / data scientist, it is very important to learn the PCA technique for feature extraction as it helps you visualize the data in the lights of importance of explained variance of data set. The following topics get covered in this post: What is principal component analysis? PCA algorithm for feature extraction PCA Python implementation step-by-step PCA Python Sklearn example What is Principal Component Analysis? Principal component analysis (PCA) is an unsupervised linear transformation technique which is primarily used …

Continue reading

Posted in Data Science, Machine Learning, Python. Tagged with , , .

PCA Explained Variance Concepts with Python Example

In this post, you will learn about the concepts of explained variance which is one of the key concepts related to principal component analysis (PCA). The explained variance concepts will be illustrated with Python code examples. Some of the following topics will be covered: What is explained variance? Python code examples of explained variance What is Explained Variance? Explained variance refers to the variance explained by each of the principal components (eigenvectors). It can be represented as a function of ratio of related eigenvalue and sum of eigenvalues of all eigenvectors. Let’s say that there are N eigenvectors, then the explained variance for each eigenvector (principal component) can be expressed the …

Continue reading

Posted in Data Science, Machine Learning, Python. Tagged with , , .

Eigenvalues & Eigenvectors with Python Examples

Eigenvalues and Eigenvectors Python Example

In this post, you will learn about how to calculate Eigenvalues and Eigenvectors using Python code examples. Before getting ahead and learning the code examples, you may want to check out this post on when & why to use Eigenvalues and Eigenvectors. As a machine learning Engineer / Data Scientist, you must get a good understanding of Eigenvalues / Eigenvectors concepts as it proves to be very useful in feature extraction techniques such as principal components analysis. Python Numpy package is used for illustration purpose. The following topics are covered in this post: Creating Eigenvectors / Eigenvalues using Numpy Linalg module Re-creating original transformation matrix from eigenvalues & eigenvectors Creating Eigenvectors / Eigenvalues using Numpy In …

Continue reading

Posted in Data Science, Python. Tagged with , .

Why & When to use Eigenvalues & Eigenvectors?

Eigenvector and Eigenvalues explained with example

In this post, you will learn about why and when you need to use Eigenvalues and Eigenvectors? As a data scientist / machine learning Engineer, one must need to have a good understanding of concepts related to Eigenvalues and Eigenvectors as these concepts are used in one of the most popular dimensionality reduction technique – Principal Component Analysis (PCA). In PCA, these concepts help in reducing the dimensionality of the data (curse of dimensionality) resulting in the simpler model which is computationally efficient and provides greater generalization accuracy.   In this post, the following topics will be covered: Background – Why need Eigenvalues & Eigenvectors? What are Eigenvalues & Eigenvectors? When to …

Continue reading

Posted in Data Science, Machine Learning. Tagged with , .

Standard Deviation of Population & Sample – Python

In this post, you will learn about the statistics concepts of standard deviation with the help of Python code example. The following topics are covered in this post: What is Standard deviation? Different techniques for calculating standard deviation Standard deviation of population vs sample What is Standard Deviation? The Standard Deviation (SD) of a data set is a measure of how spread out the data is. Take a look at the following example using two different samples of 4 numbers whose mean are same but the standard deviation (data spread) are different. Here is the code for calculating the mean of the above sample. One can either write Python code …

Continue reading

Posted in Data Science, Python, statistics. Tagged with , , .

Machine Learning – Feature Selection vs Feature Extraction

Feature extraction vs feature selection

In this post you will learn about the difference between feature extraction and feature selection concepts and techniques. Both feature selection and extraction are used for dimensionality reduction which is key to reducing model complexity and overfitting. The dimensionality reduction is one of the most important aspects of training machine learning models. As a data scientist, you must get a good understanding about dimensionality reduction techniques such as feature extraction and feature selection. In this post, the following topics will be covered: Feature selection concepts and techniques Feature extraction concepts and techniques When to use feature selection and feature extraction Feature Selection Concepts & Techniques Simply speaking, feature selection is …

Continue reading

Posted in Data Science, Machine Learning. Tagged with , .

Sklearn SelectFromModel for Feature Importance

SelectFromModel for Feature Importance

In this post, you will learn about how to use Sklearn SelectFromModel class for reducing the training / test data set to the new dataset which consists of features having feature importance value greater than a specified threshold value. This method is very important when one is using Sklearn pipeline for creating different stages and Sklearn RandomForest implementation (such as RandomForestClassifier) for feature selection. You may refer to this post to check out how RandomForestClassifier can be used for feature importance. The SelectFromModel usage is illustrated using Python code example. SelectFromModel Python Code Example Here are the steps and related python code for using SelectFromModel. Determine the feature importance using …

Continue reading

Posted in Data Science, Machine Learning, Python. Tagged with , , .

Feature Importance using Random Forest Classifier – Python

Random forest for feature importance

In this post, you will learn about how to use Sklearn Random Forest Classifier (RandomForestClassifier) for determining feature importance using Python code example. This will be useful in feature selection by finding most important features when solving classification machine learning problem. It is very important to understand feature importance and feature selection techniques for data scientists to use most appropriate features for training machine learning models. Recall that other feature selection techniques includes L-norm regularization techniques, greedy search algorithms techniques such as sequential backward / sequential forward selection etc. The following are some of the topics covered in this post: Why feature importance? Random Forest for feature importance Using Sklearn RandomForestClassifier for Feature Importance Why …

Continue reading

Posted in Data Science, Machine Learning, Python. Tagged with , , .

Sequential Forward Selection – Python Example

Sequential forward selection algorithm

In this post, you will learn about one of feature selection techniques namely sequential forward selection with Python code example. Refer to my earlier post on sequential backward selection technique for feature selection. Sequential forward selection algorithm is a part of sequential feature selection algorithms. Some of the following topics will be covered in this post: Introduction to sequential feature selection algorithms Sequential forward selection algorithm Python example using sequential forward selection Introduction to Sequential Feature Selection Sequential feature selection algorithms including sequential forward selection algorithm belongs to the family of greedy search algorithms which are used to reduce an initial d-dimensional feature space to a k-dimensional feature subspace where k < d. …

Continue reading

Posted in Data Science, Machine Learning, Python. Tagged with , , .

Sequential Backward Feature Selection – Python Example

Sequential Backward Search for Feature Selection

In this post, you will learn about a feature selection technique called as Sequential Backward Selection using Python code example. Feature selection is one of the key steps in training the most optimal model in order to achieve higher computational efficiency while training the model, and also reduce the the generalization error of the model by removing irrelevant features or noise. Some of the important feature selection techniques includes L-norm regularization and greedy search algorithms such as sequential forward or backward feature selection, especially for algorithms which don’t support regularization. It is of utmost importance for data scientists to learn these techniques in order to build optimal models. Sequential backward …

Continue reading

Posted in Data Science, Machine Learning, Python. Tagged with , , .

MinMaxScaler vs StandardScaler – Python Examples

MinMaxScaler vs StandardScaler

In this post, you will learn about concepts and differences between MinMaxScaler & StandardScaler with the help of Python code examples. Note that these are classes provided by sklearn.preprocessing module and used for feature scaling purpose. As a data scientist, you will need to learn these concepts in order to train machine learning models using algorithms which requires features to be on the same scale. For algorithms such as random forests and decision trees which are scale invariant, you do not need to use these feature scaling techniques. The following topics are covered in this post: Why is feature scaling needed? Normalization vs Standardization MinMaxScaler for normalization StandardScaler for standardization …

Continue reading

Posted in Data Science, Machine Learning, Python. Tagged with , , .

One-hot Encoding Concepts & Python Code Examples

One-hot encoding concepts and python examples

In this post, you will learn about One-hot Encoding concepts and code examples using Python programming language. One-hot encoding is also called as dummy encoding. In this post, OneHotEncoder class of sklearn.preprocessing will be used in the code examples. As a data scientist or machine learning engineer, you must learn the one-hot encoding techniques as it comes very handy while training machine learning models. Some of the following topics will be covered in this post: One-hot encoding concepts Using OneHotEncoder for single categorical feature Using OneHotEncoder & ColumnTransformer for encoding multiple categorical features Using Pandas get_dummies API for one-hot encoding One-Hot Encoding Concepts Simply speaking, one-hot encoding is a technique …

Continue reading

Posted in Data Science, Machine Learning, Python. Tagged with , , .