In this post, you will learn about One-hot Encoding concepts and code examples using Python programming language. One-hot encoding is also called as dummy encoding. In this post, OneHotEncoder class of sklearn.preprocessing will be used in the code examples. As a data scientist or machine learning engineer, you must learn the one-hot encoding techniques as it comes very handy while training machine learning models. Some of the following topics will be covered in this post:
- One-hot encoding concepts
- Using OneHotEncoder for single categorical feature
- Using OneHotEncoder & ColumnTransformer for encoding multiple categorical features
- Using Pandas get_dummies API for one-hot encoding
One-Hot Encoding Concepts
Simply speaking, one-hot encoding is a technique which is used to convert or transform a categorical feature having string labels into K numerical features in such a manner that the value of one out of K (one-of-K) features is 1 and the value of rest (K-1) features is 0. It is also called as dummy encoding as the features created as part of these techniques are dummy features which don’t represent any real world features. Rather they are created for encoding the different values of categorical feature using dummy numerical features. The primary need for using one-hot encoding technique is to transform or convert the categorical features into numerical features such that machine learning libraries can use the values to train the model. Although, many machine learning library internally converts them, but it is recommended to convert these categorical features explicitly into numerical features (dummy features). Let’s understand the concept using an example given below.
Here is a Pandas data frame which consists of three features such as gender, weight and degree. You may note that two of the features, gender and degree have non-numerical values. They are categorical features. They need to be converted into numerical features.
The above data frame when transformed into one-hot encoding technique will look like the following. Note that the categorical feature, gender, got transformed into two dummy features such as gender_male and gender_female. In the same manner, the categorical feature, degree, got transformed into two dummy features such as degree_graduate and degree_highschool. Note that in every row, only one of the dummy feature belonging to a specific feature will have value 1. The other feature will have value as 0. For example, when degree_graduate takes value as 1, other two related features such as degree_highschool and degree_postgraduate will have value as 0.
In the following sections, you will learn about how to use class OneHotEncoder of sklearn.preprocessing to do one-hot encoding of the categorical features.
OneHotEncoder for Single Categorical Feature
One-hot encoding for single categorical feature can be achieved using OneHotEncoder. The following code example illustrates the transformation of categorical feature such as gender that has two values. Note some of the following:
- When OneHotEncoder is instantiated with empty constructor function, the gender value gets converted into two feature columns.
- When OneHotEncoder is instantiated with drop=’first’, one of the dummy feature is dropped. This is because the value of remaining features when all 0’s will represent the dummy feature which got dropped. This is used to avoid multi-collinearity which can be an issue for certain methods (for instance, methods that require matrix inversion).
Here is the code sample for OneHotEncoder
from sklearn.preprocessing import OneHotEncoder # # Instantiate OneHotEncoder # ohe = OneHotEncoder() # # One-hot encoding gender column # ohe.fit_transform(df.degree.values.reshape(-1, 1)).toarray() # # OneHotEncoder with drop assigned to first # ohe = OneHotEncoder(drop='first') # # One-hot encoding gender column # ohe.fit_transform(df.degree.values.reshape(-1, 1)).toarray()
This is how the execution would look like in Jupyter Notebook:
The example below demonstrates using OneHotEncoder to transform the degree feature having more than 2 values.
ColumnTransformer & OneHotEncoder for Multiple Categorical Features
When there is a need for encoding multiple categorical features, OneHotEncoder can be used with ColumnTransformer. ColumnTransformer applies transformers to columns of an array or pandas DataFrame. The ColumnTransformer estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space.
Here is the code sample which represents the usage of ColumnTransformer class from sklearn.compose module for transforming one or more categorical features using OneHotEncoder.
from sklearn.compose import ColumnTransformer ct = ColumnTransformer([('one-hot-encoder', OneHotEncoder(), ['gender', 'degree'])], remainder='passthrough') # # For OneHotEncoder with drop='first', the code would look like the following # ct2 = ColumnTransformer([('one-hot-encoder', OneHotEncoder(drop='first'), ['gender', 'degree'])], remainder='passthrough') # # Execute Fit_Transform # ct.fit_transform(df)
This is how the execution would look like in Jupyter Notebook:
Pandas get_dummies API for one-hot encoding
Pandas get_dummies API can also be used for transforming one or more categorical features into dummy numerical features. This is one of the most preferred way of one-hot-encoding due to simplicity of the method / API usage. Here is the code sample:
# # Transform feature gender and degree using one-hot-encoding # pd.get_dummies(df, columns=['gender', 'degree']) # # Transform feature gender and degree using one-hot-encoding; Drop the first dummy feature # pd.get_dummies(df, columns=['gender', 'degree'], drop_first=True)
Here is how the code and outcome would look like.
Here is the summary of this post:
- One-hot encoding can be used to transform one or more categorical features into numerical dummy features useful for training machine learning model.
- One-hot encoding is also called dummy encoding due to the fact that the transformation of categorical features results into dummy features.
- OneHotEncoder class of sklearn.preprocessing module is used for one-hot encoding.
- ColumnTransformer class of sklearn.compose can be used for transforming multiple categorical features.
- Pandas get_dummies can be used for one-hot encoding.
- Standard Deviation of Population & Sample – Python - August 3, 2020
- Machine Learning – Feature Selection vs Feature Extraction - August 2, 2020
- Sklearn SelectFromModel for Feature Importance - August 2, 2020