In linear regression, dummy variables are used to represent the categorical variables in the model. There are a few different ways that dummy variables can be created, and we will explore a few of them in this blog post. We will also take a look at some examples to help illustrate how dummy variables work. We will also understand concepts related to the dummy variable trap. By the end of this post, you should have a better understanding of how to use dummy variables in linear regression models. As a data scientist, it is important to understand how to use linear regression and dummy variables.
While training linear regression models, if all of the predictor variables are quantitative or numerical in nature, it does not pose any problem. However, if there are one or more predictor variables that are categorical in nature, that’s where the challenge comes in. For example, consider a linear regression model to predict the price of a house. Some of the predictor variables in the model could be the size of the house, number of bedrooms, number of bathrooms, location, etc. While size and number of bedrooms are numerical variables that can take on any value, location is a categorical variable. It could take on values like the city center, suburbs, countryside, etc. In order to use this categorical variable in the linear regression model, we need some mechanism. This is where dummy variables come into the picture.
Dummy variables are often used in linear regression models to represent categorical variables or qualitative variables. Consider a scenario when a predictor variable has only two possible values. For example, let’s say there is a predictor variable such as does_smoke which has only two possible values such as the person smokes or does not smoke. For such a scenario, a variable known as a dummy variable is created which takes on values such as the following:
does_smoke = 1 if the person smokes and 0 if the person does not smoke.
Based on the above the model is trained and linear regression is applied. One can also use another coding scheme such as -1 and 1 instead of 0/1. Thus, the following can also hold good:
does_smoke = 1 if the person smokes and -1 if the person does not smoke.
Dummy variables are also used when there are more than two possible values for a predictor variable. The technique is called one-hot encoding. In this technique, a new dummy variable is created for each unique value of the predictor variable. For a particular record, the value of one except all dummy variables will be 1. The ordering of dummy variables doesn’t matter. This is also demonstrated in the previous example in relation to gender having value such as male or female. For example, let’s say we have a variable such as education that can take on the following values:
In this case, we will create three dummy variables for three unique values of education such as high school, graduate, and post-graduate. At any point in time, only one of the dummy variables shown below will be having the value of 1. Others will have the 0 value.
One-hot encoding has advantages over other techniques for creating dummy variables.
A dummy variable trap is defined as the scenario when you include all the dummy variables in the linear regression model even though some of the dummy variables are highly correlated. This leads to multicollinearity. For example, let’s say that the gender can have dummy variables such as the following:
Given that gender can only take above two values, you may note that if the value of is_male is 0, one can predict the value of is_female as 1. This results in redundancy. To avoid this issue, one of the dummy variables such as is_male or is_female can be dropped from the linear regression model.
The following represents some of the techniques one can use to transform categorical variables into numerical ones using dummy variables in Python.
The following represents some of the techniques one can use to transform categorical variables into numerical ones using dummy variables in R.
Dummy variables are used in linear regression models to represent categorical variables or qualitative variable. In linear regression, dummy variables are used to represent the categorical variables in the model. There are a few different ways that dummy variables can be created, and we explored a few of them in this blog post. One should also pay attention to dummy variable trap when working with dummy variables for training regression models. We also looked at some examples to help illustrate how dummy variables work including functions from Python and R. In case you have further questions, please feel free to reach out!
Artificial Intelligence (AI) agents have started becoming an integral part of our lives. Imagine asking…
In the ever-evolving landscape of agentic AI workflows and applications, understanding and leveraging design patterns…
In this blog, I aim to provide a comprehensive list of valuable resources for learning…
Have you ever wondered how systems determine whether to grant or deny access, and how…
What revolutionary technologies and industries will define the future of business in 2025? As we…
For data scientists and machine learning researchers, 2024 has been a landmark year in AI…