[DAY 2] #66DAYSOFDATA: THE DUMMY VARIABLE TRAP

computer crashing

On Day 1 of this #66DaysOfData, I used OneHot Encoding on locations on my housing prices data set. This pivots out each location into a separate column and uses a 1 or 0 to designate if it was True or False.  One of the things that came up in the tutorial was the issue of The Dummy Variable Trap, and to be honest it was the first time I had heard of it.

Categorical Data

First, it’s important to understand the two types of categorical variables in statistics, nominal and ordinal variables. 

Nominal variables are used to represent categories that do not have any inherent order. For example, a nominal variable could be type of fruit, where the categories are “apple,” “banana,” and “orange.” It does not make sense to order these categories because they do not have any inherent rank or level of magnitude.

Ordinal variables, on the other hand, are used to represent categories that do have a specific order. For example, an ordinal variable could be the level of education, where the categories are “high school,” “college,” and “graduate school.” In this case, there is an inherent order to the categories, with “graduate school” being the highest level of education and “high school” being the lowest.

Label Encoding vs OneHot Encoding

Knowing these 2 types of categorical variables, how can we use them in a machine-learning model?  Well, most machine models can’t handle strings so we have two options: 

  • Label Encoding 
  • OneHot Encoding.

Label Encoding is commonly used for ordinal variables because it preserves the order of the categories. In our level of education example, Label Encoding assigns a numeric label to each category in a categorical variable. For example, we could assign the label 1 to “high school,” 2 to “college,” and 3 to “graduate school” as a way to show the hierarchy between the levels of education.

OneHot Encoding, on the other hand, is commonly used for nominal variables because it does not assume any order to the categories. It creates a new binary variable for each category in a categorical variable called dummy variables. It then uses 1 or 0 to indicate whether a particular category is present or absent. For example, we could create three binary variables: “is_apple,” “is_banana,” and “is_orange” for the fruit example. Each variable would take on a value of 0 or 1, depending on whether the fruit is an apple, banana, or orange.

Overall, the primary difference between nominal and ordinal variables is whether their categories have an inherent order or not. When encoding categorical variables, we use Label Encoding for ordinal variables to preserve the order of the categories, and OneHot Encoding for nominal variables to avoid making any assumptions about the order.

What is The Dummy Variable Trap? 

Now that we understand why we used OneHot Encoding in our Bangalore Price Prediction Model, what does that have to do with The Dummy Variable Trap?

The dummy variable trap is a problem that can arise when we use dummy variables to represent nominal variables in regression analysis with OneHot Encoder. It happens when we include all the dummy variables in the regression analysis, and one variable can be predicted perfectly from the others. This can cause issues with the estimation of the model parameters and lead to incorrect estimates of the model coefficients and standard errors. These issues can affect the significance and interpretation of the results.

To avoid the dummy variable trap, we can drop one of the dummy variables. Usually, we drop the one with the lowest frequency or the one that represents the reference category. By dropping one variable, we can avoid perfect prediction and ensure the model’s accuracy and reliability.

The formula for the dummy variable trap is:

𝑌 = 𝛽0 + 𝛽1𝑋1 + 𝛽2𝑋2 + … + 𝛽𝑛−1𝑋𝑛−1 + 𝛽𝑛𝑋

where X1, X2, …, Xn are the dummy variables representing the categorical variable and 𝛽0, 𝛽1, …, 𝛽n are the coefficients of the model.

 

Understanding What Causes the Dummy Trap

 

Let’s look at our fruit example from above. The dataset has a nominal categorical variable “fruit” that can take on three values: “apple”, “banana”, or “orange”. To represent this variable in a regression model, we can use OneHot Encoding, which will create three dummy variables: “is_apple”, “is_banana”, and “is_orange”. These dummy variables will take on a value of 1 or 0, depending on whether the fruit is an apple, banana, or orange.

Here is an example dataset:

FruitWeight (oz)is_appleis_bananais_orange
apple8100
banana6010
orange4001
apple7100
banana5010

Now, we want to fit a linear regression model to predict the weight of the fruit based on the dummy variables:

Weight = b0 + b1*is_apple + b2*is_banana + b3*is_orange

If we use all three dummy variables in the model, we will fall into the dummy variable trap. This is because if we know that the fruit is not an apple and not an orange, we know that it must be a banana. Therefore, we have redundant information in the model, which can lead to incorrect estimates of the model parameters.

 

To avoid the dummy variable trap, we need to drop one of the dummy variables. In this case, we can drop “is_orange” since it has the lowest frequency (1) in the dataset. Now, the model becomes

Weight = b0 + b1*is_apple + b2*is_banana

By dropping one of the dummy variables, we avoid the dummy variable trap and obtain more accurate estimates of the model parameters.

 

Multicollinearity

Say we forgot to drop the “is_orange” dummy variable, and our regression model looks like this:

Price = β0 + β1 * is_apple+ β2 * is_banana+ β3 * is_orange

Having all 3 dummy variables we encounter perfect multicollinearity, a situation where including all dummy variables in a regression analysis results in biased and unreliable estimates of the coefficients and standard errors. 

When there is multicollinearity in a regression model, it can cause several issues, such as:

 

  • Unreliable coefficient estimates: Highly correlated variables can lead to unstable and unreliable coefficient estimates. This is because the model will struggle to determine which variable is truly contributing to the outcome variable, and which variable is just providing redundant information.
  • Inflated standard errors: Multicollinearity can cause the standard errors of the coefficient estimates to be inflated. This means that the coefficient estimates may appear less significant than they actually are, which can lead to incorrect conclusions about the importance of different variables in the model.
  • Difficulty in interpretation: When multicollinearity exists in a model, it can be difficult to interpret the coefficients of individual variables. This is because the effect of each variable on the outcome variable may be confounded with the effects of other variables in the model. 

In summary, if you want an accurate model when working with nominal categorical data and OneHot Encoder drop the category with the lowest frequency.

Leave a Comment

Your email address will not be published. Required fields are marked *