# Dummy Variables in Regression

A dummy variable (aka, an indicator variable) is a numeric variable that represents categorical data, such as gender, race, political affiliation, etc.

Technically, dummy variables are dichotomous, quantitative variables. Their range of values is small; they can take on only two quantitative values. As a practical matter, regression results are easiest to interpret when dummy variables are limited to two specific values, 1 or 0. Typically, 1 represents the presence of a qualitative attribute, and 0 represents the absence.

## How Many Dummy Variables?

The number of dummy variables required to represent a particular categorical variable depends on
the number of values that the categorical variable can assume. To represent a categorical variable
that can assume *k* different values, a researcher would need to define *k - 1*
dummy variables.

For example, suppose we are interested in political affiliation, a categorical variable that might assume three values - Republican, Democrat, or Independent. We could represent political affiliation with two dummy variables:

- X
_{1}= 1, if Republican; X_{1}= 0, otherwise. - X
_{2}= 1, if Democrat; X_{2}= 0, otherwise.

In this example, notice that we don't have to create a dummy variable to represent the "Independent" category
of political affiliation. If X_{1} equals zero and X_{2} equals zero, we know the
voter is neither Republican nor Democrat. Therefore, voter must be Independent.

## Avoid the Dummy Variable Trap

When defining dummy variables, a common mistake is to define too many variables. If a categorical variable
can take on *k* values, it is tempting to define *k* dummy variables. Resist this urge. Remember, you
only need *k - 1* dummy variables.

A *k ^{th}* dummy variable is redundant; it carries no new information. And it creates
a severe multicollinearity
problem for the analysis. Using

*k*dummy variables when only

*k - 1*dummy variables are required is known as the dummy variable trap. Avoid this trap!

## How to Interpret Dummy Variables

Once a categorical variable has been recoded as a dummy variable, the dummy variable can be used in regression analysis just like any other quantitative variable.

For example, suppose we wanted to assess the relationship between household income and political affiliation (i.e., Republican, Democrat, or Independent). The regression equation might be:

Income = b_{0} + b_{1}X_{1}+ b_{2}X_{2}

where b_{0}, b_{1}, and b_{2} are regression coefficients. X_{1}
and X_{2} are regression coefficients defined as:

- X
_{1}= 1, if Republican; X_{1}= 0, otherwise. - X
_{2}= 1, if Democrat; X_{2}= 0, otherwise.

The value of the categorical variable that is *not* represented explicitly by a dummy
variable is called the reference group. In this example, the reference group consists of Independent voters.

In analysis, each dummy variable is compared with the reference group. In this example, a positive regression coefficient means that income is higher for the dummy variable political affiliation than for the reference group; a negative regression coefficient means that income is lower. If the regression coefficient is statistically significant, the income discrepancy with the reference group is also statistically significant.

## Test Your Understanding

In this section, we work through a simple example to illustrate the use of dummy variables in regression analysis. The example begins with two independent variables - one quantitative and one categorical. Notice that once the categorical variable is expressed in dummy form, the analysis proceeds in routine fashion. The dummy variable is treated just like any other quantitative variable.

**Problem 1**

Consider the table below. It uses three variables to describe 10 students. Two of the variables (Test score and IQ) are quantitative. One of the variables (Gender) is categorical.

Student | Test score | IQ | Gender |
---|---|---|---|

1 | 93 | 125 | Male |

2 | 86 | 120 | Female |

3 | 96 | 115 | Male |

4 | 81 | 110 | Female |

5 | 92 | 105 | Male |

6 | 75 | 100 | Female |

7 | 84 | 95 | Male |

8 | 77 | 90 | Female |

9 | 73 | 85 | Male |

10 | 74 | 80 | Female |

For this problem, we want to test the usefulness of IQ and Gender as predictors of Test Score. To accomplish this objective, we will:

- Recode the categorical variable (Gender) to be a quantitative, dummy variable.
- Define a regression equation to express the relationship between Test Score, IQ, and Gender.
- Conduct a standard regression analysis and interpret the results.

### Dummy Variable Recoding

The first thing we need to do is to express gender as one or more dummy variables. How many dummy
variables will we need to fully capture all of the information inherent in the categorical variable Gender?
To answer that question, we look at the number of values (*k*) Gender can assume. We will
need *k - 1* dummy variables to represent Gender. Since Gender can assume two values (male or female),
we will only need one dummy variable to represent Gender.

Therefore, we can express the categorical variable Gender as a single dummy variable (X_{1}), like so:

- X
_{1}= 1 for male students. - X
_{1}= 0 for non-male students.

Now, we can replace Gender with X_{1} in our data table.

Student | Test score | IQ | X_{1} |
---|---|---|---|

1 | 93 | 125 | 1 |

2 | 86 | 120 | 0 |

3 | 96 | 115 | 1 |

4 | 81 | 110 | 0 |

5 | 92 | 105 | 1 |

6 | 75 | 100 | 0 |

7 | 84 | 95 | 1 |

8 | 77 | 90 | 0 |

9 | 73 | 85 | 1 |

10 | 74 | 80 | 0 |

Note that X_{1} identifies male students explicitly. Non-male students are the reference group.
This was a arbitrary choice. The analysis works just as well if you use X_{1} to identify
female students and make non-female students the reference group.

### The Regression Equation

At this point, we conduct a routine regression analysis. No special tweaks are required to handle the dummy variable. So, we begin by specifying our regression equation. For this problem, the equation is:

ŷ = b_{0} + b_{1}IQ + b_{2}X_{1}

where ŷ is the predicted value of the Test Score, IQ is the IQ score, X_{1} is the dummy variable representing Gender,
and b_{0}, b_{1}, and b_{2} are regression coefficients.

Values for IQ and X_{1} are known inputs from the data table. The only unknowns on the right side of the equation
are the regression coefficients, which we will estimate through least-squares regression.

### Data Analysis With Excel

To complete a good multiple regression analysis, we want to do four things:

- Estimate regression coefficients for our regression equation.
- Assess how well the regression equation predicts test score, the dependent variable.
- Assess the extent of multicollinearity between independent variables.
- Assess the contribution of each independent variable (i.e., IQ and Gender) to the prediction.

## Prerequisites

The remaining material assumes familiarity with topics covered in previous lessons. Specifically, you need to know:

- How to conduct regression analysis with statistical software.
- How to assess multicollinearity among independent variables.

If you're hazy on either of these topics, click the above links for a refresher.

#### Regression Coefficients

The first task in our analysis is to assign values to coefficients in our regression equation. Excel does all the hard work behind the scenes, and displays the result in a regression coefficients table:

For now, the key outputs of interest are the least-squares estimates for regression coefficients. They allow us to fully specify our regression equation:

ŷ = 38.6 + 0.4 * IQ + 7 * X_{1}

This is the only linear equation that satisfies a least-squares criterion. That means this equation fits the data from which it was created better than any other linear equation.

#### Coefficient of Multiple Determination

The fact that our equation fits the data better than any other linear equation does not guarantee that it fits the data well. We still need to ask: How well does our equation fit the data?

To answer this question, researchers look at the coefficient of multiple determination (R^{2}).
When the regression equation fits the data well, R^{2} will be large (i.e., close to 1);
and vice versa.

Luckily, the coefficient of multiple determination is a standard output of Excel
(and most other analysis packages). Here is what Excel says about R^{2} for our equation:

The coefficient of muliple determination is 0.810. For our sample problem, this means 81% of test score variation can be explained by IQ and by gender. Translation: Our equation fits the data pretty well.

#### Multicollinearity

At this point, we'd like to assess the relative importance our independent variables. We do this by testing the statistical significance of regression coefficients.

Before we conduct those tests, however, we need to assess multicollinearity between independent variables. If multicollinearity is high, significance tests on regression coefficient can be misleading. But if multicollinearity is low, the same tests can be informative.

To measure multicollinearity for this problem, we can try to predict IQ based on Gender. That is,
we regress IQ against Gender. The resulting coefficient of multiple determination (R^{2}_{k}) is an
indicator of multicollinearity. When R^{2}_{k} is greater than 0.75, multicollinearity
is a problem.

For this problem, R^{2}_{k} was very small - only 0.03. Given this result, we can
proceed with statistical analysis of our independent variables.

#### Significance of Regression Coefficients

With multiple regression, there is more than one independent variable; so it is natural to ask whether a particular
independent variable contributes significantly to the regression *after effects of other variables are taken
into account*. The answer to this question can be found in the regression coefficients table:

The regression coefficients table shows the following information for each coefficient: its value, its standard error, a t-statistic, and the significance of the t-statistic. In this example, the t-statistics for IQ and gender are both statistically significant at the 0.05 level. This means that IQ predicts test score beyond chance levels, even after the effect of gender is taken into account. And gender predicts test score beyond chance levels, even after the effect of IQ is taken into account.

The regression coefficient for gender provides a measure of the difference between the group identified by the dummy variable (males) and the group that serves as a reference (females). Here, the regression coefficient for gender is 7. This suggests that, after effects of IQ are taken into account, males will score 7 points higher on the test than the reference group (females). And, because the regression coefficient for gender is statistically significant, we interpret this difference as a real effect - not a chance artifact.