Discriminant Analysis

Discriminant analysis is statistical technique used to classify observations into non-overlapping groups, based on scores on one or more quantitative predictor variables.

For example, a doctor could perform a discriminant analysis to identify patients at high or low risk for stroke. The analysis might classify patients into high- or low-risk groups, based on personal attributes (e.g., chololesterol level, body mass) and/or lifestyle behaviors (e.g., minutes of exercise per week, packs of cigarettes per day).

Note: There are several different ways to conduct a discriminant analysis. The approach described in this lesson is based on linear regression.

Two-Group Discriminant Analysis

A common research problem involves classifying observations into one of two groups, based on two or more quantitative, predictor variables.

When there are only two classification groups, discriminant analysis is really just multiple regression, with a few tweaks.

• The dependent variable is a dichotomous, categorical variable (i.e., a categorical variable that can take on only two values).
• The dependent variable is expressed as a dummy variable (having values of 0 or 1).
• Observations are assigned to groups, based on whether the predicted score is closer to 0 or to 1.
• The regression equation is called the discriminant function.
• The efficacy of the discriminant function is measured by the proportion of correct assignments.

The biggest difference between discriminant analysis and standard regression analysis is the use of a catergorical variable as a dependent variable. Other than that, the two-group discriminant analysis is just like standard multiple regression analysis. The key steps in the analysis are:

• Estimate regression coefficients.
• Define regression equation, which is the discriminant function.
• Assess the fit of the regression equation to the data.
• Assess the ability of the regression equation to correctly classify observations.
• Assess the relative importance of predictor variables.

The sample problem at the end of this lesson illustrates each of the above steps for a two-group discriminant analysis.

Multiple Discriminant Analysis

Regression can also be used with more than two classification groups, but the analysis is more complicated. When there are more than two groups, there are also more than two discriminant functions.

For example, suppose you wanted to classify voters into one of three political groups - Democrat, Republican, or Independent. Using two-group discriminant analysis, you might:

• Define one discriminant function to classify voters as Democrats or non-Democrats.
• Define a second discriminant function to classify non-Democrats as Republicans or Independents.

The maximum number of discriminant functions will equal the number of predictor variables or the number of group categories minus one - whichever is smaller.

With multiple discriminant analysis, the goal is to define discriminant functions that maximize differences between groups and minimize differences within groups. The calculations to do this make use of canonical correlation, a technique that is beyond the scope of this tutorial.

The SAT is an aptitude test taken by high school juniors and seniors. College administrators use the SAT along with high school grade point average (GPA) to predict academic success in college.

The table below shows the SAT score and high school GPA for ten students accepted to Acme College. And it shows whether each student ultimately graduated from college.

Yes 1300 2.7
Yes 1260 3.7
Yes 1220 2.9
Yes 1180 2.5
Yes 1060 3.9
No 1140 2.1
No 1100 3.5
No 1020 3.3
No 980 2.3
No 940 3.1

For this exercise, using data from the table, we are going to complete the following tasks:

• Define a discriminant function that classifies incoming students as graduates or non-graduates, based on their SAT score and high school GPA.
• Assess the goodness of fit of the discriminant function.
• Assess how well the discriminant function predicts academic performance (i.e., whether the student graduates).
• Assess the contribution of each independent variable (i.e., SAT and GPA) to the prediction.

To accomplish these tasks, we'll use the regression module in Excel. (We explained how to conduct a regression analysis with Excel in a previous lesson.)

Dummy Variable Recoding

Look at the data table above. The dependent variable (Graduate) is a categorical variable that takes the values "Yes" or "No". To use that variable in regression analysis, we need to make it a quantitative variable.

We can make Graduate a quantitative variable through dummy variable recoding. That is, we can express the categorical variable Graduate as a dummy variable (Y), like so:

• Y = 1 for students that graduate.
• Y = 0 for students that do not graduate.

Now, we replace the categorical variable Graduate with the quantitative variable Y in our data table. We set the value of Y equal to 1 for students who graduated; 0, for students who did not graduate.

Y SAT GPA
1 1300 2.7
1 1260 3.7
1 1220 2.9
1 1180 2.5
1 1060 3.9
0 1140 2.1
0 1100 3.5
0 1020 3.3
0 980 2.3
0 940 3.1

We input data from the above table into our statistical software to conduct a standard regression analysis. Outputs from the analysis include a regression coefficients table, a coefficient of multiple determination, and an overall F-test. We discuss each output below.

Discriminant Function

The first task in our analysis is to define a linear, least-squares regression equation to predict academic performance, based on SAT and GPA. That equation will be our discriminant function. Since we have two independent variables, the equation takes the following form:

ŷ = b0 + b1SAT + b2GPA

In this equation, ŷ is the predicted academic performance (i.e., whether the student graduates or not). The independent variables are SAT and GPA. The regression coefficients are b0, b1, and b2. On the right side of the equation, the only unknowns are the regression coefficients; so to specify the equation, we need to assign values to the coefficients.

To assign values to regression coefficients, we consult the regression coefficients table produced by Excel:

Here, we see that the regression intercept (b0) is -3.8392, the regression coefficient for SAT (b1) is 0.003233, and the regression coefficient for GPA (b2) is 0.23955. So the least-squares regression equation is:

ŷ = -3.8392 + 0.003233 * SAT + 0.23955 * GPA

This is the discriminant function that we can use to classify incoming students as likely graduates or non-graduates.

Goodness of Fit

The fact that our discriminant function satisfies a least-squares criterion does not guarantee that it fits the data well or that it will classify students accurately. To assess goodness of fit, researchers look at the coefficient of multiple determination (R2) and/or they conduct an overall F test.

Coefficient of Multiple Determination

The coefficient of multiple determination measures the proportion of variation in the dependent variable that can be predicted from the set of independent variables in the regression equation. When the regression equation fits the data well, R2 will be large (i.e., close to 1); and vice versa.

The coefficient of multiple determination is a standard output of Excel (and most other analysis packages), as shown below.

A quick glance at the output suggests that the regression equation fits the data pretty well. The coefficient of muliple determination is 0.610. This means 61% of variation in academic performance (i.e., graduating vs. not graduating) can be explained by SAT score and by high school GPA.

Overall F Test

Another way to evaluate the discriminant function would be to assess the statistical significance of the regression sum of squares. For that, we examine the ANOVA table produced by Excel:

This table tests the statistical significance of the independent variables as predictors of the dependent variable. The last column of the table shows the results of an overall F test. The p value (0.037) is small. This indicates that SAT and/or GPA has explanatory power beyond what would be expected by chance.

Like the coefficient of multiple correlation, the overall F test found in the ANOVA table suggests that the regression equation fits the data well.

Validity of Discriminant Function

In the real world, we are probably most interested in how well we can classify observations, based on outputs from the discriminant function. The table below shows actual student performance (Y) and predicted performance (ŷ), computed using the discriminant function.

Y ŷ SAT GPA
1 0.97 1300 2.7
1 1.08 1260 3.7
1 0.75 1220 2.9
1 0.53 1180 2.5
1 0.48 1060 3.9
0 0.30 1140 2.1
0 0.51 1100 3.5
0 -0.16 1020 3.3
0 0.20 980 2.3
0 -0.10 940 3.1

Recall that the discriminant function was designed to predict 0's and 1's. Thus, if predicted performance (ŷ) is less than 0.5, we assign the student to the "not graduating" group; and if it is greater than 0.5, we assign the student to the "graduating" group.

Comparing actual performance (Y) and predicted performance (ŷ) in the table above, we see that the discriminant function correctly classified eight of ten students. The incorrect classifications are highlighted in gray. One student who did not graduate was incorrectly assigned to the "graduating" group, and one student who actually graduated was incorrectly assigned to the "not graduating" group.

This result seems to indicate that SAT and GPA are useful in predicting graduation status.

Note: For this hypothetical example, we used the same data (1) to define the discriminant function and (2) to test the discriminant function. This is poor practice, because it capitalizes on chance variation in the data set. In the real world, we should use one data set to define the discriminant function and a different data set to test its validity.

Significance of Regression Coefficients

When the discriminant function has more than one independent variable, it is natural to ask whether each independent variable contributes significantly to the regression after effects of other variables are taken into account. The answer to this question can be found in the regression coefficients table:

The regression coefficients table shows the following information for each coefficient: its value, its standard error, a t-statistic, and the significance of the t-statistic. In this example, the t-statistic for SAT score was statistically significant at the 0.05 level; the t-statistic for GPA was not. This means that SAT contributed significantly to the regression after effects of GPA are taken into account.

Note: A separate analysis revealed minimal correlation between SAT score and GPA, so multicollinearity was not an issue.