Teach yourself statistics

Teach yourself statistics

Influential Points in Regression

Sometimes in regression analysis, a few data points have disproportionate effects on the slope of the regression equation. In this lesson, we describe how to identify those influential points.

Outliers

Data points that diverge in a big way from the overall pattern are called outliers. There are four ways that a data point might be considered an outlier.

It could have an extreme X value compared to other data points.
It could have an extreme Y value compared to other data points.
It could have extreme X and Y values.
It might be distant from the rest of the data, even without extreme X or Y values.

Each type of outlier is depicted graphically in the scatterplots below.

Extreme X value

Scatterplot with extreme X value

Extreme Y value

Scatterplot with extreme Y value

Extreme X and Y

Scatterplot with extreme X value

Distant data point

Scatterplot with extreme Y value

Influential Points

An influential point is an outlier that greatly affects the slope of the regression line. One way to test the influence of an outlier is to compute the regression equation with and without the outlier.

This type of analysis is illustrated below. The scatterplots are identical, except that one plot includes an outlier. When the outlier is present, the slope is flatter (-4.10 vs. -3.32); so this outlier would be considered an influential point.

Without Outlier

Scatterplot with extreme X value

Regression equation: ŷ = 104.78 - 4.10x
Coefficient of determination: R² = 0.94

With Outlier

Scatterplot with extreme Y value

Regression equation: ŷ = 97.51 - 3.32x
Coefficient of determination: R² = 0.55

The charts below compare regression statistics for another dataset with and without an outlier. Here, one chart has a single outlier, located at the high end of the X axis (where x = 24). As a result of that single outlier, the slope of the regression line changes greatly, from -2.5 to -1.6; so the outlier would be considered an influential point.

Without Outlier

Scatterplot with extreme X value

Regression equation: ŷ = 92.54 - 2.5x
Slope: b₀ = -2.5
Coefficient of determination: R² = 0.46

With Outlier

Scatterplot with extreme Y value

Regression equation: ŷ = 87.59 - 1.6x
Slope: b₀ = -1.6
Coefficient of determination: R² = 0.52

Sometimes, an influential point will cause the coefficient of determination to be bigger; sometimes, smaller. In the first example above, the coefficient of determination is smaller when the influential point is present (0.94 vs. 0.55). In the second example, it is bigger (0.46 vs. 0.52).

If your dataset includes an influential point, here are some things to consider.

An influential point may represent bad data, possibly the result of measurement error. If possible, check the validity of the data point.
Compare the decisions that would be made based on regression equations defined with and without the influential point. If the equations lead to contrary decisions, use caution.

Test Your Understanding

In the context of regression analysis, which of the following statements are true?

I. When the dataset includes an influential point, the dataset is nonlinear.
II. Influential points always reduce the coefficient of determination.
III. All outliers are influential data points.

(A) I only
(B) II only
(C) III only
(D) All of the above
(E) None of the above

Solution

The correct answer is (E). Datasets with influential points can be linear or nonlinear. Influential points do not always reduce the coefficient of determination. In this lesson, we went over an example in which an influential point increased the coefficient of determination. With respect to regression, outliers are influential only if they have a big effect on the regression equation. Sometimes, outliers do not have big effects. For example, when the dataset is very large, a single outlier may not have a big effect on the regression equation.

Last lesson Next lesson