Influential Points
Sometimes in regression analysis, a few data points have
disproportionate effects on the slope of the regression equation. In
this lesson, we describe how to identify those influential points.
Outliers
Data points that diverge in a big way from the overall pattern are called
outliers. There are four
ways that a data point might be considered an outlier.
- It might be distant from the rest of the data,
even without extreme X or Y values.
Each type of outlier is depicted graphically in the scatterplots below.
Extreme X value
Extreme Y value
Extreme X and Y
Distant data point
Influential Points
An influential point is an
outlier
that greatly affects the slope of the regression
line. One way to test the influence of an outlier is to compute the
regression equation with and without the outlier.
This type of analysis is illustrated below. The scatterplots are identical, except
that one plot includes an outlier. When the outlier is present, the slope is flatter (-4.10 vs. -3.32);
so this outlier would be considered an influential point.
Without Outlier
Regression equation: ŷ = 104.78 - 4.10x
Coefficient of determination: R^{2} = 0.94
With Outlier
Regression equation: ŷ = 97.51 - 3.32x
Coefficient of determination: R^{2} = 0.55
The charts below compare
regression statistics for another data set with and without an
outlier. Here, one chart has a single outlier,
located at the high end of the X axis (where x = 24).
As a result of that single outlier, the slope of the
regression line changes greatly, from -2.5 to -1.6; so the outlier
would be considered an influential point.
Without Outlier
Regression equation: ŷ = 92.54 - 2.5x
Slope: b_{0} = -2.5
Coefficient of determination: R^{2} = 0.46
With Outlier
Regression equation: ŷ = 87.59 - 1.6x
Slope: b_{0} = -1.6
Coefficient of determination: R^{2} = 0.52
Sometimes, an influential point will cause the
coefficient of determination to be bigger; sometimes, smaller. In the first
example above, the coefficient of determination is smaller when the influential point
is present (0.94 vs. 0.55). In the second example, it is bigger (0.46 vs. 0.52).
If your data set includes an influential point, here are some things to consider.
- An influential point may represent bad data, possibly the result of measurement
error. If possible, check the validity of the data point.
- Compare the decisions that would be made based on regression equations defined with
and without the influential point. If the equations lead to contrary decisions,
use caution.
Test Your Understanding
In the context of
regression
analysis,
which of the following statements are true?
I. When the data set includes an influential point, the data set is
nonlinear.
II. Influential points always reduce the coefficient of determination.
III. All outliers are influential data points.
(A) I only
(B) II only
(C) III only
(D) All of the above
(E) None of the above
Solution
The correct answer is (E).
Data sets with influential points can be linear or nonlinear.
In this lesson, we went over an example in which an
influential point increased the coefficient of determination.
With respect to regression,
outliers are influential only if they have a big effect on the
regression equation. Sometimes, outliers do not have big effects. For
example, when the data set is very large, a single outlier may not have
a big effect on the regression equation.