What is Linear Regression?
In a cause and effect relationship, the
independent variable is the cause, and the
dependent variable is the effect.
Least squares linear regression is a method
for predicting the value of a dependent variable Y,
based on the value of an independent variable
X.
In this tutorial, we focus on the case where there is only one
independent variable. This is called simple regression. In another tutorial (see
Regression Tutorial), we cover
multiple regression, which handles two or more independent variables.
Tip: The next lesson presents a
simple linear regression example
that shows how to
apply the material covered in this lesson. Since
this lesson is a little dense, you may benefit by also
reading the next lesson.
Prerequisites for Regression
Simple linear regression is appropriate when the following
conditions are satisfied.
- The dependent variable Y has a linear relationship
to the independent variable X. To check this,
make sure that the XY
scatterplot is linear and that the
residual plot shows a random pattern. (Don't worry. We'll cover residual plots in a
future lesson.)
- For each value of X, the probability distribution of Y has the
same standard deviation σ. When this condition is
satisfied, the variability of the residuals will be relatively
constant across all values of X, which is easily checked in
a residual plot.
- For any given value of X,
The Least Squares Regression Line
Linear regression finds the straight line, called the
least squares regression line or LSRL, that
best represents observations in a
bivariate data set. Suppose Y is a dependent variable,
and X is an independent variable. The population
regression line is:
Y = Β0 + Β1X
where Β0 is a constant,
Β1 is the regression coefficient,
X is the value of the independent variable, and Y is the
value of the dependent variable.
Given a random sample of observations, the population regression
line is estimated by a sample regression line. The sample regression line is:
ŷ = b0 + b1x
where b0 is a constant,
b1 is the regression coefficient,
x is the value of the independent variable, and ŷ is the
predicted value of the dependent variable.
How to Define a Regression Line
Normally, you will
use a computational tool - a software package (e.g., Excel) or a graphing calculator -
to find b0 and b1. You enter the
X and Y values into your program or calculator,
and the tool solves for each parameter.
In the unlikely event that you find yourself on a desert island
without a computer or a graphing calculator, you can solve for
b0 and b1 "by hand". Here are the
equations.
b1 = Σ [ (xi - x)(yi - y) ] / Σ [ (xi - x)2]
b1 = r * (sy / sx)
b0 = y - b1 * x
where b0 is the constant in the regression equation,
b1 is the regression coefficient,
r is the correlation between x and y,
xi is the X value of observation i,
yi is the Y value of observation i,
x is the mean of X,
y is the mean of Y,
sx is the standard deviation of X, and
sy is the standard deviation of Y.
Properties of the Regression Line
When the regression parameters (b0 and b1)
are defined as described
above, the regression line has the following properties.
The least squares regression line is the only straight line that
has all of these properties.
The Coefficient of Determination
The coefficient of determination (denoted by
R2) is a key output of regression analysis.
It is interpreted as the proportion of the variance in the
dependent variable that is predictable from the independent variable.
- An R2 between 0 and 1 indicates the extent to which
the dependent variable is predictable. An R2 of
0.10 means that 10 percent of the variance in Y is
predictable from X; an R2 of 0.20 means
that 20 percent is predictable; and so on.
The formula for computing the coefficient of determination for a
linear regression model with one independent variable is given below.
Coefficient of determination.
The coefficient of determination (R2) for a linear regression model with one independent variable is:
R2 = { ( 1 / N ) * Σ [ (xi - x) * (yi - y) ]
/ (σx * σy ) }2
where N is the number of
observations used to fit the model, Σ is the summation symbol,
xi is the x value for observation i,
x is the mean x value,
yi is the y value for observation i,
y is the mean y value,
σx is the standard deviation of x, and
σy is the standard deviation of y.
If you know the linear correlation (r) between two variables, then the coefficient of
determination (R2) is easily computed using the following formula:
R2 = r2.
Standard Error
The standard error about the regression line
(often denoted by SE) is a measure of the average amount that the
regression equation over- or under-predicts. The higher
the coefficient of determination, the lower the standard
error; and the more accurate predictions are likely to be.
Test Your Understanding
Problem 1
A researcher uses a regression equation to predict home heating
bills (dollar cost), based on home size (square feet).
The correlation between predicted
bills and home size is 0.70. What is the correct interpretation
of this finding?
(A) 70% of the variability in home heating bills can be
explained by home size.
(B) 49% of the variability in home heating bills can be
explained by home size.
(C) For each added square foot of home size, heating bills
increased by 70 cents.
(D) For each added square foot of home size, heating bills
increased by 49 cents.
(E) None of the above.
Solution
The correct answer is (B). The coefficient of determination
measures the proportion of variation in the dependent variable
that is predictable from the independent variable. The
coefficient of determination is equal to R2;
in this case, (0.70)2 or 0.49. Therefore, 49%
of the variability in heating bills can be explained by home size.