# What is Linear Regression?

In a cause and effect relationship, the
**independent variable** is the cause, and the
**dependent variable** is the effect.
**Least squares linear regression** is a method
for predicting the value of a dependent variable *Y*,
based on the value of an independent variable
*X*.

For the next few lessons, we focus on the case where there is only one independent variable. This is called simple regression. Toward the end of the tutorial, we will cover multiple regression, which handles two or more independent variables.

**Tip:** The next lesson presents a
simple linear regression example
that shows how to
apply the material covered in this lesson. Since
this lesson is a little dense, you may benefit by also
reading the next lesson.

## Prerequisites for Regression

Simple linear regression is appropriate when the following conditions are satisfied.

- The dependent variable
*Y*has a linear relationship to the independent variable*X*. To check this, make sure that the XY scatterplot is linear and that the residual plot shows a random pattern. (Don't worry. We'll cover residual plots in a future lesson.) - For each value of X, the probability distribution of Y has the same standard deviation σ. When this condition is satisfied, the variability of the residuals will be relatively constant across all values of X, which is easily checked in a residual plot.
- For any given value of X,

## The Least Squares Regression Line

Linear regression finds the straight line, called the
**least squares regression line** or LSRL, that
best represents observations in a
bivariate data set. Suppose *Y* is a dependent variable,
and *X* is an independent variable. The population
regression line is:

Y = Β_{0} + Β_{1}X

where Β_{0} is a constant,
Β_{1} is the regression coefficient,
X is the value of the independent variable, and Y is the
value of the dependent variable.

Given a random sample of observations, the population regression line is estimated by a sample regression line. The sample regression line is:

ŷ = b_{0} + b_{1}x

where b_{0} is a constant,
b_{1} is the regression coefficient,
x is the value of the independent variable, and ŷ is the
*predicted* value of the dependent variable.

## How to Define a Regression Line

Normally, you will
use a computational tool - a software package (e.g., Excel) or a graphing calculator -
to find b_{0} and b_{1}. You enter the
*x* and *y* values into your program or calculator,
and the tool solves for the regression constant (b_{0}) and for the regression coefficient (b_{1}).

In the unlikely event that you find yourself on a desert island
without a computer or a graphing calculator, you can solve for
b_{0} and b_{1} "by hand". Here are the
equations.

b_{1} = Σ [ (x_{i} - x)(y_{i} - y) ] / Σ [ (x_{i} - x)^{2}]

b_{1} = r * (s_{y} / s_{x})

b_{0} = y - b_{1} * x

where b_{0} is the constant in the regression equation,
b_{1} is the regression coefficient,
r is the correlation between x and y,
x_{i} is the *x* value for observation *i*,
y_{i} is the *y* value for observation *i*,
x is the sample mean of *x*,
y is the sample mean of *y*,
s_{x} is the standard deviation of *x*, and
s_{y} is the standard deviation of *y*.

## Properties of the Regression Line

When the regression parameters (b_{0} and b_{1})
are defined as described
above, the regression line has the following properties.

- The line minimizes the sum of squared differences between
observed values (the
*y*values) and predicted values (the ŷ values computed from the regression equation). - The regression line passes through the mean of the
*x*values (x) and through the mean of the*y*values (y). - The regression constant (b
_{0}) is equal to the y intercept of the regression line. - The regression coefficient (b
_{1}) is the average change in the dependent variable (*y*) for a 1-unit change in the independent variable (*x*). It is the slope of the regression line.

The least squares regression line is the only straight line that has all of these properties.

## The Coefficient of Determination

The **coefficient of determination** (denoted by
R^{2}) is a key output of regression analysis.
It is interpreted as the proportion of the variance in the
dependent variable that is predictable from the independent variable.

- The coefficient of determination ranges from 0 to 1.
- An R
^{2}of 0 means that the dependent variable cannot be predicted from the independent variable. - An R
^{2}of 1 means the dependent variable can be predicted without error from the independent variable. - An R
^{2}between 0 and 1 indicates the extent to which the dependent variable is predictable. An R^{2}of 0.10 means that 10 percent of the variance in*y*is predictable from*x*; an R^{2}of 0.20 means that 20 percent is predictable; and so on.

The formula for computing the coefficient of determination for a linear regression model with one independent variable is given below.

**Coefficient of determination.**
The coefficient of determination (R^{2}) for a linear regression model with one independent variable is:

R^{2} = { ( 1 / N ) * Σ [ (x_{i} - x) * (y_{i} - y) ]

/ (σ_{x} * σ_{y} ) }^{2}

where N is the number of
observations used to fit the model, Σ is the summation symbol,
x_{i} is the x value for observation i,
x is the mean x value,
y_{i} is the y value for observation i,
y is the mean y value,
σ_{x} is the standard deviation of x, and
σ_{y} is the standard deviation of y.

If you know the linear correlation (r) between two variables, then the coefficient of
determination (R^{2}) is easily computed using the following formula:
R^{2} = r^{2}.

## Standard Error

The **standard error** about the regression line
(often denoted by SE) is a measure of the average amount that the
regression equation over- or under-predicts. The higher
the coefficient of determination, the lower the standard
error; and the more accurate predictions are likely to be.

## Test Your Understanding

**Problem 1**

A researcher uses a regression equation to predict home heating bills (dollar cost), based on home size (square feet). The correlation between predicted bills and home size is 0.70. What is the correct interpretation of this finding?

(A) 70% of the variability in home heating bills can be
explained by home size.

(B) 49% of the variability in home heating bills can be
explained by home size.

(C) For each added square foot of home size, heating bills
increased by 70 cents.

(D) For each added square foot of home size, heating bills
increased by 49 cents.

(E) None of the above.

**Solution**

The correct answer is (B). The coefficient of determination
measures the proportion of variation in the dependent variable
that is predictable from the independent variable. The
coefficient of determination is equal to R^{2};
in this case, (0.70)^{2} or 0.49. Therefore, 49%
of the variability in heating bills can be explained by home size.