Teach yourself statistics

Teach yourself statistics

Linear Regression Example

In this lesson, we apply regression analysis to some fictitious data, and we show how to interpret the results of our analysis.

Note: Regression computations are usually handled by a software package or a graphing calculator. For this example, however, we will do the computations "manually", since the gory details have educational value.

Problem Statement

Last year, five randomly selected students took a math aptitude test before they began their statistics course. The Statistics Department has three questions.

What linear regression equation best predicts statistics performance, based on math aptitude scores?
If a student made an 80 on the aptitude test, what grade would we expect her to make in statistics?
How well does the regression equation fit the data?

How to Find the Regression Equation

In the table below, the x_i column shows scores on the aptitude test. Similarly, the y_i column shows statistics grades. The last two columns show deviations scores - the difference between the student's score and the average score on each measurement. The last two rows show sums and mean scores that we will use to conduct the regression analysis.

Student	x_i	y_i	(x_i-x)	(y_i-y)
1	95	85	17	8
2	85	95	7	18
3	80	70	2	-7
4	70	65	-8	-12
5	60	70	-18	-7
Sum	390	385
Mean	78	77

And for each student, we also need to compute the squares of the deviation scores (the last two columns in the table below).

Student	x_i	y_i	(x_i-x)²	(y_i-y)²
1	95	85	289	64
2	85	95	49	324
3	80	70	4	49
4	70	65	64	144
5	60	70	324	49
Sum	390	385	730	630
Mean	78	77

And finally, for each student, we need to compute the product of the deviation scores (the last column in the table below).

Student	x_i	y_i	(x_i-x)(y_i-y)
1	95	85	136
2	85	95	126
3	80	70	-14
4	70	65	96
5	60	70	126
Sum	390	385	470
Mean	78	77

The regression equation is a linear equation of the form: ŷ = b₀ + b₁x . To conduct a regression analysis, we need to solve for b₀ and b₁. Computations are shown below. Notice that all of our inputs for the regression analysis come from the above three tables.

First, we solve for the regression coefficient (b₁):

b₁ = Σ [ (x_i - x)(y_i - y) ] / Σ [ (x_i - x)²]

b₁ = 470/730

b₁ = 0.644

Once we know the value of the regression coefficient (b₁), we can solve for the regression slope (b₀):

b₀ = y - b₁ * x

b₀ = 77 - (0.644)(78)

b₀ = 26.768

Therefore, the regression equation is: ŷ = 26.768 + 0.644x .

How to Use the Regression Equation

Once you have the regression equation, using it is a snap. Choose a value for the independent variable (x), perform the computation, and you have an estimated value (ŷ) for the dependent variable.

In our example, the independent variable is the student's score on the aptitude test. The dependent variable is the student's statistics grade. If a student made an 80 on the aptitude test, the estimated statistics grade (ŷ) would be:

ŷ = b₀ + b₁x

ŷ = 26.768 + 0.644x = 26.768 + 0.644 * 80

ŷ = 26.768 + 51.52 = 78.288

Warning: When you use a regression equation to predict the value of a dependent variable, do not use values for the independent variable that are outside the range of values used to create the equation. That is called extrapolation, and it can produce unreasonable estimates.

In this example, the aptitude test scores used to create the regression equation ranged from 60 to 95. Therefore, only use aptitude test score values between 60 and 95 to predict statistics grades. Using values outside that range (less than 60 or greater than 95) is problematic.

How to Find the Coefficient of Determination

Whenever you use a regression equation, you should ask how well the equation fits the data. One way to assess fit is to check the coefficient of determination, which can be computed from the following formula.

R² = r² = { Σ (x_i - x̄) (y_i - ȳ) / [ (n - 1) * s_x * s_y ] }²

where n is the number of observations used to fit the model, x_i is the x value for observation i, x̄ is the mean x value, y_i is the y value for observation i, ȳ is the mean y value, s_x is the sample standard deviation of x, and s_y is the sample standard deviation of y.

Computations for the sample problem of this lesson are shown below. We begin by computing the sample standard deviation of x (s_x):

s_x = sqrt [ Σ ( x_i - x )² / (n-1) ]

s_x = sqrt( 730/4 ) = sqrt(182.5) = 13.51

Next, we find the standard deviation of y (s_y):

s_y = sqrt [ Σ ( y_i - y )² / (n-1) ]

s_y = sqrt( 630/4 ) = sqrt(157.5) = 12.55

And finally, we compute the coefficient of determination (R²):

R² = { Σ (x_i - x̄) (y_i - ȳ) / [ (n - 1) * s_x * s_y ] }²

R² = [ 470 / ( 4 * 13.51 * 12.55 ) ]²

R² = ( 470 / 678.2 )² = ( 0.693 )² = 0.48

A coefficient of determination equal to 0.48 indicates that about 48% of the variation in statistics grades (the dependent variable) can be explained by the relationship to math aptitude scores (the independent variable). This would be considered a good fit to the data, in the sense that it would substantially improve an educator's ability to predict student performance in statistics class.

Last lesson Next lesson