Teach yourself statistics

Teach yourself statistics

Correlation Coefficient

Correlation coefficients measure the strength of association between two variables. The most common correlation coefficient, called the Pearson product-moment correlation coefficient, measures the strength of the linear association between variables measured on an interval or ratio scale.

In this tutorial, when we speak simply of a correlation coefficient, we are referring to the Pearson product-moment correlation. Generally, the correlation coefficient of a sample is denoted by r, and the correlation coefficient of a population is denoted by ρ or R.

How to Interpret a Correlation Coefficient

The sign and the absolute value of a correlation coefficient describe the direction and the magnitude of the relationship between two quantitative variables.

The value of a correlation coefficient ranges between -1 and 1.
The greater the absolute value of the Pearson product-moment correlation coefficient, the stronger the linear relationship.
The strongest linear relationship is indicated by a correlation coefficient of -1 or 1.
The weakest linear relationship is indicated by a correlation coefficient equal to 0.
A positive correlation means that if one variable gets bigger, the other variable tends to get bigger.
A negative correlation means that if one variable gets bigger, the other variable tends to get smaller.

Keep in mind that the Pearson product-moment correlation coefficient only measures linear relationships. Therefore, a correlation of 0 does not mean zero relationship between two variables; rather, it means zero linear relationship. (It is possible for two variables to have zero linear relationship and a strong curvilinear relationship at the same time.)

Scatterplots and Correlation Coefficients

The scatterplots below show how different patterns of data produce different degrees of correlation.

Maximum positive correlation
(r = 1.0)

Strong positive correlation
(r = 0.80)

Zero correlation
(r = 0)

Maximum negative correlation
(r = -1.0)

Moderate negative correlation
(r = -0.43)

Strong correlation & outlier
(r = 0.71)

Several points are evident from the scatterplots.

When the slope of the line in the plot is negative, the correlation is negative; and vice versa.
The strongest correlations (r = 1.0 and r = -1.0 ) occur when data points fall exactly on a straight line.
The correlation becomes weaker as the data points become more scattered.
If the data points fall in a random pattern, the correlation is equal to zero.
Correlation is affected by outliers. Compare the first scatterplot with the last scatterplot. The single outlier in the last plot greatly reduces the correlation (from 1.00 to 0.71).

Pearson Correlation for a Population

If you look in different statistics textbooks, you are likely to find different formulas for computing a correlation coefficient. Here is the formula to use when you want to compute the correlation for a population.

Population correlation coefficient. The correlation ρ between two variables is:

ρ = Σ (X_i - μ_x) (Y_i - μ_y) / (N * σ_x * σ_y )

where N is the number of observations in the population, Σ is the summation symbol, X_i is the X value for observation i, μ_x is the population mean for variable X, Y_i is the Y value for observation i, μ_y is the population mean for variable Y, σ_x is the population standard deviation of X, and σ_y is the population standard deviation of Y.

In theory, this might be the best formula to use. In practice, we seldom know the population means (μ_x and μ_y) and population standard deviations (σ_x and σ_y) required to compute a population correlation coefficient (ρ); so the opportunity to use this formula is rare.

Pearson Correlation for a Sample

In applied settings, you will almost always compute a correlation coefficient from sample data. Here are two different-looking (but equivalent) formulas for computing a correlation coefficient from sample data.

Sample correlation coefficient. The correlation r between two variables is:

r = Σ (x_i - x̄) (y_i - ȳ) / [ (n - 1) * s_x * s_y ]

where n is the number of observations in the sample, Σ is the summation symbol, x_i is the x value for observation i, x̄ is the sample mean of x, y_i is the y value for observation i, ȳ is the sample mean of y, s_x is the sample standard deviation of x, and s_y is the sample standard deviation of y.

Sample correlation coefficient. The correlation r between two variables is:

r = Σ (xy) / sqrt [ ( Σ x² ) * ( Σ y² ) ]

where Σ is the summation symbol, x = x_i - x, x_i is the x value for observation i, x is the mean x value, y = y_i - y, y_i is the y value for observation i, and y is the mean y value.

Fortunately, you will rarely have to compute a correlation coefficient by hand. Many software packages (e.g., Excel) and most graphing calculators have a correlation function that will do the job for you. And if you take the AP Statistics exam, the first formula will be provided to you at the exam site; so there's need to memorize the correlation formula for the exam.

What About Bias?

When you work with a sample correlation coefficient, there's good news and there's bad news. First, the bad news. The sample correlation coefficient (r) is a biased estimate of the population coefficient (ρ).

Now, the good news. The bias is more noticeable in small samples (e.g., n < 30). As sample size increases, r becomes a consistent estimator of ρ, meaning it converges to the true value of ρ as n gets bigger. For large sample sizes, the bias is negligible, and r is approximately unbiased.

Test Your Understanding

Problem 1

A national consumer magazine reported the following correlations.

The correlation between car weight and car reliability is -0.30.
The correlation between car weight and annual maintenance cost is 0.20.

Which of the following statements are true?

I. Heavier cars tend to be less reliable.
II. Heavier cars tend to cost more to maintain.
III. Car weight is related more strongly to reliability than to maintenance cost.

(A) I only
(B) II only
(C) III only
(D) I and II only
(E) I, II, and III

Solution

The correct answer is (E). The correlation between car weight and reliability is negative. This means that reliability tends to decrease as car weight increases. The correlation between car weight and maintenance cost is positive. This means that maintenance costs tend to increase as car weight increases.

The strength of a relationship between two variables is indicated by the absolute value of the correlation coefficient. The correlation between car weight and reliability has an absolute value of 0.30. The correlation between car weight and maintenance cost has an absolute value of 0.20. Therefore, the relationship between car weight and reliability is stronger than the relationship between car weight and maintenance cost.

Last lesson Next lesson