Hypothesis Test: Difference Between Means

This lesson explains how to conduct a hypothesis test for the difference in the means of two independent populations, using one of three test methods: a two-sample z-test, a two-sample t-test, or Welch's t-test.

When to Use This Analysis

The approach described in this lesson is appropriate when the following conditions are met:

The sampling method for each sample is simple random sampling.
Each population is at least 20 times larger than its respective sample.
The sampling distribution for the difference in means is normal or nearly normal.

Before proceeding with a hypothesis test, ensure that these conditions are met.

When is it normal?

Generally, it is safe to assume the sampling distribution of the difference between means will be approximately normal in shape when at least one of the following statements is true for each sample.

The population distribution is normal.
The sampling distribution of the difference in means is symmetric, unimodal , without outliers, and the sample size is 15 or less.
The sampling distribution of the difference in means is moderately skewed, unimodal, without outliers, and the sample size is between 16 and 29.
Sample size is large (30 or more), without outliers.

General Procedure for Hypothesis Testing

To test any hypothesis, the same five-step procedure is used: (1) state the hypotheses, (2) choose the significance level, (3) compute the test statistic, (4) find the P-value, and (5) interpret results. Here, we apply the general procedure to a hypothesis test of the difference between two means.

State the Hypotheses

The table below shows three sets of null and alternative hypotheses. Each makes a statement about the difference (d) between the mean of one population μ₁ and the mean of another population μ₂. (In the table, the symbol ≠ means " not equal to ".)

Null hypothesis	Alternative hypothesis	Number of tails
μ₁ - μ₂ = d	μ₁ - μ₂ ≠ d	2
μ₁ - μ₂ ≥ d	μ₁ - μ₂ < d	1
μ₁ - μ₂ ≤ d	μ₁ - μ₂ > d	1

Choose the Significance Level

The significance level is the probability of rejecting the null hypothesis when it is actually true. Often, researchers choose significance levels equal to 0.01, 0.05, or 0.10.

Compute the Test Statistic

The AP Statistics curriculum describes three methods for testing hypotheses about the difference in the means of two independent populations: a two-sample z-test, a two-sample t-test, and Welch's t-test. Here is how to compute a test statistic for each method.

Two-Sample z-Test

If you know each population standard deviation (which is rare), use the two-sample z-test to test the null hypothesis. The test statistic (z) is a z-score computed from these formulas:

SD = sqrt [ ( σ₁² / n₁ ) + ( σ₂² / n₂ ) ]

z = (x₁ - x₂ - d) / SD;

where SD is the standard deviation of the sampling distribution of the difference between means, σ₁ and σ₂ are known population standard deviations, n₁ and n₂ are sample sizes, x₁ and x₂ are sample sizes, and d is the hypothesized difference between population means.

Two-Sample t-Test

If the populations standard deviations are unknown but equal, use the two-sample t-test. The test statistic is a t-score (t) computed from these formulas:

s_p = sqrt{ [ (n₁ -1) * s₁²) + (n₂ -1) * s₂²) ] / (n₁ + n₂ - 2) }

SE = s_p * sqrt( 1 / n₁ + 1 / n₂ )

t = (x₁ - x₂ - d) / SE

df = n₁ + n₂ - 2

where s_p is the pooled standard deviation, s₁ and s₂ are sample standard deviations, n₁ and n₂ are sample sizes, SE is the standard error of the sampling distribution of the difference between means, x₁ and x₂ are sample means, df is degrees of freedom, and d is the hypothesized difference between population means.

Welch's t-Test

If the population standard deviations are unknown and unequal, use the Welch's' t-test. The test statistic is a t-score (t) defined by the following equations:

df = (s₁²/n₁ + s₂²/n₂)² / { [ (s₁2 / n₁)² / (n₁ - 1) ] + [ (s₂² / n₂)² / (n₂ - 1) ] }

SE = sqrt[ (s₁²/n₁) + (s₂²/n₂) ]

t = (x₁ - x₂ - d) / SE

where df is degrees of freedom, s₁ and s₂ are sample standard deviations, n₁ and n₂ are sample sizes, SE is the standard error of the sampling distribution of the difference between means, x₁ and x₂ are sample means, and d is the hypothesized difference between population means.

Note: The standard deviation (SD) and standard error (SE) equations are approximations. They are valid when the population is at least 20 times bigger than the sample.

Find the P-Value

The P-value is the probability of observing a sample statistic as extreme as the test statistic. To find the P-value probability for a z-score test statistic, use a standard normal table or a normal distribution calculator. To find the probability for a t-score test statistic, use a t-distribution table or a t-distribution calculator. (See sample problems at the end of this lesson for examples of how this is done with Stat Trek's t-Distribution Calculator.)

Interpret Results

If the sample findings are unlikely, given the null hypothesis, the researcher rejects the null hypothesis. This involves comparing the P-value to the significance level, and rejecting the null hypothesis when the P-value is less than the significance level.

Test Your Understanding

In this section, two sample problems illustrate how to conduct a hypothesis test of a difference between mean scores. The first problem involves a two-tailed test; the second problem, a one-tailed test.

Problem 1: Two-Tailed Test

Within a school district, students were randomly assigned to one of two Math teachers - Mrs. Smith and Mrs. Jones. After the assignment, Mrs. Smith had 30 students, and Mrs. Jones had 25 students.

At the end of the year, each class took the same standardized test. Mrs. Smith's students had an average test score of 78, with a standard deviation of 10; and Mrs. Jones' students had an average test score of 85, with a standard deviation of 15.

Test the hypothesis that Mrs. Smith and Mrs. Jones are equally effective teachers. Use a 0.10 level of significance. (Assume that student performance is approximately normal.)

Solution: The solution to this problem takes five steps: (1) state the hypotheses, (2) choose the significance level, (3) compute the test statistic, (4) find the P-value, and (5) interpret results. We work through those steps below:

State the hypotheses. The first step is to state the null hypothesis and an alternative hypothesis.

Null hypothesis: μ₁ - μ₂ = 0

Alternative hypothesis: μ₁ - μ₂ ≠ 0
Note that these hypotheses constitute a two-tailed test. The null hypothesis will be rejected if the difference between sample means is too big or if it is too small.
Choose the significance level. Here, the significance level was set at 0.10.
Compute the test statistic. Because we don't know the standard deviation of both populations and we can't assume that the standard deviations are equal, we will use Welch's t-test to test the null hypothesis. We use the formulas below to compute degrees of freedom (df), standard error (SE), and the t-score test statistic (t) for Welch's t-test.
df = (s₁²/n₁ + s₂²/n₂)² / { [ (s₁² / n₁)² / (n₁ - 1) ] + [ (s₂² / n₂)² / (n₂ - 1) ] }

df = (10²/30 + 15²/25)² / { [ (10² / 30)² / (29) ] + [ (15² / 25)² / (24) ] }

df = (3.33 + 9)² / { [ (3.33)² / (29) ] + [ (9)² / (24) ] } = 152.03 / (0.382 + 3.375) = 152.03/3.757 = 40.47

SE = sqrt[ (s₁²/n₁) + (s₂²/n₂) ]
SE = sqrt[ (100/30) + (225/25) ] = sqrt (3.33 + 9) = 3.51

t = [ (x₁ - x₂) ] / SE ]

t = [ (78 - 85) ] / 3.51 = -7/3.51 = -1.99

where s₁ is the standard deviation of sample 1, s₂ is the standard deviation of sample 2, n₁ is the size of sample 1, n₂ is the size of sample 2, x₁ is the mean of sample 1, and x₂ is the mean of sample 2.
Find the P-value. Since we have a two-tailed test, the P-value is the probability that a t-score having 40 degrees of freedom will be more extreme than the test statistic (i.e., smaller than -1.99 or bigger than 1.99).
We use the t Distribution Calculator to find P(t < -1.99) is about 0.027.

Since the t-distribution is symmetric around zero, we know that the P(t < 1.99) equals P(t > 1.99). Thus, the P-value = 0.027 + 0.027 = 0.054.

Interpret results. Since the P-value (0.054) is less than the significance level (0.10), we cannot accept the null hypothesis.

Note: If you use this approach on an exam, you may also want to mention why this approach is appropriate. Specifically, the approach is appropriate because the sampling method was simple random sampling, the sample size was much smaller than the population size, and the samples were drawn from a normal population.

Problem 2: One-Tailed Test

The Acme Company has developed a new battery. The engineer in charge claims that the new battery will operate continuously for at least 7 minutes longer than the old battery.

To test the claim, the company selects a simple random sample of 100 new batteries and 100 old batteries. The old batteries run continuously for 200 minutes with a standard deviation of 20 minutes; the new batteries, 200 minutes with a standard deviation of 40 minutes.

Test the engineer's claim that the new batteries run at least 7 minutes longer than the old. Use a 0.05 level of significance. (Assume that there are no outliers in either sample.)

State the hypotheses. The first step is to state the null hypothesis and an alternative hypothesis.
Null hypothesis: μ₁ - μ₂ = 7

Alternative hypothesis: μ₁ - μ₂ < 7

where μ₁ is battery life for the new battery, and μ₂ is battery life for the old battery.
Note that these hypotheses constitute a one-tailed test. The null hypothesis will be rejected if the difference between sample means is less than would be expected by chance.
Choose the significance level. Here, the significance level was set at 0.05.
Compute the test statistic. Because we don't know the standard deviation of both populations and we can't assume that the standard deviations are equal, we will use Welch's t-test to test the null hypothesis. We use the formulas below to compute degrees of freedom (df), standard error (SE), and the t-score test statistic (t) for Welch's t-test.
df = (s₁²/n₁ + s₂²/n₂)² / { [ (s₁² / n₁)² / (n₁ - 1) ] + [ (s₂² / n₂)² / (n₂ - 1) ] }

df = (40²/100 + 20²/100)² / { [ (40² / 100)² / (99) ] + [ (20² / 100)² / (99) ] }

df = (20)² / { [ (16)² / (99) ] + [ (4)² / (99) ] } = 400 / (2.586 + 0.040) = 400/ 2.626 = 152.3

SE = sqrt[ (s₁²/n₁) + (s₂²/n₂) ]
SE = sqrt[ (400/100) + (1600/100) ] = sqrt (4 + 16) = 4.472

t = [ (x₁ - x₂) - d ] / SE ]

t = [(200 - 200) - 7] / 4.472 = -7/4.472 = -1.565

where s₁ is the standard deviation of sample 1, s₂ is the standard deviation of sample 2, n₁ is the size of sample 1, n₂ is the size of sample 2, x₁ is the mean of sample 1, x₂ is the mean of sample 2, d is the hypothesized difference between population means.
Find the P-value. The P-value is the probability of observing a sample statistic as extreme as the test statistic, assuming the null hypothesis were true. For this problem, the test statistic was a t-score of -1.565 with 152 degrees of freedom. We use the t Distribution Calculator to find P(t ≤ -1.565) is about 0.06.

In this test, the average life for both batteries was 200 hours, so the observed difference in battery life was 0. Based on our analysis, if the null hypothesis were true and the difference in population means were actually 7, we would expect the observed difference our experiment to be 0 or less about 6% of the time. Therefore, the P-value in this analysis is 0.06.

Interpret results. Since the P-value (0.06) is greater than the significance level (0.05), we cannot reject the null hypothesis that the new battery will last at least 7 minutes longer than the old battery.

Last lesson Next lesson