Teach yourself statistics

Teach yourself statistics

Sampling Distribution: Difference Between Means

Statistics problems often involve comparisons between sample means from two independent populations. This lesson describes the sampling distribution for the difference between sample means.

In this lesson, you'll learn how to find the mean of the sampling distribution, how to compute the standard deviation of the sampling distribution, how to compute the standard error of the sampling distribution, and how to find the cumulative probability that the difference between sample means will be less than or equal to any value of interest.

Shape of Sampling Distribution

Consider the following scenario. We have two populations with population means equal to μ₁ and μ₂. We take all possible simple random samples of size n₁ from population 1, and all possible simple random samples of size n₂ from population 1. For each sample from population 1, we compute the sample mean x₁; and for each sample from population 2, we compute the sample mean x₂. And finally, for every possible pairing of sample 1 with sample 2, we compute the difference between sample means; that is, we compute x₁ - x₂.

Assume that the samples are independent; that is, observations in one sample are not affected by observations in the other sample. Given that assumption, we might ask: Is the sampling distribution better described by a normal distribution or by a t distribution?

Here are some "rules of thumb" to answer that question:

If both sample sizes are sufficiently large (n ≥ 30), the sampling distribution for the difference between independent sample means will be approximately normally distributed. We know this from the central limit theorem.
If either sample size is small (n < 30), the sampling distribution will follow a t-distribution, unless both populations are normally distributed and both population standard deviations are known (see next bullet point).
If both populations are normally distributed and both population standard deviations are known, the sampling distribution will be described by a normal distribution.

Note: The "rules of thumb" presented above are guidelines. You may find slightly different "rules" in other places. When in doubt, the t distribution is the more conservative choice.

Mean of Sampling Distribution

The mean of the sampling distribution (μ_d) is the expected value of the difference between all possible sample means. Thus,

μ_d = E(x₁ - x₂) = μ₁ - μ₂

where x_i is the mean of sample i, and μ_i is the mean of population i.

Standard Deviation of Sampling Distribution

The standard deviation of the difference between sample means (σ_d) is approximately equal to:

σ_d = sqrt( σ₁² / n₁ + σ₂² / n₂ )

It is straightforward to derive this formula, based on material covered in previous lessons. The derivation starts with a recognition that the variance of the difference between independent random variables is equal to the sum of the individual variances. Thus,

σ²_d = σ² _{(x₁ -
x₂)} = σ² _x₁ + σ² _x₂

If the populations N₁ and N₂ are both large relative to n₁ and n₂, respectively, then

σ² _x₁ = σ²₁ / n₁

σ² _x₂ = σ²₂ / n₂

σ_d² = σ₁² / n₁ + σ₂² / n₂

σ_d = sqrt( σ₁² / n₁ + σ₂² / n₂ )

Henceforth, to minimize confusion between the various different measures of standard deviation, we will refer to the standard deviation of the sampling distribution of the difference between two means as SD. Thus,

SD = σ_d = sqrt( σ₁² / n₁ + σ₂² / n₂ )

Standard Error of Sampling Distribution

Typically, we don't know the values for population standard deviations, σ₁ and σ₂. And, if we don't know the population standard deviations, we cannot compute the standard deviation of the difference between sample means (SD).

However, we can estimate the population standard deviation from sample data, as shown below:

s = sqrt [ Σ ( x_i - x )² / ( n - 1 ) ]

where s is the sample standard deviation (i.e., the sample estimate of the population standard deviation), x is the sample mean, x_i is the ith element from the sample, n is the number of elements in the sample.

Substituting sample estimates of each population standard deviation into the equation for SD, we get:

SE = sqrt( s²₁ / n₁ + s²₂ / n₂ )

In this equation, SE is a sample estimate of the standard deviation of the difference between sample means (SD). SE is the standard error of the difference between sample means. Also, s₁ is the standard deviation of sample 1 (i.e., the sample estimate of σ₁), s₂ is the standard deviation of sample 2 (i.e., the sample estimate of σ₂), n₁ is the sample size in sample 1, and n₂ is the sample size in sample 2.

In future lessons, you will see that being able to compute the standard error from sample data is essential for inferential statistics. It will allow us to compute confidence intervals for the difference between means and to test hypotheses about the difference between means.

How to Find Probability

The sampling distribution of the difference between two sample means is a probability distribution. You can use the sampling distribution to find a cumulative probability for any difference between sample means. Specifically, you can find:

P(x₁ - x₂) ≤ d)

where x₁ is the mean in sample 1, x₁ is the mean in sample 2, and d is a constant.

Finding the probability that the difference between sample means is no greater than the constant d is a four-step process:

Step 1: Find Mean of Distribution

The mean of the sampling distribution of the difference between independent sample means is the mean of population 1 minus the mean of population 2. Thus,

μ = μ₁ - μ₂

where μ is the mean of the sampling distribution, μ₁ is the mean of population 1, and μ₂ is the mean of population 2.

Step 2: Find Standard Deviation

Earlier in this lesson (see above), we explained how to compute standard deviation of the sampling distribution (SD) when you know each population variance. And we showed how to estimate the standard deviation with the standard error (SE) when you don't know the population variance. For convenience, we repeat those formulas below:

SD = sqrt( σ₁² / n₁ + σ₂² / n₂ )

SE = sqrt( s²₁ / n₁ + s²₂ / n₂ )

where SD is the standard deviation of the sampling distribution, SE is the standard error, σ₁ and σ₂ are popuation standard deviations, s₁ and s₂ are sample estimates of population standard deviations, and n₁ and n₂ are sample sizes from each population.

Step 3: Transform d Into z- or t-Score

In the beginning of this lesson, we offered some "rules of thumb" to describe the shape of the sampling distribution. If those guidelines suggest that the sampling distribution is normal, compute a z-score using this formula:

z = (d – μ) / SD;

where d is the value of a constant for which we want to find a probability, μ is the mean of the sampling distribution, and SD is the standard deviation of the sampling distribution.

If the "rules of thumb" suggest that the sampling distribution is shaped like a t distribution, compute a t-score using this formula:

t = (d – μ) / SE

where SE is the standard error of the sampling distribution.

If you compute a t-score, you will also need to find the degrees of freedom. There are different formulas for the degrees of freedom depending on whether the two samples have equal or unequal variances.

Degrees of Freedom: Equal Variance

If you assume equal variances between the two groups, here is the formula for degrees of freedom (df).

df = n₁ + n₂ - 2

where n₁ is sample size in the first group, and n₂ is sample size in the second group.

If you have 30 observations in the first group (n₁ = 30) and 40 observations in the second group (n₂ = 40), the degrees of freedom would be:

df = 30 + 40 - 2 = 68

Degrees of Freedom: Unequal Variance

If you do not assume equal variances between the two groups, then the degrees of freedom are calculated using a more complex formula that accounts for the difference in variances between the two groups.

num = (s₁²/n₁ + s₂²/n₂)²

den = [(s₁²/n₁)²/(n₁ - 1)] + [(s₂²/n₂)²/(n₂ - 1)]

df = num / den

where s₁² and s₂² are sample variances, and n₁ and n₂ are sample sizes in the two groups.

This formula (known as the Welch-Satterthwaite Approximation) often produces a non-integer value for degrees of freedom. When that happens, round it to the nearest whole number.

Step 4: Find Probability

Find the probability for the z-score or a t-score that you calculated in Step 3; and you have found the probability that a mean difference is no greater than the constant d.

You can find the probability for the z-score or a t-score from a handheld graphing calculator, from a written probability table commonly found in the appendix of introductory statistics texts, or from an online probability calculator, like Stat Trek's normal distribution calculator and t distribution calculator.

Difference Between Means: Sample Problem

In this section, we work through a sample problem to show how to find probability, using the four-step solution described above. In this example, we will use Stat Trek's Normal Distribution Calculator to compute probabilities.

Normal Distribution Calculator

The Normal Distribution Calculator solves common statistical problems, based on the normal distribution. The calculator computes cumulative probabilities, based on three simple inputs. Clear instructions guide you to an accurate solution, quickly and easily. If anything is unclear, frequently-asked questions and sample problems provide straightforward explanations. The calculator is free. It can found in the Stat Trek main menu under the Stat Tools tab. Or you can tap the button below.

Normal Distribution Calculator

Test Your Understanding

Problem 1

For boys, the average number of absences in the first grade is 15 with a standard deviation of 7; for girls, the average number of absences is 10 with a standard deviation of 6.

In a nationwide survey, suppose 100 boys and 50 girls are randomly sampled. What is the probability that the male sample will have at most three more days of absences than the female sample?

(A) 0.025
(B) 0.035
(C) 0.045
(D) 0.055
(E) None of the above

Solution

The correct answer is B. Here's the four-step solution:

Step 1. Find the mean difference (male absences minus female absences) in the population.
μ_d = μ₁ - μ₂ = 15 - 10 = 5
Step 2. Find the standard deviation of the difference.
SD = sqrt( σ₁² / n₁ + σ₂² / n₂ )

SD = sqrt(7²/100 + 6²/50) = sqrt(49/100 + 36/50)

SD = sqrt(0.49 + .72) = sqrt(1.21) = 1.1
Step 3. Transform d into a z-score or a t-score. Since the sample from both populations is large, we conclude that the sampling distribution of the difference between means is normal. Because the distirbution is normal, we find a z-score rather than a t-score. When boys have three more days of absences, the number of male absences minus female absences is 3; so the constant d is 3. And the associated z-score is
z = (d - μ)/SD = (3 - 5)/1.1 = -2/1.1 = -1.818
Step 4. Find the probability. To find this probability, we use Stat Trek's Normal Distribution Calculator. Specifically, we enter the following inputs: -1.818, for the z-score; 0, for the mean; and 1, for the standard deviation. (It is not necessary to compute the mean or standard deviation of the z-score, because every z-score has a mean of 0 and a standard deviation of 1.)

Normal Distribution Calculator

We find that the probability of probability of a z-score being -1.818 or less is about 0.035. This means the probability of our survey finding that boys are absent 3 or fewer days than girls is 0.035.

Note: Some analysts might have used the t-distribution to compute probabilities for this problem. We used the normal distribution because both samples were relatively large. If we had used the t distribution, the results would have been very similar; because the t distribution and the normal distribution are very similar when sample size is large. The bigger the sample, the more closely the t distribution resembles the normal distribution. So, for this problem, the choice between a normal distribution and a t distribution was not critical. You can find guidelines for choosing between the normal distribution and the t distribution at https://stattrek.com/statistics/normal-vs-t-distribution.

Last lesson Next lesson