Teach yourself statistics

Teach yourself statistics

Confidence Intervals With Paired Data

This lesson explains what paired data is, how to measure paired data, and how to construct a confidence interval around the mean difference in paired data. Key points are illustrated step-by-step with a sample problem.

Understanding Paired Data

Paired data consists of a set of observations that are dependent or somehow related. Some ways to achieve pairing include the following:

Take two measurements on the same subject (e.g., before and after treatment).
Take one measurement on each member of a naturally-occurring pair (e.g., twins or spouses).
Take one measurement on each member of an artificial pair (e.g., subjects matched on some attribute, like height, weight, or IQ).

Key Measure

With paired data, a key measure of interest is the difference (d) between paired measurements:

d = x₁ - x₂

where d is the difference for a single pair, and x₁ and x₂ are two related measurements on the same pair.

Key Statistics

If we select a simple random sample of n matched pairs, we can compute the mean difference (d) and the standard deviation (s) of the sampled differences:

d = Σd_i / n

s = sqrt [ Σ (d_i - d)² / (n - 1) ]

where d_i is the difference for the ith data pair. The sample mean d is a point estimate of the mean difference in the population of pairs.

Requirements for Analysis

This lesson explains how to construct a confidence interval around the point estimate d. The approach described is valid when the following conditions are met:

The dataset is a simple random sample of data pairs from a population of data pairs.
The sampling distribution of the mean difference between data pairs (d) is approximately normally distributed.

When is it normal?

Generally, it is safe to assume the sampling distribution of the mean difference between data pairs will be approximately normal in shape when any of the following statements are true.

The population distribution of paired differences (i.e., the variable d) is normal.
The sampling distribution of paired differences is symmetric, unimodal , without outliers, and the sample size is 15 or less.
The sampling distribution of paired differences is moderately skewed, unimodal, without outliers, and the sample size is between 16 and 29.
The sample size is 30 or more, without outliers.

Variability of the Mean Paired Difference

To construct a confidence interval around the mean difference between sample data pairs (d), we need to know how to compute the standard deviation or the standard error of the sampling distribution for d.

Standard deviation: When the standard deviation of the population of paired differences is known, the standard deviation (SD) of the sampling distribution of d is:
SD = σ * sqrt{ ( 1/n ) * ( 1 - n/N ) * [ N / ( N - 1 ) ] }
where σ is the standard deviation of the population difference, N is the population size, and n is the sample size. When the population size is much larger (at least 20 times larger) than the sample size, the standard deviation can be approximated by:
SD = σ / sqrt( n )
Standard error: When the standard deviation of the population of paired differences is unknown, the standard deviation of the sampling distribution cannot be calculated. Under these circumstances, use the standard error. The standard error (SE) can be calculated from the equation below.
SE = s * sqrt{ ( 1/n ) * ( 1 - n/N ) * [ N / ( N - 1 ) ] }
where s is the standard deviation of the sample difference, N is the population size, and n is the sample size. When the population size is much larger (at least 20 times larger) than the sample size, the standard error can be approximated by:
SE = s / sqrt( n )

Note: In real-world analyses, the standard deviation of the population is seldom known. Therefore, the standard error is used more often than the standard deviation.

Alert

The Advanced Placement-scores Examination only covers the "approximate" formulas for the standard deviation and standard error.

SD = σ / sqrt( n )

SE = s / sqrt( n )

However, students are expected to be aware of the limitations of these formulas; namely, the approximate formulas should only be used when the population size is at least 20 times larger than the sample size, and when the sampling method is simple random sampling.

The Critical Value

The critical value is a factor used to compute the margin of error around a statistic. When the statistic is the mean difference (d) between matched data pairs, the critical value can be expressed as a z-score or a t-score.

z-Score. When sample size is large (n ≥ 30) and the standard deviation of the population distribution of d is known, use a z-score.
t-Score. When the sample size is small (n < 30) or the standard deviation of the population of d is unknown, use a t-score.

Warning

If sample size is small (n < 30) and the population distribution is distinctly not normal (e.g., heavily skewed or contains outliers), do not express the critical value as a z-score or a t-score. (Such cases are not part of the AP Statistics curriculum and are beyond the scope of what we cover in this tutorial.)

How to Express Critical Value as t-Score

To express the critical value as a t-score, follow these steps.

Compute alpha (α): α = 1 - (confidence level / 100)
- When the confidence level is 99%, α is 1 - 99/100 or 0.01.
- When the confidence level is 95%, α is 1 - 95/100 or 0.05.
- When the confidence level is 90%, α is 1 - 90/100 or 0.1.
Find the critical probability (p*): p* = 1 - α/2
Find the degrees of freedom (df): df = n - 1 (where n is the number of data pairs from a single sample)
Find the t-score having degrees of freedom equal to df and a cumulative probability equal to the critical probability (p*).

To find the critical t-score, use an online calculator (e.g.,Stat Trek's t Distribution Calculator), a graphing calculator, or a t-distribution statistical table (found in the appendix of most introductory statistics texts).

How to Express Critical Value as z-Score

When the critical value is expressed as a z-score, its value depends on the confidence level. Common z-score critical values are 1.645 for a 90% confidence level, 1.96 for a 95% confidence level, and 2.576 for a 99% confidence level.

To express the critical value as a z-score when the confidence level is not 90%, 95%, or 99%, follow these steps.

Compute alpha (α): α = 1 - (confidence level / 100)
Find the critical probability (p*): p* = 1 - α/2
Find the z-score having a cumulative probability equal to the critical probability (p*).

To find the critical z-score, use an online calculator (e.g, Stat Trek's Normal Distribution Calculator), a graphing calculator, or a normal distribution statistical table (found in the appendix of most introductory statistics texts).

A Judgment Call

Technically, when the population standard deviation is unknown, you should express the critical value as a t-score rather than a z-score, regardless of sample size.

As a practical matter, though, the z-score and the t-score are almost identical when sample size is large (n ≥ 100). And the z-score is easier to use; since z-score critical values (e.g., 1.96 for 95% confidence, 2.576 for 99% confidence) do not change with sample size.

Bottom line: With larger samples (n ≥ 100), the choice between a z-score critical value and a t-score critical value is a judgment call. Analysts often choose the z-score for its ease of use.

How to Find the Confidence Interval Around the Mean Paired Difference

Previously, we described how to construct confidence intervals . For convenience, we repeat the five steps below.

Choose the confidence level. The confidence level describes the uncertainty of a sampling plan. Often, researchers choose 90%, 95%, or 99% confidence levels; but any percentage can be used.
Compute the standard deviation or standard error. When the population size is at least 20 times bigger than the sample size, the standard deviation (SD) and the standard error (SE) of the sampling distribution of the mean difference d can be computed from the following formulas:
SD = σ / sqrt( n )

SE = s / sqrt( n )

where σ is the population standard deviation of d, s is the sample standard deviation of d, and n is sample size.
Find the critical value. Follow the instructions for finding z-score and t-score critical values provided above.
Find the margin of error. You can compute the margin of error, based on either of the following equations.
ME = CV * SD

ME = CV * SE

where ME is the margin of error, CV is the critical value, SD is the standard deviation of the sampling distribution of d, and SE is the standard error of the sampling distribution of d. ,
Define the confidence interval. The uncertainty is denoted by the confidence level. And the range of the confidence interval is defined by the following equation.
CI = d ± ME

where CI is the confidence interval, d is the mean difference between matched data pairs, and ME is the margin of error.

In the next section, we work through a problem that shows how to use this approach to construct a confidence interval for the mean difference between matched pairs.

Test Your Understanding

Problem

Twenty-two students were randomly selected from a population of 1000 students. The sampling method was simple random sampling. All of the students were given a standardized English test and a standardized math test. Test results are summarized below.

Student	English	Math	Diff, d	(d - d)²
95	90	5	16
2	89	85	4	9
3	76	73	3	4
4	92	90	2	1
5	91	90	1	0
6	53	53	0	1
7	67	68	-1	4
8	88	90	-2	9
9	75	78	-3	16
10	85	89	-4	25
11	90	95	-5	36

Student	English	Math	Diff, d	(d - d)²
12	85	83	2	1
13	87	83	4	9
14	85	83	2	1
15	85	82	3	4
16	68	65	3	4
17	81	79	2	1
18	84	83	1	0
19	71	60	11	100
20	46	47	-1	4
21	75	77	-2	9
22	80	83	-3	16

Σ(d - d)² = 270
d = 1

Find the 90% confidence interval for the mean difference between student scores on the math and English tests. Assume that the mean differences are approximately normally distributed.

Solution

The approach that we used to solve this problem is valid when the following conditions are met.

The sampling method must be simple random sampling. This condition is satisfied; the problem statement says that we used simple random sampling.
The sampling distribution should be approximately normally distributed. The problem statement says that the differences were normally distributed; so this condition is satisfied.

Since the above requirements are satisfied, we can use the following five-step approach to construct a confidence interval for the difference between matched pairs.

Choose a confidence level. In this analysis, the confidence level is defined for us in the problem. We are working with a 90% confidence level.
Compute the standard deviation or standard error. Since we do not know the standard deviation of the population, we cannot compute the standard deviation of the sampling distribution for d; instead, we compute the standard error (SE). Since the sample size is much smaller than the population size, we can use the approximation equation for the standard error. First, we compute the sample standard deviation (s) for d:
s = sqrt [ (Σ(d_i - d)² / (n - 1) ]

s = sqrt[ 270/(22-1) ]

s = sqrt(12.857) = 3.586

Once we know the sample standard deviation, we can compute the standard error.

SE = s / sqrt( n )

SE = 3.586 / [ sqrt(22) ]

SE = 3.586/4.69 = 0.765
Find the critical value. Find critical value. The critical value is a factor used to compute the margin of error. Because the sample size is small, we express the critical value as a t score rather than a z-score. To find the critical value, we take these steps.
- Compute alpha (α):
  α = 1 - (confidence level / 100)
  
  α = 1 - 90/100 = 0.10
- Find the critical probability (p*):
  p* = 1 - α/2 = 1 - 0.10/2 = 0.95
- Find the degrees of freedom (df):
  df = n - 1 = 22 - 1 = 21
- The critical value is the t-score having 21 degrees of freedom and a cumulative probability equal to 0.95. From the t Distribution Calculator, we find that the critical value is about 1.72.
Find the margin of error (ME). We use the margin of error formula, as shown below:
ME = critical value * standard error

ME = 1.72 * 0.765 = 1.3
Define the confidence interval (CI). The range of the confidence interval is defined by the sample statistic + margin of error. In this problem, the statistic is the mean difference between pairs (d), which equals 1. Therefore,
CI = statistic ± ME = d ± ME

CI = 1 ± 1.3
And the uncertainty is denoted by the confidence level, which is 90%.

Therefore, the 90% confidence interval is -0.3 to 2.3 or 1 + 1.3. Or you might use shorthand notation to describe this confidence interval as (-0.3, 2.3).

Last lesson Next lesson