Sampling Distribution of a Proportion
Suppose that we draw all possible random samples of size n from a given population. Suppose further that we compute a proportion for each sample. The probability distribution of this statistic is the sampling distribution for the proportion.
How to Represent Sampling Distribution
The sampling distribution of a proportion (or any discrete variable) is typically represented by a table or a histogram.
Here’s a simple example. Suppose we wanted to know the proportion of families that own dogs in a city of 100,000 families. If we surveyed every family in the city, we might find that 40% own dogs, so the actual proportion of dog owners in the population is 0.4.
It would be impractical to survey every single family; but we could sample a subset of families to estimate the proportion of dog owners. If we randomly selected two families for our sample, we could observe three possible outcomes. If nobody in our two-family sample owned a dog, the estimated sample proportion would be 0. If one family in the sample owned a dog, the estimated sample proportion would be 0.5. And if both families in the sample owned a dog, the estimated sample proportion would be 1. Given these three possible outcomes, a sampling distribution for this study might take the form of a table or a histogram, as shown below.
Table
Histogram
This table and this histogram are both examples of sampling distributions, because both show probabilities for each possible sample outcome. From the table, we see there is a 36% probability that the sample proportion will be 0; a 48% probability that the sample proportion will be 0.5; and a 16% probability that the sample proportion will be 1. That covers every possible outcome, since this study yields only three possible outcomes - 0, 0.5, or 1. The histogram shows the same information – a 3 6% probability that the sample proportion will be 0; a 48% probability that the sample proportion will be 0.5; and a 16% probability that the sample proportion will be 1.
What is the Effect of Sample Size?
Suppose we sampled more than two households in our dog-ownership study. The histograms below show sampling distributions for four different sample sizes: n=2, n=5, n=10, and n=25.
From the histograms, we see two effects of interest.
- As sample size increases, the histograms grow narrower, more closely concentrated around the population proportion of 0.4. This reflects greater precision of sample estimates with increased sample size.
- As sample size increases, the histograms become increasingly more bell-shaped, like a normal distribution (as illustrated below by the green normal curve superimposed over the last histogram from the series above).
The tendency of a sampling distribution to approximate a normal distribution has implications for statistical anaysis. When the approximation is sufficiently close, we can use the normal distribution to test hypotheses about proportions and to express confidence intervals around proportions – things you will learn to do in future lessons. For now, let's just answer the question: Under what conditions can we safely assume the sampling distribution of a proportion will be approximately normal in shape?
When is Distribution Normal?
It is safe to assume that the shape of the sampling distribution for a proportion will be approximately normal when the following conditions are true:
- Population size (N) is at least 10 times sample size (n).
- The sampling method is simple random sampling.
- n * p ≥ 10, where p is the sample proportion.
- n * (1 - p) ≥ 10.
Note: When the sample proportion p equals 0.5, the last two conditions require that at least 20 observations be sampled from a population for the sampling distribution to be approximatley normal. When the sample proportion p is more extreme than 0.5, more observations are required.
Standard Deviation of the Sampling Distribution
In a population of size N, suppose that each element can be characterized as a "success" or a "failure". The proportion of successes in the population is P; and the proportion of failures is Q. From this population, suppose that we draw all possible simple random samples of size n. And finally, within each sample, suppose that we determine the proportion of successes p and failures q. In this way, we create a sampling distribution of the proportion.
The standard deviation of the sampling distribution (σp) is determined by the population proportion P, the population size N, and the sample size n, as shown below:
σp = sqrt[ PQ/n ] * sqrt[ (N - n ) / (N - 1) ]
where
Q = 1 - P
When the population size is very large relative to the sample size, the standard deviation formula can be approximated by:
σp = sqrt[ PQ/n ] = sqrt[ P*(1-P)/n ]
You often see this "approximate" formula in introductory statistics texts. As a general rule, it is safe to use the approximate formula when the sample size is no bigger than 1/20 of the population size.
Standard Error of the Sampling Distribution
Typically, we don't know the value for population parameter P. And, if we don't know P, we cannot compute the standard deviation of the sampling distribution (σp).
However, we do know the sample proportions p and q. Substituting p and q into the equation for σp, we get:
SEp = sqrt[ pq/n ] * sqrt[ (N - n ) / (N - 1) ]
where
q = 1 - p
In this equation, p is the sample estimate of P, q is the sample estimate of Q, and SEp is the standard error of the sampling distribution of the proportion. The standard error (SEp) is a sample estimate of the standard deviation (σp) of the sampling distribution of a proportion.
And when the population size is very large relative to the sample size, the standard error formula can be approximated by:
SEp = sqrt[ pq/n ] = sqrt[ p*(1-p)/n ]
In future lessons, you will see that being able to compute the standard error from sample data is essential for inferential statistics. It will allow us to compute compute confidence intervals for proportions and to test hypotheses about proportions.
Summary of Key Points
The key takeaways from this lesson are summarized below.
- The probability distribution of a proportion is called the sampling distribution of a proportion.
- The sampling distribution of a proportion (or any discrete variable) is typically represented by a table or a histogram.
-
The sampling distribution for a sample proportion will be normally distributed when:
- Population size (N) is at least 10 times sample size (n).
- The sampling method is simple random sampling.
- n * p ≥ 10, where p is the sample proportion.
- n * (1 - p) ≥ 10.
- If population size is large relative to sample size, the standard error of the sampling distribution can be computed from the following formula:
SEp = sqrt[ pq/n ] = sqrt[ p*(1-p)/n ]
A population is considered "large" if it is at least 20 times bigger than its sample.
Test Your Understanding
In this section, we work through an example to illustrate how sampling distributions are used to solve common statistical problems. In this problem, the population proportion is known; and the sample size is large. So you can use the Normal Distribution Calculator to compute probabilities.
Normal Distribution Calculator
The normal calculator solves common statistical problems, based on the normal distribution. The calculator computes cumulative probabilities, based on three simple inputs. Simple instructions guide you to an accurate solution, quickly and easily. If anything is unclear, frequently-asked questions and sample problems provide straightforward explanations. The calculator is free. It can found in the Stat Trek main menu under the Stat Tools tab. Or you can tap the button below.
Normal Distribution Calculator
Example 1
Suppose it were possible to take a simple random sample of 120 newborns. Find the probability that no more than 40% will be
boys. Assume equal probabilities for the births of boys and girls. Assume
also that the number of births in the population (N) is very large, essentially
infinite.
Solution:
This problem satisfies the conditions that allow us to assume the sampling distribution is approximately normal.
- Population size (N = ∞) is at least 10 times sample size (n = 120).
- The sampling method is simple random sampling.
- n * p ≥ 10, where p is the sample proportion.
- n * (1 - p) ≥ 10.
The mean of the sampling distribution will equal the mean of the population distribution. In the population, half of the births result in boys; and half, in girls. Therefore, the probability of boy births in the population is 0.50. Thus, the mean proportion in the sampling distribution should also be 0.50.
The standard deviation of the sampling distribution can be computed using the following formula.
σp = sqrt[ PQ/n ] * sqrt[ (N - n ) / (N - 1) ]
Here, the finite population correction is equal to 1.0, since the population size (N) was assumed to be infinite. Therefore, standard deviation formula reduces to:
σp = sqrt[ PQ/n ] = sqrt[ P*(1-P)/n ]
σp = sqrt[ (0.5)(0.5)/120 ] = sqrt[0.25/120 ] = 0.04564
Let's review what we know and what we want to know. We know that the sampling distribution of the proportion is normally distributed with a mean of 0.50 and a standard deviation of 0.04564. We want to know the probability that no more than 40% of the sampled births are boys.
Because the sampling distribution is approximately normal, we'll use the normal distribution to find probability that 40% of sampled births are boys. To find the probability, we plug these inputs into the Normal Probability Calculator: mean of sampling distribution = .5, standard deviation of sampling distribution = 0.04564, and the raw score (i.e., sample mean) = .4.
The Calculator tells us that the probability that no more than 40% of the sampled births are boys is equal to 0.01422.
Note: This problem can also be treated as a binomial experiment. In a previous lesson, we explained how to analyze a binomial experiment, and we showed how to solve this problem when it is treated as a binomial experiment. The binomial experiment is actually the more exact analysis. When this problem is treated as a binomial experiment, we find a probability of 0.01766 (versus a probability of 0.14 that we found using the normal distribution).
The use of the normal distribution to estimate binomial probabilities is called the normal approximation to the binomial distribution. The normal approximation to the binomial distribution was used more in the 20th century, before binomial calculators were widely available, than it is used today. It is still a topic in the AP Statistics curriculum, so we include it in this tutorial.