How to Estimate a Mean or Proportion from a Cluster Sample
This lesson describes how to estimate a population mean or proportion, given survey data from a cluster sample. A good
analysis should provide two outputs:
First, we describe how to conduct a good analysis step-by-step. Then, we will illustrate the analysis
with a sample problem.
How to Analyze Survey Data
A good analysis of survey data from a cluster sample includes seven steps:
- Estimate a population parameter.
- Compute sample variance within each cluster (for two-stage cluster sampling).
- Compute standard error.
- Specify a confidence level.
- Find the critical value (often a z-score or a t-score).
- Compute margin of error.
- Define confidence interval.
Let's look a little bit closer at each step - what we do in each step and why we do it. When you understand what is really going on,
it will be easier for you to apply formulas correctly and to interpret analytical findings.
Note: The formulas presented below are only appropriate for cluster sampling.
Estimating a Population Mean or Proportion
The first step in the analysis is to develop a point estimate
for the population mean or proportion. The sample mean and sample proportion are good point estimates. Use this formula to compute the sample mean:
Sample mean = x = ( N / ( n * M ) ] * Σ ( Mh * xh )
where N is the number of clusters in the population,
n is the number of clusters in the sample,
M is the number of observations in the population,
Mh is the number of observations in cluster h,
and xh is the mean score from the sample in cluster h.
Use this formula to compute the sample proportion:
Sample proportion = p = ( N / ( n * M ) ] * Σ ( Mh * ph)
where N is the number of clusters in the population,
n is the number of clusters in the sample,
M is the number of observations in the population,
Mh is the number of observations in cluster h,
and ph is the proportion from the sample in cluster h.
Because different samples can produce different point estimates, you can be fairly sure that the estimate from your sample does
not equal the true value of the population parameter exactly.
Therefore, you need a way to express the uncertainty inherent in your estimate. The remaining six steps in the analysis are
geared toward quantifying the uncertainty in your estimate.
Computing Variance Within Clusters
If you are using one-stage cluster sampling or if
you are using cluster sampling to estimate a population proportion,
you can skip this step. But if you are using
two-stage cluster sampling to estimate a population mean, you will need
to compute the variance within each sampled cluster.
For a mean score, the variance within each cluster can be estimated from a sample as:
s2h = Σ ( xih - xh )2 / ( mh - 1 )
where s2h is a sample estimate of population variance in cluster h,
xih is the value of the ith element from cluster h,
xh is the sample mean from cluster h,
and mh is the number of observations sampled from cluster h.
You don't really need to compute the variance within each cluster when you are working with proportions. But, in case anyone is interested,
here is the formula for computing cluster variance with proportions:
s2h = [ mh / (mh - 1) ] * ph * (1 - ph)
where s2h is a sample estimate of the variance within cluster h,
mh is the number of observations sampled from cluster h,
and ph is a sample estimate of the proportion in cluster h.
Computing Standard Error: Mean
The standard error is possibly the most important
output from our analysis. It allows us to compute the
margin of error and the
confidence interval.
When we estimate a population mean from a cluster sample, the standard error (SE) of the estimate is:
SE = |
( 1 / M ) * sqrt { [ N2 * ( 1 - n/N ) / n ] * Σ ( Mh * xh -
t / N )2 / ( n - 1 )
|
|
+ ( N / n ) * Σ [ ( 1 - mh / Mh
) * M2h * s2h / mh ] }
|
where M is the number of observations in the population,
N is the number of clusters in the population,
n is the number of clusters in the sample,
Mh is the number of elements from cluster h in the population,
mh is the number of elements from cluster h in the sample,
xh is the sample mean from cluster h,
s2h is a sample estimate of the population variance in stratum h,
and t is a sample estimate of the population total.
For the equation above, use the following formula to estimate the population total.
t = N/n * Σ Mhxh
With one-stage cluster sampling, the formula for the standard error reduces to:
SE = |
( 1 / M ) * sqrt { [ N2 * ( 1 - n/N ) / n ] * Σ ( Mh * xh -
t / N )2 / ( n - 1 )
|
Think of the standard error as the standard deviation
of a sample statistic.
In survey sampling, there are usually many different subsets of the population that we might choose for analysis. Each different
sample might produce a different estimate of the value of a population parameter. The standard error provides a quantitative
measure of the variability of those estimates.
Computing Standard Error: Proportion
When we estimate a population proportion from a cluster sample, the standard error (SE) of the estimate is:
SE = |
( 1 / M ) * sqrt [ ( N2 * ( 1 - n/N ) / n ] * Σ ( Mh * ph - t / N )2 } / ( n - 1 )
|
|
+ ( N / n ) * Σ [ ( 1 - mh / Mh
) * M2h * ph * ( 1 - ph ) / ( mh - 1 ) ] }
|
where M is the number of observations in the population,
N is the number of clusters in the population,
n is the number of clusters in the sample,
Mh is the number of elements from cluster h in the population,
mh is the number of elements from cluster h in the sample,
ph is the value of the proportion from cluster h,
and t is a sample estimate of the population total.
For the equation above, use the following formula to estimate the population total.
t = N/n * Σ Mhph
With one-stage cluster sampling, the formula for the standard error reduces to:
SE = |
( 1 / M ) * sqrt [ ( N2 * ( 1 - n/N ) / n ] * Σ ( Mh * ph - t / N )2 } / ( n - 1 )
|
Specifying Confidence Level
In survey sampling, different samples can be randomly selected from the same population;
and each sample can often produce a different confidence interval.
Some confidence intervals include the true population parameter; others do not.
A confidence level refers to the percentage of all possible samples that produce confidence intervals that include the true population parameter.
For example, suppose all possible samples were selected from the same population, and a confidence interval were computed for each sample.
A 95% confidence level implies that 95% of the confidence intervals would include the true population parameter.
As part of the analysis, survey researchers choose a confidence level. Probably, the most frequently chosen confidence level is 95%.
Finding Critical Value
Often expressed as a t-score or a
z-score, the critical value is a factor used to compute the margin of error.
To find the critical value, follow these steps:
- Compute alpha (α): α = 1 - (confidence level / 100)
- Find the critical probability (p*): p* = 1 - α/2
- To express the critical value as a z-score, find the z-score having a
cumulative probability
equal to the critical probability (p*).
- To express the critical value as a t-score, follow these steps:
Researchers use a t-score when sample size is small; a z-score when it is large (at least 30).
You can use the Normal Distribution Calculator to find the critical z-score, and the
t Distribution Calculator to find the critical t statistic.
Computing Margin of Error
The margin of error
expresses the maximum expected difference between the true population parameter and a sample estimate of that parameter.
Here is the formula for computing margin of error (ME):
ME = SE * CV
where SE is standard error, and CV is the critical value.
Defining Confidence Interval
Statisticians use a confidence interval to express the degree of uncertainty associated with a sample statistic.
A confidence interval is an interval estimate combined with a probability statement.
Here is how to compute the minimum and maximum values for a confidence interval.
Mean |
Proportion |
CImin = x - SE * CV
CImax = x + SE * CV |
CImin = p - SE * CV
CImax = p + SE * CV |
In the table above, x is the sample estimate of the population mean, p is the sample estimate of the population proportion,
SE is the standard error, and CV is the critical value (either a z-score or a t-score). And,
the confidence interval is an interval estimate that ranges between CImin and CImax.
Sample Problem
This section presents a sample problem that illustrates how to analyze survey
data when the sampling method is one-stage cluster sampling. (In a
subsequent lesson, we re-visit this problem and see how cluster
sampling compares to other sampling methods.)
Sample Size Calculator
The analysis of data collected via cluster sampling can be complex and
time-consuming. Stat Trek's Sample Size Calculator can help. The calculator computes
standard error, margin of error, and confidence intervals. It assesses sample size requirements, estimates
population parameters, and tests hypotheses. The calculator
is free. You can find the Sample Size Calculator in Stat Trek's
main menu under the Stat Tools tab. Or you can tap the button below.
Sample Size Calculator
Example 1
At the end of every school year, the state administers a reading test to a
sample of third graders. The school system has 20,000 third graders, grouped in
1000 separate classes. Assume that each class has 20 students. This year, the
test was administered to each student in 36 randomly-sampled classes. Thus,
this is one-stage cluster sampling, with classes serving as clusters. The
average test score from each sampled cluster Xi
is shown below:
55, 60, 65, 67, 67, 70, 70, 70, 72, 72, 72, 72, 73, 73, 75, 75, 75, 75,
75, 77, 77, 78, 78, 78, 78, 80, 80, 80, 80, 80, 80, 83, 83, 85, 85, 85
|
Using sample data, estimate the mean reading achievement level in the
population. Find the margin
of error and the
confidence interval. Assume a 95%
confidence level.
Solution: To solve this problem, we follow the seven-step process described above.
- Estimate the population mean. To compute the overall sample mean, we use the following formula:
x = [ ( N / ( n * M ) ] * Σ
( Mi * xh )
x = [ ( 1000 / ( 36 * 20,000 ) ] * Σ ( 20 * xh )
x = Σ ( xh ) / 36
x = ( 55 + 60 + 65 + ... + 85 + 85 + 85 ) / 36 = 75
Therefore, based on data from the cluster sample, we estimate that the mean
reading achievement level in the population is equal to 75.
- Compute sample variance within cluster. If our problem involved two-stage cluster sampling, we would need to compute sample
variance within each cluster. But since our problem uses one-stage cluster sampling, we don't need to compute variance within clusters.
- Compute standard error. Before we can compute the standard error, we first need to estimate the population total:
t = N/n * Σ Mhxh
t = (1000/36) * 20 * Σ xh
t = ( 27.778 ) * 20 * ( 55 + 60 + ... + 85 + 85 )
t = 1,500,000
Now that we know the population total, we can compute the standard error (SE):
SE = ( 1 / M ) * sqrt { [ N2 * ( 1 - n/N ) / n ] * Σ ( Mh * xh -
t / N )2 / ( n - 1 )
SE = ( 1 /20,000 ) * sqrt { [ 10002 * ( 1 - 36/1000 ) / 36 ] *
Σ ( 20 * xh - 1,500,000 / 1000 )2 / ( 35 ) }
SE = ( 1 /20,000 ) * sqrt { [ 10002 * ( 1 - 36/1000 ) / 36 ] *
( 20 * 55 - 1,500,000 / 1000 )2 / ( 35 ) +
( 20 * 60 - 1,500,000 / 1000 )2 / ( 35 )
+ ... +
( 20 * 85 - 1,500,000 / 1000 )2 / ( 35 ) +
( 20 * 85 - 1,500,000 / 1000 )2 / ( 35 ) }
SE = ( 1 /20,000 ) * sqrt [ [ 10002 * ( 1 - 36/1000 ) / 36 ] * 18,217.143 ]
SE = 1.1
Thus, the standard error of the sampling distribution of the mean is 1.1.
- Select a confidence level. In this analysis, the confidence level is defined for us in the problem. We are working with a 95%
confidence level.
- Find the critical value. The critical value is a factor used to compute the margin of error. To find the critical value, we take these steps.
- Compute the margin of error (ME):
ME = critical value * standard error
ME = 1.96 * 1.1 = 2.16
- Specify the confidence interval. The minimum and maximum values of the confidence interval are:
CImin = x - SE * CV = 75 - 1.1 * 1.96 = 72.84
CImax = x + SE * CV = 75 + 1.1 * 1.96 = 77.16
In summary, here are the results of our analysis. Based on sample data, we estimate that the population mean is 75.
Given a 95% confidence level, the margin of error around that estimate is 2.16; and the 95% confidence interval is 72.84 to 77.16.