How to Analyze Data from Cluster Samples
In this lesson, we describe how to analyze survey data when the sampling
method is cluster sampling.
Notation
The following notation is helpful, when we talk about analyzing data from
cluster samples.
-
N = The number of clusters
in the population.
-
Mi = The number of observations in the ith cluster.
- Xi = The population
mean for the ith cluster
-
M = The total number of observations in the population = Σ Mi.
-
P = The population proportion.
- Pi = The population proportion for the ith cluster
-
n
= The number of clusters in the sample.
-
mi = The number of sample observations from the ith
cluster.
-
xij =
The measurement for the jth observation from the
ith cluster.
-
xi = The sample estimate of the population
mean for the ith cluster
= Σ ( xij / mi ) summed over j.
-
p = The sample estimate of the population proportion.
-
pi = The sample estimate of the population proportion for the ith
cluster.
-
s2i = The sample estimate of the population
variance within cluster i.
-
ti
= The estimated total for the ith cluster
= Σ ( Mi / mi )
* xij
= Mi * xi .
-
tmean
= The sample estimate of the population total
= ( N / n ) * Σ ti .
-
tprop(i)
= The sample estimate of the number of successes in population i
= Mi * pi .
-
tprop
= The sample estimate of the number of successes in the population
= ( N / n ) * Σ tprop(i) .
-
SE: The standard
error. (This is an estimate of the
standard deviation of the
sampling distribution.)
-
Σ = Summation symbol, used to compute sums
over the sample. ( To illustrate its use, Σ
xi = x1 + x2 + x3 + ... + xm-1
+ xm )
How to Analyze Data From Cluster Samples
Different sampling methods
use different formulas to estimate population
parameters and to estimate
standard errors. The formulas that we have used so far in this tutorial
work for simple random samples and for stratified samples, but they are not
right for cluster samples.
The next two sections of this lesson show the correct formulas to use with
cluster samples. With these formulas, you can readily estimate population
parameters and standard errors. And once you have the standard error, the
procedures for computing other things (e.g.,
margin of error,
confidence interval, and
region of acceptance) are largely the same for cluster samples as for
simple random samples. The sample problem at the end of this lesson shows
how to use these formulas to analyze data from cluster samples.
Measures of Central Tendency
The table below shows formulas that can be used with
one-stage and two-stage
cluster samples to estimate a population mean and a population proportion.
Population parameter |
Sample estimate: One-stage |
Sample estimate: Two-stage |
| Mean |
[ ( N / ( n * M ) ] * Σ ( Mi *
Xi )
|
[ ( N / ( n * M ) ] * Σ ( Mi *
xi )
|
| Proportion
|
[ ( N / ( n * M ) ] * Σ ( Mi * Pi
)
|
[ ( N / ( n * M ) ] * Σ ( Mi * pi
)
|
These formulas produce unbiased
estimates of the population parameters.
The Variability of the Estimate
The precision of a
sample design is directly related to the variability of the
estimate, which is measured by the
standard error. The tables below show how to compute
the standard error (SE),
when the sampling method is cluster sampling.
The first table shows how to compute the
standard error for a mean score, given one- or two-stage sampling.
| Number of stages
|
Standard error of mean score |
| One
|
( 1 / M ) * sqrt { [ N2 * ( 1 - n/N ) / n ] * Σ ( Mi * xi -
tmean / N )2 / ( n - 1 ) }
|
| Two
|
( 1 / M ) * sqrt { [ N2 * ( 1 - n/N ) / n ] * Σ ( Mi * xi -
tmean / N )2 / ( n - 1 )
+ ( N / n ) * Σ [ ( 1 - mi / Mi
) * Mi2 * si2 / mi ] }
|
The next table shows how to compute the standard error for a proportion. Like
the previous table, this table shows equations for one- and two-stage designs.
It also shows how the equations differ when the true population proportions are
known versus when they are estimated based on sample data.
Number
of stages
|
Population
proportion |
Standard error of proportion |
| One
|
Known |
( 1 / M ) * sqrt { [ N2 * ( 1 - n/N ) / n ] * Σ ( Mi * Pi - tprop
/ N )2 / ( n - 1 ) }
|
| One
|
Estimated |
( 1 / M ) * sqrt { [ ( N2 * ( 1 - n/N ) / n ] * Σ ( Mi * pi - tprop
/ N )2 / ( n - 1 ) }
|
| Two
|
Known |
( 1 / M ) * sqrt { [ N2 * ( 1 - n/N ) / n ] * Σ ( Mi * Pi - tprop
/ N )2 } / ( n - 1 )
+ ( N / n ) * Σ [ ( 1 - mi / Mi
) * Mi2 * Pi * ( 1 - Pi ) / mi
] }
|
| Two
|
Estimated |
( 1 / M ) * sqrt [ ( N2 * ( 1 - n/N ) / n ] * Σ ( Mi * pi - tprop
/ N )2 } / ( n - 1 )
+ ( N / n ) * Σ [ ( 1 - mi / Mi
) * Mi2 * pi * ( 1 - pi ) / ( mi
- 1 ) ] }
|
Sample Problem
This section presents a sample problem that illustrates how to analyze survey
data when the sampling method is one-stage cluster sampling. (In a
subsequent lesson, we re-visit this problem and see how cluster
sampling compares to other sampling methods.)
Sample Planning Wizard
The analysis of data collected via cluster sampling can be complex and
time-consuming. Stat Trek's Sample Planning Wizard can help. The Wizard computes
survey precision, sample size requirements, costs, etc., as well as estimates
population parameters and tests hypotheses. It also creates a summary report
that lists key findings and documents analytical techniques. Whenever you work
with cluster sampling, consider using the Sample Planning Wizard. The Sample
Planning Wizard is a premium tool available only to registered users.
>
Learn more
Example 1
At the end of every school year, the state administers a reading test to a
sample of third graders. The school system has 20,000 third graders, grouped in
1000 separate classes. Assume that each class has 20 students. This year, the
test was administered to each student in 36 randomly-sampled classes. Thus,
this is one-stage cluster sampling, with classes serving as clusters. The
average test score from each sampled cluster Xi
is shown below:
55, 60, 65, 67, 67, 70, 70, 70, 72, 72, 72, 72, 73, 73, 75, 75, 75, 75,
75, 77, 77, 78, 78, 78, 78, 80, 80, 80, 80, 80, 80, 83, 83, 85, 85, 85
|
Using sample data, estimate the mean reading achievement level in the
population. Find the margin
of error and the
confidence interval. Assume a 95%
confidence level.
Solution: Previously we described
how to compute the confidence interval for a mean score. Below,
we apply that process to the present cluster sampling problem.
Therefore, the 95% confidence interval is 72.84 to 77.16. And the margin
of error is equal to 2.16. That is, we are 95%
confident that the true population mean is in the range
defined by 75 + 2.16.