How to Analyze Data from Cluster Samples
In this lesson, we describe how to analyze survey data when the sampling
method is cluster sampling.
Notation
The following notation is helpful, when we talk about analyzing data from
cluster samples.

N = The number of clusters
in the population.

M_{i} = The number of observations in the ith cluster.
 X_{i} = The population
mean for the ith cluster

M = The total number of observations in the population = Σ M_{i}.

P = The population proportion.
 P_{i} = The population proportion for the ith cluster

n
= The number of clusters in the sample.

m_{i} = The number of sample observations from the ith
cluster.

x_{i}_{j} =
The measurement for the jth observation from the
ith cluster.

x_{i} = The sample estimate of the population
mean for the ith cluster
= Σ ( x_{i}_{j} / m_{i} ) summed over j.

p = The sample estimate of the population proportion.

p_{i} = The sample estimate of the population proportion for the ith
cluster.

s^{2}_{i} = The sample estimate of the population
variance within cluster i.

t_{i}
= The estimated total for the ith cluster
= Σ ( M_{i} / m_{i} )
* x_{i}_{j}
= M_{i} * x_{i} .

t_{mean}
= The sample estimate of the population total
= ( N / n ) * Σ t_{i} .

t_{prop(i)}
= The sample estimate of the number of successes in population i
= M_{i} * p_{i} .

t_{prop}
= The sample estimate of the number of successes in the population
= ( N / n ) * Σ t_{prop(i)} .

SE: The standard
error. (This is an estimate of the
standard deviation of the
sampling distribution.)

Σ = Summation symbol, used to compute sums
over the sample. ( To illustrate its use, Σ
x_{i} = x_{1} + x_{2} + x_{3} + ... + x_{m1}
+ x_{m} )
How to Analyze Data From Cluster Samples
Different sampling methods
use different formulas to estimate population
parameters and to estimate
standard errors. The formulas that we have used so far in this tutorial
work for simple random samples and for stratified samples, but they are not
right for cluster samples.
The next two sections of this lesson show the correct formulas to use with
cluster samples. With these formulas, you can readily estimate population
parameters and standard errors. And once you have the standard error, the
procedures for computing other things (e.g.,
margin of error,
confidence interval, and
region of acceptance) are largely the same for cluster samples as for
simple random samples. The sample problem at the end of this lesson shows
how to use these formulas to analyze data from cluster samples.
Measures of Central Tendency
The table below shows formulas that can be used with
onestage and twostage
cluster samples to estimate a population mean and a population proportion.
Population parameter 
Sample estimate: Onestage 
Sample estimate: Twostage 
Mean 
[ ( N / ( n * M ) ] * Σ ( M_{i} *
X_{i} )

[ ( N / ( n * M ) ] * Σ ( M_{i} *
x_{i} )

Proportion

[ ( N / ( n * M ) ] * Σ ( M_{i} * P_{i}
)

[ ( N / ( n * M ) ] * Σ ( M_{i} * p_{i}
)

These formulas produce unbiased
estimates of the population parameters.
The Variability of the Estimate
The precision of a
sample design is directly related to the variability of the
estimate, which is measured by the
standard error. The tables below show how to compute
the standard error (SE),
when the sampling method is cluster sampling.
The first table shows how to compute the
standard error for a mean score, given one or twostage sampling.
Number of stages

Standard error of mean score 
One

( 1 / M ) * sqrt { [ N^{2} * ( 1  n/N ) / n ] * Σ ( M_{i} * x_{i} 
t_{mean} / N )^{2} / ( n  1 ) }

Two

( 1 / M ) * sqrt { [ N^{2} * ( 1  n/N ) / n ] * Σ ( M_{i} * x_{i} 
t_{mean} / N )^{2} / ( n  1 )
+ ( N / n ) * Σ [ ( 1  m_{i} / M_{i}
) * M_{i}^{2} * s_{i}^{2} / m_{i} ] }

The next table shows how to compute the standard error for a proportion. Like
the previous table, this table shows equations for one and twostage designs.
It also shows how the equations differ when the true population proportions are
known versus when they are estimated based on sample data.
Number
of stages

Population
proportion 
Standard error of proportion 
One

Known 
( 1 / M ) * sqrt { [ N^{2} * ( 1  n/N ) / n ] * Σ ( M_{i} * P_{i}  t_{prop}
/ N )^{2} / ( n  1 ) }

One

Estimated 
( 1 / M ) * sqrt { [ ( N^{2} * ( 1  n/N ) / n ] * Σ ( M_{i} * p_{i}  t_{prop}
/ N )^{2} / ( n  1 ) }

Two

Known 
( 1 / M ) * sqrt { [ N^{2} * ( 1  n/N ) / n ] * Σ ( M_{i} * P_{i}  t_{prop}
/ N )^{2} } / ( n  1 )
+ ( N / n ) * Σ [ ( 1  m_{i} / M_{i}
) * M_{i}^{2} * P_{i} * ( 1  P_{i} ) / m_{i}
] }

Two

Estimated 
( 1 / M ) * sqrt [ ( N^{2} * ( 1  n/N ) / n ] * Σ ( M_{i} * p_{i}  t_{prop}
/ N )^{2} } / ( n  1 )
+ ( N / n ) * Σ [ ( 1  m_{i} / M_{i}
) * M_{i}^{2} * p_{i} * ( 1  p_{i} ) / ( m_{i}
 1 ) ] }

Sample Problem
This section presents a sample problem that illustrates how to analyze survey
data when the sampling method is onestage cluster sampling. (In a
subsequent lesson, we revisit this problem and see how cluster
sampling compares to other sampling methods.)
Sample Planning Wizard
The analysis of data collected via cluster sampling can be complex and
timeconsuming. Stat Trek's Sample Planning Wizard can help. The Wizard computes
survey precision, sample size requirements, costs, etc., as well as estimates
population parameters and tests hypotheses. It also creates a summary report
that lists key findings and documents analytical techniques. The Wizard
is free. You can find the Sample Planning Wizard in Stat Trek's
main menu under the Stat Tools tab. Or you can tap the button below.
Sample Planning Wizard
Example 1
At the end of every school year, the state administers a reading test to a
sample of third graders. The school system has 20,000 third graders, grouped in
1000 separate classes. Assume that each class has 20 students. This year, the
test was administered to each student in 36 randomlysampled classes. Thus,
this is onestage cluster sampling, with classes serving as clusters. The
average test score from each sampled cluster X_{i}
is shown below:
55, 60, 65, 67, 67, 70, 70, 70, 72, 72, 72, 72, 73, 73, 75, 75, 75, 75,
75, 77, 77, 78, 78, 78, 78, 80, 80, 80, 80, 80, 80, 83, 83, 85, 85, 85

Using sample data, estimate the mean reading achievement level in the
population. Find the margin
of error and the
confidence interval. Assume a 95%
confidence level.
Solution: Previously we described
how to compute the confidence interval for a mean score. Below,
we apply that process to the present cluster sampling problem.
 Identify a sample statistic. For this problem, we use
the sample mean to estimate the population mean, and we use the equation from
the "Measures of Central Tendency" table to compute the sample mean.
x = [ ( N / ( n * M ) ] * Σ
( M_{i} * X_{i} )
x = [ ( 1000 / ( 36 * 20,000 ) ] * Σ ( 20 * X_{i} )
x = Σ ( X_{i} ) / 36
x = ( 55 + 60 + 65 + ... + 85 + 85 + 85 ) / 36 = 75
Therefore, based on data from the cluster sample, we estimate that the mean
reading achievement level in the population is equal to 75.
 Select a confidence level. In this analysis, the confidence level
is defined for us in the problem. We are working with a 95%
confidence level.
 Find the margin of error. Elsewhere on this site, we show
how to compute the margin of error when the sampling
distribution is approximately normal. The key steps are
shown below.
 Find standard error of the sampling distribution.
Since we used onestage cluster sampling, the standard
error is:
SE = ( 1 / M ) * sqrt { [ N^{2} * ( 1  n/N ) / n ] *
Σ ( M_{i} * X_{i} 
t_{mean} / N )^{2} / ( n  1 ) }
where
t_{mean}
= ( N / n ) * Σ ( M_{i} * X_{i} )
Except for t_{mean}, all of the
terms on the right side of the above equation are known. Therefore, to
compute SE, we must first compute
t_{mean}. The formula for
t_{mean} is:
t_{mean} =
( N / n ) * Σ t_{i}
t_{mean} =
( N / n ) * ΣΣ [( M_{i} / m_{i} ) * x_{i}_{j} ]
t_{mean}
= ( 1000 / 36 ) * ΣΣ [( 20 / 20 ) * x_{i}_{j} ]
t_{mean} =
( 27.778 ) * ΣΣ ( x_{i}_{j} ) =
( 27.778 ) * 20 * Σ ( X_{i} )
t_{mean} =
( 27.778 ) * 20 * ( 55 + 60 + ... + 85 + 85 )
t_{mean} = 1,500,000
After we compute t_{mean},
all of the terms on the right side of the SE equation are known,
so we plug the known values into the standard error equation.
As shown below, the standard error is 1.1.
SE = ( 1 / M ) * sqrt { [ N^{2} * ( 1  n/N ) / n ] *
Σ ( M_{i} * X_{i} 
t_{mean} / N )^{2} / ( n  1 ) }
SE = ( 1 /20,000 ) * sqrt { [ 1000^{2} * ( 1  36/1000 ) / 36 ] *
Σ ( 20 * X_{i}  1,500,000 / 1000 )^{2} / ( 35 ) }
SE = ( 1 /20,000 ) * sqrt { [ 1000^{2} * ( 1  36/1000 ) / 36 ] *
( 20 * 55  1,500,000 / 1000 )^{2} / ( 35 ) +
( 20 * 60  1,500,000 / 1000 )^{2} / ( 35 )
+ ... +
( 20 * 85  1,500,000 / 1000 )^{2} / ( 35 ) +
( 20 * 85  1,500,000 / 1000 )^{2} / ( 35 ) }
SE = ( 1 /20,000 ) * sqrt [ [ 1000^{2} * ( 1  36/1000 ) / 36 ] * 18,217.143 ]
SE = 1.1
 Find critical value. The critical value is a factor used to
compute the margin of error. Based on the
central limit theorem, we can assume that the
sampling distribution
of the mean is normally distributed. Therefore, we express the critical
value as a
zscore.
To find the critical value, we take these steps.
 Compute margin of error (ME):
ME = critical value * standard error
ME = 1.96 * 1.1 = 2.16
 Specify the confidence interval. The range of the confidence
interval is defined by the sample statistic +
margin of error. And the uncertainty is denoted
by the confidence level.
Therefore, the 95% confidence interval is 72.84 to 77.16. And the margin
of error is equal to 2.16. That is, we are 95%
confident that the true population mean is in the range
defined by 75 + 2.16.