How to Analyze Data from Cluster Samples

In this lesson, we describe how to analyze survey data when the sampling method is cluster sampling.

Notation

The following notation is helpful, when we talk about analyzing data from cluster samples.

  • N = The number of clusters in the population.
  • Mi = The number of observations in the ith cluster.
  • Xi = The population mean for the ith cluster
  • M = The total number of observations in the population = Σ Mi.
  • P = The population proportion.
  • Pi = The population proportion for the ith cluster
  • n = The number of clusters in the sample.
  • mi = The number of sample observations from the ith cluster.
  • xij = The measurement for the jth observation from the ith cluster.
  • xi = The sample estimate of the population mean for the ith cluster = Σ ( xij / mi ) summed over j.
  • p = The sample estimate of the population proportion.
  • pi = The sample estimate of the population proportion for the ith cluster.
  • s2i = The sample estimate of the population variance within cluster i.
  • ti = The estimated total for the ith cluster = Σ ( Mi / mi ) * xij = Mi * xi .
  • tmean = The sample estimate of the population total = ( N / n ) * Σ ti .
  • tprop(i) = The sample estimate of the number of successes in population i = Mi * pi .
  • tprop = The sample estimate of the number of successes in the population = ( N / n ) * Σ tprop(i) .
  • SE: The standard error. (This is an estimate of the standard deviation of the sampling distribution.)
  • Σ = Summation symbol, used to compute sums over the sample. ( To illustrate its use, Σ xi = x1 + x2 + x3 + ... + xm-1 + xm )

How to Analyze Data From Cluster Samples

Different sampling methods use different formulas to estimate population parameters and to estimate standard errors. The formulas that we have used so far in this tutorial work for simple random samples and for stratified samples, but they are not right for cluster samples.

The next two sections of this lesson show the correct formulas to use with cluster samples. With these formulas, you can readily estimate population parameters and standard errors. And once you have the standard error, the procedures for computing other things (e.g., margin of error, confidence interval, and region of acceptance) are largely the same for cluster samples as for simple random samples. The sample problem at the end of this lesson shows how to use these formulas to analyze data from cluster samples.

Measures of Central Tendency

The table below shows formulas that can be used with one-stage and two-stage cluster samples to estimate a population mean and a population proportion.

Population
parameter
Sample estimate:
One-stage
Sample estimate:
Two-stage
Mean [ ( N / ( n * M ) ] * Σ ( Mi * Xi ) [ ( N / ( n * M ) ] * Σ ( Mi * xi )
Proportion [ ( N / ( n * M ) ] * Σ ( Mi * Pi ) [ ( N / ( n * M ) ] * Σ ( Mi * pi )

These formulas produce unbiased estimates of the population parameters.

The Variability of the Estimate

The precision of a sample design is directly related to the variability of the estimate, which is measured by the standard error. The tables below show how to compute the standard error (SE), when the sampling method is cluster sampling.

The first table shows how to compute the standard error for a mean score, given one- or two-stage sampling.

Number of stages Standard error of mean score
One ( 1 / M ) * sqrt { [ N2 * ( 1 - n/N ) / n ] * Σ ( Mi * xi - tmean / N )2 / ( n - 1 ) }
Two ( 1 / M ) * sqrt { [ N2 * ( 1 - n/N ) / n ] * Σ ( Mi * xi - tmean / N )2 / ( n - 1 )
+ ( N / n ) * Σ [ ( 1 - mi / Mi ) * Mi2 * si2 / mi ] }

The next table shows how to compute the standard error for a proportion. Like the previous table, this table shows equations for one- and two-stage designs. It also shows how the equations differ when the true population proportions are known versus when they are estimated based on sample data.

Number
of stages
Population
proportion
Standard error of proportion
One Known ( 1 / M ) * sqrt { [ N2 * ( 1 - n/N ) / n ] * Σ ( Mi * Pi - tprop / N )2 / ( n - 1 ) }
One Estimated ( 1 / M ) * sqrt { [ ( N2 * ( 1 - n/N ) / n ] * Σ ( Mi * pi - tprop / N )2 / ( n - 1 ) }
Two Known ( 1 / M ) * sqrt { [ N2 * ( 1 - n/N ) / n ] * Σ ( Mi * Pi - tprop / N )2 } / ( n - 1 )
+ ( N / n ) * Σ [ ( 1 - mi / Mi ) * Mi2 * Pi * ( 1 - Pi ) / mi ] }
Two Estimated ( 1 / M ) * sqrt [ ( N2 * ( 1 - n/N ) / n ] * Σ ( Mi * pi - tprop / N )2 } / ( n - 1 )
+ ( N / n ) * Σ [ ( 1 - mi / Mi ) * Mi2 * pi * ( 1 - pi ) / ( mi - 1 ) ] }


Sample Problem

This section presents a sample problem that illustrates how to analyze survey data when the sampling method is one-stage cluster sampling. (In a subsequent lesson, we re-visit this problem and see how cluster sampling compares to other sampling methods.)

Sample Planning Wizard

The analysis of data collected via cluster sampling can be complex and time-consuming. Stat Trek's Sample Planning Wizard can help. The Wizard computes survey precision, sample size requirements, costs, etc., as well as estimates population parameters and tests hypotheses. It also creates a summary report that lists key findings and documents analytical techniques. Whenever you work with cluster sampling, consider using the Sample Planning Wizard. The Sample Planning Wizard is a premium tool available only to registered users. > Learn more

Register Now View Demo View Wizard

Example 1

At the end of every school year, the state administers a reading test to a sample of third graders. The school system has 20,000 third graders, grouped in 1000 separate classes. Assume that each class has 20 students. This year, the test was administered to each student in 36 randomly-sampled classes. Thus, this is one-stage cluster sampling, with classes serving as clusters. The average test score from each sampled cluster Xi is shown below:

55, 60, 65, 67, 67, 70, 70, 70, 72, 72, 72, 72, 73, 73, 75, 75, 75, 75,
75, 77, 77, 78, 78, 78, 78, 80, 80, 80, 80, 80, 80, 83, 83, 85, 85, 85  

Using sample data, estimate the mean reading achievement level in the population. Find the margin of error and the confidence interval. Assume a 95% confidence level.

Solution: Previously we described how to compute the confidence interval for a mean score. Below, we apply that process to the present cluster sampling problem.

  • Identify a sample statistic. For this problem, we use the sample mean to estimate the population mean, and we use the equation from the "Measures of Central Tendency" table to compute the sample mean.

    x = [ ( N / ( n * M ) ] * Σ ( Mi * Xi )
    x = [ ( 1000 / ( 36 * 20,000 ) ] * Σ ( 20 * Xi ) = Σ ( Xi ) / 36
    x = ( 55 + 60 + 65 + ... + 85 + 85 + 85 ) / 36 = 75

    Therefore, based on data from the cluster sample, we estimate that the mean reading achievement level in the population is equal to 75.

  • Select a confidence level. In this analysis, the confidence level is defined for us in the problem. We are working with a 95% confidence level.

  • Find the margin of error. Elsewhere on this site, we show how to compute the margin of error when the sampling distribution is approximately normal. The key steps are shown below.

    • Find standard error of the sampling distribution. Since we used one-stage cluster sampling, the standard error is:

      SE = ( 1 / M ) * sqrt { [ N2 * ( 1 - n/N ) / n ] * Σ ( Mi * Xi - tmean / N )2 / ( n - 1 ) }
              where tmean = ( N / n ) * Σ ( Mi * Xi )

      Except for tmean, all of the terms on the right side of the above equation are known. Therefore, to compute SE, we must first compute tmean. The formula for tmean is:

      tmean = ( N / n ) * Σ ti
      tmean = ( N / n ) * ΣΣ [( Mi / mi ) * xij ]
      tmean = ( 1000 / 36 ) * ΣΣ [( 20 / 20 ) * xij ]
      tmean = ( 27.778 ) * ΣΣ ( xij ) = ( 27.778 ) * 20 * Σ ( Xi )
      tmean = ( 27.778 ) * 20 * ( 55 + 60 + 65 + ... + 85 + 85 + 85 ) = 1,500,000

      After we compute tmean, all of the terms on the right side of the SE equation are known, so we plug the known values into the standard error equation. As shown below, the standard error is 1.1.

      SE = ( 1 / M ) * sqrt { [ N2 * ( 1 - n/N ) / n ] * Σ ( Mi * Xi - tmean / N )2 / ( n - 1 ) }
      SE = ( 1 /20,000 ) * sqrt { [ 10002 * ( 1 - 36/1000 ) / 36 ] * Σ ( 20 * Xi - 1,500,000 / 1000 )2 / ( 35 ) }
      SE = ( 1 /20,000 ) * sqrt { [ 10002 * ( 1 - 36/1000 ) / 36 ] *
      ( 20 * 55 - 1,500,000 / 1000 )2 / ( 35 ) + ( 20 * 60 - 1,500,000 / 1000 )2 / ( 35 )
      + ... +
      ( 20 * 85 - 1,500,000 / 1000 )2 / ( 35 ) + ( 20 * 85 - 1,500,000 / 1000 )2 / ( 35 ) }
      SE = ( 1 /20,000 ) * sqrt [ [ 10002 * ( 1 - 36/1000 ) / 36 ] * 18,217.143 ]
      SE = 1.1

    • Find critical value. The critical value is a factor used to compute the margin of error. Based on the central limit theorem, we can assume that the sampling distribution of the mean is normally distributed. Therefore, we express the critical value as a z score. To find the critical value, we take these steps.

      • Compute alpha (α): α = 1 - (confidence level / 100) = 1 - 95/100 = 0.05
      • Find the critical probability (p*): p* = 1 - α/2 = 1 - 0.05/2 = 0.975
      • The critical value is the z score having a cumulative probability equal to 0.975. From the Normal Distribution Calculator, we find that the critical value is 1.96.

    • Compute margin of error (ME): ME = critical value * standard error = 1.96 * 1.1 = 2.16

  • Specify the confidence interval. The range of the confidence interval is defined by the sample statistic + margin of error. And the uncertainty is denoted by the confidence level.

Therefore, the 95% confidence interval is 72.84 to 77.16. And the margin of error is equal to 2.16. That is, we are 95% confident that the true population mean is in the range defined by 75 + 2.16.