How to Analyze Stratified Random Samples
In this lesson, we describe how to analyze survey data from stratified
random samples.
Notation
The following notation is helpful, when we talk about analyzing data from
stratified samples.
-
H: The number of strata
in the population.
-
N: The number of observations in the population.
-
Nh: The number of observations in stratum h
of the population.
-
Ph: The true proportion
in stratum h
of the population.
-
σ2: The known
variance
of the population.
-
σ: The known
standard deviation
of the population.
-
σh: The known standard deviation in stratum h
of the population.
-
x: The sample estimate of the population mean.
-
xh: The mean of observations from stratum h
of the sample.
-
ph: The proportion of successes in stratum h
of the sample.
-
sh: The sample estimate of the population standard deviation
in stratum h.
-
sh2: The sample estimate of the population
variance in stratum h.
-
n: The number of observations in the sample.
-
nh: The number of observations in stratum h
of the sample.
-
SD: The standard deviation of the
sampling distribution.
-
SE: The standard
error. (This is an estimate of the
standard deviation of the
sampling distribution.)
-
Σ: Summation symbol. ( To illustrate the
use of the symbol, Σ xh = x1
+ x2 + ... + xH-1 + xH )
How to Analyze Data From Stratified Samples
When it comes to analyzing data from stratified samples, there is good new and
there is bad news.
First, the bad news. Different
sampling methods use different formulas to estimate population
parameters and to estimate
standard errors. The formulas that we have used so far in this tutorial
work for simple random samples, but they are not right for stratified samples.
Now, the good news. Once you know the correct formulas, you can readily estimate
population parameters and standard errors. And once you have the standard
error, the procedures for computing other things (e.g.,
margin of error,
confidence interval, and
region of acceptance) are largely the same for stratified samples as
for simple random samples. The next two sections provide formulas that can be
used with stratified sampling. The sample problem at the end of this lesson
shows how to use these formulas to analyze data from stratified samples.
Measures of Central Tendency
The table below shows formulas that can be used with stratified sampling to
estimate a population mean and a population proportion.
| Population parameter
|
Formula for sample estimate |
| Mean
|
Σ( Nh / N ) *
xh
|
| Proportion
|
Σ( Nh / N ) * ph
|
Note that Nh/N is the
sampling fraction. Thus, to compute a sample estimate of the population
mean or population proportion, we need to know the sampling fraction (i.e., we
need to know the relative size of each
stratum).
The Variability of the Estimate
The precision of a
sample design is directly related to the variability of the
estimate, which is measured by the
standard deviation or
standard error. The tables below show how to compute
the standard deviation (SD) and standard error (SE),
assuming that the sample method is stratified random sampling.
The first table shows how to compute the varibility for a mean score. Note
that the table shows four sample designs. In two of the designs, the true
population variance is known; and in two, it is estimated from sample data.
Also, in two of the designs, the researcher sampled with replacement; and in
two, without replacement.
| Population variance
|
Replacement strategy |
Variability |
| Known
|
With replacement |
SD = (1 / N) * sqrt [ Σ ( Nh2
* σh2 / nh ) ]
|
| Known
|
Without replacement |
SD = (1 / N) * sqrt { Σ [ Nh3/(
Nh - 1) ] * ( 1 - nh / Nh ) * σh2
/ nh }
|
| Estimated
|
With replacement |
SE = (1 / N) * sqrt [ Σ ( Nh2
* sh2 / nh ) ] |
| Estimated
|
Without replacement |
SE = (1 / N) * sqrt { Σ [ Nh2
* ( 1 - nh/Nh ) * sh2 / nh
] } |
The next table shows how to compute the variability for a proportion. Like
the previous table, this table shows four sample designs. In this case,
however, the designs are based on whether the true population proportion is
known and whether the design calls for sampling with or without replacement.
Population proportion |
Replacement strategy |
Variability |
| Known |
With replacement |
SD = (1 / N) * sqrt { Σ [ Nh2
* Ph * ( 1 - Ph ) / nh ] } |
| Known |
Without replacement |
SD = (1 / N) * sqrt ( Σ { [ Nh3/(
Nh - 1) ] * ( 1 - nh / Nh ) * Ph *
( 1 - Ph ) / nh } ) |
| Estimated |
With replacement |
SE = (1 / N) * sqrt { Σ [ Nh2
* ph * ( 1 - ph ) / ( nh - 1 ) ] } |
| Estimated |
Without replacement |
SE = (1 / N) * sqrt { Σ [ Nh2
* ( 1 - nh/Nh ) * ph * ( 1 - ph ) /
( nh - 1 ) ] } |
Sample Problem
This section presents a sample problem that illustrates how to analyze survey
data when the sampling method is proportionate stratified sampling. (In a
subsequent lesson, we re-visit this problem and see how stratified
sampling compares to other sampling methods.)
Sample Planning Wizard
The analysis of data collected via stratified random sampling can be complex and
time-consuming. Stat Trek's Sample Planning Wizard can help. The Wizard computes
survey precision, sample size requirements, costs, etc., as well as estimates
population parameters and tests hypotheses. It also creates a summary report that
lists key findings and documents analytical techniques. Whenever you work with
stratified random samples, consider using the Sample Planning Wizard. The Sample
Planning Wizard is a premium tool available only to registered users.
>
Learn more
Problem 1
At the end of every school year, the state administers a reading test to a
sample of third graders. The school system has 20,000 third graders, half boys
and half girls.
This year, a proportionate stratified sample was used to select 36 students for
testing. Because the population is half boy and half girl, one stratum
consisted of 18 boys; the other, 18 girls. Test scores from each sampled
student are shown below:
| Boys
|
50, 55, 60, 62, 62, 65, 67, 67, 70, 70, 73, 73, 75, 78, 78, 80,
85, 90 |
| Girls
|
70, 70, 72, 72, 75, 75, 78, 78, 80, 80, 82, 82, 85, 85, 88, 88,
90, 90 |
Using sample data, estimate the mean reading achievement level in the
population. Find the margin
of error and the
confidence interval. Assume a 95%
confidence level.
Solution: Previously we described
how to compute the confidence interval for a mean score. We
follow that process below.
- Identify a sample statistic. For this problem, we use
the overall sample mean to estimate the population mean. To compute the overall
sample mean, we need to compute the sample means for each stratum. The stratum
mean for boys is equal to:
xboys = Σ ( xi ) / n
xboys = ( 50 + 55 + 60 + ... + 80 + 85 +
90 ) / 18 = 70
The stratum mean for girls is computed similarly. It is equal to 80. Therefore,
overall sample mean is:
x = Σ(
Nh / N ) * xh
x = ( 10,000 / 20,000 ) * 70 + ( 10,000 / 20,000 ) *
80 = 75
Therefore, based on data from the sample strata, we estimate that the mean
reading achievement level in the population is equal to 75.
- Select a confidence level. In this analysis, the confidence level
is defined for us in the problem. We are working with a 95%
confidence level.
- Find the margin of error. Elsewhere on this site, we show
how to compute the margin of error when the sampling
distribution is approximately normal. The key steps are
shown below.
- Find standard error of the sampling distribution.
First, we estimate the variance of the test scores
(sh2) within each stratum.
And then, we compute the standard
error (SE). For boys, the
within-stratum sample variance is equal to:
sh2 = Σ
( xi - xh )2 / ( n -
1 )
sh2 = [ (50 - 70)2 + (55 - 70)2 +
(60 - 70)2 + ... + (85 - 70)2 + (90 - 70)2 ] /
17 = 105.41
The within-stratum sample variance for girls is computed similarly. It is equal
to 45.41.
Using results from the above computations, we compute the
standard error (SE):
SE = (1 / N) * sqrt { Σ [ Nh2
* ( 1 - nh/Nh ) * sh2 / nh
] }
SE = (1 / 20,000) * sqrt { [ 100,000,000 * ( 1 - 18/10,000 ) * 105.41 / 18 ] +
[ 100,000,000 * ( 1 - 18/10,000 ) * 45.41 / 18 ] }
SE = (1 / 20,000) * sqrt { 99,820,000 * 105.41 / 18 ] + [ 99,820,000 * 45.41 /
18 ] } = 1.45
Thus, the standard error of the sampling distribution
of the mean is 1.45.
- Find critical value. The critical value is a factor used to
compute the margin of error. Based on the
central limit theorem, we can assume that the
sampling distribution
of the mean is normally distributed. Therefore, we express the critical
value as a
z score.
To find the critical value, we take these steps.
- Compute alpha (α): α = 1 - (confidence level / 100) = 1 - 95/100 = 0.05
- Find the critical probability (p*): p* = 1 - α/2 = 1 - 0.05/2 = 0.975
- The critical value is
the z score having a
cumulative probability
equal to 0.975. From the
Normal Distribution Calculator,
we find that the critical value is
1.96.
- Compute margin of error (ME): ME = critical value * standard error
= 1.96 * 1.45 = 2.84
- Specify the confidence interval. The range of the confidence
interval is defined by the sample statistic +
margin of error. And the uncertainty is denoted
by the confidence level.
Therefore, the 95% confidence interval is 72.16 to 77.84. And the margin
of error is equal to 2.84. That is, we are 95%
confident that the true population mean is in the range
defined by 75 + 2.84.