How to Analyze Survey Data for Hypothesis Tests
Traditionally, researchers analyze survey data to estimate population parameters. But very similar analytical
techniques can also be applied to test hypotheses.
In this lesson, we describe how to analyze survey data to test statistical hypotheses.
The Logic of the Analysis
In a bigpicture sense, the analysis of survey sampling data is easy. When you use sample data to test a hypothesis,
the analysis includes the same seven steps:
 Estimate a population parameter.
 Estimate population variance.
 Compute standard error.
 Set the significance level.
 Find the critical value (often a zscore or a tscore).
 Define the upper limit of the region of acceptance.
 Define the lower limit of the region of acceptance.
It doesn't matter whether the sampling method is simple random sampling,
stratified sampling, or cluster sampling. And it doesn't matter whether the parameter of interest is a mean score, a proportion, or a
total score. The analysis of survey sampling data always includes the same seven steps.
However, formulas used in the first three steps of the analysis can differ, based on the sampling method and the parameter of interest.
In the next section, we'll list the formulas to use for each step. By the end of the lesson, you'll know how to test hypotheses about mean
scores, proportions, and total scores using data from simple random samples, stratified samples, and cluster samples.
Data Analysis for Hypothesis Testing
Now, let's look in a little more detail at the seven steps required to conduct a hypothesis test, when you are working with data from a survey sample.
 Estimate a population parameter. The first step in the analysis to estimate the value of the population parameter that appears in the null hypothesis.
To accomplish this, we compute a point estimate
of the population parameter; that is, we compute a sample statistic. Here are formulas for different scenarios:
 Mean score (simple random sampling): Use this formula to estimate the population mean, using data from a
simple random sample:
Sample mean = x = Σx / n
where x is a sample estimate of the population mean,
Σx is the sum of all the sample observations, and n is the number of sample observations.
 Proportion (simple random sampling): A proportion is a special case of the mean.
It represents the number of observations that have a particular attribute divided by the total
number of observations in the group. Use this formula to estimate the population proportion:
Sample proportion 
= p = 
Observations with attribute Total sample size (n) 
 Total score (simple random sampling): If we know the sample mean, we can estimate the population total (t)
from the following formula:
Population total = t = N * x
where N is the number of observations in the population, and x is the sample mean.
Or, if we know the sample proportion, we can estimate the population total (t) as:
Population total = t = N * p
where t is an estimate of the number of elements in the population that have a specified attribute,
N is the number of observations in the population, and p is the sample proportion.
 Mean score (stratified sampling): Use this formula to estimate the population mean from a stratified sample:
Sample mean = x = Σ( N_{h} / N ) * x_{h}
where N_{h} is the number of observations in stratum h of the population, N is the number of observations in the
population, and x_{h} is the mean score from the sample in stratum h.
 Proportion (stratified sampling): Use this formula to estimate the population proportion from a stratified sample:
Sample proportion = p = Σ( N_{h} / N ) * p_{h}
where N_{h} is the number of observations in stratum h of the population, N is the number of observations in the
population, and p_{h} is the sample proportion in stratum h.
 Total score (stratified sampling): If we know the sample mean in each stratum, we can estimate the
population total (t) from the following formula:
Population total = t = ΣN_{h} * x_{h}
where N_{h} is the number of observations in the population from stratum h,
and x_{h} is the sample mean from stratum h.
Or if we know the population proportion in each stratum, we can use this formula to estimate a population total:
Population total = t = ΣN_{h} * p_{h}
where t is an estimate of the number of observations in the population that have a specified attribute,
N_{h} is the number of observations from stratum h in the population,
and p_{h} is the sample proportion from stratum h.
 Mean score (cluster sampling): Use this formula to compute the sample mean from a cluster sample:
Sample mean = x = ( N / ( n * M ) ] *
Σ ( M_{h} * x_{h} )
where N is the number of clusters in the population,
n is the number of clusters in the sample,
M is the number of observations in the population,
M_{h} is the number of observations in cluster h,
and x_{h} is the mean score from the sample in cluster h.
 Proportion (cluster sampling): Use this formula to compute the sample proportion from a cluster sample:
Sample proportion = p = ( N / ( n * M ) ] * Σ ( M_{h} * p_{h})
where N is the number of clusters in the population,
n is the number of clusters in the sample,
M is the number of observations in the population,
M_{h} is the number of observations in cluster h,
and p_{h} is the proportion from the sample in cluster h.
 Total score (cluster sampling): If we know the sample mean in each cluster,
we can estimate the population total (t) from the following formula:
Population total = t = N/n * ΣM_{h} * x_{h}
where N is the number of clusters in the population,
n is the number of clusters in the sample,
M_{h} is the number of observations in the population from cluster h,
and x_{h} is the sample mean from cluster h.
And, if we know the sample proportion for each cluster, we can estimate a population total:
Population total = t = N/n * ΣM_{h} * p_{h}
where t is an estimate of the number of elements in the population that have a specified attribute,
N is the number of clusters in the population,
n is the number of clusters in the sample,
M_{h} is the number of observations from cluster h in the population,
and p_{h} is the sample proportion from cluster h.
 Estimate population variance. The formula(s) to estimate variance will vary, depending on the sampling method and
the parameter in the null hypothesis.
 Proportions. If you are testing a hypothesis about a population proportion, use this formula
to estimate population variance (s^{2}):
s^{2} = P * (1  P)
where s^{2} is an estimate of population variance, and P is the value of the proportion in the null hypothesis.
 Simple random sampling with means or totals. If you use a simple random sample to test a hypothesis about a
mean or a total score, use this formula to estimate variance:
s^{2} = Σ ( x_{i}  x )^{2} / ( n  1 )
where s^{2} is a sample estimate of population variance, x is
the sample mean, x_{i} is the ith element from the sample, and n
is the number of elements in the sample.
 Stratified sampling. If you use a stratified sample to test a hypothesis about a mean or
a total score, you will need to estimate variance within each stratum. Use this formula:
s^{2}_{h} = Σ
( x_{i}_{h}  x_{h} )^{2} / ( n_{h}  1 )
where s^{2}_{h} is a sample estimate of population variance in stratum h,
x_{i}_{h} is the value of the ith element from stratum h,
x_{h} is the sample mean from stratum h,
and n_{h} is the number of sample observations from stratum h.
 Variance within clusters. If you use
twostage cluster sampling
to test a hypothesis about a mean or total score, you need to estimate the variance within clusters.
Use this formula:
s^{2}_{h} = Σ ( x_{i}_{h}  x_{h} )^{2} / ( m_{h}  1 )
where s^{2}_{h} is a sample estimate of population variance in cluster h,
x_{i}_{h} is the value of the ith element from cluster h,
x_{h} is the sample mean from cluster h,
and m_{h} is the number of observations sampled from cluster h.
 Variance between clusters. If you use cluster sampling to estimate a total score,
you need to estimate the variance between clusters. Use this formula:
s^{2}_{b} = Σ ( t_{h}  t/N )^{2} / ( n  1 )
where s^{2}_{b} is a sample estimate of the variance between sampled clusters,
t_{h} is the total from cluster h,
t is the sample estimate of the population total,
N is the number of clusters in the population,
and n is the number of clusters in the sample.
You can estimate the population total (t) from the following formula:
Population total = t = N/n * ΣM_{h} * x_{h}
where M_{h} is the number of observations in the population from cluster h,
and x_{h} is the sample mean from cluster h.
 Compute standard error. The right formula to compute standard error will vary, depending on the
sampling method and the parameter under study.
 Simple random sampling (mean or proportion). When we estimate a mean or a proportion from a simple random sample,
the standard error (SE) of the estimate is:
SE = sqrt [ (1  n/N) * s^{2} / n ]
where n is the sample size, N is the population size,
and s is a sample estimate of the population standard deviation.
 Simple random sampling (total score). When we use a mean or a proportion to estimate a
population total from a simple random sample, the standard error (SE) of the estimate is:
SE = sqrt [ N^{2} * (1  n/N) * s^{2} / n ]
where N is the population size, n is the sample size,
and s^{2} is a sample estimate of the population variance.
 Stratified sampling (mean or proportion). When we estimate a mean or a proportion from a
stratified random sample, the standard error (SE) of the estimate is:
SE = (1 / N) * sqrt { Σ [ N_{h}^{2}
* ( 1  n_{h}/N_{h} ) * s_{h}^{2} / n_{h}
] }
where n_{h} is the number of sample observations from stratum h,
N_{h} is the number of elements from stratum h in the population,
N is the number of elements in the population,
and s^{2}_{h} is a sample estimate of the population variance in stratum h.
 Stratified sampling (total score). When we estimate a total from a stratified random sample, the standard error (SE) of the estimate is:
SE = sqrt { Σ [ N_{h}^{2}
* ( 1  n_{h}/N_{h} )
* s_{h}^{2} / n_{h} ] }
where N_{h} is the number of elements from stratum h in the population,
n_{h} is the number of sample observations from stratum h,
and s^{2}_{h} is a sample estimate of the population variance in stratum h.
 Cluster sampling (mean). When we estimate a population mean from a cluster sample,
the standard error (SE) of the estimate is:
SE = 
( 1 / M ) * sqrt { [ N^{2} * ( 1  n/N ) / n ] * Σ ( M_{h} * x_{h} 
t / N )^{2} / ( n  1 )


+ ( N / n ) * Σ [ ( 1  m_{h} / M_{h}
) * M_{h}^{2} * s_{h}^{2} / m_{h} ] }

where M is the number of observations in the population,
N is the number of clusters in the population,
n is the number of clusters in the sample,
M_{h} is the number of elements from cluster h in the population,
m_{h} is the number of elements from cluster h in the sample,
x_{h} is the sample mean from cluster h,
s^{2}_{h} is a sample estimate of the population variance in stratum h,
and t is a sample estimate of the population total.
For the equation above, use the following formula to estimate the population total.
t = N/n * Σ M_{h}x_{h}
With onestage cluster sampling, the formula for
the standard error reduces to:
SE = 
( 1 / M ) * sqrt { [ N^{2} * ( 1  n/N ) / n ] * Σ ( M_{h} * x_{h} 
t / N )^{2} / ( n  1 )

 Cluster sampling (proportion). When we estimate a population proportion from a cluster sample,
the standard error (SE) of the estimate is:
SE = 
( 1 / M ) * sqrt [ ( N^{2} * ( 1  n/N ) / n ] * Σ ( M_{h} * p_{h}  t / N )^{2} } / ( n  1 )


+ ( N / n ) * Σ [ ( 1  m_{h} / M_{h}
) * M_{h}^{2} * p_{h} * ( 1  p_{h} ) / ( m_{h}  1 ) ] }

where M is the number of observations in the population,
N is the number of clusters in the population,
n is the number of clusters in the sample,
M_{h} is the number of elements from cluster h in the population,
m_{h} is the number of elements from cluster h in the sample,
p_{h} is the value of the proportion from cluster h,
and t is a sample estimate of the population total.
For the equation above, use the following formula to estimate the population total.
t = N/n * Σ M_{h}p_{h}
With onestage cluster sampling, the formula for the standard error reduces to:
SE = 
( 1 / M ) * sqrt [ ( N^{2} * ( 1  n/N ) / n ] * Σ ( M_{h} * p_{h}  t / N )^{2} } / ( n  1 )

 Cluster sampling (total score). When we estimate a population total from a cluster sample,
the standard error (SE) of the estimate is:
SE = 
N * sqrt { [ ( 1  n/N ) / n ] * s^{2}_{b}/n +


N/n * Σ ( 1  m_{h}/M_{h} ) * M^{2}_{h} * s^{2}_{h}/m_{h} ) }

where N is the number of clusters in the population,
n is the number of clusters in the sample,
s^{2}_{b} is a sample estimate of the variance between clusters,
m_{h} is the number of elements from cluster h in the sample,
M_{h} is the number of elements from cluster h in the population,
and s^{2}_{h} is a sample estimate of the population variance in cluster h.
With onestage cluster sampling, the formula for the standard error reduces to:
SE = N * sqrt { [ ( 1  n/N ) / n ] * s^{2}_{b}/n }
 Choose a significance level. The significance level (denoted by α) is the probability of committing a
Type I error. Researchers often set the significance level equal to 0.05 or 0.01.
 Find the critical value. Often expressed as a tscore or a
zscore, the critical value is a factor used to
determine upper and lower limits of the region of acceptance.
When the null hypothesis is onetailed, the critical value is the zscore or tscore that has a
cumulative probability
equal to 1  α/2. When the null hypothesis is onetailed, the critical value has a
cumulative probability
equal to 1  α.
Researchers use a tscore when sample size is small; a zscore when it is large (at least 30).
You can use the Normal Distribution Calculator to find the critical zscore, and the
t Distribution Calculator to find the critical tscore.
If you use a tscore, you will have to find the
degrees of freedom (df). With simple random samples, df is
often equal to the sample size minus one.
Note: The critical value for a onetailed hypothesis does not equal the
critical value for a twotailed hypothesis. The critical value for a onetailed hypothesis is smaller.

Find the upper limit (UL) of the region of acceptance. There are two possibilities,
depending on the form of the null hypothesis.

If the null hypothesis is μ < M or if the null hypothesis is μ = M: The upper
limit of the region of acceptance will be:
UL = M + SE * CV
where M is the parameter value in the null hypothesis, SE is the standard error, and CV is the
critical value.

If the null hypothesis is μ > M: The theoretical upper
limit of the region of acceptance is plus infinity,
unless the parameter in the null hypothesis is a proportion or a percentage.
The upper limit is 1 for a proportion, and 100 for a percentage.

In a similar way, we find the lower limit (LL) of the range of acceptance.
There are two possibilities, depending on the form of the null
hypothesis.

If the null hypothesis is μ > M or if the null hypothesis is μ = M: The lower limit of
the region of acceptance will be:
LL = M  SE * CV
where M is the parameter value in the null hypothesis, SE is the standard error, and CV is the
critical value.

If the null hypothesis is μ < M: The theoretical lower
limit of the region of acceptance is minus infinity,
unless the test statistic is a proportion or a percentage.
The lower limit for a proportion or a percentage is zero.
The region of acceptance is the range of values between LL and UL. If the sample estimate of the population parameter
falls outside the region of acceptance, the researcher rejects the null hypothesis. If the sample estimate falls
within the region of acceptance, the researcher does not reject the null hypothesis.
By following the steps outlined above, you define the region of acceptance in such a way that the chance of making a
Type I error
is equal to the
significance level.
Test Your Understanding
In this section, two hypothesis testing examples illustrate how to define the
region of acceptance. The first problem shows a twotailed test with
a mean score; and the second problem, a onetailed test with
a proportion.
Sample Size Calculator
As you probably noticed, defining the region of acceptance can be complex and
timeconsuming. Stat Trek's Sample Size Calculator can do the same job quickly,
easily, and errorfree.The calculator is easy to use, and it
is free. You can find the Sample Size Calculator in Stat Trek's
main menu under the Stat Tools tab. Or you can tap the button below.
Sample Size Calculator
Problem 1
An inventor has developed a new, energyefficient lawn mower engine. He
claims that the engine will run continuously for 5 hours (300 minutes)
on a single ounce of regular gasoline. Suppose a random sample
of 50 engines is tested. The engines run for an average of 295
minutes, with a standard deviation of 20 minutes.
Consider the null hypothesis that the mean run time is 300 minutes
against the alternative hypothesis that the mean run time is not
300 minutes. Use a 0.05 level of significance. Find the region of
acceptance. Based on the region of acceptance, would you reject
the null hypothesis?
Solution: The analysis of survey data to test
a hypothesis takes seven steps. We work through those steps below:
 Estimate a population parameter. For this problem, we are given the sample mean. It is 295 minutes.
However, if we had to compute the sample mean from raw data, we could do it, using the following formula:
Sample mean = x = Σx / n
where Σx is the sum of all the sample observations, and n is the number of sample observations.
 Estimate population variance. For this problem, we are given a sample estimate of the standard deviation. It is
20 minutes. Since the variance is the square of the standard deviation, we can estimate that the population
variance is 20^{2} or 400.
If we hadn't been given the standard deviation, we could have computed it from the raw sample data, using
the following formula:
s^{2} = Σ ( x_{i}  x )^{2} / ( n  1 )
where s^{2} is a sample estimate of population variance, x is
the sample mean, x_{i} is the ith element from the sample, and n
is the number of elements in the sample.
 Compute standard error. The right formula to compute standard error will vary, depending on the
sampling method and the parameter under study. The right equation for a mean
score from a simple random sample is:
SE = sqrt [ (1  n/N) * s^{2} / n ]
where n is the sample size, N is the population size, and s is a sample estimate of the population standard deviation.
For this problem, we know that the sample size is 50, and the standard deviation is 20. The population size is not stated
explicitly; but, in theory, the manufacturer could produce an infinite number of motors. Therefore, the population size
is a very large number. For the purpose of the analysis, we'll assume that the population size is 100,000. Plugging
those values into the formula, we find that the standard error is:
SE = sqrt [ (1  n/N) * s^{2} / n ]
SE = sqrt [ (1  50/100,000) * 20^{2} / 50 ]
SE = sqrt(0.9995 * 8) = 2.828
 Choose a significance level. The significance level (α) is chosen for us in the problem. It is 0.05.
(Researchers often set the significance level equal to 0.05 or 0.01.)
 Find the critical value. The critical value is a factor used to
determine upper and lower limits of the region of acceptance. When the sample size is large (at least 30),
researchers can express the critical value as a tscore or a
zscore. Here, the
sample size is much larger than 30 (n=50), so we will express the critical value as a zscore.
When the null hypothesis is onetailed, the critical value has a
cumulative probability
equal to 1  α/2. When the null hypothesis is onetailed, the critical value has a
cumulative probability
equal to 1  α.
For this problem, the null hypothesis and the alternative hypothesis can be expressed as:
Null hypothesis 
Alternative hypothesis 
Number of tails 
μ = 300 
μ ≠ 300 
2 
Since this problem deals with a twotailed hypothesis, the critical value will be the zscore that has a
cumulative probability
equal to 1  α/2. Here, the significance level (α) is 0.05, so the
critical value will be the zscore that has a cumulative probability equal to 0.975.
We use the Normal Distribution Calculator to find that the zscore
with a cumulative probability of 0.975 is 1.96. Thus, the critical value is 1.96.
 Find the lower limit of the region of acceptance. The lower limit (LL) of the region of acceptance is:
LL = M  SE * CV
where M is the parameter value in the null hypothesis, SE is the standard error, and CV is the critical value. So, for this problem, we
compute the lower limit of the region of acceptance as:
LL = 300  2.828 * 1.96
LL = 300  5.54
LL = 294.46
 Find the upper limit of the region of acceptance. The upper limit (UL) of the region of acceptance is:
UL = M + SE * CV
where M is the parameter value in the null hypothesis, SE is the standard error, and CV is the critical value. So, for this problem, we
compute the lower limit of the region of acceptance as:
LL = 300 + 2.828 * 1.96
LL = 300 + 5.54
LL = 305.54
Thus, given a significance level of 0.05, the region of acceptance is range of values
between 294.46 and 305.54. In the tests, the engines
ran for an average of 295 minutes. That value is within the region of acceptance, so the inventor cannot reject
the null hypothesis that the engines run for 300 minutes on an ounce of fuel.
Problem 2
Suppose the CEO of a large software company
claims that at least 80 percent of the company's
1,000,000 customers are very satisfied. A survey of 100 randomly
sampled customers finds that 73 percent are very satisfied.
To test the CEO's hypothesis, find the region of acceptance.
Assume a significance level of 0.05.
Solution: The analysis of survey data to test
a hypothesis takes seven steps. We work through those steps below:
 Estimate a population parameter. For this problem, we are interested in the population proportion;
and we are given the sample proportion as an estimate. It is 0.73.
However, if we had to compute the sample proportion from raw data, we could do it by using the following formula:
Sample proportion 
= p = 
Observations with attribute Total sample size (n) 
 Estimate population variance. To compute the population variance when the true population proportion is P,
we use the following formula:
s^{2} = P * (1  P)
where s^{2} is the population variance when the true population proportion is P,
and P is the value of the proportion in the null hypothesis.
For the purpose of estimating population variance, we assume the null hypothesis is true. In this problem, the null hypothesis states
that the true proportion of satisfied customers is 0.8. Therefore, to estimate population variance, we insert that value in the formula:
s^{2} = 0.8 * (1  0.8)
s^{2} = 0.8 * 0.2 = 0.16
 Compute standard error. The right formula to compute standard error will vary, depending on the
sampling method and the parameter under study. The right equation for a proportion
score from a simple random sample is:
SE = sqrt [ (1  n/N) * s^{2} / n ]
where n is the sample size, N is the population size, and s is a sample estimate of the population standard deviation.
For this problem, we know that the sample size is 100, the variance (s^{2}) is 0.16, and the population size is 1,000,000.
Plugging those values into the formula, we find that the standard error is:
SE = sqrt [ (1  n/N) * s^{2} / n ]
SE = sqrt [ (1  100/1,000,000) * 0.16 / 100 ]
SE = sqrt(0.9999 * 0.0016) = 0.04
 Choose a significance level. The significance level (α) is chosen for us in the problem. It is 0.05.
(Researchers often set the significance level equal to 0.05 or 0.01.)
 Find the critical value. The critical value is a factor used to
determine upper and lower limits of the region of acceptance. When the sample size is large (at least 30),
researchers can express the critical value as a tscore or a
zscore. Here, the
sample size is much larger than 30 (n=100), so we will express the critical value as a zscore.
When the null hypothesis is onetailed, the critical value has a
cumulative probability
equal to 1  α/2. When the null hypothesis is onetailed, the critical value has a
cumulative probability
equal to 1  α.
For this problem, the null hypothesis and the alternative hypothesis can be expressed as:
Null hypothesis 
Alternative hypothesis 
Number of tails 
μ = 0.8 
μ < 0.8 
1 
Since this problem deals with a onetailed hypothesis, the critical value will be the zscore that has a
cumulative probability
equal to 1  α. Here, the significance level (α) is 0.05, so the
critical value will be the zscore that has a cumulative probability equal to 0.95.
We use the Normal Distribution Calculator to find that the zscore
with a cumulative probability of 0.95 is 1.645. Thus, the critical value is 1.645.
 Find the lower limit of the region of acceptance. The lower limit (LL) of the region of acceptance is:
LL = M  SE * CV
where M is the parameter value in the null hypothesis, SE is the standard error, and CV is the critical value. So, for this problem, we
compute the lower limit of the region of acceptance as:
LL = 0.8  0.04 * 1.645
LL = 0.8  0.0658 = 0.7342
 Find the upper limit of the region of acceptance. For this type of onetailed hypothesis,
the theoretical upper limit of the region of acceptance is 1;
since any proportion greater than 0.8 is consistent with the null hypothesis, and 1 is the
largest value that a proportion can have.
Thus, given a significance level of 0.05, the region of acceptance is the range of values
between 0.7342 and 1.0. In the sample survey, the proportion of satisfied customers was 0.73.
That value is outside the region of acceptance, so null hypothesis must be rejected.