### Beyond AP Statistics

#### Probability Basics

#### Small Samples

#### Distributions

#### Power

### Beyond AP Statistics

#### Probability Basics

#### Small Samples

#### Distributions

#### Power

# Sampling Distributions

Suppose that we draw all possible samples of size *n* from a given
population. Suppose further that we compute a
statistic (e.g., a mean, proportion, standard deviation) for each
sample. The probability
distribution of this statistic is called a **sampling distribution**. And
the standard deviation of this statistic is called the **standard error**.

## Variability of a Sampling Distribution

The variability of a sampling distribution is measured by its variance or its standard deviation. The variability of a sampling distribution depends on three factors:

- N: The number of observations in the population.
- n: The number of observations in the sample.
- The way that the random sample is chosen.

If the population size is much larger than the sample size, then the sampling distribution has roughly the same standard error, whether we sample with or without replacement. On the other hand, if the sample represents a significant fraction (say, 1/20) of the population size, the standard error will be meaningfully smaller, when we sample without replacement.

## Sampling Distribution of the Mean

Suppose we draw all possible samples of size *n* from a population of size *N*.
Suppose further that we compute a mean score for each sample. In this way, we
create a sampling distribution of the mean.

We know the following about the sampling distribution of the mean.
The mean of the sampling distribution (μ_{x})
is equal to the mean of the population (μ).
And the standard error of the sampling distribution (σ_{x})
is determined by the standard deviation of the population (σ),
the population size (N), and the sample size (n). These relationships are shown in the
equations below:

μ_{x} = μ

σ_{x} = [ σ / sqrt(n) ] * sqrt[ (N - n ) / (N - 1) ]

In the standard error formula, the factor sqrt[ (N - n ) / (N - 1) ] is called the finite population correction or fpc. When the population size is very large relative to the sample size, the fpc is approximately equal to one; and the standard error formula can be approximated by:

σ_{x} = σ / sqrt(n).

You often see this "approximate" formula in introductory statistics texts. As a general rule, it is safe to use the approximate formula when the sample size is no bigger than 1/20 of the population size.

## Sampling Distribution of the Proportion

In a population of size *N*, suppose that the probability of the occurrence
of an event (dubbed a "success") is P; and the probability of the event's
non-occurrence (dubbed a "failure") is Q. From this population, suppose that we
draw all possible samples of size *n*. And finally, within each sample,
suppose that we determine the proportion of successes *p* and failures *q*.
In this way, we create a sampling distribution of the proportion.

We find that the mean of the sampling distribution of the proportion (μ_{p})
is equal to the probability of success in the population (P). And the standard
error of the sampling distribution (σ_{p})
is determined by the standard deviation of the population (σ),
the population size, and the sample size. These relationships are shown in the
equations below:

μ_{p} = P

σ_{p} = [ σ / sqrt(n) ] * sqrt[ (N - n ) / (N - 1) ]

σ_{p} = sqrt[ PQ/n ] * sqrt[ (N - n ) / (N - 1) ]

where σ = sqrt[ PQ ].

Like the formula for the standard error of the mean, the formula for the standard error of the proportion uses the finite population correction, sqrt[ (N - n ) / (N - 1) ]. When the population size is very large relative to the sample size, the fpc is approximately equal to one; and the standard error formula can be approximated by:

σ_{p} = sqrt[ PQ/n ]

You often see this "approximate" formula in introductory statistics texts. As a general rule, it is safe to use the approximate formula when the sample size is no bigger than 1/20 of the population size.

## Central Limit Theorem

The **central limit theorem** states that the
sampling distribution of the mean of any
independent,
random variable will be normal or nearly normal,
if the sample size is large enough.

How large is "large enough"? The answer depends on two factors.

- Requirements for accuracy. The more closely the sampling distribution needs to resemble a normal distribution, the more sample points will be required.
- The shape of the underlying population. The more closely the original population resembles a normal distribution, the fewer sample points will be required.

In practice, some statisticians say that a sample size of 30 is large enough when the population distribution is roughly bell-shaped. Others recommend a sample size of at least 40. But if the original population is distinctly not normal (e.g., is badly skewed, has multiple peaks, and/or has outliers), researchers like the sample size to be even larger.

## How to Choose Between T-Distribution and Normal Distribution

The t distribution and the normal distribution can both be used with statistics that have a bell-shaped distribution. This suggests that we might use either the t-distribution or the normal distribution to analyze sampling distributions. Which should we choose?

Guidelines exist to help you make that choice. Some focus on the population standard deviation.

- If the population standard deviation is known, use the normal distribution
- If the population standard deviation is unknown, use the t-distribution.

Other guidelines focus on sample size.

- If the sample size is large, use the normal distribution. (See the discussion above in the section on the Central Limit Theorem to understand what is meant by a "large" sample.)
- If the sample size is small, use the t-distribution.

In practice, researchers employ a mix of the above guidelines. On this site, we use the normal distribution when the population standard deviation is known and the sample size is large. We might use either distribution when standard deviation is unknown and the sample size is very large. We use the t-distribution when the sample size is small, unless the underlying distribution is not normal. The t distribution should not be used with small samples from populations that are not approximately normal.

## Test Your Understanding

In this section, we offer two examples that illustrate how sampling distributions are used to solve commom statistical problems. In each of these problems, the population standard deviation is known; and the sample size is large. So you can use the Normal Distribution Calculator, rather than the t-Distribution Calculator, to compute probabilities for these problems.

## Normal Distribution Calculator

The normal calculator solves common statistical problems, based on the normal distribution. The calculator computes cumulative probabilities, based on three simple inputs. Simple instructions guide you to an accurate solution, quickly and easily. If anything is unclear, frequently-asked questions and sample problems provide straightforward explanations. The calculator is free. It can found in the Stat Trek main menu under the Stat Tools tab. Or you can tap the button below.

Normal Distribution CalculatorWould it be wrong to use the t-distribution when you know the population standard deviation and the sample size is large? Not at all. When the sample size is large, the t-distribution and the normal distribution yield approximately the same results.

**Example 1**

Assume that a school district has 10,000 6th graders. In this district, the
average weight of a 6th grader is 80 pounds, with a standard deviation of 20
pounds. Suppose you draw a random sample of 50 students. What is the
probability that the average weight of a sampled student will be less than 75
pounds?

*Solution:* To solve this problem, we need to define the sampling
distribution of the mean. Because our sample size is greater than
30, the Central Limit Theorem tells us that the sampling distribution will
approximate a normal distribution.

To define our normal distribution, we need to know both the mean of the sampling distribution and the standard deviation. Finding the mean of the sampling distribution is easy, since it is equal to the mean of the population. Thus, the mean of the sampling distribution is equal to 80.

The standard deviation of the sampling distribution can be computed using the following formula.

σ_{x} = [ σ / sqrt(n) ] * sqrt[ (N - n ) / (N - 1) ]

σ_{x} = [ 20 / sqrt(50) ] * sqrt[ (10,000 - 50 ) / (10,000 - 1) ]

σ_{x} = (20/7.071) * (0.995) = 2.81

Let's review what we know and what we want to know. We know that the sampling distribution of the mean is normally distributed with a mean of 80 and a standard deviation of 2.81. We want to know the probability that a sample mean is less than or equal to 75 pounds.

Because we know the population standard deviation and the sample size is large, we'll use the normal distribution to find probability. To solve the problem, we plug these inputs into the Normal Probability Calculator: mean = 80, standard deviation = 2.81, and normal random variable = 75. The Calculator tells us that the probability that the average weight of a sampled student is less than 75 pounds is equal to 0.038.

**Note:** Since the population size is more than 20 times greater than the sample size,
we could have used the "approximate" formula σ_{x} = [ σ / sqrt(n) ]
to compute the standard error. Had we done that, we would have found a standard error equal to
[ 20 / sqrt(50) ] or 2.83.

**Example 2**

Find the probability that of the next 120 births, no more than 40% will be
boys. Assume equal probabilities for the births of boys and girls. Assume
also that the number of births in the population (N) is very large, essentially
infinite.

*Solution:* The Central Limit Theorem tells us that the proportion of boys
in 120 births will be approximately normally distributed.

The mean of the sampling distribution will be equal to the mean of the population distribution. In the population, half of the births result in boys; and half, in girls. Therefore, the probability of boy births in the population is 0.50. Thus, the mean proportion in the sampling distribution should also be 0.50.

The standard deviation of the sampling distribution (i.e., the standard error) can be computed using the following formula.

σ_{p} = sqrt[ PQ/n ] * sqrt[ (N - n ) / (N - 1) ]

Here, the finite population correction is equal to 1.0, since the population size (N) was assumed to be infinite. Therefore, standard error formula reduces to:

σ_{p} = sqrt[ PQ/n ]

σ_{p} = sqrt[ (0.5)(0.5)/120 ] = sqrt[0.25/120 ] = 0.04564

Let's review what we know and what we want to know. We know that the sampling distribution of the proportion is normally distributed with a mean of 0.50 and a standard deviation of 0.04564. We want to know the probability that no more than 40% of the sampled births are boys.

Because we know the population standard deviation and the sample size is large, we'll use the normal distribution to find probability. To solve the problem, we plug these inputs into the Normal Probability Calculator: mean = .5, standard deviation = 0.04564, and the normal random variable = .4. The Calculator tells us that the probability that no more than 40% of the sampled births are boys is equal to 0.014.

**Note:** This problem can also be treated as a
binomial experiment. Elsewhere, we showed
how to analyze a binomial experiment. The binomial experiment
is actually the more exact analysis. It produces a probability
of 0.018 (versus a probability of 0.14 that we found using the normal distribution). Without a computer,
the binomial approach is computationally demanding. Therefore,
many statistics texts emphasize the approach presented above,
which uses the normal distribution to approximate the binomial.