The Bonferroni correction (aka, Bonferroni adjustment, Bonferroni test, Bonferroni method) is way to control error rate familywise with experiments that test multiple comparisons.
The lesson is all about the Bonferroni correction - what it is, why it is needed, when to use it, and how to implement it.
Prerequisites: This lesson assumes familiarity with multiple comparisons for follow-testing in ANOVA. You should know how to represent a statistical hypothesis mathematically by a comparison. You should be able to compute the sum of squares associated with a comparison. You should understand how the probability of committing a Type I error is affected by the number of comparisons tested. And you should know how to use an F ratio to test multiple comparisons. If you don't know these things, review the following lessons:
- Comparison of Treatment Means. This lesson defines an ordinary comparison. It explains how to represent a statistical hypothesis mathematically by a comparison. And it explains how to compute the sum of squares for a comparison.
- Multiple Comparisons. This lesson describes how the probability of committing a Type I error is affected by the number of comparisons tested.
- F Ratio for Planned Comparisons. This lesson explains how to use an F ratio with analysis of variance to test statistical hypotheses represented by planned comparisons.
What is Bonferroni's Correction?
Bonferroni's correction is an adjustment to the significance level used to evaluate the statistical significance of an individual comparison. Here's how it works:
- Step 1. Set a significance level for the error rate familywise.
- Step 2. Divide the significance level by the number of comparisons to be tested.
- Step 3. Use the value from Step 2 as the significance level to test individual comparisons.
For example, suppose an experimenter set the error rate familywise at 0.05. If the experiment included four comparisons to be tested, the significance level to test each individual comparison would be 0.05/4 or 0.0125.
Why Do We Need Bonferroni's Correction?
When comparisons in the family are orthogonal, the probability of incorrectly rejecting at least one null hypothesis is easily calculated as:
ERF = 1 - (1 - α)C
where ERF is the probability of making at least one Type I error when testing C orthogonal hypotheses (i.e., ERF is the error rate familywise), α is the significance level for a single hypothesis test, and C is the number of orthogonal comparisons being tested.
Suppose an experimenter wants to set the error rate familywise at 0.05. What happens if the experimenter tests four true, independent null hypotheses, using a significance level of 0.05 for each test? The probability of incorrectly rejecting at least one null hypothesis is greater than 0.05, because there are four chances go wrong. The probability of making at least one Type I error is:
ERF = 1 - (1 - 0.05)4 = 0.185
However, what happens if the experimenter uses the Bonferroni correction? The experimenter specifies a significance level for each individual comparison of 0.05/4 or 0.125. And the probability of making at least one Type I error is:
ERF = 1 - (1 - 0.125)4
ERF = 1 - (0.9875)4 = 0.049
With the Bonferroni correction, the chances of making at least one Type I error are approximately equal to the desired error rate familywise of 0.05.
When to Use an F Ratio
In some situations, the F ratio is a good technique for testing the statistical significance of multiple comparisons. In other situations, it is not so good.
There are several things to like about the Bonferroni correction, including the following:
- The Bonferroni correction does a great job of controlling error rate familywise. The experimenter sets the significance level for a family of hypothesis tests, and the Bonferroni correction specifies the right significance level for each individual hypothesis test.
- The Bonferroni correction can be used with planned comparisons and with post hoc comparisons.
- The Bonferroni correction is easy to apply. It just requires simple arithmetic.
For an experimenter who is most concerned with controlling error rate familywise, the Bonferroni correction is a good choice.
The Bonferroni correction has one major disadvantage: It increases the likelihood of Type II errors. The more hypotheses you test, the more likely it is that you will fail to reject at least one hypothesis that should be rejected.
When an experiment calls for many hypothesis tests, the Bonferroni correction may be a poor choice.
What Do Statisticians Say?
While many statisticians offer conflicting advice about when to use the Bonferroni correction, most agree on the following points:
- The Bonferroni correction is an effective technique for controlling error rate familywise.
- The Bonferroni correction is most appropriate when an experiment calls for only a few hypothesis tests. It controls error rate familywise; yet, holds error rate per comparison to reasonably acceptable levels.
- The Bonferroni correction is least appropriate when an experiment calls for many hypothesis tests. The correction reduces statistical power, making hypothesis tests unacceptably vulnerable to Type II errors.
Note: Statistical power is affected by sample size. The bigger the sample, the greater the power. As a result, the Bonferroni correction is more vulnerable to Type II errors when the sample size is small; less vulnerable, when it is large.
A Step-By-Step Example
In this section, we'll work through a simple example to illustrate how to apply the Bonferroni correction to test multiple comparisons in a single study.
To test the long-term effect of aerobic exercise on resting pulse rate, an investigator conducts a controlled experiment. The experiment uses a completely randomized design, consisting of three treatment groups:
- Control. Subjects do not participate in an exercise program.
- Low-effort. Subjects jog 1 mile on Monday, Wednesday, and Friday.
- High-effort. Subjects jog 2 miles every day, except Sunday.
Five subjects are randomly assigned to each group; and, after 28 days of treament, their resting pulse rate is measured on day 29.
Before collecting any data, the investigator poses the research questions to be answered, states statistical hypotheses implied by each research question, and identifies the analytical technique used to test each statistical hypothesis.
For this experiment, the researcher is interested in three research questions. Those questions, and the associated statistical hypotheses, appear below:
- Overall research question. Will mean pulse rate in one treatment group differ from mean pulse rate in any other treatment group?
H0: μi = μj
H1: μi ≠ μj
- Follow-up question 1. Will mean pulse rate of subjects in the control group (Group 1) differ from the mean pulse rate of subjects
in the low-effort group (Group 2)?
H0: μ1 = μ2
H1: μ1 ≠ μ2
- Follow-up question 2. Will mean pulse rate of subjects in the control group (Group 1) differ from the mean pulse rate of subjects
in the high-effort group (Group 3)?
H0: μ1 = μ3
H1: μ1 ≠ μ3
Note: The two follow-up questions are nonorthogonal. Therefore, we will not use an unadjusted F ratio to test their hypotheses.
The overall research question asks whether the mean pulse rate in one treatment group differs from the mean pulse rate in any other group. The null hypothesis implied by this research question can be tested by an omnibus analysis of variance, using a significance level of 0.05.
The remaining questions are follow-up questions. The null hypothesis associated for each follow-up question can be represented mathematically by a unique comparison. To determine whether to reject the null hypothesis for a follow-up question, we test its associated comparison for statistical significance.
Assume that the investigator wants to maintain an error rate familywise of 0.05, when testing follow-up questions. To accomplish this objective, the experimenter will apply a Bonferroni correction to the significance level for each follow-up test. Since the experiment calls for two follow-up tests, the significance level for each follow-up test becomes 0.05/2 or 0.25.
Note: The experimenter is comfortable using a Bonferroni correction for follow-up tests, because the experiment calls for a small number of follow-up tests (only two).
Pulse rate measurements for each subject in each treatment group appear below:
Table 1. Pulse Rate for Each Subject in Each Group
|Group 1 (control)||Group 2 (low effort)||Group 3 (high effort)|
The overall research question is: Will mean pulse rate in one treatment group differ from mean pulse rate in any other treatment group? The statistical hypotheses implied by that question are:
H0: μi = μj
H1: μi ≠ μj
We can test this null hypothesis with a standard, omnibus analysis of variance. Here is the ANOVA table from that analysis.
Table 2. ANOVA Summary Table
The P value for the between-groups (BG) effect is 0.046, which is less that the significance level of 0.05. Therefore, we reject the null hypothesis of no difference in pulse rates between treatment groups.
Note: We explained how to conduct a one-way analysis of variance in previous lessons. If you're wondering how to produce the ANOVA table shown above, see One-Way Analysis of Variance: Example or One-Way Analysis of Variance With Excel.
For this experiment, the investigator planned to conduct two follow-up tests to supplement the omnibus analysis of variance. In case you've forgotten, here are the two follow-up questions:
- Follow-up question 1. Will mean pulse rate of subjects in the control group (Group 1) differ from the mean pulse rate of subjects in the low-effort group (Group 2)?
- Follow-up question 2. Will mean pulse rate of subjects in the control group (Group 1) differ from the mean pulse rate of subjects in the high-effort group (Group 3)?
Each of these questions can be addressed by testing the statistical significance of a particular comparison. To illustrate the process, we'll work though a step-by-step analysis for the first reseach question.
Step 1. Compute Mean Scores
Mean pulse rate within each group (computed from raw scores in Table 1) appears below:
Table 3. Mean Pulse Rate in Each Treatment Group
|Group 1 (control)||Group 2 (low effort)||Group 3 (high effort)|
Step 2. Define a Comparison
Next, we define a comparison that represents our research question. For the first follow-up question, we want to compare the mean score in the control group (Group 1) with the mean score in mean score in the low-effort treatment group (Group 2). Therefore, this is the comparison we need to use:
L1 = X1 - X2
L1 = 90 - 80 = 10
where L1 is the value of the comparison, X1 is the mean score in Group 1, and X2 is the mean score in Group 2.
Step 3. Compute Sum of Squares
With a balanced design, the sum of squares for a given comparison ( Li ) can be computed from the following formula:
SSi = n * Li2 / Σ c2ij
where SSi is the sum of squares for comparison Li , Li is the value of the comparison, n is the sample size in each group, and cij is the coefficient (weight) for level j in the formula for comparison Li.
Plugging values from our sample problem into the formula, we get:
SS1 = 5 * 102 / [ (1)2 + (-1)2 ]
SS1 = 500 / 2 = 250
Step 4. Produce ANOVA Summary Table
The summary table from an omnibus analysis of variance includes two outputs that we can use to test the statistical significance of a comparison. Those outputs are (1) the value of the within-groups mean square and (2) the degrees of freedom for the within-groups mean square.
We generated the ANOVA summary table earlier. For convenience, here it is again.
Table 2. ANOVA Summary Table
Step 5. Find the F Ratio
The F ratio for a comparison equals its sum of squares divided by the within-groups mean square (from the ANOVA table).
F(1, v2) = SSi / MSWG
where F is the value of the F ratio, SSi is the sum of squares for comparison i, and MSWG is the within-groups mean square. The numerator of any F ratio for a comparison has one degree of freedom. The degrees of freedom (v2) for the denominator equal the degrees of freedom (from the ANOVA table) for the within-groups mean square.
For this problem, the F ratio is:
F(1, 12) = 250 / 125 = 2.0
Step 6. Find the P-Value
With a planned comparison, the F ratio is probability that an F statistic would be more extreme (i.e., bigger) than the actual F ratio computed from experimental data.
We can use Stat Trek's F Distribution Calculator to find the probability that an F statistic will be bigger than the actual F ratio observed in the experiment. Enter the numerator degrees of freedom (1), the denominator degrees of freedom (12), and the observed F ratio (2.0) into the calculator; then, click the Calculate button.
From the calculator, we see that the P ( F > 2.0 ) equals about 0.18. Therefore, the P-Value is 0.18.
Step 7. Test the Hypothesis
If the P-value for a comparison is less than the significance level, we reject the associated hypothesis. Otherwise, we fail to reject.
In this example, the P-value (0.18) is greater than the adjusted significance level (0.025) for a follow-up test. Therefore, we cannot reject the null hypothesis that the mean score in the control group (Group 1) is equal to the mean score in the low-effort treatment group (Group 2).
What About the Other Follow-up Test?
As part of this experiment, the investigator planned to conduct two follow-up tests to supplement the omnibus analysis of variance. In the paragraphs above, we described a seven-step process for conducting the first follow-up test. You would use the same seven-step process to conduct the second follow-up test.
ANOVA Summary Table
Follow-up tests are often reported in an enhanced ANOVA summary table. The enhanced table shows all of the results from a standard ANOVA summary table. In addition, it shows results (sum of squares, mean square, degrees of freedom, F ratio, and P-value) for each follow-up comparison (L1 and L2).
Here is the enhanced ANOVA summary table for the present experiment.
Table 4. Enhanced ANOVA Summary Table
In this example, the between groups effect (BG) is statistically significant (p=0.046) at the 0.05 significance level, indicating that the mean pulse rate in at least one group is significantly different from the mean pulse rate in another group. The comparison effects (L1 and L2) are tested using an adjusted significance level of 0.025. At the adjusted significance level, only the L2 effect is statistically significant (p=0.02), indicating that the mean pulse rate in the high-effort group is significantly different from the mean pulse rate in the control group. Based on these findings, the investigator concludes that the high effort treatment affects resting pulse rate.
Note: The mean square for a comparison is computed just like the mean square for any other treatment effect:
MS = SS / df
where MS is the mean square, SS is the sum of squares, and df is the degrees of freedom.
The degrees of freedom for every comparison is equal to one. Therefore, the mean square for a comparison equals the sum of squares for the comparison.