Chapter 19 Multiple comparison tests
19.1 Introduction
In the [One-way ANOVA in R] chapter we learned how to use ANOVA to examine the global hypothesis of no difference between means—we did not learn how to evaluate which means might be driving such a significant result. For example, we found evidence for a significant difference between the means in the corn crake example, but we were not able to say which supplements are better or worse, and did not make any statements about which supplements differ significantly from each other. The purpose of this chapter is to examine one method for assessing where the differences actually lie.
The general method we will use is called a post hoc multiple comparisons test. The phrase ‘post hoc’ refers to the fact that these tests are conducted without any particular prior comparisons in mind. The words ‘multiple comparisons’ refer to the fact that they consider many different pairwise comparisons. There are quite a few multiple comparison tests—Scheffé’s test, the Student-Newman-Keuls test, Duncan’s new multiple range test, Dunnett’s test, … (the list goes on and on). Each one is applicable to particular circumstances, and none is universally accepted as ‘the best’. We are going to work with the most widely used test: the Tukey multiple comparison test. This test is also known as Tukey’s Honestly Significant Difference (Tukey HSD) test16.
People tend to favour Tukey’s HSD test because it is ‘conservative’: the test has a low false positive rate compared to the alernatives. A false positive occurs when a test turns up a statistically significant result for an effect that is not really there. A low false positive rate is a good thing. It means that if we find a significant difference we can be more confident it is ‘real’. The cost of using the Tukey HSD test is that it isn’t as powerful as alternatives: the test turns up a lot of false negatives. A false negative occurs when a test fails to produce a statistically significant result for an effect when it is really present. A test with a high false negative rate tends to miss effects.
There is one line of thinking that says post hoc multiple comparisons tests of any kind should never be undertaken. We shouldn’t carry out an experiment without a prior prediction of what will happen—we should know which comparisons need to be made and should only undertake those particular comparisons rather than making every possible comparison. Nonetheless, post hoc multiple comparisons test are easy to apply and widely used, so there is value in knowing how to use them. The Tukey HSD test at least tends to guard against picking up non-existent effects.
19.2 Tukey’s HSD in R
Walk through example
We are going to work with the corn crake example again. If you kept the CORN_CRAKE.CSV file from before all you need to do is make sure your working directory is set to this location. Otherwise you’ll need to download it again.
Let’s work through the corn crake example again. Read the data into R we and fit an ANOVA model using the lm
function.
# 1. read in data
corn_crake <- read.csv(file = "CORN_CRAKE.CSV")
# 2. fit anova model
corncrake_model <- lm(WeightGain ~ Supplement, data = corn_crake)
We stored the model object in corncrake_model
. In the [One-way ANOVA in R] chapter we saw how to evaluate whether the means differ using anova
on this object. We use a couple of new functions to carry out a Tukey HSD test. First, we have to convert the linear model object into a different kind of model object using the aov
function:
We don’t really need to understand what this is doing—aov
prepares the model so that we can perform a Tukey HSD test. Notice that we gave the new object its own name (corncrake_aov
) because we need to use it in the next step.
It is easy to perform a Tukey HSD test once we have the ‘aov’ version of our model. There are a few different options. Here is how to do this using the TukeyHSD
function:
Pay attention! We applied the TukeyHSD
function to the ‘aov’ object, not the original lm
object. We have suppressed the output for now. Before we review it we need to get an idea of what it is going to show us.
The ordered = TRUE
tells TukeyHSD
that we want to order the treatment means from smallest to largest, and then apply every pairwise comparison, starting with the smallest mean (‘Allvit’) and working up through the order. Here are the means ordered from smallest to largest, working left to right:
Supplement | Allvit | None | Linseed | Sizefast | Earlybird |
Mean | 14.3 | 15.6 | 18.5 | 19.0 | 22.9 |
So the TukeyHSD
with ordered = TRUE
will first compare ‘Allvit’ to ‘None’, then ‘Allvit’ to ‘Linseed’, then ‘Allvit’ to ‘Sizefast’, then ‘Allvit’ to ‘Earlybird’, then ‘None’ to ‘Linseed’, then ‘None’ to ‘Sizefast’, … and so on, until we get to ‘Sizefast’ vs. ‘Earlybird’. Let’s look at the output:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
## factor levels have been ordered
##
## Fit: aov(formula = corncrake_model)
##
## $Supplement
## diff lwr upr p adj
## None-Allvit 1.375 -4.627565 7.377565 0.9638453
## Linseed-Allvit 4.250 -1.752565 10.252565 0.2707790
## Sizefast-Allvit 4.750 -1.252565 10.752565 0.1771593
## Earlybird-Allvit 8.625 2.622435 14.627565 0.0018764
## Linseed-None 2.875 -3.127565 8.877565 0.6459410
## Sizefast-None 3.375 -2.627565 9.377565 0.4971994
## Earlybird-None 7.250 1.247435 13.252565 0.0113786
## Sizefast-Linseed 0.500 -5.502565 6.502565 0.9992352
## Earlybird-Linseed 4.375 -1.627565 10.377565 0.2447264
## Earlybird-Sizefast 3.875 -2.127565 9.877565 0.3592201
This table look confusing at first glance. It enables you to look up every pair of treatments, and see whether they are significantly different from each other. Lets see how this works… The first four lines compare the Allvit treatment (‘Allvit’) with each of the other treatments in turn:
## diff lwr upr p adj
## None-Allvit 1.375 -4.627565 7.377565 0.9638453
## Linseed-Allvit 4.250 -1.752565 10.252565 0.2707790
## Sizefast-Allvit 4.750 -1.252565 10.752565 0.1771593
## Earlybird-Allvit 8.625 2.622435 14.627565 0.0018764
So to look up the difference between the control treatment and the ‘Allvit’ treatment, we read the first results row in the table. This says the means differ by 1.375, the confidence interval associated with this difference is [-4.63, 7.38], and that the comparison has a p-value of 0.96. So in this case we would conclude that there was a no significant difference between the control treatment and the ‘Allvit’ treatment. We could look up any comparison of the ‘Allvit’ treatment with a different treatment in the next three lines of this portion of the table.
This basic logic extends to the rest of the table. If we want to know whether the ‘Earlybird’ treatment is different from the control, we look up the ‘Earlybird-None’ row:
## diff lwr upr p adj
## Earlybird-None 7.250 1.247435 13.252565 0.0113786
It looks like the means of the ‘Earlybird’ and ‘None’ levels are significantly different at the p < 0.05 level.
Now we know how to look up any set of comparisons we need to see whether the difference is significant. The next question is: How should we summarise such a table?
19.3 How to summarise multiple-comparison results
Summarising the results of multiple comparison tests can be a tricky business. The first rule is: don’t present the results like the TukeyHSD
function does! A clear summary of the results will help us to interpret them correctly and makes it easier to explain them to others. How should we do this? Let’s list the means in order from smallest to largest again:
Supplement | Allvit | None | Linseed | Sizefast | Earlybird |
Mean | 14.3 | 15.6 | 18.5 | 19.0 | 22.9 |
Now, using the table of Tukey results we can perform a sequence of pair-wise comparisons between the supplements starting with the smallest pair… ‘Allvit’ and ‘None’. The appropriate test is in the first table:
## diff lwr upr p adj
## 1.3750000 -4.6275646 7.3775646 0.9638453
The last column gives the p-value, which in this case is certainly not significant (it is much greater than 0.05), so we conclude there is no difference between the ‘Allvit’ and ‘None’ treatments. So now we continue with ‘Allvit’, but compare it to the next larger mean (‘Linseed’). In this case the values are:
## diff lwr upr p adj
## 4.250000 -1.752565 10.252565 0.270779
The last column gives the p-value, which again is not significant, so we conclude there is no difference between the ‘Allvit’ and ‘Linseed’ treatments. So now we continue with ‘Allvit’, but compare it to the next larger mean (‘Sizefast’). In this case the values are:
## diff lwr upr p adj
## 4.7500000 -1.2525646 10.7525646 0.1771593
Once again, this difference is not significant, so we conclude there is no difference between the ‘Allvit’ and ‘Sizefast’ treatments either. So again, we continue with ‘Allvit’, which we compare to the next larger mean (‘Earlybird’).
## diff lwr upr p adj
## 8.6250000 2.6224354 14.6275646 0.0018764
This time the p-value is clearly less than 0.05 so we conclude that this pair of treatments are significantly different. We record this by marking ‘Allvit’, ‘None’, ‘Linseed’ and ‘Sizefast’ to indicate that they don’t differ from each other. We’ll use the letter ‘b’ to do this.
Supplement | Allvit | None | Linseed | Sizefast | Earlybird |
Mean | 14.3 | 15.6 | 18.5 | 19.0 | 22.9 |
b | b | b | b |
The ‘b’ defines a group of treatment means—‘Allvit’, ‘None’, ‘Linseed’ and ‘Sizefast’—that are not significantly different from one another. It doesn’t matter which letter we use by the way (the reason for using ‘b’ here will become apparent in a moment).
The means are ordered from smallest to largest which means we can forget about ‘None’, ‘Linseed’ and ‘Sizefast’ treatments for a moment—if they are not significantly different from ‘Allvit’ they can’t be significantly different from one another.
We move on to ‘Earlybird’, but now, we work back down the treatments to see if we can define another overlapping group of means that are not significantly different from one another. When we do this, we find that ‘Earlybird’ is not significantly different from ‘Linseed’ and ‘Sizefast’, but that it is significantly different from ‘None’. This forms a second ‘not significantly different’ group. We will denote this with a new letter (‘a’) in our table:
Supplement | Allvit | None | Linseed | Sizefast | Earlybird |
Mean | 14.3 | 15.6 | 18.5 | 19.0 | 22.9 |
b | b | b | b | ||
a | a | a |
If there were additional treatments with a mean that was greater than ‘Earlybird’ we would have to carry on this process, working back up from ‘Earlybird’. Thankfully, there are no more treatments, so we are finished.
This leaves us with a concise and complete summary of where the differences between treatments are, which greatly simplifies the task of interpreting the results. Treatments that share a letter form groups within which means are not different from each other. Treatments that are not linked are significantly different.
19.4 Doing it the easy way…
The results table we just produced is concise and complete, but no reasonable person would say they were easy to arrive at. As you might expect, someone has written an R function to do this for us. It isn’t part of ‘base R’ though, so we have to install a package to use it. The package we need is called agricolae
:
Once this has been installed we use the library
function to tell R that we want to use the package in the current session:
Now that the package is loaded we can carry out the Tukey HSD test and find the ‘not significantly different’ groups using the HSD.test
function:
##
## Study: corncrake_aov ~ "Supplement"
##
## HSD Test for WeightGain
##
## Mean Square Error: 17.43571
##
## Supplement, means
##
## WeightGain std r Min Max
## Allvit 14.250 5.599745 8 5 22
## Earlybird 22.875 3.482097 8 17 27
## Linseed 18.500 3.023716 8 15 24
## None 15.625 3.420004 8 10 20
## Sizefast 19.000 4.780914 8 10 25
##
## Alpha: 0.05 ; DF Error: 35
## Critical Value of Studentized Range: 4.065949
##
## Minimun Significant Difference: 6.002565
##
## Treatments with the same letter are not significantly different.
##
## WeightGain groups
## Earlybird 22.875 a
## Sizefast 19.000 ab
## Linseed 18.500 ab
## None 15.625 b
## Allvit 14.250 b
The console = TRUE
argument tells the function to print the results for us. That’s a lot of output, but we can ignore most of it. The part that matters most is the table at the very end. This shows the group identities as letters, the treatment names, and the treatment means. If we compare that table with the one we just made, we can see they convey the same information. The package labels each group with a letter. For example, we can see that ‘Linseed’ and ‘SizeFast’ are both members of the ‘a’ and ‘b’ group. Hopefully it is obvious why we used ‘b’ and then ‘a’ when building the table manually above.
So, there is no need to build these Tukey HSD tables by hand. We just use the HSD.test
function in the agricolae
package. So why did we do it the long way? Well, the usual reasoning applies: it is important to know how to do this because it improves our understanding of what the ‘letters notation’ actually means.
19.5 Summarising and presenting the results of a Tukey test
As with any statistical test it will usually be necessary to summarise the result from the Tukey HSD test in a written form. With an ANOVA test of global significance and multiple comparisons of the means, this can become quite complex and hard to follow. In most cases it is best to summarise the ANOVA results and either the main differences between means, or concentrate on those comparisons which relate to the original hypothesis we were interested in, and then refer to a table or figure for the additional detail. For example…
There was a significant effect of supplement on the weight gain of hatchlings (ANOVA: F=5.1; d.f.= 4,35; p<0.01) (Figure 1). The only supplement that led to a significantly higher rate of weight gain than the control group was Earlybird (Tukey multiple comparisons test, p < 0.05).
When it comes to presenting the results in a report, we really need some way of presenting the means, and the results of the multiple comparison test, as the statement above cannot entirely capture the form of the results. The information can often be conveniently incorporated into a table or figure, using more or less the same format as the output from the HSD.test
function in the agricolae
package.
An example table might be:
Supplement | Mean weight gain (g) |
---|---|
Earlybird | 22.9a |
Sizefast | 19.0ab |
Linseed | 18.5ab |
None | 15.6b |
Allvit | 14.3b |
Note that, however we present it, we need to provide some explanation saying: (a) what test we did, (b) what the letter codes mean, and (c) the critical threshold we used to judge significance. In this case the information could be presented in a table legend:
Table 1: Mean weight gain of hatchlings in the five supplement treatments. Means followed by the same letter did not differ significantly (Tukey test, p>0.05).
Letter coding can also be used effectively in a figure. Again, we must ensure all the relevant information is given in the figure legend.
19.6 Significant ANOVA but no differences in a Tukey test?
It is possible to get a significant result from the global ANOVA test but find no significant differences in a Tukey test, though it doesn’t happen often. ANOVA and the Tukey HSD test (or indeed other multiple comparison tests) are different tests, with different null hypotheses. Because of this it is possible to end up with a significant result from ANOVA, indicating at least one difference between means, but fail to get any differences detected by the Tukey test. This usually happens when the ANOVA result is marginal (close to p = 0.05). If it happens then aside from running the experiment again with more replication, there isn’t much we can do except make the best interpretation we can from inspecting the data, and be suitably cautious in the conclusions we draw17. If in doubt seek some expert advice!
N.B. Try to avoid a common mistake: it is Tukey, after its originator, the statistician Prof. John Tukey, not Turkey, a large domesticated bird which has made no useful contributions to statistical theory or practice.↩︎
It might be tempting to run a new post hoc analysis using a different kind of test. Don’t do this. It is a terrible strategy for doing statistics because this kind of practise is guaranteed to increase the overall false positive rate.↩︎