A Choosing models and tests
A.1 Introduction
One of the more difficult skills in data analysis is deciding which statistical models and tests to use in a particular situation. This book has introduced a range of different approaches, and has demonstrated a variety of biological questions that can be addressed with these tools. Here we draw together the statistical tools we’ve encountered and explain how to match them to particular kinds of question or hypothesis. However, before diving into an analysis, there are a few things to consider…
A.1.1 Do we need to carry out a statistical analysis?
This may seem like an odd question to ask, having just spent a considerable amount of time learning statistics. But it is an important one. There are many situations in which we don’t need, or can’t use, statistical tools. Here are two common ones:
There are no statistical procedures that will allow us to analyse our data correctly36. This happens sometimes. Even with careful planning, things don’t always work out as anticipated and we end up with data that cannot be analysed with a technique we know about. If the data can’t be analysed in a sensible way there is no point doing any old analysis just because we feel we have to produce a p-value. Instead, it’s time to seek some advice.
We could quite correctly apply a statistical test to your data, but it would be entirely superfluous. We don’t always need statistics to tell us what is going on. An effect may be exceptionally strong and clear, or it may be that the importance of the result is not something captured by applying a particular statistical model or test37. This caveat is particularly relevant to exploratory studies, where the goal is to use the data to generate new hypotheses, rather than test a priori hypotheses.
Also in the category of superfluous statistics are situations where we are using statistics in a technically correct way, but we’re evaluating effects that simply are not interesting or relevant to the question in hand. This often arises from a misplaced worry that unless we have lots of statistics in our study it somehow isn’t ‘scientific’, so we apply them to everything in the hope that a reader will be impressed. Resist the temptation! This strategy will have the opposite effect—a competent reader will just assume you don’t know what you’re doing if they see a load of pointless analyses.
A.2 Getting started…
If there is a need for statistical analysis then the first thing to do is read the data into R and then… resist the temptation to start generating p-values! There are a number of things to run through before leaping into a statistical analysis:
- Be sure to carefully review the data after importing it into R (functions like
View
andglimpse
is good for this). There are a number of things to look out for…
- _Understand how the data is organised_ Most importantly, is the data set 'tidy'? Each variable should be one column and each row should correspond to one observation. The majority of statistical modelling tools in R expect the data to be organised in this format, as do **dplyr** and **ggplot2**. If it isn't already tidy, a data set will need to be reorganised first. We can always do this by hand, but it is almost always quicker to do it with R (the **tidyr** package is really good at this).
- _Understand how R has encoded the variables_. Examine the data using functions like `glimpse` or `str`. Pay close attention to the variable types---is it numeric, a character, or a factor? If a variable is not appropriate for the planned analysis, make any necessary changes. For example, if we plan to treat a variable as a factor because we're going to carry out ANOVA, but it has been read in as a number, we'd need to convert it to a factor before preceding.
- _Check whether or not there are any missing values_. These appear as `NA` in R. If they are present, were they expected? If not, check the original data source to determine what has happened. If we're absolutely certain the missing values represent an error in the way the data were coded then it might be sensible to fix the source data. However, if there is any doubt about how they arose, it is better to leave the source data alone and deal with the miscoded `NA`s in the R script.
- _Ensure there are no miscoded values_. In an ideal world we should leave the source data alone and fix these miscoded values in the R script. Why? Because changing the source data is slow and runs the risk of introducing new errors. It's easy to edit an R script and rerun it. Editing source data is time consuming and error prone. If this is too difficult then fix these in the source data. Either way, there's no point starting an analysis until the data are error free.
Spend some time thinking about the variables in the data set. Which ones are relevant to the question in hand? If appropriate, decide which variable is the dependent variable (the ‘y’ variable) and which variable(s) is (are) the independent variable(s)? What kind of variables are we dealing with—ratio or interval scale numeric variables, ordinal or nominal categorical variables? It is much easier to determine which analysis options are available once these details are straightened out.
Make at least one figure to visualise the data. We have done this throughout this book. This wasn’t to fill the time—it is a crucial step in any data analysis exercise. Informative figures allow us spot potential problems with the data and they give us a way to evaluate our question before diving into an analysis. If we can’t see an appropriate way to visualise the data, we probably aren’t ready to start doing the statistics!
Steps 1-3 are all critical components of ‘real world’ data analysis. It may be tempting to skip these and just get on with the statistics, especially when pressed for time. Don’t do this! The ‘skip to the stats’ strategy always leads to a lot of time being wasted, either because we fail to spot potential problems or because we end up carrying out an inappropriate analysis.
A.3 A key to choosing statistical models and tests
The choice of statistical model/test is affected by two things:
The kind of question we are asking.
The nature of data we have:
what type of variables: ratio, interval, ordinal or nominal?
are the assumptions of a particular model or test satisfied by the data?
The schematic key (below) provides a overview of the statistical models and tests we’ve covered in this book, structured in the form of a key. The different choices in the key are determined by a combination of the type of question being asked, and the nature of the data under consideration.
The notes that follow the key expand on some of the issues it summarises and explain some of the trickier elements of deciding what to do. The key is quite large, so it is hard to read easily a web browser (it also doesn’t render very well in Firefox for some reason). Either download a PDF copy of the key or right click on the figure to open it on its own in a new tab and then zoom in.
A.4 Four main types of question
Question 1: Are there significant differences between the means or medians (‘central tendency’) of a variable in two or more groups, or between a single mean or median and a theoretical value?
This first question is relevant when we have measurements of one variable (e.g. plant height) on each experimental unit (e.g. individual plants) and experimental units are in different groups. If there is more than one group, one or more variables (factors) are used to encode group membership (given by the factor levels). Keep in mind that these grouping factors are distinct from the variable being analysed—they essentially describe the study design. This type of question includes anything where a comparison is being made between the variable in one group and…
a single theoretical value
another group whose values are independent of those in the first group (independent design)
more than one other group
a second group which has values that form logical pairs with those in the first group (paired design)38
The measurement scale of the variable in these situations may be ratio, interval, or ordinal. The only variable for which the statistical tools described here would not be suitable are categorical variables
Question 2: Is there a association between two variables? What is the equation that describes the dependence of one variable on the other?
Where Question 1 is concerned with comparing distinct groups, where we have measurements of one variable on each experimental units, Question 2 occurs where we’ve taken two different measurements for each experimental unit (e.g., plant size and number of seeds produced). Here we are interested in asking whether, and perhaps how, the two variables are associated with each other.
Here, again, the measurement scale of the variable in these situations may be ratio, interval, or ordinal.
Question 3: Are there significant differences between the observed and expected number of objects or events classified by one or more categorical variables?
Question 3 is the only question which is focused on the analysis of categorical variables. Here we have situations where the ‘measurements’ on objects can be things like colour, species, sex, etc. In these situations we analyse the data by counting up how many of the objects fall into each category, or combination of categories. The frequency of counts across the categories can then be tested against the against some predicted pattern.
A.5 Question 1 –- Comparison of group means or medians
A.5.1 Question 1 How many groups?
Within the set of situations covered by Question 1, there are some further subdivisions: we need to decide whether: we have one group only (and a theoretical or expected value to compare it against), we have two groups, or we have more than two groups. Usually this should be a fairly straightforward decision.
Confusion sometimes arises in situations similar to Festuca experiment, where the observations are classified by two factors (pH and and Calluna). What exactly do we mean by ‘a group’ in this situation? The answer is that if we have more than one factor, then we must think of each group as being the set of observations defined by each combination of factor levels. So in the case of the Festuca experiment, there were four different combinations of the two factors (with and without Calluna, at either pH level). The simplest thing to remember is that if we have more than one factor (regardless of how many levels the factors have) then we must be dealing with more than two groups.
A.5.2 [Question 1] Single group
When we have a single group (1a) the only thing that remains to be done is check the type and distribution of the variable. If the data are approximately normally distributed then the obvious test is a one-sample t-test. If the data are not suitable for a t-test, even after transformation, then we could use a Wilcoxon test (we studied this in terms of a paired design, but remember, a paired experimental design is ultimately reduced to a one-sample test).
A.5.3 [Question 1] Two groups
If we have two groups then there is a further choice to be made: whether there is a logical pairing between variable measured in the two groups, or whether the data can be regarded as independent. This sometimes causes problems, particularly where the pairing is not of the obvious sort. One useful rule-of-thumb is to ask whether there is more than one way the data could be ‘paired’. If there is any uncertainty about how the pairing should be done, that is probably an indication that it isn’t a paired situation. The most common problem, however, is failing to recognise pairing when it exists.
When faced with paired design the test involves first deriving a new variable from the differences among pairs, and then using this variable in a one-sample test. Either a one-sample t-test or Wilcoxon paired-sample test is required, depending on whether the new variable (the differences) is approximately normally distributed or not.
If the data are independent then a two-sample t-test or Mann-Whitney -test will be the best approach.
A.5.4 [Question 1] More than two groups
The first decision here is about the structure of the data. This sometimes causes problems. There are a variety of different situations in which we may be interested in testing for differences between several means (or perhaps medians). Most of the time these will involve either a one-way comparison in which each observation can be classified as coming from one set of treatments (one factor), or a two-way comparison in which each value comes from a combination of two different sets of treatments (two factors). It is easy to mix these situations up if we’re not paying attention. One way to try and establish the structure of the data is to draw up a table…
If the data fit neatly into the sort of table below, where there is one factor (e.g. factor ‘A’) which has two or more levels (e.g. level 1, 2, 3) then we have a one-way design. The treatments designated by the levels of the factor in this situation are typically related in some way (e.g. concentrations of pesticide, temperatures). The only question it makes sense to address with these data is whether there are differences among the means of the three treatments.
FACTOR A | |
---|---|
Treatment 1 | 1,4,6,2,9 |
Treatment 2 | 7,3,8,9,4 |
Treatment 3 | 5,3,7,6,4 |
If there are two different types of treatment factor (factors ‘A’ and ‘B’) and within each factor there are two or more levels then we have a two-way design. The treatments designated by the levels of a particular factor are typically related in some way, but the set of treatments associated with each factors are typically not related to one another. We could not draw up a table like the first one (above) and fit these data into it, because each observation occurs simultaneously in one treatment level from each of the two factors (below).
Treatment B1 | Treatment B2 | Treatment B3 | |
---|---|---|---|
Treatment A1 | 1,4,6 | 3,9,1 | 2,2,7 |
Treatment A2 | 7,3,8 | 2,3,6 | 9,3,4 |
Treatment A3 | 5,3,7 | 1,8,6 | 2,2,6 |
From the data in this design, we can ask more than one question (i.e. there are several different tests associated with this design). It clearly makes sense to ask questions about the main effects and the interaction:
main effect of factor A: difference among row means
main effect of factor B: difference among column means
interaction: differences among means for each different combination of treatments of factors A and B (individual cells of the table).
When a data set can be described by the second table, but there is only one value in each cell of the table, then we still have a two-way design, but now without replication. It is possible to analyse this design, but we there is no way to assess the interaction between treatments—we can only test main effects with this kind of design. This design is most often associated with a Randomized Complete Block Design with one treatment factor (the thing we care about) and one blocking factor (a nuisance factor).
Having established whether we have a one-way or two-way design we need to determine whether the data are likely to satisfy the assumptions of the model we’re presented with. We can start to make this evaluation by plotting the raw data (e.g. using a series dot plots). Occasionally it will be obvious at this stage that the data are not suitable for ANOVA. However, things are often not so clear so it’s better to fit the ANOVA model and use regression diagnostics to check whether the assumptions are satisfied.
In the case of a one-way design we have the option of a non-parametric Kruskal-Wallis test when the data are not suitable for ANOVA. Otherwise we need to find a suitable transformation and then use a one-way ANOVA. Remember, before turning to a Kruskal-Wallis test it is a good idea to see if the data can be made suitable for ANOVA by transformation.
In the case of the two-way design, we don’t have the option of a non-parametric test. If the data are suitable (or can be made suitable by transformation), then use a two-way ANOVA, otherwise there is no choice but to break the analysis down into its component parts, which can be analysed as one-way designs using non-parametric methods. This is far from ideal though because we lose all information about the interactions by doing this (which somewhat defeats to purpose of designing a two-way experiment).
(N.B. In the special case of a two-way design without replication, then there is a non-parametric test—Friedman’s test—that can be used instead of normal two-way ANOVA.)
Finally, we should consider the option of multiple comparison tests. These should only be used if the global significance test from an ANOVA is significant. Additionally, in the two-way ANOVA, there is a further consideration, which is whether the main effects (one or both of them) or the interaction are significant. If the interaction is significant then the multiple comparison should be done for the interaction means (i.e. the means in each treatment combination). If the interaction is not significant, then the significance of the differences between the main effect means (whichever are significant) should be evaluated.
A.6 Question 2 – Associations between two variables?
Assuming it is an association we’re after—not a difference between groups (see box below)—then the main decision we need to make is whether to use a correlation or regression analysis.
A.6.1 [Question 2] Testing \(y\) as a function of \(x\), or an association between \(x\) and \(y\)?
The choice between regression and correlation techniques, for analysing the relationship between two variables, depends on the nature of the data and the purpose of the analysis. If the purpose of the analysis is to find (and describe mathematically) the relationship that describes the dependence of one variable (\(y\)) on another (\(x\)), then this points to a regression being the most appropriate technique. If we just want to know whether there is an association between two variables, then this would suggest correlation.
However, it is not just the goal of an analysis that matters, unfortunately! The two techniques also make different assumptions about the data. For example, regression makes the assumption that the \(x\)-variable is measured with little error relative to the \(y\)-variable, but doesn’t require the \(x\)-variable to be normally distributed. Pearson’s correlation assumes that both \(x\) and \(y\) are normally distributed.
So the final decision about which method to use may depend on trying to match up both the question and the nature of the data. It sometimes happens that we want to ask a question that requires a regression approach (e.g. we need an equation that describes a relationship) but the data are not suitable. In this situation it can be worth proceeding with regression, bearing in mind that the answer we get may be less accurate than it should be (though careful use of transformations may improve things).
Once we decide a correlation is appropriate, then the choice of parametric or non-parametric test should be based on the extent to which the data match the assumptions of the parametric test.
A final point here that can cause difficulties is the issue of dependence of one variable on another. In biological terms we are often interested in the relationship between two variables, one of which we know is biologically dependent on the other. However, designating one variable the dependent variable (\(y\)) and the other the independent variable (\(x\)) does not imply that we think ‘y depends on x’, or that ‘x causes y’. For example, tooth wear in a mammal is dependent on age. However, imagine that we have collected a number of samples of mammalian teeth from individuals of a species which have died for various reasons, and for which we also have an estimate of age at death. We may want to find the an equation for the relationship between age (\(y\)) and tooth wear (\(x\)). Why? Well for example, in a population study you might often recover remains of individuals that have died, from which you can obtain the teeth (even if not much else remains). It could be useful to be able to use the measurement of tooth wear to estimate the age of the individuals, and to do this you want the equation that describes the ‘dependence’ of age on tooth wear. So here the direction of dependence in the analysis is not the same as the causal dependence of the two variables. The point is that the choice of analysis does not determine the direction of biological dependence—it is up to us to do the analysis in a way that makes sense for the purpose of the study.
Qustion 1 or Question 2?
Although it seems straightforward to choose between Question 1 and Question 2, it does sometimes cause a problem in the situation there are two groups, the data are paired, and the same variable has been measured in each ‘group’. Because two sets of measurements on the same objects (say individual organisms) fit the structure a paired-sample t-test or a regression (or correlation), it is very important to identify clearly the effect we want to test for. A concrete example will help make this clearer…
The situation where confusion most easily arises is when the same variable has been measured in both groups. Imagine we’ve collected data on bone strength from males and females in twenty families, where the males and females are siblings—a brother and sister from each family. The pairing clearly makes sense because the siblings are genetically related and likely to have grown up in similar environments. Consider the following two situations…
-
If our goal is to test whether males and females tend to have different bone strengths, then a paired-sample t-test makes sense: it compares males with females controlling for differences due to relatedness and environment.
-
If our goal is to test whether bone strength runs in families then the t-test is no use. In this case it makes sense to evaluate whether there is a correlation in the bone strength of sibling pairs (i.e. if one sibling has high bone strength then does the other as well?).
So while the data can be analysed in either way, it is the question we’re asking that is the critical thing to consider. Just because data can be analysed in a particular way, doesn’t mean that analysis will tell us what we want to know. One way to do tackle this sort of situation is to imagine what the result of each test using those data might look like and think about how one might interpret that result. Does that interpretation answer the right question?
A.7 Question 3 -– Frequencies of categorical data
This kind of question relates to categorical or qualitative data, (e.g. the number of male versus female offspring; the number of black versus red ladybirds, the number of plants of different species). The data are frequencies (counts, not means) of the number of objects or events belonging to each category. The principle of testing such data is that the observed frequencies in each category are compared with expected (predicted) frequencies in each category.
Deciding between goodness of fit tests and contingency tables is generally fairly straightforward once we’ve determined whether counts are classified by a single categorical variable, or more than one categorical variable. If there is more than one categorical variable, then it should be possible to classify each observation (organism, event, habitat, location, etc.) into one category of each categorical variable, and that allocation should be unique. Each observation should fit in only one combination of categories.
There is a further difference which may also affect our choice. In the case of a single factor, the question we ask is whether the numbers of objects in each category differ from some expected value. The expected values might be that there are equal numbers in each category, but could be something more complicated—it is entirely dependent on the question we’re asking. In the case of the two-variable classification, the test addresses one specific question: is there an association between the two factors? The expected numbers are generated automatically based on what would be expected if the frequencies of objects in the different categories of one factor were unaffected by their classification by the other factor.
A.8 Variables or categories?
One final issue that sometimes causes questions is when a variable is treated as categorical when it is really a continuous measure or when a continuous variable is made into categories. There is indeed some blurring of the boundaries here. Two situations are discussed below.
A.8.1 ANOVA vs. regression
There are many situations in which data may be suitable for analysis by regression or one-way ANOVA, even though they are different kinds of models. For example, if a farmer wishes to determine the optimal amount of fertiliser to add to fields to achieve maximum crop yield, he might set up a trial with 5 control plots and 5 replicate plots for each of 4 levels of fertiliser treatment: 10, 20, 40 and 80 kg ha NPK (nitrogen, phosphorus and potassium fertiliser) and measure the crop yield in each replicate plot at the end of the growing season (kg ha year).
If we are simply interested in determining whether there is a significant difference in yields from different fertiliser treatments, and if so which dose from the levels we have used is best, then ANOVA (and multiple comparisons) is probably the best technique. On the other hand we might be interested in working out the general relationship between fertiliser dose and yield, perhaps in order to be able to make predictions about the yield at other doses than those we have tested. If the relationship between fertiliser and yield was linear, or could be made so by transformation, then we could use a regression to determine whether there was a significant relationship between the two, and describe it mathematically39.
A.8.2 Making categories out of continuous measures
Sometimes you will have data which are, at least in principle, continuous measurements (e.g. the abundance of an organism at different sites), but have been grouped into categories (e.g. abundance categories such as 1-100, 101-200, 201-300, etc. ). One question is whether these count as categories – and hence whether for example you could look at the association between abundance categories and habitat by taking the frequencies of samples in each abundance category across two different habitats and examining the association using a contingency table test. The answer is that this would indeed be a perfectly legitimate procedure (though it may not be the most powerful).
It is a good idea not to group numeric data into categories if it can be avoided because this throws away information. However, it isn’t always possible to make a proper measurement, though we can at least assign observations to ordinal categories. For example, when observing flocks of birds we might find that it’s impossible to count them properly, but we can reliably place the numbers of birds in a flock into abundance categories (1-100, 101-200, etc. ). Many variables that we treat as ordinal categories could in principle be measured in a more continuous form: ‘unpigmented’, ‘lightly pigmented’, ‘heavily pigmented’; ‘yellow’, ‘orange’, ‘red’; ‘large’, ‘small’… These are all convenient categories, but in one sense they are fairly arbitrary. This doesn’t mean that we can’t construct an analysis using these categories. However, one thing to bear in mind is that when we divide a continuous variable into categories, decisions where to draw the boundaries will affect the pattern of the results.
And, of course, there are many ‘true’ categorical data: on the whole things like male/female, species, dead/alive, can be regarded as fairly unambiguous categories.
Alternatively, an appropriate technique exists, but we don’t know about it!↩︎
As the old joke goes: What does a statistician call it when ten mice have their heads cut off and one survives? Not significant.↩︎
In a paired design the two groups aren’t really separate, independent entities, in the sense that pairs of measurements have been taken from the same physical ‘thing’ (site, animal, tree, etc).↩︎
One additional potential advantage of regression in this kind of situation is that it might result in more a powerful statistical test of fertiliser effects than ANOVA. This is because a regression model only ‘uses up’ two degrees of freedom—one for each of the intercept and slope—while ANOVA uses four (n-1). A regression makes stronger assumptions about the data though, because it assumes a linear relationship between crop yield and fertiliser.↩︎