Our next review exercise will focus on ANOVA and simple linear regression. You learned to do these in your introductory statistics classes, but they were probably taught as two distinct types of analysis, one that is good for comparing means between groups (ANOVA), and one that is good for measuring the straight-line relationship between two numeric variables (regression). For interpretation of results, ANOVA focuses on differences between group means, and regression focuses on the slope of the best fit line (which measures change in the response per unit change in the predictor variable) and the coefficient of determination (r2, which measures the strength of the relationship between predictor and response).
Even though they seem to be really different, one hint that regression and ANOVA share underlying similarity is that we produce an ANOVA table for both of them. In both cases, we calculate how much variation in a measured response variable can be accounted for by a predictor variable. This is fairly obvious with a regression analysis, because both the predictor and response are numeric - in any regression data set they will each be in a column, with the variable name used as the column header. It's less obvious with ANOVA, because it is possible to arrange ANOVA data with a column of data for each group being compared. However, it is better to think of ANOVA as having a numeric response variable, which are all stored in one column, and a categorical predictor, stored in another column. The categorical predictor is called a factor by R, and factors are made up of the groups that are being compared (the levels of the factor). If we were doing an experiment on crop yield for fields fertilized with one of three different fertilizers, we would put all the measurements of yield into one column (called Yield), and the three different fertilizers would all be in another (called Treatment, or something like that), with the type of fertilizer used on the fields being the levels. Each row of the data would be one field, with the fertilizer type indicated in the Treatment column, and the amount of crops harvested from the field recorded in the Yield column.
So, if we consider both regression and ANOVA to be an analysis of the effects of a predictor variable on the response by our experimental subjects, the fact that both use an ANOVA table is easier to understand. In both cases, we partition the variation in the response variable into a component that can be explained by the predictor and a component that is unpredictable, random variation. We then compare these two different sources of variation with an F test. The ANOVA table collects and organizes all of the information needed for the F test into a standard form that allows the reader to see exactly how the p-value is obtained. As we will be learning shortly, the fact that ANOVA and regression both generate an ANOVA table is not a coincidence, they are actually special cases of the same linear model.
But, for today we will review what you have already learned about ANOVA and regression - we will save the demonstration that ANOVA and regression are special cases of the same model for next week.
One-way ANOVA
Daphnia
are small aquatic organisms, like the one pictured to the left.
Cyanobacteria are low-quality food for Daphnia because cyanobacteria
produce toxins that slow down growth and development. Consequently,
Daphnia fed on cyanobacteria have slow growth rates.
However, over time it is expected that Daphnia living in water bodies with high levels of cyanobacteria will evolve resistance to cyanobacterial toxins, so that they can take advantage of the abundant food source. Hairston and colleagues (1999) tested this hypothesis experimentally by measuring the rate of weight gain in Daphnia fed on cyanobacteria, using individuals that came from lakes with high, medium, or low densities of cyanobacteria. Weight gain rate is used as an indication of resistance to cyanobacteria toxins, with large rates of weight gain indicating high levels of resistance.
1. Start RStudio, and create a new project called "review2". Download the Excel file with today's data sets into the project folder. Download this R markdown file and place it in the review2 folder, then open it in R Studio.
Import the worksheet "anova" from "review_data.xls" into a data set called "daphnia". Enter the import command in the code chunk called "anova.import.data" of your R markdown file (you need to load the readxl library and then import the anova worksheet). Note that you will need to use the readxl commands for your Rmd file to knit - don't use R Studio's graphical import tool anymore.
Once the daphnia data set is in your environment click on its name to open it for viewing. If you look at how the data are organized, you'll see that you have one column called "cyandensity" that indicates the cyanobacteria density where the Daphnia was sampled, and "resistance" gives the growth rate of the Daphnia. Each row is a replicate, which in this case is an individual Daphnia. To address Hairston's question we will need to compare the means between high, medium, and low cyanobacteria densities - with three group means to compare, we will use an ANOVA. Provided that growth rate is highest where cyanobacteria are the most abundant, you would conclude that the difference indicates increased resistance to cyanobacterial toxins.
2. We are going to make a graph of the data, but strangely enough there is not a graph type in "base" R that plots means and error bars. Yes, this is strange, since this is a dead standard kind of graph for grouped data, which is an extremely common type of data. But, one of the side-effects of using a programming language to do data analysis is that some seemingly pretty basic things are left up to you to do yourself.
Adding capabilities to R
When there is something you need to do in R that is not part of base R you have several options:
- Do the calculations step by step with functions that are included in R
- Write your own functions
- Install an extension library that has the capabilities you need
Let's look at each of these.
Using built-in functions to get the summary you need
What we need in order to plot means with standard errors for the error bars is a table that gives each level of cyandensity, a mean resistance for each, and a standard error for each mean resistance. R provides a function called aggregate() that can be used to make the table we need, but it will require several steps. The syntax for aggregate() is (don't write this yet, just explaining still):
aggregate(formula, data, function.to.use)
The simplest way to use aggregate() to calculate one of the statistics we need, such as the mean (in the console):
aggregate(resistance ~ cyandensity, data = daphnia, FUN = mean) -> resistance.mean
You can read the formula, resistance ~ cyandensity, as "resistance grouped by cyandensity", so this will calculate mean resistance for each cyandensity group. If you type the name resistance.mean in the console will see a table with mean resistances at the three cyanobacteria densities, like so:
cyandensity resistance
1 high 0.8000000
2 low 0.6833333
3 med 0.7830000We could use aggregate() to calculate other statistics, such as the standard deviation (in the console):
aggregate(resistance ~ cyandensity, data = daphnia, FUN = sd) -> resistance.sd
which gives us:
cyandensity resistance
1 high 0.07859884
2 low 0.10333822
3 med 0.04738729Those are standard deviations in the resistance column - one of the disadvantages of using aggregate() is that it uses the name of the variable as the column name, so you need to be careful to identify what the numbers are (we used .sd in the name of the object for that reason). To get the sample size we could use (in the console):
aggregate(resistance ~ cyandensity, data = daphnia, FUN = length) -> resistance.n
Which gives us:
cyandensity resistance
1 high 10
2 low 12
3 med 10You may be wondering, the length of what, exactly? The way aggregate works is to split the resistance data based on the cyandensity level, and feed one set of data at a time as a vector of numbers to the function we identify with FUN =. The length of a vector is the number of elements it contains, so the length of one of these subsets of resistance data is also the sample size for that group.
Now we have pieces of the table we need in three different objects, and we need to assemble these into a single summary data set for graphing. We can do this with the data.frame() command (in the console):
data.frame(cyandensity = resistance.mean$cyandensity, mean = resistance.mean$resistance, sd = resistance.sd$resistance, n = resistance.n$resistance, se = resistance.sd$resistance/sqrt(resistance.n$resistance)) -> daphnia.sumstats
Notice that in addition to making a column for each statistic and assigning the appropriate object's column to it, we calculated the standard error as well - the standard error is the standard deviation divided by the square root of the sample size, which is calculated and assigned to a column called se as the last part of our data frame.
If you open daphnia.sumstats (by clicking on it in the Environment tab) you will see:
cyandensity mean sd n se
1 high 0.8000000 0.07859884 10 0.02485514
2 low 0.6833333 0.10333822 12 0.02983117
3 med 0.7830000 0.04738729 10 0.01498518This is exactly what we need - took several steps, and created several objects to get it, but there it is!
You could write your own function
You can add your own functions to R - for example, we could put all these commands into a function definition like so (this is to illustrate the approach, you don't need to do this):
my.summarySE <- function(data.set, measure.var, group.var) {
aggregate(formula(paste(measure.var," ~ ", group.var)), data = data.set, FUN = mean) -> mean
aggregate(formula(paste(measure.var," ~ ", group.var)), data = data.set, FUN = sd) -> sd
aggregate(formula(paste(measure.var," ~ ", group.var)), data = data.set, FUN = length) -> n
data.frame(mean[group.var], mean[measure.var], sd = sd[[measure.var]], n = n[[measure.var]], se = sd[[measure.var]]/sqrt(n[[measure.var]])) -> sumstats
return(sumstats)
}
You'll recognize the commands from above - once the function is created it would work by writing the command my.summarySE(daphnia, 'resistance', 'cyandensity')
The syntax of the commands that make up the function is a little different because we need a way to make the formula argument out of the input values - so, for example formula(paste(measure.var, '~', group.var)) converts the name of the measured variable and the name of the group variable into a formula that can be used by aggregate().
Writing your own functions works, but you might want to wait until you are a little more expert in R before you pursue this option.
Or, the easiest solution, you could use an extension library
We can take advantage of the massive number of contributed libraries available for R, and find a function that does this data summary for us. There is such a command, called summarySE(), in the Rmisc library - install Rmisc if you are working at home. It has the same syntax as the hand-made my.summarySE() function, and it produces all of these statistics, plus a 95% confidence interval to boot. You can even use more than one grouping variable if needed. Very nice.
Okay, then, to get the summary statistics you need for your graph add these commands to the daphnia.sumstats chunk in your Rmd file:
library(Rmisc)
summarySE(daphnia, "resistance", "cyandensity") -> daphnia.sumstats
In the daphnia.sumstats data set you now have the cyandensity labels we need for the x-axis, the means we need for y-axis, and the standard errors we need for the error bars (as well as a sample size, N, the standard deviation, sd, and the uncertainty value needed for a 95% confidence interval, ci). We are ready to graph them.
3. Now we will plot the means found in the daphnia.sumstats data frame. We will use ggplot2 again, and the following instructions help you learn how ggplot() works by building the graph one step at a time. For each new version of the command just edit the previous one until you have your final graph (don't write a separate graph command for each one, you'll just have one graph with all the bells and whistles at the end).
We will use daphnia.sumstats, and will assign cyandensity levels to the x axis, and the mean resistance values to the y-axis. In your R markdown file, in the daphnia.plot.means chunk, enter the lines:
library(ggplot2)
ggplot(daphnia.sumstats, aes(x = cyandensity, y = resistance))
and submit the code chunk. You'll see a graph with cyandensities on the x-axis, a y-axis labeled "mean", and with nothing in it. This is because so far all we did was load the library, and then use the ggplot() command to identify the data set (daphnia.stats) and which variables to use for the x and y axes (the aesthetic mappings, aes()), but we haven't added anything to the plot yet.
We will represent the means with points, so we need to add those to our ggplot() command - modify the command you entered already to read:
ggplot(daphnia.sumstats, aes(x = cyandensity, y = resistance)) + geom_point()
The plus sign extends the ggplot() command - we are adding points as a geometric object (geom) to the plot. We now have a point for each mean on the plot. We can change the appearance of the points by adding some arguments to the geom_point() command - we can make them big and red with:
ggplot(daphnia.sumstats, aes(x = cyandensity, y = resistance)) + geom_point(size = 3, color = "red")
To add the error bars we add an additional geom, using geom_errorbar():
ggplot(daphnia.sumstats, aes(x = cyandensity, y = resistance)) + geom_point(size = 3, color = "red") + geom_errorbar(aes(ymax = resistance + se, ymin = resistance - se))
The geom_errorbar() command also has an aes() within it, which sets the top of the errorbar (ymax) to be the mean for each point plus the standard error for each point, and the minimum to be the mean minus the standard error. Bear in mind that both mean and se are columns in our data set, so when we calculate the top and bottom in the aes() statement we use the mean and se for each row, which are all different numbers.
The error bars are a little wide and are drawing over the points. We can make the points draw over the error bars by moving the geom_errorbar() statement before the geom_point() statement:
ggplot(daphnia.sumstats, aes(x = cyandensity, y = resistance)) + geom_errorbar(aes(ymax = resistance + se, ymin = resistance - se)) + geom_point(size = 3, color = "red")
and we can make the widths look better by setting their width within the geom_errorbar() command - note that we will want this width to apply to all of the error bars, and the width isn't present in a column in the data set, so we put it outside of the aes() statement:
ggplot(daphnia.sumstats, aes(x = cyandensity, y = resistance)) + geom_errorbar(aes(ymax = resistance + se, ymin = resistance - se), width = 0.1) + geom_point(size = 3, color = "red")
which looks much nicer. Finally, the labels are variable names, which are okay for analysis but don't make a lot of sense for a graph (the y-axis plots resistance, and the points are the means, but the y-axis isn't the mean of resistance - confusing at best, inaccurate at worst). We can change the labels using the labs() command:
ggplot(daphnia.sumstats, aes(x = cyandensity, y = resistance)) + geom_errorbar(aes(ymax = resistance + se, ymin = resistance - se), width = 0.1) + geom_point(size = 3, color = "red") + labs(x = "Cyanobacteria density", y = "Resistance")
Based on this graph it looks like resistance to cyanobacterial toxins is greatest in "High" cyanobaceria densities, least in "Low" densities, and in between at moderate densities.
4. We need to fix the order of the levels of cyandensity on the graph - they are in alphabetical order, which doesn't match the order of the levels.
Currently, cyandensity is an unordered factor, which is R's data type for a variable that is used for grouping. When you look at the data set all you see is the labels for each cyandensity level, but in addition to these text labels a factor also has a numbers assigned to each level that R uses internally, but does not display by default. In an un-ordered factor these numbers are assigned to the levels alphabetically, and their ordering is never used in data analysis (they are used for other purposes, such as picking colors on plots). When we have a factor that is ordinal like this one, we need to make an ordered factor, which assigns the underlying numbering to match the order of the labels. To do this go back to the daphnia.sumstats code chunk of your Rmd file and add a command at the end that says:
daphnia.sumstats$cyandensity <- ordered(daphnia.sumstats$cyandensity, levels = c("low","med","high"))
You won't see an obvious difference, but if you type daphnia.sumstats$cyandensity at the console you'll see that now the levels are listed in the correct order (low, med, high), with less than symbols between them to indicate the ordering assigned.
If you run the daphnia.plot.means code chunk again the graph will show the levels in the correct order. This, by the way, is an example of why the Rmd system is so nice - if you decide to make a change in one command, you can then run each code chunk that comes after and your output will all reflect the change.
5. Check your normality assumption and HOV assumption. This is done using the same commands as you used in your two-sample t-tests. We will be using the daphnia data frame for this - put these tests into daphnia.normality and daphnia.hov chunks of your R markdown file.
6. Now you can conduct the ANOVA. Because R is object-oriented we will be doing this analysis in two stages - first, the ANOVA model is calculated (or "fitted" in statistical modeling terminology), and once that is done the ANOVA table is obtained as a summary of the fitted model. To fit the model, in the daphnia.run.anova chunk of your R markdown file use:
daphnia.aov <- aov(resistance ~ cyandensity, data = daphnia)
The work is being done by the aov() function, which identifies the data set being used (data = daphnia), and specifies the model formula. We will be using these in all of our analyses, because they specify the dependent variable (on the left of the tilde) that is to be explained by one or more independent variables (on the right of the tilde). This model formula, resistance ~ cyandensity, says that we will explain variation in resistance as a function of the cyandensity levels.
You won't see anything on the screen, because the output of the aov() function is put into the daphnia.aov object. If you type daphnia.aov at the console you'll see that the output isn't organized as an ANOVA table yet. To get an ANOVA table we need to apply the anova() command to the fitted model in daphnia.aov (in your R markdown file, on the next line after the aov() command):
anova(daphnia.aov)
Remember that when you do an ANOVA the p-value tells you if there is reason to think that there are differences between means, but with three or more means being compared you can't tell which are different - it could be that all three means are different from one another, but any combination of significant and non-significant differences between the pairs of means is possible. After receiving a p-value less than 0.05 with our ANOVA we need to conduct a post-hoc procedure to determine which of the groups are different from which. You learned about Tukey tests in Biol 215, so we will stick with those for now.
Naming objects
We have made a few objects, and I've been choosing names for them for you. Objects names can be anything (as long as you don't use characters like spaces, quotes, dollar signs, or other "reserved" characters that are used by R for other things), but they should be sensible. Notice that I used a name for the dataset that told us something about what it contained (rather than something generic, like dataset1). Each time we created objects from an analysis of those data we used the name of the dataset first, followed by a period, followed by something that indicated what that object contained. For example, daphnia.sumstats tells us that object contains summary statistics for the daphnia data.
When you're using Rmd files you're less likely to confuse yourself with poor object naming, but even with an Rmd file to help you document your work it's always a good idea to make your code as self-documenting as possible. Adopting sensible naming practices helps.
To get Tukey post-hoc comparisons use the command (in your R markdown file, anova.tukey code chunk):
TukeyHSD(daphnia.aov)
You will see all three of the possible comparisons (med-low, high-low, high-med) listed, with a difference between the means (diff), a lower and upper limits to a confidence interval on the difference (lwr and upr), and a p-value. We will go over post-hocs in much greater detail in week 4, for now just identify which comparisons have p-values less than 0.05 and consider those groups to be significantly different.
You can interpret these differences in the context of your graph of means - if there is a statistically significant difference refer to the graph to see which mean is greater than which.
Regression
The
worksheet "regression" has some (instructor generated) data on
environmental temperature, abdomen temperature, and thorax temperature
in winter moths. Winter moths are ectotherms that live in cold climates,
yet are able to fly in the winter months by using a "pre-flight warmup"
in which they move their wings to generate heat in their flight muscles.
The high temperatures allow them to sustain the high rates of energy use
that are needed to power flight. Because the heat is generated locally
in the thorax, we expect that the temperature in other parts of the body
that are far from the thorax (such as the abdomen) would not warm up,
and should more closely match ambient temperature.
1. Import the sheet "regression" into an R dataset called "moth". If you open the data set you'll see that there is a column for environmental temperature, which will be the independent variable, and then two different dependent variables, Thorax.temp and Abdomen.temp. We will do two different regression analyses, one for each body part.
2. Before we analyze the data, we should plot it. If you open the moth data set you'll see that there are three columns, one called Env.temp, one called Thorax.temp, and one called Abdomen.temp. We will want to plot Thorax.temp and Abdomen.temp against Env.temp, color coded to indicate with is which, with a regression line plotted through each.
We will build the ggplot() command one step at a time. First we'll set the x variable to be Env.temp, since that is true for both of the body temperatures. In the moth.plot chunk of your R markdown file enter:
ggplot(moth, aes(x = Env.temp))
If you run this you'll just an empty graph with Env.temp on the x-axis.
To add the abdomen points, we add on to this command:
ggplot(moth, aes(x = Env.temp)) + geom_point(aes(y = Abdomen.temp, color = "Abdomen"))
You'll now see the abdomen temperatures plotted. Using "Abdomen" as the color may seem odd, but having this statement in the aes() for geom_point() causes ggplot() to add the data to the legend.
We can add the regression line by adding a geom_smooth() command:
ggplot(moth, aes(x = Env.temp)) + geom_point(aes(y = Abdomen.temp, color = "Abdomen")) + geom_smooth(aes(y = Abdomen.temp, color = "Abdomen"), method = "lm", se = F)
We need to tell geom_smooth() that it needs to use the Abdomen.temp variable for the y data, but it picks up the x data from the original ggplot() command. We add the line to the legend with the color = "Abdomen" argument inside of the aes(). Since geom_smooth() can use a variety of different methods to put lines through data we need to tell it to use the lm (or "linear model") methods, which produces straight lines - we will use the lm() function in just a moment to do the regression analysis for these data. Finally, the se = F command suppresses a gray band around the line indicating the standard error of the estimate.
To get the thorax temperatures into the graph we just add on a geom_point() and a geom_smooth() for the thorax temperature:
ggplot(moth, aes(x = Env.temp)) + geom_point(aes(y = Abdomen.temp, color = "Abdomen")) + geom_smooth(aes(y = Abdomen.temp, color = "Abdomen"), method = "lm", se = F) + geom_point(aes(y = Thorax.temp, color = "Thorax")) + geom_smooth(aes(y = Thorax.temp, color = "Thorax"), method = "lm", se = F)
You'll see you now have both body temperatures plotted, color coded by the body part - with a numeric x-axis and a numeric y-axis this is a scatter plot. Based on this graph it looks like abdomen temperature changes as the environmental temperature changes, but thorax temperature does not.
3. Now we can run the regression analysis. The function that fits a linear model (i.e. a straight line) to the data is called lm(), and it is a lot like the aov() function we used above - we identify a data set, and a model formula to get our fitted model (in the moth.run.lm chunk of your Rmd file):
lm(Abdomen.temp ~ Env.temp, data = moth) -> abdomen.lm
The object abdomen.lm is the fitted model for the relationship between abdomen temperature and environmental temperature. Just like with the ANOVA, though, the contents of the fitted model is not the ANOVA table that we need for our significance test. We will get the ANOVA table we need shortly.
First, though, repeat this procedure with thorax temperature - in the next line of the same code chunk write the command:
lm(Thorax.temp ~ Env.temp, data = moth) -> thorax.lm
run this command and you will now also have an object called thorax.lm in your Environment.
4. The ANOVA table is obtained with the command (in the moth.anova.tables chunk of your R markdown file):
anova(abdomen.lm)
Note that the response variable is given above the table, and the table itself has the test of significance for environmental temperature as a predictor of abdomen temperature.
You can do the same with thorax.lm to get its ANOVA table. You'll see that (as we thought) abdomen temperature increases significantly with environmental temperature but thorax temperature does not.
5. To get r2 and the regression equation we need to ask for a summary() of each fitted model. Try getting the summary() of abdomen.lm (enter the command in the moth.regression.coeff chunk of your R markdown file), and you should see this:
Call:
lm(formula = Abdomen.temp ~ Env.temp, data = moth)
Residuals:
Min
1Q Median
3Q Max
-1.16297 -0.57980 0.03181 0.61661 1.04760
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.31194 0.96637
-0.323 0.755
Env.temp 1.01080 0.01775
56.931 1.01e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8063 on 8 degrees of freedom
Multiple R-squared: 0.9975, Adjusted
R-squared: 0.9972
F-statistic: 3241 on 1 and 8 DF, p-value: 1.006e-11
You'll find the r2 labeled "Multiple R-squared" - ignore adjusted r2 for now, we'll learn about it later. The regression equation is built from the coefficients listed in this output - you'll see them listed as the "Estimate" for (Intercept) and Env.temp. A straight line equation is mx + b, where m is the slope and b is the y-intercept. Our predictor variable here is environmental temperature, so the regression equation for abdomen temperature is:
Abdomen.temp = -0.31194 + 1.01080 Env.temp
The coefficients are each reported with tests of the hypothesis that they are equal to 0 - you'll see that the test of the slope has a p-value that is equal to that in the ANOVA table - it is highly statistically significant. The intercept is not different from zero. This tells us that the slope is very close to 1, and the intercept is not significantly different from 0.
You can do the same set of steps to get r2 and coefficients for thorax temperature and put together the regression equation for that model.
Based on your regression analysis you now know that the relationship between environmental temperature and abdomen temperature is significant, but environmental temperature and thorax temperature is not.
Knit, quit and save
That's it! Knit your R markdown file to upload to get your Word file to submit to the course web site. Don't forget to save when you quit.