Canonical correlation analysis finds the multivariate correlations between two sets of variables. Previously, we used linear combinations of raw variables to understand structure in a single matrix of data (PCA), and to understand how group centroids differed from one another (DFA), and now with canonical correlation we will use linear combinations to see how collections of variables in two different matrices are correlated with one another.
In canonical correlation analysis we have two different sets of variables all measured on the same subjects, and we want to know how the two sets relate with one another. We would come close to canonical correlation analysis if we conducting a PCA on each set and then correlating the PCA scores with one another. But, CC find the linear combinations that are maximally correlated between the two data sets, and the CC axes are independent of one another - PCA would find the axes that explained the maximum variation within each matrix, and the correlation between the PCA axes for the two different data sets would not be as high; additionally, the axes would be independent within each PCA, but if you correlated the PCA scores for one data set with the PCA scores for the other the first PCA axis for data set 1 would not be independent of the second PCA axis for data set 2 (that is, cross correlations would not be 0).
We will be working today with data on blood samples taken from sea turtles in Florida over an 8 year period (I compiled it from here - feel free to look over the web site if you want to learn more about the data set). Several variables were measured from each sample, including the concentrations of various ions, blood sugar, and several enzymes. We will focus on the correlation between the ions in the blood and several of the other constituents (a handful of big molecules, as well as glucose).
Start a new project for today, download and open today's Rmd file, then download and import this data set into a data frame called blood.
The data set has 31 columns, but we don't want to use all of them. We need to separate the columns of ions (all but one of which needed to be log-transformed) from the columns of big molecules (three of six of which needed to be log transformed), and we can do this easily by making two different lists of the column names we want to use. Note that I already did some data screening, and log-transformed the variables that needed it.
The list of ions is (make.variable.lists):
ions <- c("log.ca", "log.p", "log.mg", "log.na", "log.k", "cl" )
and the list of big molecules is (also in make.variable.lists):
big.molecules <- c("glucose", "urea.nitrogen", "log.uric.acid", "cholesterol", "log.total.protein", "log.albumin")
Next, you can look at the bivariate correlations between these two different sets of variables - this will give you some idea of what patterns of correlation to expect in the multivariate analysis (correlations):
cor(blood[big.molecules],blood[ions])
which will give you this set of correlations:
log.ca log.p log.mg log.na log.k cl
glucose 0.156 0.4039 -0.0406 0.233
0.145 0.28010
urea.nitrogen -0.559 0.0153 -0.2324 -0.439 -0.493 0.19066
log.uric.acid 0.287 0.0661 0.2770 0.165 0.404 -0.17322
cholesterol 0.252 0.2509 0.2194 0.192 0.179 -0.07031
log.total.protein 0.577 0.1173 0.1422 0.304 0.403 0.00397
log.albumin 0.325 0.0928 0.1952 0.397 0.378
0.08040
Since we're correlating two different sets of variables, rather than a set of variables with itself, each correlation is unique in this table, and you should look at all of them. You'll see that these bivariate correlations are not extremely high, but it's possible that across sets of variables the correlation is stronger.
Now that you have some idea of which variables are correlated with which, we can move on to the multivariate analysis.
1. Load the vegan library. We will use the vegan library to conduct canonical correlation analysis. There are actually several libraries available with various strengths and weaknesses, but I selected vegan because we will also use it next week to conduct canonical correspondence analysis, so we can cut down on the number of libraries you need to install if we use it for canonical correlation analysis today. Load the library with (in chunk load.vegan):
library(vegan)
2. Run the canonical correlation analysis. The function in vegan that runs CC is CCorA() - enter the command (run.cc):
CCorA(X = blood[ions], Y = blood[big.molecules], stand.X = T, stand.Y = T) -> blood.cc
The command syntax is pretty simple - we identify the two matrices to be correlated (X and Y), and then tell CCorA() to standardize the variables in each (standardizing means to calculate z-scores for them). Note an annoying quirk of the CCorA() function - you might expect that the first matrix entered would be treated as X by CCorA() and the second matrix would be treated as Y, but CCorA() instead uses the first matrix as Y and the second as X. As you know, if we label the inputs the order doesn't matter, so I've included X = and Y = to make it clear which is which. The choice of X and Y has no effect on the analysis, but since the output is labeled by X and Y it's important to be clear about which is which to avoid confusion.
To see the output write the name of the object in the code chunk and run it (same run.cc code chunk, next line):
blood.cc
which will give you:
Canonical Correlation Analysis
Call:
CCorA(Y = blood[big.molecules], X = blood[ions], stand.Y = T, stand.X = T)
Y X
Matrix Ranks 6 6
Pillai's trace: 1.26
Significance of Pillai's trace:
from F-distribution: 3e-08
CanAxis1 CanAxis2 CanAxis3 CanAxis4 CanAxis5 CanAxis6
Canonical Correlations 0.793 0.522 0.458
0.299 0.212 0.1
Y | X
X | Y
RDA R squares 0.306 0.27
adj. RDA R squares 0.247 0.21
There is much more information in blood.cc that we will look at in a minute, but this output tells us among other things whether the relationship between the matrices is significant - below the "Call" and the "Matrix ranks" output there is a Pillai's trace test statistic, and below that there is a p-value (labeled "Significance of Pillai's trace"), which means that the correlation between the matrices is significant. This is good to know, but it doesn't tell us which if the individual canonical correlation axes is significant, so before we start interpreting any of the canonical correlations we should get significance tests for the individual axes.
3. Get significance tests for each axis. The vegan library provides a test of significance of the correlation between the X and Y matrices overall, but it doesn't test individual canonical axes separately. We would like to know which of the axes represent significant correlations that are worth interpreting, but we need to install the CCP library to get those tests.
Install the CCP library, then load it and use the p.asym() command to get significance tests for each CC axis (chunk blood.cc.pvals):
library(CCP)
p.asym(rho = c(blood.cc$CanCorr), N = 77, p = 6, q = 6)
This command calculates the p-values for the canonical correlations (argument rho = blood.cc$CanCorr), given the sample size (N = 77) and the number of variables in each matrix (p = 6, q = 6; it doesn't matter which is which, as long as the number of variables in one matrix is used for p and the other is used for q).
The output looks like this:
Wilks' Lambda, using F-approximation (Rao's F):
stat approx df1 df2 p.value
1 to 6: 0.183 3.779 36 288 1.31e-10
2 to 6: 0.494 2.062 25 247 2.88e-03
3 to 6: 0.679 1.730 16 205 4.33e-02
4 to 6: 0.860 1.173 9 166 3.15e-01
5 to 6: 0.945 0.987 4 138 4.17e-01
6 to 6: 0.990 0.739 1 70 3.93e-01
The p-values are calculated for all six axes (1 to 6), for all but the first (2 to 6), for all but the first and second (3 to 6) and so on. You'll see that we have significant p-values for the first three rows, but when we drop the third axis the fourth, fifth, and sixth are no longer significant. This is interpreted to mean that the first three axes are significant, but the remaining are not.
The significant canonical correlations are the first three, which are R = 0.793, R = 0.522, and R = 0.458.
4. Plot the canonical correlation axes. The scores for big molecules and ions are in blood.cc$Cy and blood.cc$Cx, respectively. If you correlated these two sets of scores (in the Console):
diag(cor(blood.cc$Cy, blood.cc$Cx))
you get:
CanAxis1 CanAxis2 CanAxis3 CanAxis4 CanAxis5 CanAxis6
0.793 0.522 0.458 0.299
0.212 0.102
These are the correlations of the scores for big molecules with the scores for ions for each of the canonical axes - you'll see they are the same as the canonical correlations from your output, above. You can think of the canonical correlations as correlations between the CC axis scores.
Given this, we can visualize the multivariate correlation between the two data sets by making a scatterplot of the scores for ions and big molecules for each of the statistically significant CC axes.
In chunk cc.scatters enter the following to get the plot of CC1 scores for ions on the x-axis, and CC1 scores of big molecules on the y-axis of a scatter plot:
par(mfrow = c(1,3))
plot(blood.cc$Cy[,1]~blood.cc$Cx[,1], xlab = "Ions CC1", ylab = "Big molecules CC1", main = "CC1, R = 0.793")
This graph shows the scores for big molecules on the first CC axis (using blood.cc$Cy[,1]) against the scores for ions on the first CC axis (using blood.cc$Cx[,1]). The x and y axis are each labeled appropriately, and the graph title is set to the name of the CC axis. You'll see that there is a fairly strong positive relationship between big molecule and ion scores on this first axis, which has a correlation of 0.793.
Add the second graph - the command is the same, but change to the second axis (using [,2] as column indexes), and update the graph labels to use CC2 (the canonical correlation is 0.522). Then add the third graph, using the third CC axis (its canonical correlation is 0.458).
The axes represent successively weaker correlations, so the first will always have the highest correlation.
5. Interpreting the axes - loadings. The scatter plots of scores illustrate what the CC is actually a correlation of, but to use these canonical correlations to make sense of our data we still need to figure out what the axes mean.
CCorA() gives us loadings for the ions and for big molecules separately. The loadings are named elements in blood.cc, so we can get the loadings for ions with (ion.loadings):
blood.cc$corr.X.Cx
You'll see the loadings reported:
CanAxis1 CanAxis2 CanAxis3 CanAxis4 CanAxis5 CanAxis6
log.ca 0.8359 -0.217 0.3613 0.160
0.1724 -0.261
log.p 0.2435 0.585 -0.2450 -0.673
-0.1649 0.241
log.mg 0.2934 -0.429 -0.4403 -0.402 0.0411
-0.610
log.na 0.6206 0.077 -0.2757 0.396 -0.3771
-0.484
log.k 0.7134 -0.258 -0.2719 0.166
0.4981 0.274
cl -0.0669 0.697 -0.0451 0.265
0.4480 -0.487
We interpret these like all the other loadings we've encountered - these are the correlations of the ions (row labels) with each canonical axis (column labels). You'll see that most of the ions are positively correlated with CanAxis1, except for cl, which is weakly negatively correlated (so weakly that you should consider it uncorrelated with Axis 1). The variables have a mix of positive and negative loadings on CanAxis2, so log.p, log.na, and cl are inversely related to log.ca, log.mg, and log.k (with cl and log.mg being the ions with the largest positive and negative loadings, respectively).
We can get the loadings for big molecules next (big.molecules.loadings):
CanAxis1 CanAxis2 CanAxis3 CanAxis4 CanAxis5 CanAxis6
glucose 0.432
0.8605 -0.1454 -0.210 0.0682 0.0553
urea.nitrogen -0.787 0.3907 0.0597
-0.339 0.3102 -0.1114
log.uric.acid 0.469 -0.3996 -0.3420
-0.303 0.4077 0.4952
cholesterol 0.452 0.0416
-0.0616 -0.717 -0.2905 -0.4381
log.total.protein 0.758 0.0225 0.3567 -0.130
0.3647 -0.3844
log.albumin 0.561 0.0853
-0.4490 0.363 0.1315 -0.5723
The first axis is positively correlated with every variable except urea.nitrogen, and the second has positive correlations with everything except log.uric.acid (albeit with very small correlations for cholesterol, log.total.protein, and log.albumin).
You can interpret how ions and big molecules covary by comparing the loadings between these two sets - ion variables with positive loadings will be positively correlated with big molecule variables with positive loadings; conversely, if the signs are different for the loadings on the same axis the variables will be negatively related to one another along that axis.
Probably the easiest way to interpret how the variables relate to one another between the two matrices, though, is by making a biplot for the ions and for the big molecules, which we will do next.
1. Add unicode plot symbols to the blood data. We need to work around a limitation of the biplot() function as applied to a CCorA() object - biplot() will only let us use text for the the points on the graph, and doesn't give us the option of using plot symbols. This very quickly makes the graph hard to interpret because the over-plotting of the text labels makes them impossible to read.
We can get around this problem by adding plot symbols as a column in our blood data set, and then use that column as the labels in our graph. We make this trick work by entering unicode character codes for the plot symbols we want into a column in the blood data set.
If you've ever used the "Insert special character" functions in Word (or whatever word processor you use) you'll get a list of all the characters that are supported in the font typeface you're using. If you scroll through them you'll see not only letters, numbers, and punctuation symbols like you see on your keyboard, you'll see mathematical symbols, Greek letters, and a variety of dots, circles, and other shapes. Each of these has an underlying alphanumeric code that can be used to identify them, and one of the systems used to encode them is called unicode (you can see a comprehensive list here, if you're interested). Unicode is a standard for identifying symbols that allows for, among other things, changes between different languages that use different alphabets in electronic documents.
We're going to enter unicode alphanumeric codes for the plot symbols we want into a column in the data set, and R will render the codes as the plot symbols they stand for in the graph.
For example, code \U25CB is rendered by R as the symbol ○, and code \U25CF is rendered as the symbol ●. We will use the open symbols for green sea turtles, and the black filled circle for loggerhead sea turtles in our biplots.
To add these as a column in the blood data set, do the following (unicode.symbols):
plotsymbols <- c("\U25CB", "\U25CF")
Now we want to assign these codes to the green and loggerhead turtles - there's a clever trick for getting the codes assigned properly to each row in the blood data set; let's see how it works one step at a time.
First, we can convert the species column in blood into a factor, like so (in the Console):
factor(blood$species)
You'll see species, and the levels that are present in the species column:
[1] Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead
Loggerhead
[10] Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead
[19] Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead
[28] Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead
[37] Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead Loggerhead Green
Green Green
[46] Green Green Green
Green Green Green
Green Green Green
[55] Green Green Green
Green Green Green
Green Green Green
[64] Green Green Green
Green Green Green
Green Green Green
[73] Green Green Green
Green Green
Levels: Green Loggerhead
We happen to have loggerhead turtles entered in the first 42 rows, but when R makes a factor out of the column it puts the levels in alphabetical order - thus, the levels are listed as Green Loggerhead, rather than the order they were entered in the data set.
Next, since factors have numeric codes assigned to each level that are replaced by the matching text label in the Levels: list, we can get the underlying numeric codes with (in the Console):
as.numeric(factor(blood$species))
This time you'll see the numeric values assigned to each level:
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1
1 1 1 1 1 1 1
[53] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
The 2's are the second level (Loggerhead), and the 1's are the first level (Green).
Now, our list of unicode characters has a character for each level of the species factor, with the first one being the code for green turtles, and the second being the code for loggerheads. This list of 2's and 1's can be used to repeat the loggerhead code for every 2, and the green turtle code for every 1, like so (still in the Console):
plotsymbols[as.numeric(factor(blood$species))]
R renders the symbols when it displays them, so even though you entered codes in plotsymbols you'll see the plot symbols themselves:
[1] "●" "●" "●" "●" "●" "●" "●" "●" "●" "●" "●" "●" "●" "●" "●" "●" "●" "●" "●" "●" "●" "●"
"●" "●" "●" "●"
[27] "●" "●" "●" "●" "●" "●" "●" "●" "●" "●" "●" "●" "●" "●" "●" "●" "○" "○" "○" "○" "○" "○" "○" "○" "○" "○"
[53] "○" "○" "○" "○" "○" "○" "○" "○" "○" "○" "○" "○" "○" "○" "○" "○" "○" "○" "○" "○" "○" "○" "○" "○" "○"
To finish this step we just need to enter these symbols into a new column in the blood data set called plotsym (chunk unicode.symbols):
blood$plotsym <- plotsymbols[as.numeric(factor(blood$species))]
We now have the symbols we need where we need them.
Before we move on to the graph, though, there are also a couple of points I want to highlight on the graph, so we'll replace the symbols in blood$plotsym for these rows with letters to make them easy to see (same chunk, next lines):
blood$plotsym[45] <- c("G")
blood$plotsym[18] <- c("L")
Row 45 is a green sea turtle, so we'll use a G to represent it on the graph, and row 18 is loggerhead, so it gets an L.
2. Make the biplots. Now you can visualize the results with the command (plot.cca):
biplot(blood.cc, plot.type = "b", xlabs = blood$plotsym)
This command will give you two plots. The first is a biplot of the scores of observations (as dots) with the loadings for the big molecules (as arrows) overlaid. The second is the same biplot for the ions. You'll get a warning message in the console about length(x) = 77 > 1, but I'm not sure what its problem is, and it's not affecting the graph, so ignore it.
Remember that the open circles are green sea turtles and the filled circles are loggerheads. Note that species is not used in any way by the CC, but differences in the species can contribute to the correlation between the data sets - for example, loggerhead turtles seem to have high levels of urea.nitrogen and low levels of log.abument and total.protein compared with green sea turtles, and this difference can cause those variables to be negatively correlated when we put both species together in one analysis. This is not a bad thing, just something to be aware of (if you want you can do the optional steps below that remove this difference statistically, so that you can see what the pattern is when species no longer differ on average in their blood chemistries).
The two points that we set to a different plot symbol should be a green turtle (◎) and a loggerhead (◉) appear in each graph because the graphs show the same set of turtles scored on CC1 and CC2 for both the big molecules and ions. You'll find these points either on the far right or at the top of each graph - there are a couple of questions in your Rmd file about the blood chemistry for these two turtles.
Now that we have measures of correlation between the two matrices, we would like to know how much of the variation in ions is shared with variation in large molecules - this is like an R2 in a regression analysis, but based on multivariate correlations.
We would also like to know how well each set of CC axes represents each data set - if we have a strong
correlation between CC axes that only capture a small amount of variation in the measured data then we probably
don't want to attach much importance to the pattern.
1. Communality. We have been answering the second question using communalities, which are just sums of squared loadings across the axes. We'll use the same approach we used for PCA - we'll make a matrix with an upper triangle containing 1's, and then use a matrix multiplication of the squared loadings by this matrix to get a running total of variance explained across the CC axes. We can get the communalities for the X (ion) data set with (ion.communalities):
ones.matrix <- round(upper.tri(matrix(nrow = 6, ncol = 6), diag = T))
blood.cc$corr.X.Cx^2 %*% ones.matrix -> ions.communalities
ions.communalities
You can modify these commands to get communalities for the big molecules matrix as well (big.molecules.communalities - you can use the same ones.matrix, since the number of columns is the same for both data matrices, just use the corr.Y.Cy loadings in place of the corr.X.Cx loadings).
The communalities will sum to 1 for both matrices if you have the same number of variables in each matrix - if so, then you will have as many CC axes as you have variables in each matrix, and the CC axes will account for all of the variation in each data matrix, just like a PCA does. You'll see that the communalities all go to 1 by the last CC axis for both ions and big molecules.
If the number of variables is different between the two data matrices you'll only get as many CC axes as you have variables in the smaller data matrix - if this were the case we would have fewer CC axes than there are data variables for the bigger matrix, and less than 100% of the variation would be accounted for.
2. Redundancy. To finish up, you can address how much shared variation there is between the two data sets by calculating redundancy index values for each of the matrices. If you look at the original set of output we did get redundancy index values for the entire set of CC axes - it looks like this:
Y | X X | Y
RDA R squares 0.306 0.27
adj. RDA R squares 0.247 0.21
This output is telling us that the amount of variation in big molecules (Y) that is shared with ions (X) is
0.306, and the amount of variation in ions (X) shared with big molecules (Y) is 0.27. The second row is an
adjusted estimate that accounts for the number of variables (we can ignore these, they are mostly useful for model
selection purposes, which we are not doing here).
To calculate redundancy for each CC axis, we will multiply the averages of the squared loadings by the squared canonical correlations. For the proportion of variance in the ion matrix (X) explained by the big molecules matrix (Y), that would be (redundancy.ions):
colMeans(blood.cc$corr.X.Cx^2) * blood.cc$CanCorr^2 -> ion.redundancy
ion.redundancy
sum(ion.redundancy)
The values stored in ion.redundancy are the variance in the ions matrix explained by each CCA axis, and the sum of these is the total variance explained in ions by the CCA. You'll see that the variation in ions shared with big molecules through the CanAxis1 is 0.18287, or 18.3%. The sum of all six of the redundancy values is 0.273, which is the same as the X|Y redundancy in the CCorA() output. That value of 0.18 is 0.18287/0.273 = 0.67, or two thirds of the total redundancy between the two matrices. The second axis adds another 0.05141/0.273 = 0.188, so between the first two we're accounting for 0.858, or 85.8% of the variation that is shared between the two. Ignoring the rest of the axes is only omitting 14.2% of the variation in the data that is shared between the two matrices.
The variance in big molecules explained by ions will be different, so we need to calculate it separately as (redundancy.big.molecules):
colMeans(blood.cc$corr.Y.Cy^2) * blood.cc$CanCorr^2 -> big.molecule.redundancy
big.molecule.redundancy
sum(big.molecule.redundancy)
You'll see that about 22% of the variation in big molecules is shared with ions through CanAxis1. The sum of the redundancy coefficients is 0.306, or 30.6%, and this matches the CCorA() output.
Thus completes the required part of the assignment.
But, we detected a difference between the two species that likely produced some of the correlation between the ions and big molecules. If we want to see how ions and big molecules are related after this difference between the species is accounted for, we can statistically eliminate the differences between species and run the analysis again. This is a form of partial analysis, in which we statistically account for an effect and then see how the variables are related after that effect is removed from each of them. After removing the differences between the species we can focus on whether blood chemistry varies in a consistent way within both of the species, without the differences between species as a confounding factor.
1. Calculate residuals for ions and big molecules after accounting for species differences. One way of accounting for a variable is to use it as a predictor in a model, and then use the residuals as the variables - in the case of species as a predictor, the residuals will be differences between each turtle's measurements and the mean for the species they belong to. This has the effect of setting the species to the same mean of 0 for every variable, but it has no effect on the correlations between the variables within the species.
We can get residuals from a MANOVA relating all of the variables (both ions and big molecules) to species. In chunk residuals enter:
manova(as.matrix(blood[c(ions,big.molecules)])~species, data = blood) -> blood.sp.manova
data.frame(residuals(blood.sp.manova)) -> resids
The manova() command just fits the MANOVA model and assigns it to the blood.sp.manova object. We have done this several times, the only thing that's different is that I concatenated the two lists of response variables into one, rather than making a separate one - thus, c(ions, big.molecules) is used to identify all of the variables to use as responses.
The data.frame(residuals()) command extracts the residuals from the MANOVA model and makes a data frame out of them.
We can now do a canonical correlation on these residuals and there will be no difference in blood chemistry on average between the species to affect the results.
2. Run a canonical correlation on the residuals. Just like before, but using the residuals - in chunk cc.resids:
CCorA(resids[big.molecules], resids[ions], stand.Y = T, stand.X = T) -> resids.cc
resids.cc
You'll see that the size of the canonical correlations and the redundancy coefficients have gone down a little - this is an indication that the differences in species means was responsible for some of the correlation in the variables.
3. Test the canonical correlations for significance. Using p.asym() from the CCP library gives us (cc.resids.pvals):
p.asym(rho = resids.cc$CanCorr, N = 77, p = 6, q = 6)
gives us:
Wilks' Lambda, using F-approximation (Rao's F):
stat approx df1 df2 p.value
1 to 6: 0.290 2.61076 36 288 5.68e-06
2 to 6: 0.586 1.52710 25 247 5.63e-02
3 to 6: 0.792 1.01927 16 205 4.38e-01
4 to 6: 0.925 0.59720 9 166 7.98e-01
5 to 6: 0.986 0.24441 4 138 9.13e-01
6 to 6: 1.000 0.00154 1 70 9.69e-01
The first canonical correlation axis is still highly significant, but none of the rest are. We should restrict our interpretation to the first axis.
If you look at the loadings for ions (chunk ion.loadings.resids):
resids.cc$corr.X.Cx
and for big molecules (chunk big.molecule.loadings.resids):
resids.cc$corr.Y.Cy
you'll see:
(ions)
CanAxis1 CanAxis2 CanAxis3 CanAxis4 CanAxis5 CanAxis6
log.ca 0.63795 -0.6072 0.0909 -0.295 0.358 -0.0255
log.p 0.49601 0.4725 0.1637 0.427
-0.473 0.3129
log.mg 0.00629 0.3092 0.6937 -0.126 0.560
0.3071
log.na 0.46060 0.2493 -0.2743 0.108 0.779
-0.1768
log.k 0.44019 0.0444 0.1921 -0.779 -0.224
-0.3336
cl 0.29044 0.1985 -0.5870 -0.393
0.238 0.5665
(big molecules)
CanAxis1 CanAxis2 CanAxis3 CanAxis4 CanAxis5 CanAxis6
glucose 0.8960
0.3912 -0.158 0.0307 -0.1295 0.0375
urea.nitrogen -0.2397 0.5389 0.409
-0.0454 0.0217 -0.6943
log.uric.acid 0.0996 0.0687 0.511
-0.5198 -0.6412 0.2085
cholesterol 0.5163 0.0745
0.614 0.3036 0.3678 0.3520
log.total.protein 0.7323 -0.4293 0.231 -0.3173 0.3301
-0.1283
log.albumin 0.3581 0.2981 -0.280
-0.4755 0.6568 0.2165
All of the ions are now positively correlated with the first axis, and all but urea.nitrogen are positive for the big molecules. This is a very different pattern, so species differences were producing a good deal of the patterning we saw before. Within species, there is a tendency for the total concentration of all of these blood components to increase or decrease together, except for urea.nitrogen, which is negatively related to the rest.
4. Plot the results. We can add plot symbols by species so that we can confirm that species differences were eliminated (unicode.symbols.resid):
resids$plotsym <- plotsymbols[as.numeric(factor(blood$species))]
We can then get the biplots (biplot.resids):
biplot(resids.cc, xlabs = resids$plotsym, plot.type = "b")
You'll see that the open and filled dots are now intermingled, and don't differ on average anymore. The graphs give you the first two CC axes by default, but remember that only the first is statistically significant, so restrict your interpretation to the left to right patterning.
Knit your Rmd file and upload your Word document to complete the assignment.