Arrays and matrices are closely related terms. Arrays are computer data structures that store data in rows and columns. Matrices are mathematical structures made up of rows and columns. Mathematicians and computer scientists may be able to tell you how they are different, but we will use the terms interchangeably in here.

Array formulas in Excel work on ranges of cells, but apply their calculations to the cells one at a time.

We will look at two applications of array formulas:

Download this file and save it to your OneDrive (or your H: drive, or your own computer... whatever is working for you).

Array formulas for summarizing grouped data

Array formulas can greatly increase the clarity and interpretability of a spreadsheet by allowing you to do complex calculations in a single cell, so that only the result of the calculation is displayed.

As an example we will calculate a mean and standard deviation from data on numbers of aphids on a sample of 100 tomato leaves from a garden.

Switch to sheet "Grouped data". You'll see that you have a column of data called "Number of aphids" and one called "Frequency" giving the number of times that each number of aphids was counted out of 100 plants.

The challenge here is that the data are already in a summarized form - instead of having 100 individual measurements we have a column showing the possible data values (column "Number of aphids") and another that shows how often they occurred (column Frequency). We can't just average the numbers of aphids in column A to get the mean, since some of counts of aphid numbers occurred frequently (2 aphids were found 27 times) and some didn't occur at all (no leaf had 8 aphids).

A. The arithmetic mean is Σxi/n, which says that we need to sum all the data values and divide ty the sample size.

The sample size is easy - it's the sum of the frequencies (n = 100).

To get the sum of the data values we need to add each number of aphids for every time they appear in the data set - we need to add zeros ten times, 0 + 0 + 0 + 0 + 0 + 0 + 0 +0 + 0 + 0, and then add that to the 23 times we counted 1 aphid (1+1+1... out to 23 ones), then add that to the 27 times we observed 2 aphids, and so on. Equivalently, we multiply the numbers of aphids by their frequencies, and sum the products to get the sum of the data values for the numerator of our mean calculation.

This is pretty simple, so we'll do just do this as an array formula:

As an array formula the ranges of cells are each treated as arrays, and a2:a12*b2:b12 is interpreted to mean "a2 x b2, a3 x b3, a4 x b4... a12 x b12". These products are themselves a new array, and this new array is used as the argument for the sum() function, which sums all the products. The sum of the frequencies in the denominator is the sample size, and it doesn't do any array calculations - it works the same as if it was not part of an array formula, it just sums the range of cells in its argument.

New in Excel 365 - dynamic arrays

Up until Excel 365 it was required that you use CTRL+SHIFT+ENTER to use an array formula. Now when you enter an array formula if you just hit ENTER Excel will treat the entry as an array formula and adjust the output as needed (if the output range will spill over into cells that already have contents Excel will let you know). Formulas entered this way are called dynamic array formulas - if you select a cell within the output range it will put a gray order around the cells so you can see that they are all output from a single dynamic array formula, but the curly brackets don't show up in the formula bar.

This is so new that I'm not sure what the implications are - I have tried out the calculations we use for this exercise and have not yet found one that fails if you hit ENTER instead of CTRL+SHIFT+ENTER. I'll continue to keep CTRL+SHIFT+ENTER in the instructions, but if you forget and use ENTER instead and everything works there's no need for you to re-do the entry with CTRL+SHIFT+ENTER, it seems they're essentially the same thing.

B. We're going to calculate the standard deviation for these grouped data now. The standard deviation is a measure of dispersion in the data, equal to the average distance of observations from the mean. We calculate it as:

Std dev

Complex cell formulas - use with caution

Before we move on to the calculation, though, a quick caution about complicated formulas. If you recall from last week, we avoided having cells with intermediate steps of calculation in our spreadsheet by nesting calculations within a single cell formula, such as sin(radians(B2)) - this allows us to do a unit conversion on an angle stored in B2 and then calculate the sin() of the converted value without having to devote a column in the spreadsheet to the conversion. Array formulas allow us to take this approach to the next level, which allows us to make cleaner, less cluttered spreadsheets, which is good.

However, nested formulas makes it harder to detect errors, because some of the calculations are hidden within the nested parts of the formula.

Because of this you should be especially vigilant about double-checking your results if you use array formulas.

Our approach will be do do the calculations in two ways - once without using array formulas at all, in small steps that are easy to error check, and then when we're confident we got it right, we will do the entire calculation again with an array formula in a single cell.

This means that we're not saving ourselves any work - if we have to do the calculation of the standard deviation without array formulas anyway to check our answers then using the array formula is redundant, and if our only goal was to save time we wouldn't bother with the array formula. But, doing it both ways allows us to get a version that's simple and clean at the end, while ensuring that we didn't make any calculation errors in our complex formula - the journey will be longer, but we'll be happier with the destination.

Let's do the version that doesn't use array formulas first - to get the sums of squares (the numerator of a standard deviation) we need to:

Now that we have the squared differences we can finish the standard deviation calculation in a single cell:

This is not an array formula, because there is no per-cell operation being done - both of the cell ranges in the numerator and denominator sums are just used as arguments to a sum() function.

C. Now that we're confident we know what the correct standard deviation is, we can do all these calculations in a single cell, using an array formula.

You should see the same value reported in B16 as you got for cell B15.

If you look at this formula it does exactly what we did without using array formulas, but does it all in one cell:

Presumably you agree that this formula is more complicated and harder to interpret what is going on than the first approach that split the calculation into a series of smaller steps.

The advantage of this complicated formula is that you wouldn't need the "squared differences" or "Diffs x freqs" columns - they aren't meant to be interpreted, they are just steps in the calculation of the standard deviation, and having them there invites confusion. If you were doing this calculation for your own data analysis project, you could delete those columns and the sd calculation in B15, and just retain the sd calculation in cell B16, once you know it's right, which would leave only the data and the summary statistics that are meant to be interpreted - don't do that, though, because I want to see both of your calculations.

Responses to selection on genetically correlated traits

Chinook

Size-selective harvesting is very common in hunted or fished wild populations, and usually the sizes selected are the largest individuals in the population. This image of a Chinook salmon on the left is an example of a species that is harvested in this way - both commercial harvesting and sport fishing focus on the largest individuals. We can expect that removing the larger individuals from the population will leave the smaller individuals that escaped harvest to be the parents for the next generation. If we make it dangerous for them to be large we can expect them to evolve smaller body size.

Body length is a phenotypic trait, which means it is part of the way the organism looks or functions. The length of the salmon's body is a quantitative trait (i.e. varying continuously), and most quantitative traits are polygenic (i.e. are affected by multiple genes).

Environmental conditions experienced by a salmon as it grew will also affect its adult length, so there is an environmental component to length as well.

Although both the genes they carry and the environment they grow in will affect their lengths, only the genes can be passed on from parent to offspring. The ability of a parent to pass its trait value on to its offspring is called the heritability of the trait. Heritability is symbolized as h2, and it can fall between 0 (no genetic contribution to the trait value) and 1 (complete genetic control over the trait value, no environmental contribution).

We predict how selection will change a single trait using the breeder's formula: R = h2S, where R is the amount of response to selection, h2 is the heritability of the trait, and S is the selection differential.

Response to selection

The graph to the left illustrates how this works. This example uses a heritability of h2 = 0.5, which means that 1/2 of the variation in length of the salmon is due to their genes. The selection differential is equal to the difference in mean for the population as a whole and the population that is selected to be the breeders. In this case we're selecting the smaller individuals to be breeders as a side effect of taking the larger individuals from the population, and we can expect the offspring will fall halfway between mean for the population as a whole and the mean for the breeders.


The breeder's formula works well for one trait at a time, but organisms are integrated units, not collections of independent traits. When you select against salmon with big body lengths by removing them from the breeding pool you end up changing all sorts of other traits as well. Fishing may not directly target fish that are fast growing, but fish with big lengths may have gotten so large by growing faster - selecting against long fish could end up also selecting against fish that are fast growing, and result in slower growth rates over time. The selective pressure put on traits that are not under selection themselves, but are correlated with traits that are, is called indirect selection. We predict how indirect selection will affect response to selection on traits using a matrix of genetic variances and covariances, called a G matrix.

If you switch to the "Matrix" tab you'll see an example of a G matrix that comes from a study by Hard (2004). Variance-covariance matrices are organized like a correlation matrix, which you may have seen before - the row labels and column labels are the same, and the body of the matrix holds variances when the row and column labels match, and holds covariances between two variables when the row and column labels are different. Since the row and column labels match in a diagonal from the upper left to the lower right corner of the matrix, this is called the main diagonal of the matrix. The main diagonal is made up of genetic variances, which are used in the same way as heritabilities in the breeder's formula, to predict the effects of direct selection on traits.

The other elements in the G matrix are covariances. Covariances are like correlations that haven't been standardized to fall between -1 and 1 - they indicate how two variables change relative to one another, but they still have the original data units.

It's harder to compare covariances to one another to see which variables have the strongest inter-relationship because different pairs of variables have different units. But, but you should at least be able to see that all of the covariances in the G matrix are positive - this means that in every case as one variable increases so does the other (and, as one decreases the other decreases too).

The equivalent of a selection differential when we work with responses to selection across multiple correlated traits is the selection gradient. We won't worry too much about how selection gradients differ from selection differentials for the purposes of this exercise because they serve the same purpose - they represent how much selection pressure is being put on a trait, and in what direction (positive selection gradients select for an increase in the trait value, and negative selection gradients select for a decrease).

When we select for a single trait, like length of the body, we put a direct selection pressure on that trait, and an indirect selection pressure on any trait it's correlated with. The way we will predict how direct and indirect selection produce change in all the organism's traits is to use a version of the breeder's formula designed for multiple correlated traits, which tells us:

R =

R is now a column of responses, each row of which will hold the predicted response to selection for each variable we are tracking. G is the genetic variance-covariance matrix, and β is a column of selection gradients.

We will use this model to explore how these traits are expected to change as different patterns of selection are applied to them.

A. First, we need to set up the model:

You can already see that there are changes in all of the traits, even though we're only selecting for length, because of indirect selection via their covariation with weight.

We have a problem, though - the responses are all giving expected amounts of change in the trait, but all of the trait values have different units. The biggest number is -62.583 for fork length, but we don't know if that's a big response because it could be that fork lengths are really big numbers (salmon are big fish, and the lengths are in mm). If we divide these responses by the mean trait value they will all give us a relative change on a proportional scale, which we can format as a percent - this will make it much easier to compare the responses across variables.

The publication I took the G-matrix from did not provide the means for these variables, so we will use some ballpark estimates taken from other sources.

These means are all in their natural, untransformed units, but you may have noticed that the labels in the G matrix show that some of the variables have been transformed to make the relationships between the variables more linear. We need to apply the same transformation to the means that were used in the G matrix before we can use them.

Now we can divide the responses in column J by these transformed means and format the result as a percentage to get a percent change per generation:

Note that the biggest percent change was not in the trait that was under direct selection (fork length), it was in adult age, which has a large correlation with fork length.

We're going to change the gradients a few times, but we want to record our results for each ones:

B. The first scenario in step A imagined that fishing only exerted direct pressure on one trait, adult fork length, but in fact it exerts direct pressure on other traits as well (perhaps inadvertently, but direct pressure nevertheless). For example, growing rapidly so that you can breed before you get caught may be beneficial in a harvested population, and although fishing doesn't specifically target the slow-growing fish, there may still be a direct benefit to rapid growth.

The positive gradient value for growth rate means we're selecting directly for faster growth as well as for shorter length. The fact that the selection on growth rate is positive while selection on fork length is negative actually reduces the percent responses for all of the traits. The positive selection on growth rate did not overcome the negative selection on fork length, but it did decrease the amount of reduction in growth rate that we saw when growth rate was only indirectly affected by selection on fork length.

C. In addition to growing faster we might expect that breeding younger would be beneficial. Salmon migrate out to sea, spend 2-4 years growing, and then return to spawn. Although a salmon can only choose to spawn at 2 years, 3 years, or 4 years, the average age can decrease or increase by fractional amounts, depending on how many animals of each age are spawning in a year.

To select for younger breeding:

D. We are now going to graph the results - select the range of cells that has the trait names, the responses, and the column headings (A11 to D16) and insert a line graph.

The graph has some issues we need to fix, the first of which is that the traits are on the x-axis, and the series are the different selection gradients we used. We want the opposite - the gradients should be on the x-axis, and each line should be a trait. To swap them, select any one of the lines, right-click, and select "Select data".

You now have a different line for each trait, and you can more easily see how changing the selection gradients affect them.

Next, since we have both positive and negative changes the default of having the x-axis cross the y-axis at 0 is resulting in the labels being located in the middle of the graph. We would like them at the bottom where they belong. This is done, a little counter-intuitively, by changing a y-axis setting:

You will now have the labels for the selection gradients at the bottom of the graph.

You can change the chart title to "Responses to selection", and add a y-axis label of "Percent response".

E. Now that you have the results, notice a few interesting patterns:

Assignment

That's it! Upload your completed worksheet to the class web site.