Introduction

Today we are going to re-visit the lung cancer survival data, but instead of log-transforming our troubles away we're going to try using a Mann-Whitney U-test and a randomization test to analyze the data instead. All three of these are supposed to be alternative ways of testing whether there is a difference in survival time between groups, but each addresses the question in a different way. We will see at the end whether they all arrive at the same conclusions.

The randomization test will take more time to do, and you will need to wait for a class data set to finish it. Do your part of generating the class data set, and enter your data into the class database first. You can then work on the Mann-Whitney U-test while you're waiting for the rest of the class data.

Randomization testing

We will re-visit the lung cancer survival time data for patients with small cell or large cell cancers that we worked with last time to see how randomization tests can be used to test for differences between means, even when the data aren't normal.

Under the null hypothesis, the survival times are the same for small and large-cell cancers - the cell type grouping is a distinction without a difference. If the null is true, the 94.4 week difference between large and small cell cancer mean survival times we observed should be easy to generate if we randomly shuffle the data between the groups. This random shuffling of data mimics the process of drawing two samples repeatedly from a population, such that there is only one population mean that is the same for both samples. The differences between the means of randomly shuffled groups are used as our sampling distribution, in place of the t-distribution.

Large mean =

Small mean =

Difference =



Number of random shuffles =

Number exceeding 94.4 =

p-value =

The app to the left shows the cancer survival data in the table - initially the groups are correctly assigned, and the large mean, small mean, and difference between them are all equal to their observed values. The difference between means is our test statistic for this test - it is the measure of difference we are using to test if the difference between population means is 0. The difference is highlighted in red because any difference between means that we generate with an absolute value of 94.4 or larger will show in red, and will add to the "Number exceeding 94.4" count. Under the null hypothesis the observed difference of 94.4 is a randomly generated difference, so it is being added both as one of the random shuffles, and is treated as one of the random shuffles exceeding 94.4 - consequently, the "Number of random shuffles" and "Number exceeding 94.4" both read 1.

The p-value is just the number of random shuffles exceeding 94.4 divided by the total number of random shuffles generated, with 1 added to the numerator and denominator for the observed data. Before you start generating random shuffles there is just the observed data in both the numerator and denominator, and p-value starts at a value of 1.

Now that you understand what the app is showing you, you can use it to see how randomization testing works.

1. Hit "Randomize" once to generate your first random shuffle. When you hit the "Randomize" button the labels for cancer type are randomly shuffled, while the survival times stay in place, which has the effect of randomly assigning the data values to new groups. The means for the new groups are calculated, and the resulting difference between them is reported. If the difference is not as big as the observed difference of 94.4 the difference turns black (which is very likely what happened). The number of random shuffles increases to 2, but the number exceeding 94.4 remains at 1, so the p-value is now 0.5.

If you hit "Randomize" again a few times you'll see that the p-value declines each time you get a random shuffle with differences between means less than 94.4.

2. Continue to hit "Randomize" until p is less than 0.05. Since the denominator of our p-value calculation is the number of random shuffles we generate, it will take at least a count of 20 random shuffles (which is actually the observed data plus 19 random differences) for p to equal 0.05. When the number of random shuffles counter hits 21 the p will be less than 0.05 (unless you get a rare difference greater than 94.4, in which case it will take an extra shuffle).

3. Continue to hit "Randomize" until p is less than 0.01. Record how many random shuffles it took, and how many differences exceeded 94.4 on your worksheet.

4. Clicking buttons is fun and all, but it would get tedious to do this enough times to get a p-value that is 1 in 10,000 (0.0001), so I generated 10,000 differences for you. Download this MINITAB worksheet, which contains 10,000 random random differences between randomly shuffled groups. Make a histogram of the differences, and you will see that it resembles a bell-shaped sampling distribution of test statistics, much like a t-distribution.

5. To see how many of the 10,000 differences exceed 94.4 do the following:

This expression calculates the absolute value of each difference (abs()), and compares each to 94.4. If the value is greater than or equal to 94.4 the expression returns a 1, and if it is less than 94.4 it returns a 0.

Now, to count up how many of the differences exceeded 94.4 you just need to sum up the Exceeds column. Do this with "Display descriptives", using "Exceeds" as the "Variable", and include only the "Sum" as a statistic to display.

Note that you could get a sum of the number of differences that exceed 94.4 in a single step using the calculator - you could use the expression:

sum(abs("Differences") >= 94.4)

which would compare the absolute values of the differences to 94.4, and then sum them before reporting the result in the "Exceeds" column. This is more compact, but it's easier to see what is happening if you do the calculation in two steps.

Answer the questions on your worksheet about randomization testing.

Mann-Whitney U test

The last analysis we'll do on these data is to compare the medians between the two groups using a Mann-Whitney U-test. Mann-Whitney U tests don't require that the data are normally distributed, but the distributions should be the same between the groups. Since the distributions are both right-skewed, we can apply this test to our survival data.

You can use this file for the analysis, which is the same survival data again but organized in an "unstacked" manner, with a column for each cell type. MINITAB can by a funny beast at times, and it isn't always consistent in its approach from one test type to the next - since we're making a comparison between two independent samples we should be able to used stacked data, with a column of measured data and a column of group identifiers (like we did for two-sample t-tests), but MINITAB doesn't want that here. Other than the organization, these are the same data as you analyzed using a log transformation in the previous activity, and that you just analyzed using a randomization test.

1. To run the test, select "Stat" → "Nonparametrics" → "Mann-Whitney".

2. In the form that pops up, put "large cell survival" into the first sample box, and "small cell survival" into the second. Leave the confidence level at 95% and the alternative as "not equal", and click OK.

The output gives median survival times for each group (Greek "eta" symbols represent the population median), and an "Estimation for difference", which is reported to be 89.5. Given the labeling it seems as though this should just be the difference in medians between the groups, but it isn't - instead, MINITAB calculates differences between all possible pairs of data points for small and large-cell survival times, and then reports the median of these differences as the point estimate.

There are two reported results of the null hypothesis test - the first is labeled "Not adjusted for ties", and the second is labeled "Adjusted for ties". Ties refer to data values that are the same, and that have the same rank. The first result, not adjusted for ties, uses the W statistic to calculate a p-value based on a normal approximation (we talked about this in lecture - if the sample sizes are over 30 this is the preferred approach). The second option, which adjusts for ties, uses the Mann-Whitney U distribution for the p-value (which is preferred when the sample sizes are less than 30). We have over 30 in one group, but under 30 in the other, so we should prefer the adjusted for ties approach. In this instance the results are identical regardless of which we choose, so it doesn't matter which we report, but you should understand the differences between them so you can pick the correct one when the medians are not the same.

3. Answer the questions on your assignment sheet pertaining to the test results.

Challenge Question

Many species come into breeding condition only under particular circumstances. Many are seasonal breeders, and during non-breeding seasons females will be non-receptive, and males will often be non-agressive towards each other. Cichlid fish become reproductively active due to social interactions, in that males will increase reproductive hormone levels (such as "gonadotropin releasing hormone", or GnRH) when they are in the presence of other males.

To test whether the presence of other males affects levels of GnRH, researchers placed males in tanks by themselves, making them non-territorial, or in the presence of other males, making them territorial. They they measured the GnRH levels in territorial and non-territorial males. The data are here in the first two columns, Territorial status and GnRH level.

1. Test for HOV using an F-test. Do you have HOV?

2. Test for normality. Do you pass the normality test?

3. Why might you want to avoid doing a two-sample t-test on these data? Note that the sample sizes are not identical between the two groups.

The researchers decided to do a randomization test to compare mean GnRH levels between the groups. The differences between randomly re-grouped GnRH levels are given in column "Rand diffs". The differences are sorted from smallest to largest.

4. What is the observed difference between means?

5. How many of the random differences exceeded the observed difference (that is, how many absolute values of random differences were greater than or equal to the absolute value of the observed difference)?

6. What is the p-value for this test?

7. Was this a one-tailed or two-tailed test?

8. Now, conduct a Mann-Whitney U test for the same data. Report the difference in medians, the W test statistic, and the p-value.

9. What conclusion can you draw from this test regarding the effect of territoriality on GnRH levels in male cichlids? Do the two different non-parametric tests lead to the same conclusion?