One-sample hypothesis testing

Introduction

You may have heard the myth that the Swiss would tie barrels of brandy to the necks of St. Bernard rescue dogs to warm up cold skiers in the Alps (this gif is from a Disney cartoon that depicts a St. Bernard giving a very cold Pluto a shot from his barrel. Binge drinking is not usually depicted in children's cartoons anymore - it was 1936, times have changed). The basis of this myth is the belief that drinking alcohol increases body temperature because people experience the sensation of warmth after drinking it. But, this subjective feeling of warmth may or may not reflect an actual increase in core body temperature.

How would we know if drinking brandy actually changes body temperature? As data-driven scientists, we should try to answer this question by devising an experiment to test whether drinking brandy changes core body temperature (and, no, this is not what we'll be doing in class this week, sorry to disappoint you).

As you know from our work on experimental design, to test the effect of brandy on body temperature we need to apply a treatment (brandy drinking) and measure a response (body temperature). One possible experiment that would test the hypothesis that drinking brandy affects core body temperature would be:

Administer a shot of brandy (1.5 fl. oz.) to a randomly selected group of people (our experimental subjects).
Measure each person's core body temperature.
Compare the mean of the measured body temperatures to 98.6° (average normal core body temperature for humans - the body temperature we would expect if brandy has no effect on body temperature). We would draw a conclusion based on these decision rules:

If mean body temperature is higher than 98.6°, conclude that drinking brandy increases core temperature.
If mean body temperature is equal to 98.6° conclude that drinking brandy doesn't affect core temperature.
If mean body temperature is less than 98.6° conclude that drinking brandy decreases core temperature.

Brandy experiment: results

x̄ = 98.7°

s = 0.3

n = 20

If we conducted the experiment and got the results in the box on the right, the decision rules tell us that since the sample mean body temperature of x̄ = 98.7° is above 98.6° we should conclude that brandy increases body temperature.

That's what the data says, and you can't argue with the data. Right?

Random sampling strikes again

Unfortunately, it isn't so simple.

The problem is, the decision rules are not accounting for random sampling. According to the decision rules we can only conclude that drinking brandy doesn't affect core temperature if the mean for our data is exactly equal to 98.6°, and any differences, no matter how small, would cause us to conclude that brandy either increases or decreases body temperature.

This is a problem because the random samples we work with will produce some random variation in sample means. So, even if the population mean, μ, is exactly equal to 98.6°, the mean of a sample selected from that population, x̄, probably won't be. These differences would cause us to conclude that there is an effect of brandy on body temperature when there is not if we used decision rules that didn't account for random sampling.

The simulation to the right illustrates the problem - the data points represent a sample of 20 body temperatures selected from a population with a body temperature of μ = 98.6°, and the mean of the sample is reported below the graph. If you click the "Select a random sample" button a new sample of 20 is selected from the same population, which gives you a new sample mean. As you click repeatedly, note that the sample mean is rarely equal to 98.6°. Because of this random sampling variation, if we conclude that any mean other than exactly 98.6° indicates that brandy has an effect on body temperature then we will mistakenly conclude that brandy has an effect most of the time.

If you keep hitting "Select a random sample" you'll see that even though the mean is changing randomly, it isn't completely unpredictable. Means exactly equal to 98.6° don't happen very often, but means only rarely go as low as 98.5° or higher than 98.7°. This suggests a solution to our problem - even though a mean that is exactly equal to 98.6° isn't likely, we can change our decision rules to reflect the fact that small differences can happen by chance. If we can characterize the range of variation in sample means that random sampling usually produces then we can treat any observed difference that's within this range as probably being due to random chance, but treat sample means that fall outside of this usual range as probably being due to an actual effect of brandy on body temperature.

The question is, then, how different from 98.6° would our sample mean have to be to conclude that brandy has an actual effect on core temperature?

To answer that question we need inferential statistics, which are methods that allow us to draw conclusions about a population based on a sample of data. This week we will learn how to do a type of null hypothesis significance test (NHST), called a one-sample t-test, to determine if our sample mean of x̄ = 98.7° is different enough from 98.6° to conclude that drinking brandy increases body temperature.

Inferences are based on hypotheses

Before we learn about the one-sample t-test we will use in particular, we need to spend a little time on NHST's in general.

In the sciences, we base our conclusions on tests of hypotheses. In the sciences, a hypothesis is simply a possible explanation for some phenomenon. In inferential statistics we will call these sorts of statements of the way we think a system is working scientific hypotheses, although sometimes we will simply refer to the scientific hypothesis as the question we are trying to answer. For the current example, the scientific hypothesis is that brandy increases body temperature. The experiment we designed to test this question has us measuring body temperature, and we came up with some decision rules that tell us how to interpret the three possible outcomes of our experiment (i.e. an increase in temperature, no change, or a decrease in temperature). We use the scientific hypotheses to guide us in developing statistical hypotheses, and we revisit the scientific hypothesis after our statistical analysis is complete to draw a conclusion about whether the experiment supports our scientific hypothesis. This final decision will be based on our decision rules, modified to account for the effects of random sampling.

Statistical hypotheses are about how randomness affects our data. To decide if the data gives us enough evidence to conclude that drinking brandy changes body temperature we will test a null hypothesis. Null hypotheses are always hypotheses of no difference, or randomness. We can express a null hypothesis about our test of the effect of brandy on core body temperature as:

Null hypothesis: at a population level, average body temperature (μ) for people drinking brandy is 98.6°.

In other words, since we know that body temperature for people who aren't drinking brandy is 98.6°, body temperature should still be at the normal human body temperature of 98.6° if brandy has no effect. If the null hypothesis is true, any difference between our experiment's sample mean and this hypothetical population value is just due to random sampling variation.

We test the null hypothesis, but there is another possibility - brandy may have an actual effect on body temperature. We do not know what the body temperature of people who drink brandy should be, but if brandy affects core temperature it definitely shouldn't be 98.6° anymore. We can thus express this alternative hypothesis as:

Alternative hypothesis: at a population level, average body temperature (μ) is not equal to 98.6° for people drinking brandy.

You can see that a null hypotheses is very specific - it specifies a hypothetical value of the population mean exactly. In contrast, alternative hypotheses are not specific at all, they just say "whatever the population mean might be, it is not equal to the null value".

So far this doesn't actually look any different from the decision rules that we set up before. There is a very important difference, though - the null hypothesis is about the population mean, not about the sample mean. We will use the value of the population mean of 98.6° as the center of a sampling distribution, and will ask if the sample mean of 98.7° that we observed is likely to occur by chance, or unlikely to occur by chance when the null hypothesis is true and brandy has no effect on body temperature. This will account for random sampling, and will allow us to draw a conclusion about our scientific question in a way that properly acknowledges the way that randomness affects our data.

The general procedure that all null hypothesis significance tests follow is:

Formal null hypothesis significance testing

State the null hypothesis - in terms of hypothetical values of a population parameter.

Calculate a test statistic from the data that measures how far the data are from the value specified in the null hypothesis.

Compare the test statistic to a sampling distribution to obtain a probability of the test statistic, if the null hypothesis is true (the p-value).

Compare the p-value to a predetermined level, called the α-level

If p is less than α reject the null in favor of the alternative hypothesis.

If p is greater than α retain the null hypothesis.

Draw a scientific conclusion, based on the result of testing the null hypothesis.

The one-sample t-test

The above set of steps are common to all NHST's, but the details are different depending on the data type and experimental design used. In this body temperature experiment example, we have:

One sample of data: 20 people
A continuous numeric variable: body temperature
A known, hypothetical value we will compare out sample of data to: 98.6°

The appropriate null hypothesis significance testing procedure to compare a single sample mean to a specified hypothetical population mean is called a one-sample t-test. The steps in the procedure are as follows:

1. State the null (and the alternative)

The null hypothesis will always be the hypothesis of no effect or randomness, and it will be expressed in terms of a population parameter. If there is no effect of brandy drinking we expect the population mean to equal 98.6°.

To make the null hypothesis as precise and clear as possible we can express it in symbolic form. The symbol for "null hypothesis" is Ho, or "h-naught", with the o representing a subscripted zero. We can write the null hypothesis symbolically as:

H_o: μ = 98.6°

or, equivalently, as:

H_o: μ - 98.6° = 0

This makes it absolutely clear that our null hypothesis is that the mean of the population our data was selected from is 98.6°.

Once the null hypothesis is stated, the alternative hypothesis is just that the null hypothesis is false. Symbolically the alternative hypothesis is:

H_A: μ ≠ 98.6°

This statement makes it absolutely clear that if we decide that the null is not true then all our null hypothesis test will tell us is that the population mean is something other than the null value of 98.6°.

2. Calculate a test statistic

The test statistic tells us how different our sample mean is from the null hypothetical value. Mathematically, "difference" implies subtraction, so we could find the difference by subtracting our hypothetical population mean of 98.6° from the sample mean of 98.7°, which gives us 98.7° - 98.6° = 0.1°.

If you recall from our work on confidence intervals, we can use the t-distribution as a mathematical model of randomly sampling means with a given sample size from a population. The advantage of using the t-distribution is that it will allow us to calculate a probability of obtaining a difference of 0.1° from a population with a mean of 98.6° based on the data in just a single sample. But, the units on the x-axis of a t-distribution are standard errors, not degrees Fahrenheit. To use the t-distribution we need to convert the units for our difference into standard errors.

To do this unit conversion we first need to know the standard error for the data set. The standard error for a single sample of data is just the standard deviation divided by the square root of the sample size. Our sample of 20 people had a standard deviation of 0.3, which means that the standard error would be:

The units on a standard error are the data units, so there are 0.067° in one standard error for this sample of data.

Now to get our observed t-value test statistic we just need to know how many standard errors 0.1° is. This is just a simple unit conversion, just like converting 120 inches into feet - to make this conversion we would divide 120 inches by the number of inches per foot to get 120/12 = 10 feet. Dividing our observed difference of 0.1 by the number of degrees in a standard error gives us:

tvalue

This tells us that 98.7° is 1.49 standard errors above 98.6° (to be clear, it's 1.49 standard errors away from 98.6°, and since the value is positive it's above 98.6°). This observed t-value is now in the correct units to compare to a t-distribution to obtain a probability for our experimental result.

3. Compare the test statistic to a sampling distribution to obtain a p-value

To reach our conclusion about the null hypothesis we need to calculate a probability, called a p-value. The p-value is the probability of a very specific thing:

The p-value is the probability of randomly sampling from a population with a mean equal to the null value (μ = 98.6°) and obtaining a sample mean that is at least as different from the null value as the observed sample mean (x̄ - μ = 0.1°).

We will use the t-distribution to obtain the p-value for our observed t-value. Remember from the confidence interval exercise that the t-distribution is a good mathematical model of a sampling distribution for means, and its shape depends on degrees of freedom. For a one-sample t-test degrees of freedom equals the sample size minus 1, which this example is 20-1 = 19.

The t-distribution with 19 degrees of freedom is shown in the graph to the left.

The probability of getting a sample mean at least t = 1.49 standard errors above 98.6° by chance is the red shaded area from 1.49 to infinity, which is equal to:

p = 0.076

Finding the probability for an observed t-value is best left to the computer - we will use MINITAB to calculate all of our p-values for us.

This p-value is called a one-tailed p-value, because we are only using one tail of the t-distribution to obtain it. We suspect that drinking brandy increases body temperature because it feels like it does, so we may only be interested in the chances that random sampling would produce sample means that are higher than 98.6° - if that's the case then using a one-tailed p-value is appropriate.

However, it's possible that the sensation of warming when we drink brandy is misleading, and that body temperature actually decreases even though we feel warmer. This sort of unexpected result can happen, and we may want to leave open the possibility of the unexpected occurring when we do our work. When we're doing our null hypothesis test we're considering the possibility that our experimental result is only different from 98.6° due to random sampling, and from that perspective the direction of the observed difference isn't important - if it's a randomly generated difference it could just as easily have gone the other way. We account for this possibility by looking at random differences that are just as big as we observed (0.1°, or 1.49 standard errors), but in the opposite direction.

We account for the possibility of unexpected results by using a two-tailed p-value. To include differences as big as we observed in either the positive or negative direction, we would want to include the blue region, which starts at t = -1.49 and extends to negative infinity, as well as the red region, to calculate our p-value.

Since the t-distribution is symmetrical, the area in the blue region is also 0.076, and the two-tailed p-value that uses both the red and blue regions is p = 0.152.

	Conclusion drawn
Reality	Retain	Reject
Null is true	No error (1-α)	Type I error (α) False positive
Null is false	Type II error (β) False negative	No error (power, 1-β)

One-sample hypothesis tests - prep reading

Introduction

Random sampling strikes again

Inferences are based on hypotheses

Formal null hypothesis significance testing

The one-sample t-test

1. State the null (and the alternative)

2. Calculate a test statistic

3. Compare the test statistic to a sampling distribution to obtain a p-value

4. Compare p to α - reject the null hypothesis if p < α, and retain the null hypothesis if p ≥ α

5. Draw a scientific conclusion

What does it mean to accept the alternative?

Formal null hypothesis testing - the critical t-value approach

Relationship between a one-sample t-test and a confidence interval

Why don't we test what we want to know?

Assumptions of the one-sample t-test

Errors in hypothesis testing

The errors we could make depend on the conclusion we draw.

Minimizing false negatives, increasing statistical power

Summing up - statistical errors and power

So, what does brandy do to body temperature?

Next activity