# 15 Decisions

[latex]\newcommand{\pr}[1]{P(#1)} \newcommand{\pv}{$P$} \newcommand{\var}[1]{\mbox{var}(#1)} \newcommand{\mean}[1]{\mbox{E}(#1)} \newcommand{\sd}[1]{\mbox{sd}(#1)} \newcommand{\Binomial}[3]{#1 \sim \mbox{Binomial}(#2,#3)} \newcommand{\Student}[2]{#1 \sim \mbox{Student}(#2)} \newcommand{\Normal}[3]{#1 \sim \mbox{Normal}(#2,#3)} \newcommand{\Poisson}[2]{#1 \sim \mbox{Poisson}(#2)} \newcommand{\se}[1]{\mbox{se}(#1)} \newcommand{\prbig}[1]{P\left(#1\right)}[/latex]

A confidence interval essentially gives a range of plausible values for a population parameter, based on the sample data, at some level of probability. A **hypothesis test** uses the same underlying ideas but instead looks to see if a particular value is plausible, based on the sample data.

# Means

Consider Alice’s 10 subjects in a Chapter 2 table who drank the decaffeinated cola. Based on this sample, is there any evidence that the cola without caffeine increases pulse rate?

The population parameter here is [latex]\mu[/latex], the mean increase in pulse rate experienced by people in the population these subjects were taken from 30 minutes after drinking 250 mL of the decaffeinated cola. The null hypothesis is that there is no increase, so we’d write [latex]H_0: \mu = 0[/latex]. We’re interested in seeing if there is actually an increase, so we have a one-sided alternative, [latex]H_1: \mu \gt 0[/latex].

The 10 observations have mean [latex]\overline{x}[/latex] = 5.10 bpm with standard deviation [latex]s = 5.587[/latex] bpm. The [latex]P[/latex]-value for this test is the probability of getting a mean increase of [latex]\overline{x}[/latex] = 5.10 bpm if the real population mean was [latex]\mu = 0[/latex]. We calculate this by finding out how many standard errors 5.10 is above the expected value of 0. The standard error of the sample mean is

\[ \se{\overline{x}} = \frac{5.587}{\sqrt{10}} = 1.767 \mbox{ bpm}, \]

with [latex]10 - 1 = 9[/latex] degrees of freedom.

The [latex]P[/latex]-value is thus

\[ \pr{\overline{X} \ge 5.10} = \prbig{T_9 \ge \frac{5.10 – 0}{5.587/\sqrt{10}}} = \pr{T_9 \ge 2.89}. \]

Using Student’s T distribution table we see that this probability is between 0.01 and 0.005, while with software we can calculate that

\[ \pr{T_9 \ge 2.89} = 0.009, \]

but either way we can conclude that there is strong evidence of an increase in pulse rate for people drinking 250 mL of decaffeinated cola.

The general method for a test of [latex]H_0: \mu = \mu_0[/latex] uses the [latex]t[/latex] statistic

\[ t_{n-1} = \frac{\overline{x} – \mu_0}{s/\sqrt{n}}, \]

the number of standard errors that [latex]\mu_0[/latex] is from the estimate [latex]\overline{x}[/latex].

Cavendish

The Appendix includes 23 measurements of the mean density of the Earth made by Cavendish after he changed the suspension wire. These have a mean of 5.4835 g/cm[latex]^3[/latex] and a standard deviation of 0.1904 g/cm[latex]^3[/latex]. The currently accepted value of the mean density of the Earth is 5.517 g/cm[latex]^3[/latex]. Is there any evidence that Cavendish’s measurements were biased?

Suppose [latex]\mu[/latex] is the mean of all measurements from Cavendish’s experimental apparatus. You can think of this as an infinite population since, in principle, you could repeat the experiment as many times as desired. Our null hypothesis is [latex]H_0 : \mu = 5.517[/latex], that the equipment is giving the true mean density. The alternative hypothesis is **two-sided** since we are interested in any possible bias, either positive or negative. That is, [latex]H_1 : \mu \ne 5.517[/latex].

We want to know how likely a sample mean of 5.4835 is if the real value was 5.517. Of course the probability of 5.4835 exactly is 0 (for continuous data). Instead we look at how likely it is to get a value as extreme or more extreme than the one observed. Any sample mean less than 5.4835 would be just as much evidence of bias, so we calculate [latex]\pr{\overline{X} \le 5.4835}[/latex]. We can do this by standardising, noting that we use [latex]T[/latex] instead of [latex]Z[/latex] since our standard error involves a sample standard deviation:

\[ \pr{\overline{X} \le 5.4835} = \prbig{T \le \frac{5.4835 – 5.517}{0.1904/\sqrt{23}}} = \pr{T \le -0.84}. \]

Remember that we subtract the expected value [latex]E(\overline{X}) = \mu[/latex] to standardise. Since we are assuming that [latex]H_0[/latex] is true we can substitute 5.517 for this in the formula.

Thus 5.4835 is 0.84 standard errors below the true value and we can use software or the tables to find the [latex]P[/latex]-value. However, if we had seen a value 0.84 standard errors **above** the true value then this would have been the same level of bias. For a two-sided test, we include both areas in calculating the [latex]P[/latex]-value. Since the [latex]t[/latex] distribution is symmetric, we can just find one side and then double it.

Student’s T distribution table gives [latex]\pr{T \le -0.84} \simeq 0.20[/latex] so the two-sided [latex]P[/latex]-value is about 0.40. This suggests that a value of 5.4835 is quite likely to happen due to sampling variability. Thus there is no evidence to suggest that Cavendish’s equipment was subject to measurement bias.

# Types of Errors

In the previous section we saw how to use hypothesis testing to find evidence of an effect. For example, in the caffeine experiment the 10 students who drank the caffeinated cola had a mean increase in pulse rate of [latex]\overline{x} = 15.80[/latex] bpm with standard deviation [latex]s = 8.324[/latex] bpm. Testing the null hypothesis [latex]H_0: \mu = 0[/latex], that there is no mean increase in pulse rate, uses the [latex]t[/latex] statistic

\[ t_{9} = \frac{15.80 – 0}{8.324/\sqrt{10}} = 6.00. \]

With a one-sided alternative, [latex]H_1: \mu \gt 0[/latex], the [latex]P[/latex]-value is then 0.0001, very strong evidence against the null hypothesis. Hence we conclude that there is evidence that drinking the caffeinated cola does increase pulse rate.

But what if our test result was not significant? Does this mean that [latex]H_0[/latex] is true and there really is no effect?

Consider a small study which tried to show that drinking lemonade increased pulse rate. Five subjects had their pulse rate measured beforehand and then 30 minutes after a 250 mL drink of lemonade. The following table shows the resulting measurements of pulse rate.

## Pulse rates (bpm) before and after lemonade

Before | 66 | 88 | 65 | 77 | 88 |

After | 66 | 91 | 69 | 75 | 100 |

Here the mean increase is [latex]\overline{x} = 3.40[/latex] bpm with standard deviation [latex]s = 5.367[/latex] bpm. On average there has been an increase for these 5 subjects. However, with the same hypotheses as before, the [latex]t[/latex] statistic

\[ t_{4} = \frac{3.40 – 0}{5.367/\sqrt{5}} = 1.42 \]

gives a one-sided [latex]P[/latex]-value of 0.115. Thus there is no evidence against [latex]H_0: \mu = 0[/latex], so we retain our null hypothesis.

So does this mean that indeed [latex]\mu = 0[/latex], that on average drinking lemonade does not increase pulse rate?

Recall the role of the [latex]P[/latex]-value in hypothesis testing. For the caffeinated cola example the [latex]P[/latex]-value of 0.0001 means that there is a 1 in a 10000 chance that we could see a mean increase in pulse rate of 15.80 bpm from 10 subjects if in fact caffeinated cola had no effect on pulse rate. This seems pretty unlikely so we reject the assumption of no effect, concluding that there is evidence of an increase. But of course there is a 1 in a 10000 chance that we are wrong, that really there was no effect and we got the result by chance. If this was the case then we’d say we’ve made a **Type I error**. The `Reject’ column of the table below shows these two possibilities.

## Four scenarios for significance testing

Decision | ||
---|---|---|

Retain | Reject | |

[asciimath]H_0[/asciimath] is true | Correct [asciimath](1-\alpha)[/asciimath] |
Type I Error [asciimath](\alpha)[/asciimath] |

[asciimath]H_0[/asciimath] is false | Type II Error [asciimath](\beta)[/asciimath] |
Correct [asciimath](1-\beta)[/asciimath] |

The probability of making a Type I error is the probability of getting extreme data just by chance when there is really no effect present. If we reject the null hypothesis for any [latex]P[/latex]-value below 0.05 then the upper limit on the probability of making a Type 1 error is also 0.05. This probability is usually denoted by [latex]\alpha[/latex], the significance level used for testing.

The other kind of error we can make is if the null hypothesis is actually false but we don’t find any evidence against it. For the lemonade example the [latex]P[/latex]-value of 0.115 gave no significant evidence of an increase in pulse rate. However, the data for that example actually came from a population where lemonade *did* increase pulse rate, with mean [latex]\mu = 5[/latex] bpm and [latex]\sigma = 6[/latex] bpm, so our decision to retain the null hypothesis was wrong! In this case we say we’ve made a **Type II error**.

What then is the probability, [latex]\beta[/latex], of making a Type II error? Well, what is the probability of *not* making a Type II error, of actually rejecting [latex]H_0[/latex] when it’s false?

We can estimate this probability by randomly generating samples of size [latex]n=5[/latex] from a distribution where [latex]\mu=5[/latex] bpm and [latex]\sigma=6[/latex] bpm and seeing how often we get a [latex]P[/latex]-value less than 5%. For example, the table below shows the results from ten samples, each of size 5, taken from a [latex]\mbox{Normal}(5, 6)[/latex] distribution. The mean and standard error from each sample is given, along with the resulting [latex]t[/latex] statistic and [latex]P[/latex]-value for a test of [latex]H_0: \mu=0[/latex] against [latex]H_1: \mu \gt 0[/latex].

## Tests of [asciimath]H_0: \mu=0[/asciimath] from ten samples where [asciimath]n=5, \mu=5, \sigma=6[/asciimath]

Sample | [asciimath]\overline{x}[/asciimath] | [latex]\mathrm{se}({\overline{x}})[/latex] | [asciimath]t[/asciimath] | [asciimath]p[/asciimath] | ||||
---|---|---|---|---|---|---|---|---|

6.1 | 17.2 | 3.3 | 8.0 | 6.4 | 8.20 | 2.3738 | 3.45 | 0.013 |

20.6 | 0.6 | 12.1 | -6.9 | 0.1 | 5.30 | 4.8903 | 1.08 | 0.170 |

6.4 | -0.9 | 5.8 | 1.2 | -2.6 | 1.98 | 1.7890 | 1.11 | 0.165 |

5.5 | 10.5 | -5.4 | 8.6 | -4.3 | 2.98 | 3.2993 | 0.90 | 0.209 |

0.5 | 15.2 | 1.4 | 10.7 | 1.7 | 5.90 | 2.9714 | 1.99 | 0.059 |

0.4 | 3.7 | -2.3 | 15.2 | 3.4 | 4.08 | 2.9875 | 1.37 | 0.122 |

-1.2 | 0.5 | 5.4 | 3.2 | 14.7 | 4.52 | 2.7841 | 1.62 | 0.090 |

0.2 | -3.0 | -3.1 | 11.1 | 9.0 | 2.84 | 3.0210 | 0.94 | 0.200 |

5.6 | 9.2 | -3.3 | 10.1 | 11.3 | 6.58 | 2.6468 | 2.49 | 0.034 |

9.8 | -0.3 | 4.9 | 3.4 | 6.4 | 4.84 | 1.6663 | 2.90 | 0.022 |

In three of these ten samples the [latex]P[/latex]-value was significant at the 5% level and for each of these we would reject the null hypothesis that there was no change in pulse rate. The null hypothesis is actually wrong here, since there is an underlying increase of [latex]\mu = 5[/latex] bpm, so these three decisions are correct. In the remaining seven samples the [latex]P[/latex]-value was more than 0.05, giving no evidence against [latex]H_0[/latex]. Those seven samples are each resulting in a Type II error.

So trying to detect a 5 bpm increase in pulse rate with a sample of size 5 does not seem very likely to be successful. Most of the time we fail to find evidence even though there is an effect. To get a more accurate estimate of the probability of success we could continue doing this simulation. The figure below shows the distribution of [latex]P[/latex]-values from 10000 tests of [latex]H_0: \mu=0[/latex] where the underlying mean increase was 5 bpm. The shaded area gives the proportion of tests that successfully found evidence of the increase, equal to 0.4649.

We call this probability, 0.465, the **power** of the test. If we didn’t find evidence then we’d be wrong, since here [latex]\mu \gt 0[/latex], so the probability of making a Type II error in this case is 1 – 0.465 = 0.535. Thus we estimate [latex]\beta = 0.535[/latex].

Software can be used to directly calculate power for various models. Assuming the data is coming from a Normal distribution with [latex]\mu=5[/latex] bpm and [latex]\sigma=6[/latex] bpm, the power of the (one-sided) [latex]t[/latex] test is actually 0.460, very close to the value from our simulation, giving [latex]\beta = 0.540[/latex].

# Improving Power

Having [latex]\beta = 0.540[/latex] is pretty depressing, since it means we have a less than even chance of detecting an effect that exists. Why waste the time and resources setting up this experiment to detect a 5 bpm increase in pulse rate when it only has a 46% chance of finding evidence? The obvious question then is how can we make our test more powerful.

There are two main limitations with the current situation. Firstly, the effect we are trying to detect ([latex]\mu = 5[/latex] bpm) is relatively small, particularly in comparison with the standard deviation ([latex]\sigma = 6[/latex] bpm). Secondly, we are trying to detect this small effect with only a small number of subjects ([latex]n = 5[/latex]) in our study.

The following figure shows the relationship between power and the mean increase we are trying to detect. For example, the power is 46% when the effect size is 5 bpm, as before. However, if the true increase was actually 10 bpm then a study with 5 subjects would have a 92% chance of detecting the effect. So the power of the test is greater if the effect is greater.

This may seem a strange point to consider since in practice we don’t have any control over this parameter. However, we can usually decide what effect size we would be interested in detecting, and it should be clear from our discussion of power that this should be part of the study’s design. For example, perhaps an increase of 5 bpm in pulse rate would not be of any physiological concern and we would only be interested in detecting increases of at least 10 bpm. In that case 5 subjects may be sufficient, giving power of at least 92%.

The actual relationship between power and effect size will also depend on the variability, [latex]\sigma[/latex]. If the effect size is [latex]\mu[/latex] then the standardised increase that will appear in the [latex]t[/latex] test is

\[ \frac{\mu}{\sigma/\sqrt{n}} = \frac{\mu}{\sigma} \sqrt{n}. \]

We call the ratio [latex]\phi = \mu/\sigma[/latex] the **signal-to-noise ratio**, or **noncentrality parameter**, a measure of the effect size relative to the variability. The figure below shows the same plot as the previous figure but with the horizontal scale transformed to the corresponding signal-to-noise ratio.

The signal-to-noise ratio, and hence power, can be improved by reducing the variability, [latex]\sigma[/latex]. This could be done by improving measurement error, for example, but the most common way of reducing [latex]\sigma[/latex] is to incorporate **covariates** that help explain the variability.

If we were still interested in detecting a smaller effect, such as an increase of 5 bpm, then power can be increased by increasing the sample size. The figure above also shows the power curve for [latex]n = 10[/latex]. For a 5 bpm increase the power is then 78%, substantially better than the 46% chance for [latex]n = 5[/latex].

## Significance Level

While sample size is obviously something we can control in our study design, there is also a relationship between power and the significance level of the test. The previous two figures of signal-to-noise ratio both gave power curves assuming we were looking for significance at the 5% level. Suppose we wanted to make fewer Type I errors and decide to test at the 1% level instead. The following figure shows the effect on power. The curve for 1% is well below that for 5%. For our original example with [latex]\mu = 5[/latex] bpm ([latex]\sigma = 6[/latex] bpm) the power drops to just 7% if we demand significance at the 1% level.

There is thus a clear trade-off between significance and power. Reducing [latex]\alpha[/latex] increases [latex]\beta[/latex], and vice versa. All things being equal, trying to make less Type I errors results in making more Type II errors.

Related to significance level is the issue of whether you are conducting a one-sided or two-sided test. In our discussion here we have used a one-sided [latex]t[/latex] test as illustration since it is a little easier to think about the direction of significance. However, in practice, two-sided tests are standard and most statistical methods assume you want two-sided [latex]P[/latex]-values.

The figure below shows the sample size required for a one-sample [latex]t[/latex] test to obtain 80% power in detecting the desired signal-to-noise ratio. You should thus work with the solid line when choosing the sample size for your design — the savings of using a one-sided are small relative to the other factors affecting power.

We will look at many other hypothesis tests in the coming chapters and each has its corresponding power analysis. For example, Chapter 17 includes the simple power calculations required for a Binomial test. However in general the power calculations required become more complicated and so we will not cover these in this book. There are many good references which include power curves for particular methods, as well as details on the calculations involved (Cohen, 1988). The paper by Muller and Benignus (1992) gives a good overview of the role of power analysis and further considerations when using it in practice.

Summary

- A test result showing significant evidence of an effect would be incorrect if the null hypothesis was actually true. This would be a Type I error. The probability of a Type I error is the significance level, [latex]\alpha[/latex], used in testing.
- A test result showing no significant effect does not mean the null hypothesis is true. We may have made a Type II error.
- A study design should include a statement of what effect size is of interest. This will allow determination of sample size needed for the desired power.
- There is a trade-off between the rates of Type I errors ([latex]\alpha[/latex]) and Type II errors ([latex]\beta[/latex]).

Exercise 1

Based on the data in the Appendix, is there any evidence that Newcomb’s apparatus was biased in estimating the passage time of light?

Exercise 2

A person’s height is roughly 6 times the length of their forearm. Based on the data in the survey data, is there any evidence that the ratio of Islander height to forearm length is different to 6?

Exercise 3

What is the intercept, the power when the mean increase is 0, in this previous figure?

Exercise 4

Suppose that drinking two standard (alcoholic) drinks increases reaction time by [latex]\mu = 2[/latex] seconds with [latex]\sigma = 3[/latex] seconds. If we want to carry out a significance test with [latex]\alpha = 0.05[/latex] and [latex]\beta = 0.20[/latex] (so power is 80%) what sample size is needed to detect this increase?

Exercise 5

Using the data in the Appendix, replicate the first ever [latex]t[/latex] test to determine whether there is any evidence that substance 2 gives more additional hours of sleep on average than substance 1.