"

2 Designing Studies

In Chapter 1 we introduced the basic notion of variability and defined the types of variables that can be recorded, ending with a survey that showed examples of these. In Chapter 3 we will start looking at how we can describe and summarise the variability we see in such data. However, before we do so it is worth looking a little more closely at how we set up studies to give us data that is worth analysing in the first place.

Comparative Experiments

Alice is a student in an introductory statistics course. Suppose Alice has a drink of caffeinated cola and notices that her pulse rate seems to be higher sometime afterwards. Does this mean that the drink increases pulse rate?

Of course not. This is an anecdotal association between a precursor and an effect, but it is not necessarily a causal association. There could be many other explanations for the link. Perhaps Alice climbed some stairs in the meantime, or received an exciting text message. Or perhaps she expected the drink to be a stimulant, since it contained caffeine, and the increase in pulse rate was a purely psychosomatic response. This is known as the placebo effect, where an inactive drug can have a real effect based on expectation.

So what could Alice do to determine if the caffeine intake really did have a physiological response? She could start by designing a clear protocol for making measurements that would help reduce some of these issues. For example, she might measure her pulse rate and then drink exactly 250 mL of the cola, measuring her pulse rate again after remaining seated for 30 minutes. Doing this she finds her initial and final pulse rates were 65 bpm and 77 bpm, respectively, an increase of 12 bpm.

This is a good first step since the increase has now been quantified, so we can get an idea of the magnitude of the increase. The protocol is also useful because it allows the experiment to be replicated. Alice could repeat the protocol the next day and see if she obtains similar results.

However, this design still has a few flaws. It doesn’t address the possibility that the increase is psychological, since Alice is still expecting the effect (and may now even expect it more strongly since it has been quantified). Related to this is that the measurement is only for Alice. Perhaps the cola only has this effect on her and for other people there is no change in pulse rate. Finally, even if it was the drink that was having the effect it wouldn’t mean that it is the caffeine that is responsible. The cola also contains other ingredients, such as sugar, which may have an effect on pulse rate.

The next steps then are to get more subjects to participate in this study and to introduce a control group into the design. Alice finds 20 friends to be subjects in her study and she uses her original protocol for each one, recording pulse rate before and after a 250 mL drink of cola. However, she gets 10 of the subjects to drink the caffeinated cola while the other 10 drink a decaffeinated cola. We call these two different drinks the treatments of the experiment. The decaffeinated treatment acts as a control, a placebo treatment in terms of caffeine content, against which we can compare the treatment of interest. If the other ingredients of the drinks are similar, and Alice finds a difference between the treatments in terms of the average increase in pulse rate, then this will more strongly suggest that it is really the caffeine that is responsible. This type of study is called a comparative experiment, as illustrated in the figure below.

Structure of a comparative experiment

Replication within the two groups is very important. If we only had one subject in each group and we found a difference between them we would have no idea whether the difference was between the treatments or whether it was just due to the natural variability between the subjects. Having replication in each treatment allows us to estimate the scale of the natural variability. We can then see if the difference between the treatments is significantly more than what could be explained by natural variability.

Comparative experiments are great, but there are still two possible reasons why a difference found may not be the result of the caffeine. The first goes back to the psychological effect mentioned above. It may be that a subject drinking caffeine might expect their pulse rate to increase and so their pulse rate may indeed increase because of this psychological expectation, rather than because of the caffeine itself. The easy solution to this problem is to make the experiment blind by not telling the subjects which drink they have. It is even better to have a double-blind experiment in which Alice doesn’t know either, just in case her interactions with the subjects or her measurements of pulse rate are biased by knowing which have the caffeine. (A third-party, not involved in the experiment process, can be asked to pour the drinks and record which subject gets which treatment, keeping the allocation secret until all the data has been collected.)

The second problem is related to a bias that comes from confounding variables. Suppose that 10 of Alice’s friends were male and 10 were female. A simple allocation to the treatments might give caffeinated cola to the 10 males and decaffeinated cola to the 10 females. Of course this might not be so good. For example, males might have a different response to the sugar in the drinks than females. Supposing caffeine had no effect on pulse rate, we might still see a higher increase in the caffeine group because the males experienced a higher increase from the sugar.

There are two solutions to avoid possible confounding. If we know or suspect that a variable, such as sex, does have an effect on the response we are measuring then we can use blocks in our design. For example, Alice could first split her subjects into two blocks, one with males and one with females, and then allocate the two treatments within each block, as shown in the figure below. We will then be able to compare the caffeine effect relative to each sex, as well as comparing the difference between the sexes (see Chapter 21).

Structure of a block design

If we don’t particularly suspect that another variable has an effect then the safest option is to randomly assign our subjects to the treatment groups. This randomisation is an important step in comparative experiments. It means that on average the two groups here will have the same proportions of males and females. Even if there was a difference in increase in pulse rate between the sexes it would not be too bad, since the two treatments had a fair allocation of males and females. They could be quite different if there were only a couple of subjects in each group, but as the number of subjects increases they will be more and more similar in terms of the sexes. They will also be similar, on average, in their distributions of heights, ages, weights, fitness levels and any other variable that might have some influence on changes in pulse rate, even ones we’ve never thought of. Thus if we find a difference in the average increases between the caffeine group and the decaffeinated group then we can conclude that the caffeine caused the difference.

Data

In all we have needed a lot of careful consideration about how we will obtain our data before we start our measurements. Finally Alice was able to conduct her experiment. She did not think sex would be an important factor in pulse rate so she stuck with the simple comparative design of the previous figure, randomly allocating her 20 friends to the caffeinated and decaffeinated drinks. The results are shown in the following table.

Pulse rates (bpm) before and after diet cola

  Caffeinated
Before 70 70 75 81 72 80 67 75 64 76
After 87 92 96 97 78 78 94 90 80 96
Decaffeinated
Before 96 65 90 90 86 89 73 69 75 64
After 100 75 97 81 91 93 78 76 81 76

Now of course Alice would like to know the results of her experiment. Did the caffeine group experience a higher average increase in pulse rate than the decaffeinated group? How to answer this type of question will be the aim of the remainder of this book.

Note that we have by no means considered all the issues involved in experimental design here. For example, Alice had 20 friends to be subjects in this experiment. She split them into two groups of 10 each, but is this the best thing to do? Could it be better to have 15 drinking the caffeinated cola, since this was the main substance of interest, with just 5 drinking the decaffeinated cola as the control? It turns out that for various reasons it is actually mathematically optimal to have equal group sizes. We will discuss one of these reasons in Chapter 16.

There is also no reason why the experiment should only compare two treatments. For example, we might also consider the effect on pulse rate of three different amounts of extra sugar in the cola. We could choose to add 10 g or 20 g of sugar to each 250 mL drink, or leave the drink without extra sugar. Our experiment would then have two factors, the caffeine content and the sugar content, with two levels and three levels, respectively. The number of possible treatments is the product of the numbers of levels for each factor, so here we would have six treatments to consider.

The amount of sugar can actually be considered either a categorical or continuous variable in this context, and the way we view it will dictate how we will analyse the data from the experiment. The later chapters of this book will essentially work through the different types of variables we could have in an experiment and look at how the resulting data should be analysed.

Scientific Hypotheses

The term “hypothesis” has a particular meaning in the context of statistical decision making, as we will see starting in our section on evidence. However, our motivation for using statistical methods in this book is as a framework for scientific investigation and in the philosophy of science the term hypothesis has a different meaning. There it is used to describe a tentative explanation for some phenomenon or association between phenomena. This is in contrast to a prediction which describes an expected outcome, based on the hypothesis (Niaz, 2004).

For Alice her hypothesis was that the caffeine in the cola was causing an increase in pulse rate. Her prediction was that if she carried out the randomised comparative experiment shown in our earlier figure then the caffeine group would show a higher mean increase than the decaffeinated group. Being able to make predictions and test them is an important attribute of a good hypothesis (Popper, 1974). The cycle of making a hypothesis and testing its predictions is known as hypothetico-deductive method, since the predictions are deduced from the hypothesis.

Although the above discussion of Alice’s experimental design was somewhat fictional, the data given in the previous table is from a real experiment conducted by two statistics students at the University of Queensland. What makes this a nice example is that the students had a simple scientific hypothesis they wanted to test – that caffeine in cola increases your pulse rate. They devised this elegant experiment that would allow them to test the prediction while eliminating as many confounding factors as possible.

Observational Studies

An observational study has the same aims as an experiment but is passive, often working with existing data such as medical records. The problem here is that it is difficult to establish causation. There are many studies where it has been shown that people who smoke have higher rates of lung cancer than those who don’t. However, this does not mean that smoking is causing the lung cancer. It could be that the people who are predisposed to smoking are also predisposed to developing lung cancer, perhaps because of some other factor.

The solution would be to do an experiment where you randomly split a group of subjects and force half of them to smoke and half of them not to smoke. If you found higher rates of lung cancer in the smoking group then this would be proof that smoking caused the higher rate, whether physically or through some other chain of factors, since all other factors were equal in the groups. Of course, we cannot do such experiments for ethical reasons. It is for this reason that observational studies are an important part of medical research, and efforts must be made to establish causation in addition to the association seen in the study.

Randomisation

Suppose Alice’s experiment does show that the caffeinated cola increases pulse rate by around 10 bpm more than the decaffeinated cola. Can she conclude that the caffeine increases pulse rate for everybody?

Well, yes and no. We will think of Alice’s 20 friends as a sample from a larger population. For example, if Alice’s friends were all fit young adults then it might be reasonable to conclude that caffeine will increase pulse rate by roughly an extra 10 bpm on average for the population of all fit young adults. However, caffeine may have a different effect on infants or the elderly and it would not be appropriate to make conclusions about them based on Alice’s sample. The figure below emphasises the population context in this way.

Experiments as sampling from a population

The two steps in the figure should both use randomisation where possible. In Alice’s story we noted the importance of randomisation in helping reduce bias and protect against confounding variables, while using random samples from populations allows us to make more representative conclusions from our results. Given this key role in experimental design, it is worth considering how to randomize in practice.

For example, if we wish to compare 2 treatments and have 20 subjects available, how should we assign the subjects to the treatments? What is needed to do this is a way of choosing subjects randomly, and the randomness has to be an objective kind of randomness. People are very poor at generating randomness, so having someone select subjects by “picking at random” is not sufficient for this purpose.

Random digits

                   
2 4 5 8 1 6 2 6 6 3
6 7 3 3 4 0 4 0 6 5
1 4 1 6 8 8 1 8 4 8
8 9 0 0 1 8 8 5 9 6
3 8 5 5 0 2 7 8 3 5
5 5 0 8 0 8 3 5 6 5
4 6 1 3 1 8 5 5 5 3
1 9 7 3 0 6 2 4 2 8
3 8 6 9 0 6 7 6 2 0
9 4 9 8 8 2 0 8 7 4
8 8 9 9 8 2 2 0 2 6
0 7 4 1 1 5 8 1 5 9
3 3 9 6 6 3 1 5 9 4
4 4 7 4 6 9 3 4 5 8
6 8 5 4 1 4 4 4 1 2
3 3 9 8 3 2 0 0 5 6
1 6 1 8 5 8 9 2 6 6
9 0 7 1 5 9 7 3 5 7
6 9 1 1 1 8 7 9 2 1
8 3 3 3 1 4 3 4 7 5
2 9 3 5 5 9 4 4 3 5
1 6 7 9 1 2 0 2 4 3
3 4 2 5 7 8 3 7 6 3
9 3 8 4 4 5 0 3 9 7
0 7 7 9 4 8 3 9 2 2
6 8 7 5 0 6 9 8 7 1
1 6 8 7 9 7 1 8 6 6
5 0 4 7 8 8 0 8 7 8
9 9 7 3 7 9 1 6 9 2
9 9 2 6 1 8 4 0 3 0
3 7 3 8 8 4 6 3 1 9
8 4 6 6 9 6 3 5 6 9
4 6 9 0 1 1 5 1 0 4
5 5 9 1 1 9 9 2 4 7
7 0 4 0 7 8 4 1 8 0
7 2 4 1 3 5 9 9 1 8
2 2 4 7 8 8 6 1 9 2
2 1 3 6 9 5 6 6 7 5
8 4 7 1 4 0 5 7 9 7
2 8 2 7 8 4 2 2 9 6
1 8 5 2 5 2 1 3 8 9
3 1 1 0 5 7 5 6 7 1
9 8 9 9 1 5 8 8 1 9
7 8 8 3 4 3 8 1 1 5
2 0 5 1 9 0 1 0 9 9
1 3 7 5 8 1 7 8 8 1
6 7 5 0 3 6 2 7 2 2
2 6 1 4 2 5 3 1 3 8
3 4 7 5 6 4 1 9 9 6
7 1 0 9 0 2 6 5 8 8
6 4 6 5 4 1 9 2 8 0
1 3 3 1 6 9 4 3 7 4
0 0 4 8 7 6 8 9 1 5
4 7 3 7 3 0 8 9 0 2
6 4 8 6 6 8 4 0 5 1
4 2 3 6 8 4 6 4 6 1
4 0 9 0 0 1 5 1 3 0
6 6 3 1 8 6 2 8 6 7
3 4 4 8 1 9 2 0 3 2
2 8 1 3 0 4 6 9 9 8
8 7 2 9 9 0 5 0 7 2
4 6 2 0 3 5 5 3 5 8
8 9 8 3 2 4 5 7 5 1
8 2 9 2 6 1 0 4 7 2
6 8 5 3 7 8 1 8 1 0
8 1 2 5 0 6 0 8 3 1
7 5 4 8 6 4 1 2 2 3
0 2 8 7 6 7 7 0 7 8
0 7 5 5 9 0 6 0 3 4
2 5 2 1 2 4 1 6 7 2
2 7 6 5 0 4 6 6 8 4
9 6 5 6 6 3 4 2 9 8
9 6 1 6 3 5 2 6 6 8
7 5 5 1 6 7 5 2 2 5
2 5 4 9 9 2 2 7 5 1
0 0 2 2 0 6 0 8 5 3
0 2 2 6 6 3 2 6 9 3
9 1 2 2 4 7 4 9 6 7
3 8 8 6 0 0 6 3 5 5
2 7 5 9 5 3 8 4 6 3
3 0 4 2 1 2 3 1 6 8
9 4 1 3 2 7 9 5 1 7
2 5 4 4 3 6 6 2 9 4
5 1 4 7 5 2 5 2 4 8
5 5 5 9 9 9 3 3 9 1
1 2 4 3 5 8 6 5 5 8
8 1 1 2 0 3 4 6 4 9
6 1 8 3 4 0 8 0 9 8
9 2 7 1 1 1 7 2 9 6
9 3 2 4 0 3 2 8 3 9
2 8 9 1 9 5 0 8 6 9
1 1 9 8 5 7 4 1 8 9
7 8 8 5 4 8 4 1 7 7
5 9 2 5 6 0 2 3 2 7
8 8 6 0 5 1 7 6 5 6
0 2 0 0 3 2 0 1 7 6
7 4 6 7 1 9 5 2 9 5
6 2 5 7 6 1 8 3 6 5
5 2 2 7 4 6 1 5 8 3
7 8 1 0 6 8 2 0 3 7
4 5 8 3 7 8 3 1 3 2
2 6 0 6 7 4 7 1 2 5
5 2 0 3 4 2 7 4 0 0
7 3 5 5 8 2 6 2 0 5
6 2 5 9 7 0 8 8 1 4
5 0 9 7 2 9 8 4 5 3
1 8 7 8 3 7 8 2 3 6
5 8 1 8 1 4 7 9 5 1
3 4 3 9 0 8 9 6 0 6
8 0 5 9 7 9 1 8 0 1
8 9 5 4 3 6 8 2 3 7
2 9 7 6 8 0 1 1 1 8
2 0 9 8 4 0 1 6 2 6
5 8 6 1 5 0 9 5 7 0
4 2 2 7 7 1 8 2 4 4
5 8 8 2 6 5 3 4 3 1
5 5 6 9 0 2 8 6 7 6
7 8 6 9 8 3 6 8 3 7
2 6 8 7 0 3 0 6 1 6
7 4 5 9 5 3 6 1 7 4
8 1 0 3 4 1 4 6 4 0
7 7 6 5 1 3 0 4 2 5
3 7 7 1 0 7 5 0 3 6
8 1 5 8 3 3 8 2 2 0
7 6 4 9 5 2 5 0 3 6
1 2 4 4 3 2 3 7 9 2
5 6 3 1 7 0 9 4 6 5
4 7 4 1 4 9 9 7 0 6
5 1 7 3 1 5 5 5 8 6
5 8 9 5 0 6 3 2 8 7
1 6 9 9 5 7 1 0 9 8
0 8 3 5 3 4 8 4 3 0
8 7 7 2 7 1 8 9 5 2
1 2 7 6 9 1 4 5 5 5
6 9 6 7 0 6 9 0 2 3
2 2 2 3 6 6 2 8 4 9
5 3 9 7 4 3 7 6 0 8
1 1 9 6 9 8 1 1 4 0
6 3 2 6 0 4 2 0 4 6
1 6 2 2 8 7 3 9 3 2
6 3 4 8 9 6 8 9 9 2
5 9 2 8 3 9 7 4 1 2
5 3 0 5 6 9 9 4 4 9
5 7 7 7 5 9 0 6 8 5
9 6 1 2 4 8 9 1 9 5
1 8 7 7 6 7 8 7 5 6
7 6 0 2 8 3 8 9 5 0
1 9 6 3 5 4 5 7 4 8
3 9 8 9 1 6 5 7 8 6
4 5 9 4 3 2 3 1 8 0
6 7 4 8 1 2 8 8 3 1
2 1 7 8 7 1 7 3 1 7
3 4 0 2 2 1 8 2 2 1
2 5 7 6 0 1 0 0 5 5
5 7 8 0 2 0 6 0 9 0
8 4 3 3 6 8 7 5 2 1
7 2 9 1 4 4 2 2 3 4
0 5 1 6 4 9 1 1 0 4
9 1 3 6 0 0 1 6 3 3
8 8 6 8 7 1 4 4 3 0
4 7 3 3 2 1 5 0 2 4
9 1 1 0 0 4 9 5 8 5
0 0 1 1 0 0 9 8 4 3
9 6 7 4 7 6 2 8 9 6
3 9 1 5 7 6 9 7 1 8
2 5 3 4 1 4 7 2 3 4
4 8 3 2 0 9 4 9 2 4
7 1 2 4 3 1 8 4 7 9
4 4 1 6 4 9 0 8 7 7
8 9 0 3 3 2 3 0 2 7
7 7 7 6 5 1 1 2 5 0
6 0 9 4 2 4 6 5 5 5
1 8 9 3 6 9 3 3 9 9
8 7 3 5 9 2 6 3 9 8
6 9 2 7 8 6 9 3 5 0
6 8 1 6 3 2 2 2 5 1
5 2 6 7 5 1 7 0 5 9
2 5 2 7 8 3 7 5 1 5
4 0 1 1 7 9 6 7 3 1
0 5 8 9 0 2 5 8 9 1

The table above gives a list of 1800 random digits, obtained using a mathematical procedure which is known to produce “good” randomness. Some of the conditions such randomness should satisfy are explored in Chapter 22.

You can use the previous to assign subjects to treatments in a number of ways. A simple one is to go through your 20 subjects, looking at each of the first 20 digits in the first line of this table. If the digit is even then assign them to the control group and if it is odd then assign them to the treatment group. Once you have 10 subjects in one group, put the remaining subjects into the other group. The table below shows the result of this process.

Example random assignment

Group Subject
Control 1 2 4 6 8 9 13 15 18 20
Treatment 3 5 7 10 11 12 14 16 17 19

Cross the numbers out as you use them, so you have new random numbers for your next experiment. If you run out of numbers then there are other tables available, such as the Rand Corporation’s classic book of 1000000 random digits (Rand Corporation, 1955). The digits in that book were produced by a physical random process, rather than a mathematical one. The table below shows a sample of random digits generated from an astronomical image (Pimbblet & Bulmer, 2005). However, such random processes are usually difficult to harness in producing random numbers.

Cosmic random digits

                   
5 9 3 7 9 4 0 7 8 6
9 2 6 6 7 2 3 5 5 5
0 3 4 1 2 2 6 7 2 9
2 0 0 2 5 2 0 3 6 3
2 8 0 5 2 9 8 1 4 9
3 1 5 6 8 9 2 3 5 8
3 1 4 6 4 6 1 1 2 6
7 2 5 5 3 0 9 0 4 1
2 7 4 8 7 0 6 8 2 0
1 4 5 3 6 2 3 5 2 4
0 7 5 3 7 6 3 0 2 9
4 6 0 2 0 2 6 5 7 9
7 5 5 1 9 7 3 2 1 2
7 4 1 7 0 5 2 4 9 8
1 5 9 2 4 0 0 3 1 8
9 5 2 5 6 3 2 8 0 0
7 9 6 9 9 8 4 5 9 0
0 9 3 9 0 5 1 3 6 2
0 3 7 5 0 5 9 4 9 6
8 0 4 1 9 7 8 3 8 3
0 3 9 7 7 9 9 9 9 9
6 7 9 0 8 6 7 2 7 1
8 9 9 0 6 7 2 0 2 1
2 1 6 8 3 9 5 6 2 8
6 3 1 1 8 1 8 0 3 6
2 1 1 2 3 1 0 0 7 5
9 0 3 6 6 7 2 5 1 8
4 0 0 1 9 5 2 4 9 2
9 6 3 8 1 8 1 0 3 4
4 8 6 0 4 4 9 7 6 5
3 0 3 1 8 4 1 7 0 5
8 3 5 1 6 4 1 2 9 7
5 6 7 2 9 1 5 1 3 4
2 2 3 9 1 8 4 0 0 0
5 5 7 0 3 2 2 7 7 7
6 4 4 1 3 9 8 7 2 6
9 2 6 5 5 6 8 6 9 8
0 4 2 8 7 0 8 9 1 5
9 1 7 1 4 7 6 6 6 2
2 1 7 7 6 5 7 2 0 7
7 4 0 1 4 6 3 5 9 6
7 3 8 9 7 7 2 4 3 9
5 5 9 8 7 8 8 8 9 7
8 7 0 7 7 9 6 6 9 5
6 8 3 6 5 3 8 5 4 2
2 6 3 7 8 4 6 3 6 7
9 7 6 8 1 9 6 9 3 9
7 2 2 8 9 4 6 7 2 7
4 6 4 5 2 4 2 2 3 5
4 9 9 8 4 7 3 6 4 5
2 8 6 7 0 4 5 9 4 3
5 9 7 7 2 5 5 7 9 3
9 0 5 2 8 4 9 0 8 5
6 4 5 5 8 8 6 8 4 7
1 6 4 3 8 7 8 9 0 6
4 3 8 3 9 0 3 8 4 7
6 8 5 2 0 1 5 6 7 9
1 0 5 9 3 6 2 1 7 4
9 2 8 0 4 1 4 8 6 3
0 7 3 8 0 0 7 0 5 4
6 1 1 2 8 5 8 1 7 6
3 5 3 1 7 0 3 6 3 9
2 9 8 6 2 9 1 6 8 4
2 3 3 7 1 9 0 9 6 1
0 2 6 2 7 8 4 3 7 6
2 3 5 1 6 4 7 8 9 7
6 4 2 4 6 5 4 6 3 9
5 8 3 7 3 5 1 4 5 4
9 2 4 1 8 8 6 8 0 7
7 8 9 4 0 2 6 1 7 8
5 1 2 6 3 5 6 2 4 0
0 5 4 7 1 8 5 4 7 2
2 7 7 6 8 5 4 6 4 4
2 9 8 1 3 7 1 8 3 9
0 4 4 3 5 7 2 4 2 6
2 5 0 5 0 7 2 1 3 1
0 8 9 8 3 7 8 4 1 7
5 9 7 4 1 8 6 9 2 9
4 2 4 3 9 8 6 2 8 0
4 4 8 1 7 9 5 5 7 6
6 8 3 4 9 7 3 0 5 1
6 9 8 5 0 8 7 4 2 3
2 9 0 3 2 1 9 0 4 9
1 4 3 5 9 9 5 0 0 8
2 3 8 7 0 6 5 3 9 6
3 1 8 1 5 5 9 7 1 8
7 1 7 4 0 1 6 7 2 4
3 3 6 9 7 2 6 8 2 9
7 4 8 4 1 7 6 2 0 9
4 0 2 0 1 3 9 0 3 5
3 4 5 4 5 1 4 7 2 5
8 2 1 0 5 6 8 6 5 7
9 7 8 8 3 3 8 7 9 5
4 0 7 6 3 6 3 9 7 5
7 9 2 2 6 4 7 2 4 6
3 1 3 2 7 2 5 8 8 3
2 0 7 1 5 0 2 2 1 5
0 0 0 5 1 8 4 2 8 7
1 5 5 8 2 6 8 5 7 7
0 4 7 7 3 4 3 6 8 9
7 3 6 0 9 4 6 7 3 3
7 2 6 9 2 3 5 7 0 2
4 2 6 9 4 2 8 8 0 1
6 8 3 8 0 9 7 4 0 4
4 5 1 5 5 0 8 1 3 4
6 8 2 3 7 6 0 9 3 8
9 0 6 1 0 1 7 8 5 2
9 0 9 1 8 2 0 1 3 6
4 7 0 1 2 7 5 8 9 5
7 9 7 5 3 9 8 8 1 8
7 5 0 2 7 3 3 1 6 3
6 0 0 0 6 9 5 9 9 1
9 9 1 0 9 8 3 2 4 5
2 4 2 0 0 9 4 1 0 5
7 3 0 2 3 4 3 2 0 1
1 4 2 7 1 5 8 8 0 8
3 8 3 7 7 5 5 0 8 0
5 1 5 3 2 8 4 1 8 5
2 7 6 3 8 9 4 8 6 1
2 7 4 3 3 6 1 4 4 0
2 5 8 3 0 5 6 6 4 0
7 2 6 6 4 0 7 4 5 1
0 0 5 4 7 8 6 4 5 6
8 1 4 0 7 0 9 6 6 1
8 3 6 4 8 2 1 3 3 3
1 0 0 3 2 0 3 5 3 5
9 1 2 6 5 7 8 0 8 0
4 0 3 8 2 9 2 1 6 5
3 5 1 8 3 8 8 5 8 9
8 2 0 3 4 7 9 5 8 1
3 7 9 3 1 9 5 3 0 5
0 6 3 4 6 0 7 5 7 2
1 0 7 2 8 5 5 0 6 6
7 7 5 7 7 6 2 0 4 6
9 2 7 1 4 8 2 0 7 3
1 6 2 9 2 8 4 0 5 3
6 2 5 1 5 8 5 4 4 5
9 8 7 7 3 5 9 7 7 2
6 5 6 5 4 9 3 4 0 0
9 5 2 7 2 8 4 7 5 7
2 1 3 3 8 5 2 8 2 2
8 3 7 3 7 3 6 2 1 0
1 5 5 8 1 0 9 1 1 0
0 5 9 6 9 2 5 0 8 5
3 0 5 7 4 4 9 7 0 9
6 5 5 9 8 0 8 8 0 9
2 2 2 1 7 3 1 1 8 6
2 9 6 3 7 6 4 4 1 5
1 9 2 2 6 9 8 6 5 9
9 7 0 4 3 7 2 3 6 0
6 7 7 2 1 2 1 8 7 6
7 3 1 6 3 8 0 5 1 6
3 3 5 1 5 8 7 9 8 7
5 8 6 8 3 1 2 7 8 5
3 1 4 6 8 4 7 9 6 0
0 6 7 4 2 3 1 2 9 7
6 8 0 0 0 1 5 5 6 2
8 1 4 0 7 3 7 0 3 6
0 1 0 3 8 5 4 5 5 6
6 4 6 3 0 8 0 9 6 7
4 1 3 9 4 3 5 3 8 8
5 6 2 3 3 2 6 7 8 3
1 3 8 9 8 9 2 1 2 4
1 8 8 0 4 0 2 6 8 8
2 6 3 8 6 7 2 9 6 3
7 8 5 8 2 1 2 3 9 5
4 9 5 6 9 3 5 4 0 3
8 9 1 7 3 9 0 2 4 6
4 4 2 6 2 0 7 5 1 0
5 0 9 2 0 0 8 6 3 1
9 8 5 3 2 7 3 0 0 4
5 8 6 2 0 3 4 7 1 8
9 5 9 1 2 8 0 8 6 7
3 2 6 9 2 0 6 0 9 0
8 5 6 1 6 5 5 8 2 4
0 8 0 8 9 6 0 6 2 0
5 4 1 5 0 0 5 7 4 6
3 9 5 1 6 0 0 8 2 8
3 8 3 1 6 4 0 7 9 5
9 0 3 6 7 1 2 1 3 7

These two tables of random digits are only included for historical interest, and in case you don’t have access to a computer or calculator. Many software packages can generate good random numbers and can do so within a specified range, such as 1 to 20, so you can pick subjects directly. The basic idea and the importance of randomisation have not changed though.

Evidence

So is there any evidence that caffeine increases pulse rate? The table below shows the changes in pulse rate during the study for the two groups. The mean increase in pulse rate for the decaffeinated group was 5.1 bpm, so perhaps other ingredients in the cola have an effect or there is a psychological response, or perhaps a mean increase of 5.10 bpm was just due to chance. (We will return to that question in Chapter 15.) However the mean increase in pulse rate for the caffeine group was 15.8 bpm, 10.7 bpm higher than without caffeine.

Changes in pulse rates (bpm) for caffeine study

Caffeinated 17 22 21 16 6 -2 27 15 16 20
Decaffeinated 4 10 7 -9 5 4 5 7 6 12

We will consider two explanations for the 10.7 bpm difference:

  1. The caffeine has no effect on pulse rate and the observed difference of 10.7 was just due to the chance variability in pulse rates
  2. The difference of 10.7 arose because caffeine does increase pulse rate

How can we decide which is the correct explanation? A difference of 10.7 seems fairly high but there is one person with caffeine whose pulse rate actually went down so perhaps the first explanation is reasonable.

The standard approach is to start by assuming that the first explanation is correct and try to quantify how likely it would be to obtain the observed outcome by chance. If it was quite likely then the observations are consistent with the first explanation and so we wouldn’t find any evidence for the second. However, if it was very unlikely to observe the outcome we did then we would be suspicious about the first explanation and favour the second instead.

So suppose that the first explanation is true and that caffeine has no effect on pulse rate. If this was the case then instead of having two different groups of observations we have really made 20 observations from the same process (involving other factors, like sugar or psychological effects) and it was just by chance that these ended up in the groups that they did. For example, the table below shows an alternative random allocation of the 20 observations to two groups. If the first explanation was correct then this outcome should have been just as likely as the one we originally observed. Here the mean changes are 11.7 bpm with caffeine and 9.2 bpm without caffeine, giving a mean difference between the groups of 2.5 bpm. This is positive but not as big as 10.7 bpm.

Random allocation of pulse rate changes (bpm) to groups

Caffeinated -2 6 21 7 16 15 5 12 17 20
Decaffeinated 10 4 27 -9 4 16 7 22 5 6

We could repeat this process again, randomly allocating the 20 observations to two groups and calculating the mean difference, to see if we get anything like 10.7 bpm by chance. However with the aid of a computer we can just work through all possible randomisations and find out exactly how likely it is to get a 10.7 bpm difference. There are quite a lot of possibilities — the number of randomisations is the number of ways you can pick the 10 observations for the first group from the 20 available, or
\[ ^{20}C_{10} = \binom{20}{10} = 184756. \]
A computer doesn’t mind the size of this number (though it may become disgruntled if we had 40 observations to split — see below). The figure below shows the distribution of the 184756 mean differences, based on the assumption that there was no caffeine effect. It seems clear from the figure that a value of 10.7 is not that likely. In fact only 351 of the 184756 randomisations gave a mean difference as unusual as 10.7, a probability of
\[ \frac{351}{184756} = 0.0019. \]
So the first explanation was that there was no caffeine effect and that the results were due to chance but now this is hard to believe since the chance of the results is so small. Thus for this study we favour the second explanation, that the difference of 10.7 bpm arose because caffeine does increase pulse rate.

Distribution of mean difference for all possible allocations of subjects to groups

Randomisation Test

The above sequence of reasoning is known as a randomisation test (Ernst, 2004; Cobb, 2007). This is a very general and flexible approach for assessing evidence from an experiment or study.
In the previous figure we looked at all possible randomisations to the two groups but this is not always practical. For example, with 40 subjects to split between two groups of 20 there are
\[ \binom{40}{20} = 137846528820 \]
possible randomisations. This number is rather daunting, though not impossible to compute. However in practice we don’t need to know the exact probability of obtaining the data if there was no effect. Like the example in our previous table, we can instead randomly allocate the subject observations to the groups many times and see how often we get 10.7 or more. The figure below shows the results of doing this 10000 times, an approximation to the exact distribution pictured in the previous figure. We find 17 out of the 10000 differences were 10.7 or higher, giving an estimated probability of
\[ \frac{17}{10000} = 0.0017. \]
This in turn is an estimate of the exact probability calculated above. It is pretty close to the exact value and is certainly close enough to evaluate the evidence from the experiment. This is how randomisation tests are used in practice.

Distribution of mean difference for 10000 randomisations of subjects to groups

The Language of Hypothesis Testing

The randomisation test is one example of statistical hypothesis testing and we will see many other examples of this sequence of reasoning in later chapters. We conclude this chapter with an introduction to the language that is used when describing the steps in a hypothesis test.

In the caffeine analysis, the first explanation for the observed difference in pulse rates between the groups is referred to as the null hypothesis of the test.
The null hypothesis will usually be a statement of “no effect”. For example, if we were trying to show that a new drug helped a medical condition then our null hypothesis would be that it had no benefit. Note that this sense of “hypothesis” is quite different to the scientific hypotheses described earlier in this chapter. Here our null hypothesis is that the mean increase in pulse rate is the same for caffeinated and decaffeinated cola.

The null hypothesis is usually denoted [latex]H_0[/latex] when discussing the theory of hypothesis testing but you will rarely find this notation appearing in scientific papers that use hypothesis tests. In fact it is rare for authors to specify the null hypothesis at all, though it is usually easy to infer what it was, based on the statement of results.

Assuming the null hypothesis is true, we calculate the probability that we could get data like what we saw just by chance. This probability is called the P-value of the test and it is almost always reported in scientific papers that use hypothesis testing. It is usually denoted by [latex]p[/latex].

If the P-value is large then data like those observed were quite likely by chance and so there is no reason to doubt the null hypothesis. For example, suppose we are wondering if a coin is more likely to come up heads than it should. Our null hypothesis would be that the coin is fair, an even chance for heads or tails. Someone tosses the coin once and it comes up heads — should we be suspicious? Well the probability of obtaining that result, the P-value, is [latex]p = 0.5[/latex]. This is quite likely and so we would have inconclusive evidence against the null hypothesis. Note that this doesn’t mean that the null hypothesis is true since the coin could have heads on both sides! It just means that our experimental design (a single coin toss) may not have been good enough to detect any effect.

If the P-value is small then we have evidence against the null hypothesis. For the caffeine example the P-value was [latex]p = 0.0019[/latex]. P-values provide a continuous scale for strength of evidence against the null hypothesis. The figure below illustrates some standard adjectives that are used when assessing the evidence from an experiment.

Interpreting strength of evidence from a [latex]P[/latex]-value

Related to the null hypothesis is the alternative hypothesis. Denoted by [latex]H_1[/latex], this is usually what we want to show and it gives us the direction by which we judge a possible outcome to be as unusual as the one actually observed. For the caffeine study Alice was trying to show that caffeine increased pulse rate and so the alternative hypothesis was that the mean difference between the groups would be more than 0.

Decisions

A traditional use of hypothesis testing has been as a tool for decision making. To do this a threshold is chosen, such as 0.05, and if we find a P-value which is less than 0.05 then we say that “the results were significant at the 5% level”. This will also often be written in journal articles as “the results were found to be significant ([latex]p \lt 0.05[/latex])”.

Such decisions are often appropriate. A classic example is in process control where samples are used to check that quality is being maintained. With a null hypothesis that the quality target is being met, a hypothesis and P-value can be used to decide when the process is “out of control” and some action needs to be taken.

However this role is often used in scientific research where there is usually no need to make such binary decisions. In terms of assessing evidence from a scientific study there is no practical difference between a P-value of 0.045 and one of 0.055, even though the former is significant at the 5% level while the latter is not significant. It is preferable to always report the exact P-value so that the reader can assess the level of evidence in more detail than a statement such as “[latex]p \lt 0.05[/latex]” provides. Hubert and Lombardi (2009) give an extensive survey of the historical development of hypothesis testing in this context.

Summary

  • Randomised comparative experiments are the ideal way of detecting a difference between treatments.
  • Observational studies are an important methodology when experimentation is not possible but you need to be aware of the limitations in their conclusions.
  • Think about the population that your sample of subjects might represent. The conclusions that can be drawn from experimental results will be most valid for this population.
  • Hypothesis tests assess evidence against a null hypothesis ([latex]H_0[/latex]) in favour of an alternative hypothesis ([latex]H_1[/latex]) using a P-value.
  • Smaller P-values give stronger evidence against [latex]H_0[/latex].

Exercise 1

Use the table of random digits to randomly assign 20 subjects to a study involving three treatment groups. You will need to adapt the method given in the section on randomisation.

Exercise 2

In this figure, what is the highest possible mean difference?

Exercise 3

Alice has seven close friends, three male and four female, who are all about the same height as each other. She is wondering if the males around this height tend to be heavier than females on average.
She sets up a scale in her floor and secretly weighs each friend when they visit. The resulting
weights are given in the table below.

Weights (kg) of Alice's closest friends

Female 56 39 50 70
Male 72 68 86
  1. State in words the null hypothesis ([latex]H_0[/latex]) and alternative hypothesis ([latex]H_1[/latex]) that Alice is interested in testing.
  2. What is the observed difference in mean weight between males and females?
  3. Calculate the exact P-value for a randomisation test of [latex]H_0[/latex] and interpret the result.

Exercise 4

Inspired by the work of Casini et al. (2013), Anna Solberg from Hofn conducted an experiment to see whether sleep deprivation affected the internal clock. She recruited ten friends for one night and randomly assigned them to either a rested group or a sleep deprived group. For the latter, Anna followed the original protocol and “remained with participants throughout the night to ensure they were awake. Access to television and games were provided. No food was allowed after midnight and caffeinated beverages were discontinued for 24 hours prior to the study” (Casini et al., 2013).
In the morning all participants completed a duration-production task where they were asked to press and hold a button for 1100 ms. The actual durations recorded are given in the table below. Calculate the exact P-value for a randomisation test to determine whether the mean duration produced tends to be lower with sleep deprivation.

Recorded durations (ms)

Rested 1822 1568 1366 1460 1739
Deprived 1629 1045 1182 1051 1444

Exercise 5

Discuss any limitations with Anna’s experiment in Exercise 4. For example, Casini et al. (2013) had each subject perform the duration-production task on two occasions, once when rested and once when sleep deprived. Why might this affect results?

 

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

A Portable Introduction to Data Analysis Copyright © 2024 by The University of Queensland is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.