# 16 Comparing Two Means

[latex]\newcommand{\pr}[1]{P(#1)} \newcommand{\var}[1]{\mbox{var}(#1)} \newcommand{\mean}[1]{\mbox{E}(#1)} \newcommand{\sd}[1]{\mbox{sd}(#1)} \newcommand{\Binomial}[3]{#1 \sim \mbox{Binomial}(#2,#3)} \newcommand{\Student}[2]{#1 \sim \mbox{Student}(#2)} \newcommand{\Normal}[3]{#1 \sim \mbox{Normal}(#2,#3)} \newcommand{\Poisson}[2]{#1 \sim \mbox{Poisson}(#2)} \newcommand{\se}[1]{\mbox{se}(#1)} \newcommand{\prbig}[1]{P\left(#1\right)}[/latex]

For the caffeinated cola study in Chapter 2 we can think of the two groups of subjects as coming from two different populations. As people they came from the same population but in terms of their pulse rate response one group came from a population where they drank caffeinated cola while the other group came from a population where they drank decaffeinated cola. We now want to determine whether those populations are different based on our samples.

# Standard Error

Suppose we take two independent samples from two populations. Suppose the first sample was of size [latex]n_1[/latex] and came from a population with mean [latex]\mu_1[/latex] and standard deviation [latex]\sigma_1[/latex], and that the second sample was of size [latex]n_2[/latex] and came from a population with mean [latex]\mu_2[/latex] and standard deviation [latex]\sigma_2[/latex]. We estimate [latex]\mu_1[/latex] with [latex]\overline{x}_1[/latex], [latex]\mu_2[/latex] with [latex]\overline{x}_2[/latex], [latex]\sigma_1[/latex] with [latex]s_1[/latex], and [latex]\sigma_2[/latex] with [latex]s_2[/latex].

We would like to compare [latex]\mu_1[/latex] and [latex]\mu_2[/latex] to see if there is a difference in the mean responses for two treatments or between two groups. We can make this comparison by looking at [latex]\mu_1 - \mu_2[/latex] and seeing how far away it is from 0. Of course, we don’t know what [latex]\mu_1 - \mu_2[/latex] is but we can estimate it with the statistic [latex]\overline{x}_1 - \overline{x}_2[/latex]. This is the difference between two sample means but it is useful to think of it as one value, an outcome of the random variable [latex]\overline{X}_1 - \overline{X}_2[/latex], the process of taking two random samples and returning the difference between their means.

To work out a confidence interval for [latex]\mu_1 - \mu_2[/latex] we need to know the sampling distribution of [latex]\overline{X}_1 - \overline{X}_2[/latex]. Now

\[ \mean{\overline{X}_1 – \overline{X}_2} = \mean{\overline{X}_1} – \mean{\overline{X}_2} = \mu_1 – \mu_2, \]

as we would like, and

\[ \var{\overline{X}_1 – \overline{X}_2} = \var{\overline{X}_1} + \var{\overline{X}_2} = \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}, \]

since we are assuming the samples are independent. This gives the standard deviation

\[ \sd{\overline{X}_1 – \overline{X}_2} = \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}. \]

As usual, we don’t know [latex]\sigma_1[/latex] or [latex]\sigma_2[/latex], but we can estimate them with the sample standard deviations. This gives the standard error

\[ \se{\overline{x}_1 – \overline{x}_2} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}. \]

The [latex]t[/latex] distribution was introduced to cope with the extra variability from one sample standard deviation. Here we now have two and so unfortunately we cannot use the [latex]t[/latex] distribution directly with this standard error. However, we can use the [latex]t[/latex] distribution to give a conservative approximation to the real distribution by taking

\[ \mbox{df } = \min(n_1 – 1, n_2 -1), \]

the smaller of the two degrees of freedom. By “conservative” we mean that a 95% confidence will probably be a bit wider than it has to be and hypothesis tests will give less significant [latex]P[/latex]-values.

# Confidence Intervals

We can use the above discussion to give a general formula for a confidence interval for the difference between two population means,

\[ (\overline{x}_1 – \overline{x}_2) \pm t_{\small{\mbox{df}}}^{*} \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}, \]

where [latex]\mbox{df } = \min(n_1 - 1, n_2 -1)[/latex].

Lighting and Plant Growth

The table below gives summary statistics for the seedling growth data in the Chapter 4 example.

## Summary statistics for plant growth (mm) by lighting

Lighting | [asciimath]n[/asciimath] | [asciimath]\overline{x}[/asciimath] | [asciimath]s[/asciimath] |
---|---|---|---|

High | 15 | 79.9 | 7.039 |

Normal | 15 | 41.0 | 6.024 |

To have a break from 95% intervals, suppose we want a 90% confidence interval for the mean increase in plant growth resulting from the continuous fluorescent lighting. This estimated mean difference could be important in deciding whether the cost of the lighting is justified in terms of the benefit.

For 90% confidence we require the two tail probabilities to each be 5% so we look at the 0.05 column in Student’s T distribution. The smaller degrees of freedom here are 14 giving [latex]t_{14}^{*}[/latex] = 1.761. The interval is thus

\[ (79.9 – 41.0) \pm 1.761 \sqrt{\frac{7.039^2}{15} + \frac{6.024^2}{15}} = 38.9 \pm 4.21 \mbox{ mm}.\]

So we are 90% sure that the continuous lighting results in between 34.7 mm and 43.1 mm extra growth on average.

## The Welch Approximation

The conservative degrees of freedom we have used is easy to calculate by hand but is almost always too conservative. We are underselling our confidence intervals. The above 90% confidence interval is probably close to a 93% confidence interval in reality. Computer packages use a more complicated calculation for the appropriate degrees of freedom to use. Remember the aim of this is to approximate the real distribution that arises from using two sample standard deviations by a [latex]t[/latex] distribution. The better degrees of freedom gives the **Welch approximation** (Welch, 1936) and is calculated by

\[ \mbox{df } = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{1}{n_1-1}\left(\frac{s_1^2}{n_1}\right)^2 + \frac{1}{n_2-1} \left(\frac{s_2^2}{n_2}\right)^2}. \]

For example, with the above comparison of plant growth, the Welch degrees of freedom would be 27.35 instead of 14. This is still conservative but far less so. The margin of error becomes 4.07 mm instead of 4.21 mm, suggesting we know the difference a bit more precisely than we said before.

The best degrees of freedom we could hope to use would be to combine the two degrees of freedom from [latex]s_1[/latex] and [latex]s_2[/latex], in this case (15-1) + (15-1) = 28. It is justifiable to do this in some cases, as we will see later in this chapter. However, when it is justifiable to do so the Welch approximation would give this as well. In practice you can let the software package take care of this issue.

Treatment of Worms in Native Rats

A study carried out by Renee Sternberg and Hamish McCallum at the University of Queensland involved trapping and releasing a number of native rats near Mount Glorious. Two species of rat were involved: *Rattus fuscipes* and *Melomys cernvinipes*. Before releasing, half of the rats were given a treatment in an attempt to reduce their worm burden while the others were given distilled water instead, as a control group.

The two tables below show the worm count data obtained at the end of this experiment. The following figure shows a dot plot comparing the number of worms found in the small intestine of each rat at the end of the study.

## Worm count data - Water Group

Species | Sex | Liver/Heart/Lungs | Stomach | Small Intestine | Caecum | Large Intestine |
---|---|---|---|---|---|---|

Melomys | Male | 0 | 0 | 84 | 0 | 0 |

Melomys | Female | 0 | 0 | 8 | 4 | 4 |

Melomys | Female | 0 | 0 | 50 | 7 | 0 |

Melomys | Female | 0 | 0 | 20 | 1 | 0 |

Melomys | Male | 0 | 0 | 0 | 0 | 1 |

Rattus | Female | 0 | 7 | 71 | 0 | 1 |

Rattus | Female | 7 | 22 | 217 | 0 | 0 |

Rattus | Female | 2 | 16 | 145 | 2 | 0 |

Rattus | Male | 0 | 12 | 71 | 19 | 5 |

Rattus | Male | 0 | 2 | 30 | 7 | 4 |

Rattus | Male | 23 | 9 | 234 | 9 | 2 |

Rattus | Male | 10 | 9 | 246 | 16 | 2 |

Rattus | Male | 4 | 6 | 470 | 60 | 4 |

## Worm count data - Treatment Group

Species | Sex | Liver/Heart/Lungs | Stomach | Small Intestine | Caecum | Large Intestine |
---|---|---|---|---|---|---|

Melomys | Female | 0 | 0 | 28 | 1 | 0 |

Melomys | Male | 0 | 0 | 10 | 0 | 0 |

Melomys | Male | 0 | 0 | 3 | 0 | 0 |

Melomys | Male | 0 | 0 | 2 | 0 | 0 |

Melomys | Male | 0 | 0 | 4 | 0 | 0 |

Rattus | Female | 0 | 2 | 9 | 0 | 0 |

Rattus | Female | 0 | 1 | 5 | 0 | 0 |

Rattus | Female | 0 | 3 | 1 | 0 | 0 |

Rattus | Female | 0 | 11 | 8 | 28 | 0 |

Rattus | Female | 23 | 6 | 0 | 3 | 0 |

Rattus | Male | 0 | 9 | 0 | 9 | 0 |

Rattus | Male | 0 | 6 | 10 | 0 | 0 |

Rattus | Male | 0 | 2 | 1 | 2 | 0 |

The distributions of worm counts are highly skewed and so they need to be transformed if [latex]t[/latex] methods are to be applied, as was done in Chapter 14. Taking logarithms would be a first step, but this data contains many zero counts, where no worms were found, and [latex]\log(0)[/latex] is undefined. This can be overcome by adding 1 to all observations before taking logarithms, and the results of this transformation, using logarithms to the base 10, are shown in the figure below.

The transformed data is much more symmetric, though there is now a slightly unusual value in the Water group. We can proceed to calculate a 95% confidence interval for the difference between the groups using the summary data in the following table.

## Summary statistics for transformed worm count data

Group | [asciimath]n[/asciimath] | [asciimath]\overline{x}[/asciimath] | [asciimath]s[/asciimath] |
---|---|---|---|

Water | 13 | 1.774 | 0.7167 |

Treatment | 13 | 0.666 | 0.4394 |

The Welch approximation suggests using 19 degrees of freedom. This reflects some difference between the sample standard deviations, but is much higher than the 12 degrees of freedom suggested at the start of the section on confidence intervals.

The 95% confidence interval for the effect of the treatment over the placebo is

\[ (0.666 – 1.774) \pm 2.093 \sqrt{\frac{0.4394^2}{13} + \frac{0.7167^2}{13}} =

-1.108 \pm 0.488, \]

giving a range of -1.596 to -0.620. As in the section on confidence intervals, we then need to undo our transformation to get an interval we can interpret. We have found an interval for

\[ \log(\mbox{Treatment}) – \log(\mbox{Water}) = \log\left(\frac{\mbox{Treatment}}{\mbox{Water}}\right), \]

so a 95% confidence interval for the ratio of worms in the treatment group to the control group is

\[ \left(10^{-1.596}, 10^{-0.620}\right) = (0.025, 0.240). \]

Thus we are 95% sure that native rats undergoing the treatment will have between only 2.5% and 24% of the worms in their small intestines that a rat would otherwise have.

# Hypothesis Tests

Suppose we want to test the hypothesis [latex]H_0: \mu_1 = \mu_2[/latex]. In this case, when calculating the [latex]P[/latex]-value we would expect [latex]\mu_1 - \mu_2[/latex] = 0. Combining this with the standard error formula gives the [latex]t[/latex] statistic

\[ t_{\small{\mbox{df}}} = \frac{(\overline{x}_1 – \overline{x}_2) – 0}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}, \]

where [latex]\mbox{df } = \min(n_1 - 1, n_2 -1)[/latex] or the value from the Welch approximation. The expected value for the difference is 0 and we have left it there to emphasise that this is just the usual process of standardising.

Caffeine and Cola

We can now use a [latex]t[/latex] test to answer Alice’s question from Chapter 2 and see whether the caffeinated cola gives an increase in pulse rate that is significantly higher than the decaffeinated cola.

We would like to see if there is a difference between the mean increase in pulse rate with caffeine, [latex]\mu_Y[/latex], and the mean increase without caffeine, [latex]\mu_N[/latex]. We will test [latex]H_0: \mu_Y = \mu_N[/latex] against the one-sided alternative [latex]H_1: \mu_Y \gt \mu_N[/latex]. It is one-sided because Alice was trying to show that the presence of caffeine would give a higher mean increase.

The figure below shows a side-by-side dot plot of the pulse rate increases for the 20 subjects in the Chapter 2 example, while the following table gives the summary statistics we need to calculate the [latex]t[/latex] statistic.

## Summary statistics for pulse rate increases (bpm)

Caffeine | [asciimath]n[/asciimath] | [asciimath]\overline{x}[/asciimath] | [asciimath]s[/asciimath] |
---|---|---|---|

Yes | 10 | 15.80 | 8.324 |

No | 10 | 5.10 | 5.587 |

From these summaries find

\[ t_9 = \frac{(15.80 – 5.10) – 0}{\sqrt{\frac{8.324^2}{10} + \frac{5.587^2}{10}}} = \frac{10.7}{3.17} = 3.38, \]

where 9 degrees of freedom comes from the conservative approximation. Since we are expecting to find [latex]\mu_Y \gt \mu_N[/latex] the [latex]P[/latex]-value is [latex]\pr{T_9 \ge 3.38}[/latex]. From Student’s T distribution we find this [latex]P[/latex]-value is between 0.005 and 0.001, very strong evidence to suggest that the mean increase is higher for the caffeinated cola than it is for the decaffeinated cola.

## The Alice Distribution

Note that this is the same level of evidence we found using the randomisation test in Chapter 2. There we wanted to know how likely it was that we could obtain 10.7 through the random allocation of subjects to the two groups and we gave a fairly informal argument regarding this probability and the associated evidence it suggested.

We can now be more specific about this process using the language of random variables. Let [latex]A[/latex] be the difference between group means when the 20 values in the table below are randomly split into two groups of equal size. This is an example of what is variously known as a **randomisation** distribution (Ernst, 2004), a **re-randomisation** distribution (Pfannkuch et al., 2011) or a **scrambling** distribution (Finzer, 2006).

## Changes in pulse rate

-2 | -9 | 4 | 4 | 5 | 5 | 6 | 6 | 7 | 7 |

10 | 12 | 15 | 16 | 16 | 17 | 20 | 21 | 22 | 27 |

Here we will give our distribution a name and say that the random variable [latex]A[/latex] has the **Alice distribution**. This is a very special distribution, intimately tied to the 20 values. However we could still make an exact statistical table of this distribution by calculating all 184756 possible mean differences and using these to give cumulative probabilities, as shown in the table below.

## Alice distribution

First decimal place of [asciimath]a[/asciimath] | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|

[asciimath]a[/asciimath] | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |

0 | 0.500 | 0.500 | 0.481 | 0.461 | 0.461 | 0.461 | 0.442 | 0.423 | 0.423 | 0.423 |

1 | 0.404 | 0.404 | 0.385 | 0.385 | 0.366 | 0.366 | 0.348 | 0.330 | 0.330 | 0.330 |

2 | 0.312 | 0.312 | 0.295 | 0.295 | 0.278 | 0.278 | 0.262 | 0.262 | 0.246 | 0.246 |

3 | 0.231 | 0.231 | 0.216 | 0.216 | 0.202 | 0.202 | 0.188 | 0.188 | 0.174 | 0.174 |

4 | 0.161 | 0.161 | 0.149 | 0.149 | 0.138 | 0.138 | 0.127 | 0.127 | 0.116 | 0.116 |

5 | 0.106 | 0.106 | 0.097 | 0.097 | 0.089 | 0.089 | 0.081 | 0.081 | 0.073 | 0.073 |

6 | 0.066 | 0.066 | 0.059 | 0.059 | 0.053 | 0.053 | 0.047 | 0.047 | 0.042 | 0.042 |

7 | 0.037 | 0.037 | 0.033 | 0.033 | 0.029 | 0.029 | 0.026 | 0.026 | 0.022 | 0.022 |

8 | 0.019 | 0.019 | 0.017 | 0.017 | 0.014 | 0.014 | 0.012 | 0.012 | 0.011 | 0.011 |

9 | 0.009 | 0.009 | 0.008 | 0.008 | 0.006 | 0.006 | 0.005 | 0.005 | 0.004 | 0.004 |

10 | 0.004 | 0.004 | 0.003 | 0.003 | 0.002 | 0.002 | 0.002 | 0.002 | 0.002 | 0.002 |

11 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.000 | 0.000 |

This table gives [asciimath]P(A \ge a[/asciimath] where the random variable [asciimath]A[/asciimath] has the Alice distribution. Only values for [asciimath]a \ge 0[/asciimath] are given since the distribution is symmetric about 0.

The [latex]P[/latex]-value for our test is then

\[ \pr{A \ge 10.7} = 0.002, \]

from the table.

However this table is of very limited value. It could only be used by a future researcher who happened to obtain the same set of 20 values in their experiment! This explains the historical utility of the [latex]t[/latex] tests. By using the standard error to transform our difference of 10.7 bpm to a standardised [latex]t[/latex] statistic of 3.38 we can then obtain the [latex]P[/latex]-value from a single set of [latex]t[/latex] distribution tables, rather than having to determine the distribution of the original statistic. In this way the [latex]t[/latex] distribution is a short cut in transforming our data into a [latex]P[/latex]-value:

\[ \mbox{Data} \;\; \longrightarrow \;\; \overline{X}_1 – \overline{X}_2 \;\; \longrightarrow \;\; T \;\; \longrightarrow \;\; P \]

The price we pay for this utility is the need to make the assumptions required by the [latex]t[/latex] test for this short cut to be sufficiently accurate.

# Pooling Standard Deviations

We have seen that the test statistic

\[ t = \frac{(\overline{x}_1 – \overline{x}_2) – 0}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \]

does not have an exact [latex]t[/latex] distribution. This is essentially because there are now two sources of variability in the standard error, since we don’t know [latex]\sigma_1[/latex] or [latex]\sigma_2[/latex], and the [latex]t[/latex] distribution was only intended to capture the extra variability from one. However if we were happy to assume that the two populations have the **same standard deviations**, that [latex]\sigma_1[/latex] = [latex]\sigma_2[/latex], then the denominator would only involve a single estimate of variability. We could then use the [latex]t[/latex] distribution without having to approximate it using a conservative estimate of degrees of freedom.

To estimate the common standard deviation we **pool** together the squared deviations and the degrees of freedom from the two samples. This gives the **pooled variance**

\[ s_p^2 = \frac{\sum (x_{1j} – \overline{x}_1)^2 + \sum (x_{2j} – \overline{x}_2)^2}{(n_1 – 1) + (n_2 – 1)} \]

and the pooled standard deviation [latex]s_p[/latex]. Another way of writing this formula comes from the definition of the sample standard deviation

\[ s = \sqrt{\frac{\sum (x_{j} – \overline{x})^2}{n – 1}}. \]

This can be rearranged to give

\[ \sum (x_{j} – \overline{x})^2 = (n – 1)s^2 \]

so that

\[ s_p^2 = \frac{(n_1 – 1)s_1^2 + (n_2 – 1)s_2^2}{(n_1 – 1) + (n_2 – 1)}. \]

This shows [latex]s_p^2[/latex] is a weighted average of the two sample variances. It is also handy if you only have a calculator since you can usually get each [latex]s[/latex] but not [latex]\sum (x_{j} - \overline{x})^2[/latex]. A software package, of course, calculates this for you.

The degrees of freedom we can now use are [latex]n_1 + n_2 - 2[/latex], higher than the very conservative [latex]\min(n_1 - 1, n_2-1)[/latex] and at least as high as the Welch approximation. A [latex]t[/latex] distribution with higher degrees of freedom is less variable and so confidence intervals we calculate will be narrower and hypothesis tests will be more significant. This is great but the difficulty is in deciding whether the population standard deviations are equal or not. This is really another hypothesis test question, but unfortunately the methods available for handling this test are not very reliable. It is better to compare the sample distributions graphically instead and see whether the assumption is plausible.

As usual the [latex]t[/latex] test will use the pooled standard deviation, rather than the pooled variance. However in Chapter 19 we will extend the idea of pooling to more than two samples. There we will focus on variance instead.

Height and Sex

The following figure shows a side-by-side box plot of height by sex for the sample of 60 Islanders in the survey data. The spread of the two distributions seems similar from this plot and so it might be reasonable to assume that the populations have the same standard deviations.

From the summary statistics in the following table we can calculate the pooled variance

\[ s_p^2 = \frac{(34 – 1) 6.367^2 + (26- 1) 5.900^2}{(34 + 26 – 2)} = \frac{2208}{58} = 38.07, \]

giving pooled standard deviation [latex]s_p = 6.17[/latex] cm. This will always be between the two sample standard deviations.

## Summary statistics for height by sex

Sex | [asciimath]n[/asciimath] | [asciimath]\overline{x}[/asciimath] | [asciimath]s[/asciimath] |
---|---|---|---|

Male | 34 | 177.06 | 6.367 |

Female | 26 | 167.42 | 5.900 |

The [latex]t[/latex] statistic for testing whether there is a difference between male and female heights is then

\[ t_{58} = \frac{(177.06 – 167.42) – 0}{6.17\sqrt{\frac{1}{34} + \frac{1}{26}}} = 6.00, \]

giving very strong evidence of a difference. This is not a very exciting result since it is well known that males are on average taller than females. Of more interest in this setting would be a confidence interval for how much taller males are. A 95% interval would be

\[ 9.64 \pm 2.002 \left(6.17 \sqrt{\frac{1}{34} + \frac{1}{26}}\right) = 9.64 \pm 3.22, \]

so we are 95% confident that the mean height for males is between about 6.42 cm and 12.86 cm higher than for females. (Here 2.002 came from the [latex]t(58)[/latex] distribution.)

## Power

The figure below shows the power of a two-sided two-sample [latex]t[/latex] test for varying signal-to-noise ratio.

Here this is the ratio of the difference you want to detect between the groups to the pooled standard deviation. The sample sizes shown are within each group: “[latex]n[/latex] = 40” indicates that you would need a sample of size 80 for comparing two groups.

The figure below gives a more useful plot for practice, showing the sample size required in each group to obtain 80% power in detecting the desired signal-to-noise ratio.

# Choosing Sample Sizes

As in Chapter 14, we can find the sample sizes [latex]n_1[/latex] and [latex]n_2[/latex] that give a desired margin of error, [latex]m[/latex], by rearranging the equation involving the standard deviation of the difference. Again we need some estimates of [latex]\sigma_1[/latex] and [latex]\sigma_2[/latex] to proceed. There is also the subtlety that the value of [latex]t^*[/latex] depends on [latex]n_1[/latex] and [latex]n_2[/latex], and in quite a complicated way when using the Welch method. However, this can all be managed by trial and error.

However, there is a more interesting question in this setting. Consider Alice’s experiment on the effects of caffeine on pulse rate. She had 20 friends available and chose to put 10 in each of her groups, giving equal sample sizes. This seems the intuitive thing to do, but is it the best use of the 20 subjects available?

The general formula for the margin of error when comparing the two groups is

\[ m = t^* \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}. \]

Based on the sample, discussed in the section on confidence intervals, it seems plausible that the population standard deviations are the same, [latex]\sigma_1 = \sigma_2 = \sigma[/latex]. Assuming that [latex]n = n_1 + n_2[/latex] is fixed we can write

\[ m = t^* \sigma \sqrt{\frac{1}{n_1} + \frac{1}{n – n_1}}. \]

For a fixed confidence level, we have no control in this formula over [latex]t^*[/latex], [latex]\sigma[/latex], or [latex]n[/latex]. The only choice we have is in the value of [latex]n_1[/latex]. Our aim should be to choose [latex]n_1[/latex], and hence [latex]n_2 = n - n_1[/latex], so that we make [latex]m[/latex] as small as possible, giving the best precision in our estimate of the difference in the mean growth levels.

The figure above shows a plot of

\[ \sqrt{\frac{1}{n_1} + \frac{1}{20 – n_1}}, \]

for choices of [latex]n_1[/latex] between 1 and 19. It should be clear that [latex]n_1 = 10[/latex] gives the lowest value, so that [latex]n_1 = n_2 = 10[/latex] is the best choice for splitting the friends between the groups. Thus Alice was right in using equal sample sizes in her experiment. Note however that small differences in the sample sizes would not have had much effect on [latex]m[/latex].

In general, when [latex]\sigma_1 \ne \sigma_2[/latex], it can be shown that the minimum margin of error comes from choosing [latex]n_1[/latex] and [latex]n_2[/latex] so that

\[ \frac{n_1}{n_2} = \frac{\sigma_1}{\sigma_2}. \]

So when [latex]\sigma_1 = \sigma_2[/latex], as above, we should choose [latex]n_1 = n_2[/latex]. If, for example, we suspect that [latex]\sigma_1[/latex] was three times [latex]\sigma_2[/latex] then we would choose [latex]n_1[/latex] to be three times [latex]n_2[/latex]. If [latex]n = 20[/latex] this would give [latex]n_1 = 15[/latex] and [latex]n_2 = 5[/latex].

However, even if the original data shows unequal standard deviations we will often use transformations to **stabilise** the variability, as illustrated in Chapter 19. The above arguments refer to the values you do the calculations with, and so equal sample sizes will still be appropriate if you transform your data in this way to have similar standard deviations.

Summary

- The sampling distribution of [latex]\overline{X}_1 - \overline{X}_2[/latex] gives a basis for calculating confidence intervals and carrying out hypothesis tests for a comparison of two population means.
- When population standard deviations are different the Welch degrees of freedom give a conservative approximation to this sampling distribution.
- For common population standard deviations a pooled standard deviation allows the use of a [latex]t[/latex] distribution with maximum degrees of freedom.
- Confidence intervals calculated for a difference in logarithms give a range for a ratio in the original units.
- For common population standard deviations it is optimal to split subjects evenly between the two treatment groups.

Exercise 1

Forty plastic cups were each filled with 20 mm of water stained with a blue food colouring. A celery stalk with leaves was placed in each cup with a toothpick through the centre for stabilisation. For twenty of the stalks the leaves were coated with petroleum jelly. All cups were placed in behind a glass shield in the sun and left for 5 hours. Each celery stalk was then cut from the bottom up and the distance to where the blue stain could no longer be seen in the vascular tissue was recorded. The results are given in the table below.

## Dye uptake (mm) with or without coated leaves

Uncoated | 155 | 144 | 151 | 139 | 146 | 131 | 143 | 156 | 117 | 125 |

134 | 142 | 157 | 146 | 153 | 140 | 156 | 138 | 147 | 156 | |

Coated | 92 | 110 | 119 | 104 | 93 | 86 | 96 | 107 | 114 | 96 |

118 | 111 | 95 | 106 | 115 | 108 | 119 | 89 | 92 | 101 |

Based on this data, calculate a 95% confidence interval for the difference in dye uptake between plants with leaves coated with vaseline and those without.

Exercise 2

Alcoholic beverages are known to slow reaction times but can this effect be offset by adding caffeine to the drink? Two groups of 8 males were given five drinks of rum and coke over a two-hour period. One group had regular diet coke in their drinks while the other had decaffeinated diet coke. Reaction times were measured before drinking and then after the two hours they were measured again.

Reaction times came from a “ruler test” where a ruler was released at the 0 cm mark between a subject’s thumb and forefinger. The result was the distance the ruler travelled before it was caught. For each subject, the table below reports averages from three repetitions of the ruler test before drinking and three after drinking. Is there any evidence that the increase in reaction time is less for the group receiving regular diet coke?

## Average reaction times before and after alcohol (cm)

Regular | Decaffeinated | ||
---|---|---|---|

Before | After | Before | After |

11.83 | 18.30 | 12.02 | 19.79 |

11.16 | 17.80 | 11.14 | 17.40 |

11.94 | 19.00 | 11.10 | 17.90 |

12.04 | 19.01 | 11.89 | 16.60 |

12.14 | 19.20 | 12.09 | 17.50 |

12.61 | 18.70 | 11.93 | 19.00 |

12.07 | 18.55 | 12.16 | 18.60 |

12.16 | 18.20 | 12.64 | 20.00 |

Exercise 3

Ingrid Ibsen, a student at Colmar University, was interested in whether reaction times differed between males and females. Using a sample of other people in the village, Ingrid had each subject press a button as quickly as possible after seeing a light flash. The results are shown in the table below. Is there any evidence of a difference in reaction times between males and females?

## Reaction times (ms) between sexes

Female | 293 | 214 | 297 | 275 | 285 | 279 | 245 | 290 | 244 | 257 |

279 | 262 | 254 | 262 | 276 | 254 | 267 | 293 | 289 | 280 | |

262 | 274 | 238 | 275 | 269 | 283 | 289 | 288 | |||

Male | 258 | 269 | 283 | 243 | 299 | 264 | 245 | 292 | 294 | 258 |

287 | 254 | 269 | 314 | 251 | 304 |

Exercise 4

Modafinil is a wake promoting agent that has been used in the treatment of daytime sleepiness associated with narcolepsy and shift-work. Müller et. al. (2013) conducted a double-blind study comparing the effect on creative thinking of 200 mg of modafinil ([latex]n_1 = 32[/latex]) or placebo ([latex]n_2 = 32[/latex]) in non-sleep deprived healthy volunteers. In one task the mean creativity score was 5.1 ([latex]s_1 = 3.4[/latex]) for the modafinil group compared to 6.5 ([latex]s_2 = 3.8[/latex]) for the placebo group. Does this give any evidence of an effect of modafinil on creative thinking?

Robertson et al. (2013) followed a cohort of individuals from birth to age 26 years, conducting assessments at birth and then at ages 5, 7, 9, 11, 13, 15, 18, 21 and 26. At the assessments from ages 5 to 15 they asked parents the average amount of time these individuals spent watching television each weekday. For the 523 boys in the study the mean value was 2.42 hours with standard deviation 0.86 hours. For the 495 girls in the study the corresponding mean was 2.24 hours with standard deviation 0.88 hours. Does this give any evidence of a difference between boys and girls in the time spent watching television?

Exercise 6

Carry out a two-sample [latex]t[/latex] test for the sleep deprivation and internal clock study in Exercise 4. Compare your results with the exact [latex]P[/latex]-value from the randomisation test.

Exercise 7

Inspired by the work of Nascimbene et al. (2012), William Favreau conducted a study to compare the plant biodiversity between 11 conventional vineyards and 9 organic vineyards around Talu. Along with the area and the number of years it had been organic, William counted the number of annuals and perennials present in each vineyard. His results are shown in the table below.

## Plant biodiversity in vineyards

Management | Area (ha) | Years Organic | Annuals | Perennials |
---|---|---|---|---|

Organic | 29 | 21 | 10 | 18 |

33 | 21 | 13 | 20 | |

37 | 7 | 12 | 16 | |

15 | 8 | 12 | 14 | |

22 | 25 | 12 | 30 | |

24 | 5 | 14 | 17 | |

13 | 16 | 16 | 18 | |

18 | 21 | 12 | 21 | |

23 | 6 | 13 | 12 | |

Conventional | 45 | 10 | 17 | |

55 | 15 | 13 | ||

35 | 7 | 17 | ||

33 | 12 | 13 | ||

26 | 13 | 13 | ||

52 | 17 | 12 | ||

32 | 13 | 11 | ||

28 | 15 | 13 | ||

51 | 12 | 14 | ||

58 | 16 | 14 | ||

63 | 14 | 12 |

The mean total number of plant species for the organic vineyards was 31.1 with standard deviation 5.21. For the conventional vineyards the mean was 26.6 with standard deviation 1.96. Does this give evidence that the organic management practices have resulted in higher plant biodiversity?