# 13 Sampling Distribution of the Mean

[latex]\newcommand{\pr}[1]{P(#1)} \newcommand{\var}[1]{\mbox{var}(#1)} \newcommand{\mean}[1]{\mbox{E}(#1)} \newcommand{\sd}[1]{\mbox{sd}(#1)} \newcommand{\Binomial}[3]{#1 \sim \mbox{Binomial}(#2,#3)} \newcommand{\Student}[2]{#1 \sim \mbox{Student}(#2)} \newcommand{\Normal}[3]{#1 \sim \mbox{Normal}(#2,#3)} \newcommand{\Poisson}[2]{#1 \sim \mbox{Poisson}(#2)}[/latex]

We can now move on to the fundamental idea behind statistical inference. Suppose we carry out a study on the effect of drinking 250 mL of caffeinated cola, as in Alice’s experiment. We take a sample of 10 subjects and find that the average increase in pulse rate is 15.80 bpm. We are not really interested in just the 10 subjects in the sample but would instead like to use this number to say something about all people in the population that this sample came from.

There is a problem with this though. If we take another sample of 10 subjects from the same population then we probably won’t get 15.80 bpm again. So what is the use of the sample mean?

The solution to this is to quantify how much the sample mean could vary if we did lots of samples. In practice we will only do one, but we can use our understanding of the sampling behaviour to get a feel for the accuracy of the sample mean.

The sample mean is a random variable because if we were to repeat the sampling process from the same population then we would usually not get the same sample mean. To make the sample mean at all useful we need to know the nature and size of its randomness.

# Simulation

The figure below shows a histogram of the ages of the 2525 residents of Arcadia. The distribution has a definite skew to the right. The mean age in this population is [latex]\mu[/latex] = 32.9 years with standard deviation [latex]\sigma[/latex] = 20.5 years. It is worth emphasising here that you can always talk about the mean and standard deviation of a population or sample even if they are skewed. These values always exist regardless of the distribution. What you cannot always do is then say that they are the parameters of a Normal model for the distribution.

Suppose we take a sample of [latex]n = 25[/latex] residents from this population and record their age. A histogram from one sample is shown in the figure below. The aim of the sample is to describe the population it came from and we see here that the histogram shows a skewed distribution. The sample mean [latex]\overline{x}[/latex] = 29.4 years is close to the population mean, while the sample standard deviation [latex]s[/latex] = 22.8 years is close to the population standard deviation. Again this is not surprising since that is the whole point of doing a sample.

But now suppose we did another sample of [latex]n = 25[/latex] residents from the same population. Because it is a random sample it would be unlikely that we would get [latex]\overline{x}[/latex] = 29.4 years again. A further 19 samples of size 25 were taken and altogether the 20 sample means of the ages were as follows:

## Mean ages of sampled populations

29.4 | 31.7 | 28.6 | 38.6 | 29.9 | 32.6 | 32.5 | 32.0 | 32.5 | 31.9 |

29.2 | 32.3 | 38.1 | 29.7 | 38.6 | 23.9 | 25.2 | 42.3 | 34.4 | 35.8 |

Now we have 20 observations, each of which is a sample mean. What we want to do is to describe the distribution of the sample mean. You can imagine a whole population of sample means, the sample means you would calculate as you repeated the sampling process again and again. We will later give a precise description of this population, but for now we will explore it by visualising and summarising the values.

The figure above shows the distribution of the sample means from the 20 age samples. This looks quite different to the histogram of age from a single sample shown in the earlier figure. There is now more of a central peak and a less pronounced skewness. The sample mean of the sample means is 32.5 years, close to the population mean as for the single sample, but the sample standard deviation of the sample means is 4.56 years, quite a lot less than the sample value (and the population value that we know in this case).

So the distribution of the sample means seems to be different to the distribution of the population from which the samples were taken. The centres are about the same but the sample means are less spread out and have a different shape of distribution.

This is a fundamental idea and a turning point in the way we will look at data. Instead of just being able to give summary values like the sample mean, once we understand their distributions we will also be able to say other things about them, such as how accurate they are in describing the population parameters of interest.

To understand the nature of the sample mean’s distribution, let us look at some larger simulations of the sampling process and see how the sample size affects the results. The figures below show the distribution of 1000 sample means of age samples of various sizes. Study the changes in the distribution, in terms of the centre, spread, and shape, as the sample size increases.

The following table gives the summary statistics for each simulation of 1000 sample means. As we noted above, the mean of the sample means is pretty close to the population mean. This is a useful **mantra** to meditate on:

**The mean of the sample mean is the population mean.**

If you understand the role of the three different “means” in this sentence then the rest of this book should be easy going.

However, the standard deviations are becoming smaller. The last column of the following table shows the ratio of the standard deviation of the sample means to the population standard deviation ([latex]\sigma[/latex] = 20.5 years). Can you see the relationship between the ratios and [latex]n[/latex]? Try to find the pattern before reading on.

## Summary statistics for 1000 sample means of age

[asciimath]n[/asciimath] | Mean | St Dev | Ratio |
---|---|---|---|

4 | 32.11 | 10.18 | 0.497 |

16 | 33.23 | 5.16 | 0.252 |

100 | 32.86 | 2.05 | 0.100 |

The first ratio is about [latex]\frac{1}{2}[/latex] for [latex]n=4[/latex].

The second is a half again, about [latex]\frac{1}{4}[/latex] for [latex]n=16[/latex]. You might guess that the ratio in general is [latex]1/\sqrt{n}[/latex] and this does fit all of the ratios.

To write this as a formula, consider the random variable [latex]\overline{X}[/latex]. Values for this random variable are found by taking a random sample from a population and calculating the sample mean of the observations. In this way each of the simulations above are just 1000 values of [latex]\overline{X}[/latex], for differing sample sizes.

In words we have observed that “the mean of the sample mean is the population mean”. That is, the long-run average value of the sample mean, if we were to do the sampling over and over again, is the population mean. This is just an expected value and so we can write

\[ \mean{\overline{X}} = \mu. \]

We call estimators like this **unbiased** since on average they give the value we are trying to estimate. When we speak of an **accurate** estimator we mean that it has low bias and is also **precise** in the sense that it has low variability (the aspect that standard deviation captures). All of the estimates we consider will be unbiased and so our focus will be on precision.

The ratio of the standard deviation of the sample means to the population standard deviation is [latex]1/\sqrt{n}[/latex], so

\[ \sd{\overline{X}} =\frac{\sigma}{\sqrt{n}}. \]

These two formulas will underlie a lot of our procedures. In particular, the standard deviation formula tells us how much variability there is in our sample mean. Thus the formula gives us a measure of precision.

## Central Limit Theorem

We have now quantified the mean and standard deviation for the process of obtaining a sample mean. Like the standard deviation, the shape of the variability also seems to differ from the population, becoming more Normal as sample size increases.

We saw in Chapter 12 that the density curve for a Normal distribution is given by a complicated function involving exponentials and [latex]\pi[/latex], the area of the unit circle. Why on earth does biological data, such as the heights of males or females, have a distribution with such a strange density curve?

It turns out that this is related to the sampling distribution of the sample mean. No matter what the distribution being sampled from, the **Central Limit Theorem** tells us that the sample mean will have a roughly Normal distribution, getting closer and closer to Normal as the sample size increases. We saw this effect through the figures in the previous section. This theorem underpins much of our later calculations of confidence intervals and significance tests.

And why then do heights follow a Normal distribution? Think about the origin of the variability of heights. A person’s height is influenced by genetic factors and then by a range of environmental factors. That is, a person’s height is an average of various factors. And the Central Limit Theorem says that whenever you have the result of an averaging process, it will be approximately Normal.

# Some Algebra

As with the formulas for Binomial counts in Chapter 11, we can use some straightforward algebra to prove that the formulas we have induced from the above simulations are correct.

Suppose [latex]X_1, X_2, \ldots, X_n[/latex] are [latex]n[/latex] independent samples from the same population with population mean [latex]\mu[/latex] and standard deviation [latex]\sigma[/latex]. Then

\begin{eqnarray*}

\mean{\overline{X}} & = & E\left(\frac{X_1 + X_2 + \cdots + X_n}{n}\right) \\

& = & E\left(\frac{1}{n}(X_1 + X_2 + \cdots + X_n)\right) \\

& = & \frac{1}{n} \mean{X_1 + X_2 + \cdots + X_n} \\

& = & \frac{1}{n} \left( \mean{X_1} + \mean{X_2} + \cdots + \mean{X_n}\right) \\

& = & \frac{1}{n} (\mu + \mu + \cdots + \mu) \\

& = & \frac{1}{n} (n \mu) = \mu \\

% & = & \mu.

\end{eqnarray*}

Similarly for variance, since samples are independent we have

\begin{eqnarray*}

\var{\overline{X}} & = & \mbox{var}\left(\frac{X_1 + X_2 + \cdots + X_n}{n}\right) \\

& = & \mbox{var}\left(\frac{1}{n}(X_1 + X_2 + \cdots + X_n)\right) \\

& = & \frac{1}{n^2} \var{X_1 + X_2 + \cdots + X_n} \\

& = & \frac{1}{n^2} \left( \var{X_1} + \var{X_2} + \cdots + \var{X_n}\right) \\

& = & \frac{1}{n^2} (\sigma^2 + \sigma^2 + \cdots + \sigma^2) \\

& = & \frac{1}{n^2} (n \sigma^2) = \frac{\sigma^2}{n}, \\

\end{eqnarray*}

and so

\[ \sd{\overline{X}} = \frac{\sigma}{\sqrt{n}}. \]

It is worth reflecting on how fortunate it is that the [latex]n[/latex] doesn’t cancel when finding [latex]\var{\overline{X}}[/latex], as it did for [latex]\mean{\overline{X}}[/latex]. If [latex]n[/latex] did not appear on the denominator of the formula then there would be no point replicating experiments to improve precision!

Summary

If [latex]X[/latex] is a random variable with mean [latex]\mu[/latex] and standard deviation [latex]\sigma[/latex] and we define

\[ Z = \frac{\overline{X} – \mu}{\sigma/\sqrt{n}}, \]

then [latex]\Normal{Z}{0}{1}[/latex].

This is the exact distribution of [latex]Z[/latex] if [latex]X[/latex] is Normal and is an approximation, by the Central Limit Theorem, if [latex]X[/latex] is not Normal.

Exercise 1

Suppose that the height of a randomly chosen male has a mean of 179.1 cm and a standard deviation of 7.18 cm. What is the standard deviation of the mean height of eight independently chosen males?

Exercise 2

The potassium level from a blood test for a particular group has a Normal distribution with mean 4.3 mmol/L and standard deviation 0.383 mmol/L. Suppose we take a sample of three independent blood tests. What is the probability that the mean potassium level from the sample is more than 5.0 mmol/L?

Exercise 3

The table below gives the dissolving times of 200 pain relief tablets, 100 dissolved in cold water and 100 dissolved in hot water. For the cold water trials, 75 mL of water between 11[latex]^\circ[/latex]C and 14[latex]^\circ[/latex]C was placed in a glass. The tablet was stirred with a rod until it had dissolved. For the hot water trials the water was between 79[latex]^\circ[/latex]C and 83[latex]^\circ[/latex]C.

## Dissolving times for Solprin tablets (s)

Cold Water | |||||||||

28.0 | 26.4 | 26.8 | 25.7 | 25.9 | 25.9 | 30.3 | 23.1 | 22.5 | 27.3 |

26.1 | 26.1 | 25.6 | 26.6 | 26.1 | 24.5 | 28.3 | 25.3 | 26.6 | 28.6 |

26.2 | 27.3 | 27.2 | 25.0 | 26.0 | 25.0 | 26.4 | 24.7 | 26.7 | 26.1 |

29.9 | 23.8 | 23.5 | 24.0 | 27.2 | 25.3 | 24.7 | 27.0 | 29.8 | 28.0 |

24.1 | 26.9 | 24.4 | 28.4 | 23.2 | 27.4 | 28.2 | 23.1 | 24.2 | 21.9 |

28.9 | 26.1 | 28.6 | 26.5 | 25.0 | 25.4 | 26.7 | 31.3 | 27.6 | 28.1 |

26.2 | 24.1 | 31.2 | 25.8 | 25.9 | 28.5 | 27.4 | 24.7 | 28.8 | 24.0 |

26.8 | 26.6 | 27.3 | 21.5 | 22.7 | 32.5 | 33.2 | 25.6 | 33.0 | 25.8 |

24.5 | 30.6 | 27.4 | 24.1 | 23.1 | 27.3 | 29.1 | 26.1 | 22.3 | 28.8 |

25.2 | 27.4 | 22.0 | 26.6 | 28.6 | 33.6 | 21.9 | 24.4 | 26.6 | 25.7 |

Hot Water | |||||||||

19.4 | 16.9 | 21.6 | 18.6 | 21.3 | 15.6 | 16.9 | 17.3 | 16.6 | 20.7 |

22.0 | 21.6 | 16.0 | 18.0 | 18.9 | 22.2 | 15.8 | 19.0 | 21.6 | 16.8 |

14.7 | 15.7 | 17.1 | 14.5 | 15.0 | 15.5 | 16.3 | 18.2 | 15.9 | 15.9 |

16.3 | 16.8 | 15.1 | 17.9 | 17.9 | 16.2 | 17.7 | 16.2 | 14.3 | 16.5 |

13.9 | 17.8 | 18.1 | 14.8 | 19.4 | 17.4 | 16.4 | 15.6 | 17.6 | 16.8 |

17.4 | 14.4 | 16.6 | 15.2 | 21.3 | 14.2 | 16.0 | 16.3 | 20.8 | 17.5 |

17.6 | 16.0 | 16.2 | 15.4 | 14.4 | 14.2 | 16.1 | 19.0 | 20.4 | 13.2 |

17.2 | 22.3 | 20.8 | 16.7 | 17.3 | 11.9 | 14.4 | 12.7 | 15.5 | 17.9 |

17.3 | 15.0 | 19.1 | 17.4 | 16.4 | 12.8 | 15.9 | 15.4 | 14.6 | 14.4 |

17.7 | 15.4 | 15.5 | 16.4 | 16.2 | 17.5 | 16.8 | 20.1 | 15.2 | 15.5 |

You can treat these as two populations of tablets whose dissolving times are known. The cold water times have mean [latex]\mu_C[/latex] = 26.43 s with standard deviation [latex]\sigma_C[/latex] = 2.49 s, while the hot water times have mean [latex]\mu_H[/latex] = 16.92 s with standard deviation [latex]\sigma_H[/latex] = 2.24 s.

Each row of data, containing 10 tablets, can be taken as a sample from the respective population. Work out the sample mean of the dissolving times in each row and then work out the (sample) mean and (sample) standard deviation of the sample means. Compare these with the formulas we found above.