# 5 Averages

In Chapter 4 we used quantiles to summarise the location, spread, and shape of a variable. We now turn to an alternative approach, based on averaging.

# Sample Mean

The **sample mean** or **sample average** of [latex]n[/latex] observations [latex]x_1, x_2, \ldots, x_n[/latex] is defined by

\[ \overline{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}. \]

There are a lot of statistical methods which are based on adding things up and so we will often use the “sigma” notation for such a sum. In general we define

\[ \sum_{j=1}^n x_j = x_1 + x_2 + \cdots + x_n, \]

where [latex]\sum[/latex] is the Greek capital letter ‘S’, for ‘sum’.

The sample we are working with is usually obvious and so for simplicity we will drop the bounds on the sum, writing

\[ \sum x_j = x_1 + x_2 + \cdots + x_n \]

instead. The sample mean can then be written concisely as

\[ \overline{x} = \frac{\sum x_j}{n}. \]

The sample mean is popular because it is easy to calculate. Adding up a hundred numbers is a straightforward task whereas putting a hundred numbers in order to calculate the median takes a long time by hand.

The mean weight of Islanders in the survey data is 67.0 kg. You can think of this sample mean as a rate. The total weight of the 60 Islanders was 4020 kg. Suppose they were initially all in a room and were coming out a door one at a time. The average rate at which the weight would be coming out of the room is 67.0 kg per person.

The sample mean is badly affected by the presence of outliers in the data, or by strong skewness. For example, consider the forearm lengths in the survey data. If we leave the observation of 260 cm, the sample mean is 29.1 cm. You can see that nobody has a forearm length like that — it is higher than all the other observations and nowhere near 260 — so this is not capturing the “average” value of the distribution at all. Changing that single observation to 26.0 instead reduces the sample mean to 25.2 cm, a dramatic shift. In contrast, the sample median of the forearm lengths is 25.0 cm with or without the outlier.

Newcomb

The Appendix gives data from Newcomb’s experiments in 1882 to measure the passage time of light. A histogram of these measurements is shown in the figure below. His aim in doing this experiment was to determine the speed of light, since he knew the distance the light had travelled. What value of the passage time should he use when estimating the speed of light?

The values here have been coded to simplify analysis, as described in the appendix. The sample mean of the 66 observations is 26.21 ns. However, this includes the outlying value of -44 ns. If we were happy to drop this then the sample mean of the remaining observations changes to 27.29 ns. This difference will have a big impact on the estimate of the speed of light. The value of -2 ns is also a bit unusual. Dropping it changes the sample mean to 27.75 ns and also moves the sample median from 27 ns to 27.5 ns.

In the background we know there is a true speed of light and we have to make decisions like these carefully. If we remove values which are genuine then we may be biasing our results and will end up with a poor estimate of the thing we are interested in.

With the difficulties of outliers, why then do we bother using the sample mean? One pragmatic reason is that it is easy to calculate. Adding numbers up is a much simpler problem than putting numbers in order, particularly without a computer. The sample mean is also more powerful because it does take into account all the values in the data. The sample median is robust because it isn’t affected much by values far from the middle but this is also a weakness because it is potentially wasting information. Finally, the sample mean has a rich theory, which we will explore in Chapter 13, and in particular has an important relationship with Normal distributions.

# Sample Standard Deviation

Using the interquartile range to measure spread is a natural accompaniment to using the median to measure centre, both using the idea of a quantile. What then do we use to measure spread in conjunction with the sample mean [latex]\overline{x}[/latex]?

The answer to this question is well known but it is not a trivial answer. To help justify what we do, suppose we have just three observations of height: 163, 166, and 178 cm. The sample mean of these is [latex]\overline{x} = 169[/latex]cm.

## Deviations and Degrees of Freedom

To measure how spread out these values are we start by adding up the differences of each value from [latex]\overline{x}[/latex]. If values were more spread out then the differences would be bigger, so this seems a sensible measure. Here we find that

\[ (163 – 169) + (166 – 169) + (178 – 169) = (-6) + (-3) + (9) = 0 \mbox{ cm}. \]

Unfortunately this always happens, a result of the definition of the sample mean, so we can’t add up differences directly as a measure of spread. However, that this happens is important to note for our later analysis. If we know 2 of the differences from [latex]\overline{x}[/latex] then we always know the last one, since they have to get back to 0. In general we say that there are [latex]n-1[/latex] **free** deviations, or that the **degrees of freedom** are [latex]n-1[/latex]. We lost the degree of freedom because we had to estimate [latex]\overline{x}[/latex] from the data before we could calculate these differences. If we had been told a specific value for the mean then we would not necessarily get 0.

An obvious way around this problem of negative differences is to take absolute values of the differences, so that we use

\[ |163 – 169| + |166 – 169| + |178 – 169| = 6 + 3 + 9 = 18 \mbox{ cm} \]

as a measure of spread instead. This is a perfectly good method to use but there is a mathematical reasoning for avoiding it. The absolute value function is shown in the figure below.

It has a sharp point at the origin which means you can’t find its slope there. That means if we used absolute values to measure spread we wouldn’t be able to use all the tools of calculus, such as derivatives, to work with them.

Another way of getting rid of negatives is to square numbers, so we could use

\[ (163 – 169)^2 + (166 – 169)^2 + (178 – 169)^2 = (-6)^2 + (-3)^2 + (9)^2 = 126 \mbox{ cm}^2 \]

as our measure of spread. The graph of the squaring function, shown in the figure above, is a lovely smooth parabola. This is perfect for working with and so you will see squared deviations used everywhere in statistics and data analysis.

## Standard Deviation

The story is not quite over though. If we had a data set with 100 observations instead of 3 but which had the same spread then just adding up the squared deviations would naturally give a bigger number. To compensate for this we average the squared deviations. When we averaged to get the sample mean, we divided by the sample size [latex]n[/latex] but that is not what we do here. Instead we average by the degrees of freedom, $n-1$, since we only really have $n-1$ free deviations. Thus we use

\[ \frac{126}{3-1} = 63 \mbox{ cm}^2 \]

to measure spread. Another way to think about this is to suppose you just had 1 observation in your data. You could use it to estimate the centre but it doesn’t give you any information about the variability. If you had 2 observations then you have got information about the variability but the sample mean is exactly in the middle of the observations so the two squared deviations are the same. You only have 1 piece of information about the variability. In general the information about spread will always be one step behind the sample size, [latex]n-1[/latex], since you have to start by estimating the centre before you can estimate spread.

Finally, since the units of our data were in centimetres then when we squared the deviations they became cm[latex]^2[/latex]. It is not so good to have a measure of spread that is in different units to the data and so the last thing we do is to take the square root of the average squared deviations,

\[ \sqrt{63} = 7.94 \mbox{ cm} \]

in our little example.

All together this measure of spread is called the **sample standard deviation** and is denoted by [latex]s[/latex] with

\[ s = \sqrt{\frac{(x_1 – \overline{x})^2 + (x_2 – \overline{x})^2 + \cdots + (x_n – \overline{x})^2}{n-1}}, \]

or, in compact notation,

\[ s = \sqrt{\frac{\sum (x_j – \overline{x})^2}{n – 1}}. \]

It should be clear from the above discussion that [latex]s[/latex] involves a lot of ideas and it is not surprising that there is no easy way to interpret its value, unlike the interquartile range. It is also very susceptible to the effects of outliers, particularly since it squares the deviations and so a large deviation will have a very large effect. Why then do we bother using it? The main reason is that it is related to the Normal distribution and this distribution will play a central part of our later methods of analysis.

Male and Female Heights

For the male heights in the survey data we find that the mean height is 177.1 cm with a sum of squared deviations of 1337.9 cm[latex]^2[/latex], from [latex]n = 34[/latex] subjects. This gives

\[ s = \sqrt{\frac{1337.9}{34 – 1}} = 6.37 \mbox{ cm}. \]

For females the mean height is 167.4 cm with a sum of squared deviations of 870.3 cm[latex]^2[/latex], from [latex]n = 26[/latex] subjects. This gives

\[ s = \sqrt{\frac{870.3}{26 – 1}} = 5.90 \mbox{ cm}. \]

Although the sample standard deviation does not have a simple interpretation like that of the interquartile range, we can use these values to say that it seems male and female heights have roughly the same amount of variability, as noted earlier from a previous figure.

## Sample Variance

The square of the sample standard deviation, [latex]s^2[/latex], is called the **sample variance**. This is just the sum of the squared deviations divided by the degrees of freedom. Most of the time when we are summarising data we will use the standard deviation [latex]s[/latex]. However in Chapter 19 we will start looking at this process of averaging squared deviations more closely and show how it can be used to analyse a range of statistical problems.

## Prediction Errors

In the above discussion we have used a difference from the mean, such as [latex]178 - 169 = 9[/latex] cm, as part of the measure of the spread of the observations around the mean. Suppose we were instead trying to make predictions about the heights of people and wanted to choose a single value, [latex]b[/latex], that was going to be our guess based on our data. If we let [latex]b = 169[/latex] then in this case we could view the difference [latex]178 - 169 = 9[/latex] as a **prediction error**: we guessed 169 cm but their height was 178 cm so we made an error, underestimating by 9 cm.

What would be the best choice for our guess? It would be nice to minimise our prediction errors, to try and make our guess as close as possible to the observed data. As before we can’t just minimise the sum of the errors since this can always be 0 (when [latex]b[/latex] is the sample mean) or negative (for larger values of [latex]b[/latex]). One criterion for choosing [latex]b[/latex] is to minimise the sum of the squared prediction errors, as we did when defining the sample standard deviation. That is we choose [latex]b[/latex] to minimise

\[ \sum_{j=1}^n (x_j – b)^2. \]

For the three height observations this is

\[ (163 – b)^2 + (166 – b)^2 + (178 – b)^2. \]

What value of [latex]b[/latex] makes this sum as small as possible? If you expand this out you get a simple quadratic involving [latex]b[/latex] so a little bit of algebra or calculus can give you the answer. Alternatively the figure below shows a plot of this function of [latex]b[/latex] and it is reasonably clear that the minimum value occurs when [latex]b = 169[/latex] cm.

It is not a coincidence that this is just the sample mean of the three observations (see Exercise 4). In fact we could have given our definition of the sample mean to be the value that minimises the sum of the squared deviations from the observed values. We will expand on this view further in Chapter 21.

# Causality

It is worth emphasising the issue of establishing **causation** mentioned in Chapter 2. We can demonstrate a simple example of this using data from a survey of students in a statistics class. The figure below shows a side-by-side dot plot of heights split by whether the student regularly watched the classic Australian soap opera *Neighbours*.

It seems from this plot that people who watch *Neighbours* tend on average to be shorter than people who don’t watch *Neighbours*, and indeed there is some statistical evidence from the data that this is the case. Does this mean watching *Neighbours* stunts your growth? Probably not. There is certainly an **association** between height and watching *Neighbours* but that doesn’t mean there is a causal relationship.

In this case there is a variable which explains both of these variables. The following figure shows a bar chart which indicates a strong relationship between sex and whether the student regularly watches *Neighbours*. About 73% of those watching *Neighbours* were female, well above the 54% of females in the data set. We have already seen the relationship between height and sex. Together these give the association between height and watching *Neighbours*.

So does this mean that watching *Neighbours* has no effect on growth and height? Not necessarily. The problem with this data is that it came from a survey. If we really wanted to see if there was any relationship we would need to do an experiment, randomly assigning a sample of subjects to either watch or not watch *Neighbours* for a few years and then look for any differences in height.

## Summary

- The sample mean is used as a measure of a distribution’s centre.
- The sample standard deviation gives a measure of the spread about the sample mean.
- The sample standard deviation involves degrees of freedom, [latex]n-1[/latex].
- Both the sample mean and sample standard deviation are badly affected by outliers.
- A difference in means between groups does not necessarily imply the groups are causing the difference.

## Exercise 1

The mean height of 8 people in a room is 172 cm. One person leaves the room. The mean height is now 171 cm. What is the height of the person who left the room?

## Exercise 2

Verify the sample standard deviations for male and female heights given earlier in this chapter.

## Exercise 3

Twelve seedlings, starting with 2 leaves each, were planted in two large boxes, 6 in one and 6 in the other. Both boxes contained soil from the same batch except that one had fertiliser added. The boxes were kept undercover and equal quantities of water and sunlight were received by each plant. After a three-week period the number of leaves present on each plant were recorded as a measure of growth. The results are shown in the table below. The plant with 0 leaves after three weeks appeared to have died from a fungal infection.

## Number of leaves after three weeks

Control | 8 | 6 | 6 | 4 | 5 | 6 |

Fertilizer | 11 | 8 | 9 | 8 | 8 | 0 |

Calculate the mean and standard deviation of the number of leaves of fertilised plants in the above table. How do these statistics change if we ignore the plant that died? Repeat this for the median and interquartile range.

## Exercise 4

If you know some calculus, show that the estimate that minimises the sum of squared prediction errors is the sample mean.