# 3 Visualising Distributions

We now look at some plots and language for describing the variability present in a single quantitative variable. The pattern of variability we see is called the **distribution** of the variable and this pattern typically involves a **central tendency**, where observations tend to gather around a central value, with fewer observations further away.

We will be interested in describing three main aspects of the distribution we see:

- The
**location**or**centre**of the variability, a typical value taken by the variable. - The
**spread**of the variability, how far the values extend from the centre. - The
**shape**of the variability in whether or not values are spread symmetrically on either side of the centre.

Once we have tried describing the variability using these aspects, we then look for values or patterns which don’t match this general description. For example, we may find **outliers**, values which don’t match the rest of the pattern, or **bimodal** distributions where there may actually be two distributions of variability in our values. We will discuss all of these patterns as they arise in examples.

The starting point for describing variability is to **visualise** the data values. If the pattern we see is well behaved then it may be possible to **summarise** the distribution with some numbers. We will look at summarising a single quantitative variable in Chapters 4 and 5. For categorical data the task of describing and summarising a distribution is much simpler, as discussed later this chapter.

# Dot Plots

A very simple plot for continuous data is the **dot plot**. To make a dot plot we draw an axis for the variable of interest, covering a range from the smallest to largest observed value, and then put a dot for each observation. The figure below shows a dot plot of the heights of the 60 Islanders in the survey data. For some heights, such as 174 cm, there was more than one observation with that value. So that the multiple observations can be seen some random **jitter** has been added to the vertical position of the points.

We need to describe the distribution of variability we see in a dot plot. The three features that can be described from a dot plot are

- the
**location**or**centre**of the distribution - the
**spread**of the distribution

any deviations from the general pattern

For the height data the middle seems to be around 174 cm. The idea of using this to describe location is that it is a typical value for height, with similar variability on either side. Most of the observations are between 160 cm and 185 cm, a range of 25 cm. That is, the spread of the variability in height covers a range of 25 cm. The lowest and highest observations seem a bit unusual but are within the general pattern of variability.

## Side-by-Side Plots

To explore relationships between a quantitative and categorical variable, we can split the quantitative values into the various categories and then make **side-by-side** dot plots of the distributions.

The figure above shows a side-by-side dot plot of the relationship between height and sex. It is apparent that the males tend to be taller on average but the amount of variability in heights is similar for males and females.

## Paired Experiments

A study looked at the effects of carrying a load during exercise on change in pulse rate. Ten subjects recorded their resting pulse rates with pulse rate monitors. Half began the experiment with no extra weight whilst the other half strapped 4 kg of weights on their legs. The subjects then stepped on and off a 25 cm stepping block at a pace of 75 steps per minute for two minutes. Pulse rates were recorded and the subjects then rested until their pulse rate returned to its original resting rate. The subjects then switched weights and the procedure was repeated. The same process was carried out by four more pairs of subjects. The results are shown in the table below.

## Pulse rate (bpm) for different exercise loads

Subject | Resting | 0kg | 4kg |
---|---|---|---|

1 | 71 | 76 | 132 |

2 | 73 | 130 | 146 |

3 | 77 | 133 | 138 |

4 | 98 | 163 | 171 |

5 | 66 | 168 | 179 |

6 | 71 | 137 | 151 |

7 | 75 | 142 | 159 |

8 | 72 | 131 | 146 |

9 | 81 | 148 | 159 |

10 | 76 | 133 | 148 |

The figure below compares the pulse rates before and after the exercises with the 4 kg load.

There are 10 points for each group but this is misleading since there were only 10 subjects in the study. This is **paired data** since we have made two measurements for each subject. We would like to see how much an individual’s pulse rate increases with the exercise. Instead of making side-by-side plots, it is better to first take the difference in pulse rate for each individual and then plot these, as shown in the figure below. Now we see that one individual had a very large increase in pulse rate with a 4 kg load, a feature not apparent from the first plot.

Later on we will also see mathematical reasons for taking differences. It turns out that we can only analyse independent samples, and because these measurements come in pairs from the same subjects they will not be independent. If someone has a relatively high pulse rate to start with then we would expect they would have a relatively high pulse rate at the end. This setting is a particular example of **repeated measures** data. Taking differences to give a single set of numbers is one way to deal with such data. We will look at further examples in Chapter 6.

# Histograms

For large samples a dot plot can get very crowded. An alternative is to make a **histogram**. We start by breaking the range of the values into a number of **bins** or **classes**. We tally the counts of values falling in each bin and then make the plot by drawing rectangles whose bases are the bin intervals and whose heights are the counts.

For example, the figure above shows a histogram of the 60 heights in the survey data. Here 7 bins were used (see the Appendix). Rather than using raw counts, the vertical axis here gives the **proportion** in each class, the **relative frequency**, defined by

[latex]\mbox{proportion} = \frac{\mbox{count}}{\mbox{total}}[/latex]

We want to be able to describe the variability we see in such a histogram. Histograms also give an impression of centre and spread but are particularly useful for visualising shape.

Here the variability seems quite **symmetric** about the middle value of around 174 cm. That is, the pattern to the left and the right of 174 cm is roughly the same. (You will rarely see perfect symmetry with real data.)

Compare this to the figure below which shows a histogram of the Islander weights. Now there is a dense group of observations to the left which then tail out towards the larger values to the right. This is clearly not symmetric around any weight. We say that the distribution is **skewed to the right** or **positively skewed**. The direction of the skewness is the direction that the tail of values is pointing.

The following figure shows the distribution of forearm lengths in the survey data. The histogram was made using 7 bins but most of these are empty. The reason is that there is one unusual value which distorts the horizontal scale: the forearm length for Noel Swift is “260” and this is a long way from the rest of the pattern of variability. We call such values **outliers** though you should be careful with this terminology. Just because a value deviates from the general pattern does not necessarily mean it is a mistake of any sorts. In the weight data, Colin Kennedy’s value of 109 kg is a long way from the main peak of values but is not an outlier — his value is just part of the long tail.

However for the forearm data the measurements were supposed to have been recorded in centimetres but it seems likely that Noel used millimetres instead, or the person entering the data into the software missed the decimal place. Rather than drop this value we simply change it to 26 cm. In general, when you see an unusual observation like this you should do some sleuthing to try and determine whether it is a genuine observation or not. If we could not find an explanation then we should try to make that measurement again but if that is not possible then we should consider dropping such observations to focus on the main pattern. We can return to them later and see if our conclusions would change if they were included.

We should explore distributions using a range of bins. The figure below shows a histogram of heights using 10 bins rather than 7. Rather than just one peak of values there seem to be two peaks, one located around 167 cm and one located around 177 cm. Is this genuine or just a result of the bins we chose? Why might there be two peaks?

We call a peak of a distribution a **mode** and call a distribution with just one peak, such as the weights in the earlier histogram, a **unimodal** distribution. A distribution with two modes, like the heights, is called **bimodal**. If there are more than two modes then we call the distribution **multimodal**. Seeing two or more modes often indicates the presence of a categorical variable which is related to the quantitative variable we are exploring. Here that variable is sex; there is a height distribution for females and a height distribution for males and they have different locations.

# Density Plots

The histogram is an example of a plot that estimates the **density** of a distribution. It is not a very good one though because it is not smooth and so, for instance, it can change dramatically as you change the width of the bins. One solution to this is to use what are called **kernel density estimators** (Silvermann, 1986). The idea behind these is to put a little lump, the **kernel** of the density estimate, centred on each data value and then add all their densities together to get a picture of the distribution.

Suppose we are exploring the distribution of weights of the 60 Islanders in the survey data. The figure above shows the density plot you get using narrow kernels on each data point. It is easiest to see this for the weight of 109 kg since it is isolated from the other points but you can also see the distinct lumps for other points, such as weights of 95 and 100 kg. The plot then adds up the height of the kernels above each weight value. Two Islanders were 85 kg, so there is just a lump twice as high there, while at other points the sum leads to more complicated shapes.

When you make the lumps wider the resulting shapes merge together more, giving a better overall picture of the distribution. The figure below shows an intermediate stage where the kernels have been doubled in width from the previous figure.

The width of the kernels is referred to as the **bandwidth** of the density estimate, increasing from 1 kg to 2 kg in these two plots. You can see how distinct peaks are now joining together. Finally, the following figure shows the default picture generated by software where the bandwidth was automatically chosen to be 4.59 kg. This is quite a large width and so the resulting curve is very smooth. Indeed density curve estimation can be thought of as a way of smoothing data. Further details and criteria for the automatic choice of bandwidth are given by Scott (1992).

Note that density plots are useful for comparing multiple distributions on a single plot. For example, the figure below shows a comparison of the height distribution between the males and females in the survey data.

# Categorical Variables

So far we have been concerned with describing the behaviour of continuous variables. In contrast, summarising the distribution of a categorical variable is very simple. We tally up all of the observations in each of the categories and divide each count by the total number of observations. This gives the sample **proportions** in each category. The table below shows the counts and proportions for the five pizza toppings from the survey in the survey data.

## Proportions of preferred pizza toppings

Pizza | Mushroom | Pineapple | Prawns | Sausage | Spinach |
---|---|---|---|---|---|

Count | 10 | 17 | 5 | 11 | 17 |

Proportion | 0.167 | 0.283 | 0.083 | 0.183 | 0.283 |

A proportion can always be expressed as a **percentage** and we will use both in this text. For example, the proportion of Islanders who liked spinach was 0.283 so 28.3% liked spinach. People are often more able to understand a percentage rather than a proportion and so percentages are useful for communicating results. However there are rules which involve multiplying together proportions which don’t work immediately for percentages, so we will mainly use proportions when doing our analyses.

## Bar Charts

To display the distribution of a categorical variable we make a **bar chart** of the proportions. The figure below shows a bar chart of the proportions in the previous table. Since this is a nominal variable there is no ordering of the categories. For an ordinal variable, such as age group, we would order the categories as appropriate.

## Summary

- Quantitative plots are used to explore the distribution of a variable in terms of location, spread, and shape, as well as deviations away from the overall pattern.
- Dot plots are a simple tool for visualising a quantitative variable and are particularly good for detecting clusters and gaps in a distribution.
- Histograms are useful for visualising larger data sets and give a good picture of the shape of a distribution, though you need to try various numbers of bins to confirm the pattern.
- Dot plots and density curves are useful for comparing distributions.
- Data from paired experiments need to be modified before plotting.
- Categorical data can be summarised using percentages or proportions.
- Bar charts are used to visualise the distribution of counts or proportions for a categorical variable.

## Exercise 1

## Height (cm) and time breath held (s)

Female | Male | ||
---|---|---|---|

Height | Breath Held | Height | Breath Held |

175 | 22.22 | 184 | 60.75 |

158 | 30.57 | 182 | 67.41 |

166 | 17.47 | 180 | 42.19 |

175 | 22.39 | 191 | 59.74 |

160 | 26.90 | 189 | 52.64 |

165 | 36.85 | 181 | 43.37 |

166 | 27.33 | 180 | 73.27 |

170 | 29.55 | 170 | 59.09 |

170 | 13.87 | 176 | 51.15 |

172 | 34.66 | 185 | 58.32 |

Make a histogram of the breath holding times, ignoring sex. Make a side-by-side dot plot of the same data for males and females. Use both of these plots to describe the distribution of variability.

## Exercise 2

Two plastic trays were lined with a bed of cotton wool and 25 bean seeds were placed with equal spacing in each tray. One litre of water was distributed evenly in each tray. One tray was placed in a dark cupboard with a portable radio tuned to a music station. The other tray was placed in an identical cupboard, but with no music. Both cupboards were kept closed with the exception of 30-second daily inspections to check moisture and sound volume. On the fourteenth day the trays were removed and the bean plants measured from the base of the stalk to the tip of the longest leaf. The results are given in the table below.

## Plant growth (mm) with or without music

Growth | |||||
---|---|---|---|---|---|

With music | 304 | 257 | 174 | 214 | 69 |

317 | 387 | 47 | 157 | 0 | |

332 | 308 | 317 | 286 | 236 | |

299 | 206 | 278 | 188 | 43 | |

0 | 0 | 0 | 0 | 0 | |

Without music | 292 | 270 | 47 | 288 | 324 |

292 | 364 | 316 | 287 | 75 | |

282 | 149 | 274 | 319 | 213 | |

3 | 324 | 2 | 128 | 219 | |

94 | 164 | 0 | 0 | 0 |

Compare the plant growth distributions between the two groups using dot plots and histograms.

## Exercise 3

The figure below shows a bar chart of eye colour counts from a sample of Islanders from Eden. What proportion of the sample have blue eyes?