4 Quantiles
Having obtained an impression of the variability from dot plots and histograms, we now look at summarising the distribution of a variable with numbers.
Sample Median
The simplest way of measuring the location of a distribution is to put the observations in order, from smallest to largest, and record the middle observation. This value is called the sample median, [latex]M[/latex], and is the value which has 50% of the distribution above and 50% of the distribution below. We have already used this intuitive idea when describing the centres we saw in dot plots and histograms.
Lighting Level and Seedling Growth
Seedling height (mm) for different lighting levels
Normal | 41 | 52 | 49 | 34 | 38 | 42 | 38 | 43 |
51 | 41 | 42 | 32 | 39 | 40 | 33 | ||
High | 79 | 82 | 67 | 75 | 89 | 71 | 77 | 78 |
89 | 77 | 75 | 76 | 85 | 88 | 90 |
Resting Pulse Rate
The middle here is between the 5th and the 6th observation, 73 and 75 bpm. In this case we let the median be the average of these two values. Thus the median resting pulse rate is (73+75)/2 = 74 bpm.The sample median is very robust against the effects of outliers. Suppose one measurement in the bean growth data above had been recorded in centimetres instead of millimetres, so that 32 mm was recorded as 3.2. The sample median would still be 41 mm. If 52 mm had been recorded as 5.2 then the median would change, but only to 40 mm.
Quartiles
The median is the middle value of the distribution, the value with 50% of observations to the left. We can similarly define other quantities. The first quartile, [latex]Q_1[/latex], is the value with 25% of observations to the left while the third quartile, [latex]Q_3[/latex], is the value with 75% of observations to the left. (What is the second quartile?) You can estimate [latex]Q_1[/latex] by hand by taking the median of the observations below the position of the median. Similarly, [latex]Q_3[/latex] can be estimated by taking the median of values above the median position. Computer software will estimate quartiles using a weighted average, a generalisation of the method of estimating the median when there are an even number of observations. The results you get by hand might be a bit different, particularly for small data sets, but will still be useful for describing variability.
For example, the 15 bean seedling measurements in the seedling growth example had a sample median of 41 mm, the 8th observation. [latex]Q_1[/latex] can be estimated by taking the median of the observations below this position. There are 7 of these, with a median of 38, so we estimate [latex]Q_1[/latex] = 38 mm. Similarly, we estimate [latex]Q_3[/latex] = 43 mm.
The pulse rates in the previous example have 10 observations, an even number. The median is taken to be half way between the 5th and 6th observations, 74 bpm. The median of the 5 numbers below this position gives [latex]Q_1[/latex] = 71 bpm. Similarly, we estimate [latex]Q_3[/latex] = 77 bpm.
The five-number summary for a sample is a list of the minimum, the first quartile, the median, the third quartile, and the maximum. For the weight observations in the survey data, the five-number summary is
\[ 45, \; 58.5, \; 65, \; 74.5, \; 109. \]
Note that the distance from [latex]Q_1[/latex] to [latex]M[/latex] is 6.5 kg while the distance from [latex]M[/latex] to [latex]Q_3[/latex] is 9.5 kg. The 25% of values to the right of the median cover a wider range than the 25% to the left. Since there are the same number of observations in each range the ones to the right must be more thinly spread than those to the left. This suggests that the weight distribution is skewed to the right.
In contrast, the five-number summary for the pulse rates in the survey data is
\[ 48, \; 60, \; 68, \; 76, \; 92. \]
Here there is an even range on either side of the median suggesting that this distribution is more symmetric. In both of these cases the minimum and maximum are not of particular interest in themselves as a summary, since they are often extreme values which require further investigation, but they do give an impression of the greatest variability present.
Interquartile Range
The distance between [latex]Q_1[/latex] and [latex]Q_3[/latex], the interquartile range
\[ \text{IQR} = Q_3 – Q_1,\]
is the range of values covered by the middle 50% of observations. If we think of the middle 50% as being the most typical of observations in the sample then this distance is a useful measure of the spread of variability. For example, the [latex]\text{IQR}[/latex] for the weight observations in the survey data is
\[ 74.5 – 58.5 = 16 \text{ kg}. \]
Thus the middle half of the Islanders have weights in a range of 16 kg.
Quantile Plots
We have used the median, the value with 50% of observations to its left, to measure centre. We have now used the first quartile, with 25% to its left, and the third quartile, with 75% to its left, to measure spread. There is no reason to stop at these three percentiles however. We can similarly define the 10th percentile, the value with 10% of observations to its left, and so on.
A quantile is the same as a percentile but instead of speaking about the 10th percentile we speak about the 0.1 quantile. That is, a certain quantile is a value with a certain proportion to the left, rather than a certain percentage. The median is the 0.5 quantile while [latex]Q_1[/latex] is the 0.25 quantile. Again we can imagine any quantile though calculating them becomes a bit tedious by hand.
A computer package can produce a quantile plot which shows all of the observations for a variable together with their corresponding quantiles. For example, the figure above shows a quantile plot of the weight data in the survey data. The vertical dotted lines correspond to the positions of the median and the first and third quartiles. You can read off the plot that the median is around 65 kg and that the interquartile range, the distance on the vertical axis between where the outer dotted lines are crossed, is around 15 kg.
Plotting quantiles becomes particularly useful when comparing distributions. The figure below shows a quantile plot of the 60 heights in the survey data which has been split into females and males.
You can see from the pattern in the quantiles that the shapes of the two distributions are similar with both covering a similar range of values (the distance covered in vertical direction). However, the male plot is shifted upwards by about 10 cm, suggesting that the male distribution is located higher than the female distribution. We have seen this before in the side-by-side dot plot in Chapter 3, but a quantile plot like this gives an additional way of visualising the distributions in detail.
Quantile-Quantile Plots
This figure gives a quantile-quantile plot of height by sex.
Rather than plotting the quantiles for each height and for each group, as in the previous plot, each point in the quantile-quantile plot corresponds to the male height and female height for a particular quantile. For example, the median height for males is 176.5 cm and the median height for females is 166.5 cm, so there is a point (166.5, 176.5) on the plot.
If the two distributions were the same then the quantile-quantile plot would lie on the identity line. That is, if a male quantile was 170 cm then the female quantile should also be 170 cm. This is not the case here. The pattern observed lies above the identity line so that, for the same quantile, male heights are higher than female heights.
The pattern is roughly parallel to the identity line which suggests that the distributions are similar in other respects.
The above figure shows a quantile-quantile plot of pulse rate for the same Islanders.
Now most of the pattern lies closer to the identity line so that the distributions for males and females are more similar.
However, for the lower values the male quantiles are lower than the female quantiles while for the higher values the male quantiles are higher than the female quantiles. Thus the male pulse rate distribution must be a little more spread out than the female distribution.
Box Plots
Quantile plots show the position of all of the quantiles in a set of observations but this is often too much information. A box plot is a representation of just the quantiles in the five-number summary instead, giving a very simple description of the data. However there is still a lot you can say about a distribution from a box plot and their simplicity make them very useful for comparing several distributions together.
The figure below shows a box plot of the heights in the survey data. The box goes from the first quartile to the third quartile, with a vertical line at the median. The “whiskers” of the box then extend to the minimum and maximum observations.
As was discussed earlier in this chapter, the relationships among the distances between the values in the five-number summary allow you to judge the shape of the distribution. In the height data the median is roughly in the middle of the box and the whiskers extend a similar distance in each direction, so the distribution seems symmetric. In contrast, the following figure shows a box plot of the weight observations. Here the top whisker and top half of the box are quite long, suggesting that the distribution is skewed to the right.
If the extreme values in a box plot appear to be unusual then the whiskers are only drawn to the highest and lowest values which are not unusual. The unusual values are then plotted separately.
How do we define “unusual” in this context? We use the 1.5 [latex]\times[/latex] IQR rule which flags as unusual any value more than [latex]1.5\times\text{IQR}[/latex] above [latex]Q_3[/latex] or [latex]1.5\times\text{IQR}[/latex] below [latex]Q_1[/latex]. This is a fairly robust rule since outliers do not have much effect on the quartiles. In the weight data the two values of 100 kg and 109 kg are flagged as unusual but these are really just part of the tail of the distribution, as observed with this histogram in Chapter 3. We will discuss this rule further in Chapter 12.
The above figure shows a side-by-side box plot of the height values. In describing such plots we use the same points as we did when looking at a single variable, except now we can compare attributes. From the box plots it appears that male heights are located higher than female heights, though there is some overlap. While the centres are different, the spread of both distributions appears roughly the same.
Note that some values are now flagged as unusual, even though none were flagged in the original box plot. By adding this new information about each individual (their sex) we obtain a better picture of the pattern of variability and deviations from it.
Summary
- Sample quantiles give a robust description of the distribution of a variable.
- The sample median can be used to measure the centre of a distribution.
- The quartiles and interquartile range give a measure of the spread of a distribution.
- The five-number summary and box plot give a compact description of a distribution, including a rough picture of its shape.
- The [latex]1.5\times\text{IQR}[/latex] rule is a standard method for flagging unusual observations in data. Such observations are plotted separately in box plots.
Exercise 1
Calculate the median age for the Islanders in the survey data.
Exercise 2
The figure below shows a histogram of (discrete) forearm lengths for a sample of Islanders in Shinobi. Calculate the median forearm length for this data.
Exercise 3
Verify the calculation of the quartiles for the weight data given in the section on quartiles.
Exercise 4
An experiment compared the reaction times of two groups of subjects, males and females. Each group was composed of twenty subjects selected on the criteria that they were between 18 and 21 years of age and participate in some form of sports/physical activity at least three times a week. Each subject had a ruler placed between their thumb and forefinger at the 0 cm mark. The ruler was dropped and the distance it had travelled before they caught it was recorded. The results are shown in the table below.
Reaction times (cm) between sexes
Male | 14.2 | 16.0 | 19.8 | 21.9 | 15.3 |
18.8 | 18.5 | 15.2 | 15.0 | 18.5 | |
16.1 | 15.2 | 17.4 | 12.8 | 17.3 | |
20.0 | 14.3 | 16.1 | 17.0 | 16.3 | |
Female | 18.9 | 14.1 | 15.5 | 13.4 | 17.3 |
19.7 | 15.5 | 14.0 | 18.4 | 19.4 | |
16.5 | 17.8 | 14.7 | 15.2 | 16.6 | |
15.9 | 21.0 | 16.4 | 15.6 | 19.2 |
Make side-by-side dot plots and box plots of the reaction times. Describe the differences, if any, between male and female reaction times.
Exercise 5
Make side-by-side box plots of the plant growth data seen in Chapter 3. Does a box plot give a good picture of the distributions?