6 Visualising Relationships
We have previously used side-by-side plots to explore the relationship between a quantitative variable and a categorical variable. This involved making a univariate plot for each group in the data. With two quantitative variables the situation is more complicated, reflecting the richer information in quantitative variables.
Scatter Plots
We visualise patterns between two quantitative variables using a scatter plot. These are easily drawn by making two axes, one for each variable, and then using the values of the variables as the coordinates of a point to plot each case. The figure below shows a scatter plot of the weights and heights of the 60 Islanders in the survey data. As an example, Taylor Jones was 171 cm tall and weighed 63 kg. She is represented on the plot as the point (171, 63).
How do you decide which variable should go on which axis? It is common that you will be interested in how one variable affects the other. Suppose we want to predict the weight of a person from their height. We call height the predictor variable, since we want to make predictions from it, or the explanatory variable, since we think that height will help ‘explain’ the weight — it would be natural for bigger people to be heavier, for example. The weight is then called the response variable, giving the response to the value of the other variable.
The simple rule for drawing scatter plots is
- Response variables go on the vertical axis
- Predictor variables go on the horizontal axis
Note that you will not always be interested in the relationship between a response and a predictor variable. In such cases it does not matter which way around you put your axes. The following figure shows a scatter plot of the lengths of the iris petals and sepals in the Appendix. Here we are not necessarily trying to predict one of the values from the others. We would simply be interested in the nature of the relationship between these dimensions. There seems to be a fairly good relationship between the 100 points with longer petals but the irises with shorter petals seem to belong to a separate cluster of points. The symbols in the plot correspond to different iris varieties. One goal with data like this is to find a way of automatically discriminating between varieties based on observations of the variables, and one method for doing this is known as discriminant analysis. Another approach is to use logistic regression, as discussed in Chapter 23.
There are many ways of exploring the relationships between multiple variables. For example, the figure below shows a scatter plot matrix of the Iris data. This shows all the scatter plots for pairs of variables in the data. With 4 variables there are 6 possible pairs but the matrix shows each choice of response and predictor roles, giving 12 plots in total. Look for where the small copy of the previous figure is embedded in this bigger plot.
Time Plots
One case where the choice of axes is made for you is in a time plot. A time plot shows the behaviour of a variable over time, with the vertical axis giving the variable value and the horizontal axis giving time. This is done because the eye is used to following along from left to right and so it is easier to “read” the behaviour of the variable in this way. To emphasise this, time plots also join successive points together with lines to accentuate the ups and downs.
Learning Effects
The figure below shows Newcomb’s measurements of the passage time of light from the Appendix in the order in which he made them.
It is clear from this plot that the two outliers that we identified in Chapter 3 occur early on in Newcomb’s work. It is likely then that the explanation for these is some learning effect, as Newcomb became proficient with using his apparatus. Rather than just removing the two outliers from the data when calculating the mean, it may then be better to drop the first 15 or 20 observations altogether. This is another example of why it is useful to visualise your data with more than just one plot.
Drug Concentration Profiles
A standard study in pharmacokinetics involves testing the bioequivalence of a test product to a reference product. It is of interest to know whether the new product will deliver a drug in a similar way to existing products. This is usually done as a cross-over design where all subjects get both treatments.
Professor Maree Smith from the University of Queensland provided data from such a study involving 24 subjects. Multiple variables were recorded during the trials, but here we will summarise the results of the drug concentration measurements. The tables below show the measurements made on the first subject over an 8-hour period after each product was administered.
Drug concentrations ([asciimath]\mu[/asciimath]g/L) from two formulations over time - reference
Time | 0.0 | 0.3 | 0.7 | 1.0 | 1.4 | 1.7 | 2.0 | 2.3 | 2.7 |
Concentration | 0 | 0 | 0 | 0 | 10 | 74 | 73 | 100 | 92 |
Time | 3.0 | 3.5 | 4.0 | 4.5 | 5.0 | 6.0 | 7.0 | 8.0 | |
Concentration | 87 | 594 | 627 | 404 | 217 | 85 | 36 | 15 |
Drug concentrations ([asciimath]\mu[/asciimath]g/L) from two formulations over time - test
Time | 0.0 | 0.4 | 0.7 | 1.0 | 1.3 | 1.7 | 2.0 | 2.3 | 2.7 |
Concentration | 0 | 0 | 0 | 37 | 704 | 544 | 329 | 240 | 138 |
Time | 3.0 | 3.5 | 4.0 | 4.5 | 5.0 | 6.0 | |||
Concentration | 136 | 119 | 125 | 79 | 48 | 32 |
A time plot of this data is shown in the figure below. One line shows the concentration profile for the test tablet while the other shows the profile for the reference tablet. It is easy to compare the different speeds at which the drug becomes available to the body from this plot. Which line is which?
Data such as this is referred to as repeated measurements data. We have similar plots for 23 other subjects in the experiment and want to make an overall comparison between the two formulations. This is a complex problem since there is so much information for each subject. To simplify the problem it is common to try and summarise a response curve for a subject by one or more characteristic numbers. For example, we could record the maximum concentration of the drug during each trial of a formulation. Here these values are 704 [latex]\mu[/latex]g/L for the test formulation and 627 [latex]\mu[/latex]g/L for the reference formulation. We will explore these results further in Chapter 14. Alternatively, we could record the times of the maximum concentrations and compare those. Another common measurement is to calculate the area under each curve, giving an estimate of the total drug available to the body during the trial.
Describing Relationships
Scatter plots are simple to draw but because they involve two variables there is a lot more richness in the types of patterns you may see. The general model we will work with is that there is an average response for a particular value of the explanatory variable and that this average response might change as the explanatory variable changes. For example, we can imagine taking all the people in the world who are 160 cm tall and looking at the distribution of their weights. You can fit a relationship by hand in this way by drawing a line which follows the vertical mean as you move horizontally.
There is a mathematical procedure called loess fitting which captures this idea of following the changes in average response. Loess (Cleveland, 1993) is a local regression method which fits a straight line at each value of the explanatory variable using only the points which are in the neighbourhood of that value. The response level of the loess fit is the point on the locally fitted line. The figure below shows the weight and height data again with a loess fit. It appears that there is a generally increasing relationship but that below 165 cm the relationship is uncertain.
Loess is a useful tool for exploring bivariate data because it doesn’t make any assumptions about the nature of the relationship. This is similar to the density curve estimation we saw in Chapter 3 for visualising a single variable, another way of smoothing the data to see an overall pattern.
Direction
If an increase in the explanatory variable tends to correspond to an increase in the response then we say there is a positive association between the variables. For example, the relationship between weight and height is positive, since taller people tend to be heavier.
If an increase in the explanatory variable tends to correspond to a decrease in the response then we say there is a negative association. The following table shows the results from an experiment where Vitamin C tablets were placed into beakers that each had 250 mL of water in them. The temperature of the water was controlled, ranging from 0[latex]^\circ[/latex]C to 100[latex]^\circ[/latex]C. Each beaker was stirred constantly, stirring until all of the tablet had dissolved, and the time taken to completely dissolve was recorded.
Vitamin C dissolving times (s) by temperature (°C)
Temperature | 0 | 5 | 10 | 14 | 15 | 20 | 25 | 30 |
Time | 1532 | 1326 | 1152 | 1012 | 961 | 842 | 680 | 568 |
Temperature | 40 | 55 | 57 | 61 | 70 | 79 | 81 | 83 |
Time | 419 | 265 | 246 | 209 | 174 | 117 | 87 | 83 |
Temperature | 88 | 90.5 | 94 | 95 | 97 | 100 | 100 | |
Time | 87 | 81 | 66 | 64 | 57 | 44 | 42 |
The figure below shows a plot of the relationship between time take to dissolve and temperature. This association is negative, with higher temperatures giving shorter dissolving times.
Linearity
A simple pattern to look for in a scatter plot is a straight line relationship between the two variables. That is, as the explanatory variable changes the average response changes by following a straight line. We call such relationships linear. The relationship between weight and height might be modelled with a straight line, while the relationship in the previous figure is clearly nonlinear.
Microwaving Seeds
Thirty seeds of the same variety were grown for three days under the same conditions. The length of each seedling was measured and they were split into 6 groups of 5 seedlings each. Each group was then placed in a microwave on ‘high’ for an amount of time which was different for each group. On the following day the seedlings were measured again and their growth since the microwaving treatment was recorded. The results are shown in the table below.
Seedling growth (cm) for different microwave exposure times (s)
Exposure | Growth | ||||
---|---|---|---|---|---|
10 | 2.2 | 2.4 | 2.5 | 2.2 | 2.3 |
20 | 2.7 | 2.5 | 2.6 | 2.9 | 2.8 |
30 | 3.0 | 2.9 | 3.4 | 3.6 | 3.5 |
40 | 2.2 | 2.1 | 2.0 | 2.4 | 1.9 |
50 | 2.2 | 1.4 | 1.0 | 1.1 | 1.1 |
60 | 0.0 | 0.0 | 0.2 | 0.2 | 0.1 |
The following figure shows the growth of seedlings against the different amounts of time in a microwave oven. This is another type of nonlinear relationship. This type of pattern occurs frequently when a small dosage of something is beneficial but a large dosage is harmful. An optimal level, in this case around 30 seconds of radiation, is often sought.
Strength
A strong relationship is one which does not have much variability about the general trend, while a weak relationship is one which does have a lot of variability. The relationship between weight and height is rather weak. For a particular height there is a wide range of matching weights. The previous two figures show stronger relationships.
Two-Way Tables
Describing a single categorical variable simply involves making a table of proportions and then showing the distribution of these proportions with a bar chart. The method is the same for exploring relationships between two variables, except that there are more distributions to look at.
The table below shows the data from a survey of 200 residents, aged 17 to 40, from the northern island of Ironbard. Since the counts are classified by two variables, sex and preferred pizza, we call this a two-way table or a contingency table.
Counts of preferred pizza by sex
Mushroom | Pineapple | Prawns | Sausage | Spinach | Total | |
---|---|---|---|---|---|---|
Female | 10 | 39 | 17 | 13 | 23 | 102 |
Male | 18 | 10 | 13 | 36 | 21 | 98 |
Total | 28 | 49 | 30 | 49 | 44 | 200 |
Marginal Proportions
The last row in the table gives the totals of each preferred pizza, combining both sexes. Taking proportions of the total 200 gives the distribution shown in the table below.
Marginal distribution of preferred pizza
Pizza | Mushroom | Pineapple | Prawns | Sausage | Spinach |
---|---|---|---|---|---|
Proportion | 0.140 | 0.245 | 0.150 | 0.245 | 0.220 |
This is just the distribution of preferred pizza, ignoring sex, similar to the one we saw in Chapter 3. In this context we call it a marginal distribution since it appears in a margin of the two-way table. The other marginal distribution, that of sex ignoring preferred pizza, is shown in the table below.
Marginal distribution of sex
Sex | Female | Male |
---|---|---|
Proportion | 0.51 | 0.49 |
Working with these marginal distributions will be part of testing for a relationship between two categorical variables, as discussed in Chapter 22.
Conditional Proportions
The following table shows the proportions within each combination of categories, the counts in the original table divided by the total 200. For example, the percentage of people who are male and prefer sausage pizza is 18%.
Proportions of pizza preferences
Mushroom | Pineapple | Prawns | Sausage | Spinach | Total | |
---|---|---|---|---|---|---|
Female | 0.050 | 0.195 | 0.085 | 0.065 | 0.115 | 0.51 |
Male | 0.090 | 0.050 | 0.065 | 0.180 | 0.105 | 0.49 |
Total | 0.140 | 0.245 | 0.150 | 0.245 | 0.220 | 1.000 |
Now what proportion of males prefer sausage pizza? This is a different question since we are now asking just about males rather than the whole group. The answer can be found easily from the table since there are 49% males in total and 18% are males who prefer sausage, giving the proportion
\[ \frac{0.18}{0.49} = 0.367. \]
Thus about 37% of males prefer sausage pizza. We call this the conditional proportion of sausage pizza given that the person is male. There are many conditional distributions here, such as that of pizza preference given male, or sex given pineapple pizza.
Note that if there was no association between the variables we would expect the conditional distribution of pizza preference given male to be similar to the conditional distribution of pizza preference given female. The sex of the person should not tell us anything about the chance that they prefer a particular kind of pizza.
The figure above shows a segmented bar chart of conditional pizza preference proportions where we have split the groups by sex. You should be able to identify the 37% of males who prefer sausage pizza. Compare this to the figure below which shows proportions of sex conditional on pizza. The proportion of sausage eaters who are male is
\[ \frac{0.180}{0.245} = 0.735, \]
so now we have a 74% bar representing the same 36 males.
Three-Way Tables
A second survey of pizza preference was conducted with 200 Islanders, again aged 17 to 40, from the eastern island of Providence. Combining the results with those from the Ironbard gives the table below. This is a three-way table since the Islanders are now classified by three variables.
Counts of preferred pizza by sex and island
Ironbard | Providence | |||
---|---|---|---|---|
Female | Male | Female | Male | |
Mushroom | 10 | 18 | 22 | 21 |
Pineapple | 39 | 10 | 33 | 8 |
Prawns | 17 | 13 | 17 | 11 |
Sausage | 13 | 36 | 14 | 30 |
Spinach | 23 | 21 | 18 | 26 |
Visualising relationships between more than two categorical variables at a time becomes difficult. One option is to use a mosaic plot, as shown in the following figure. These plots extended the basic idea of a segmented bar chart by breaking the areas down in the additional variables. They can be challenging to read but with only three variables it is not too hard to compare the two-way relationships between Ironbard and Providence.
Summary
- Scatter plots are used to visualise the relationship between two quantitative variables.
- A response variable in a study is the quantity we are interested in. A predictor variable is a variable that we use to try and estimate the response.
- Response variables go on the vertical axis, with predictor variables on the horizontal axis.
- A time plot is a plot of a response variable that has been measured over time.
- Our model for bivariate relationships is a mean trend in the response combined with variability about the trend.
- We describe associations in terms of being positive or negative, linear or nonlinear, and strong or weak.
- Two-way tables give conditional distributions of one variable with respect to the other. These should be compared when making a bar chart, rather than using the overall proportions from the table.
Exercise 1
Make a scatter plot of height against forearm length for the observations in the survey data. Draw a curve that follows the general pattern of the relationship. Describe the association you see.
Exercise 2
Calculate conditional proportions and make a bar chart showing the relationship between sex and pizza preference for the Providence data in the previous table.
Exercise 3
Make a time plot of Cavendish’s measurements of the mean density of the Earth in the Appendix. Describe the pattern before and after the change in the suspension wire.
Exercise 4
Make a bar chart comparing the distribution of education level between the three towns in the survey data.