6 Visualising Relationships

We have previously used side-by-side plots to explore the relationship between a quantitative variable and a categorical variable. This involved making a univariate plot for each group in the data. With two quantitative variables the situation is more complicated, reflecting the richer information in quantitative variables.

Scatter Plots

We visualise patterns between two quantitative variables using a scatter plot. These are easily drawn by making two axes, one for each variable, and then using the values of the variables as the coordinates of a point to plot each case. The figure below shows a scatter plot of the weights and heights of the 60 Islanders in the survey data. As an example, Taylor Jones was 171 cm tall and weighed 63 kg. She is represented on the plot as the point (171, 63).

Weight by height for the 60 Islanders

How do you decide which variable should go on which axis? It is common that you will be interested in how one variable affects the other. Suppose we want to predict the weight of a person from their height. We call height the predictor variable, since we want to make predictions from it, or the explanatory variable, since we think that height will help ‘explain’ the weight — it would be natural for bigger people to be heavier, for example. The weight is then called the response variable, giving the response to the value of the other variable.

The simple rule for drawing scatter plots is

  • Response variables go on the vertical axis
  • Predictor variables go on the horizontal axis

Note that you will not always be interested in the relationship between a response and a predictor variable. In such cases it does not matter which way around you put your axes. The following figure shows a scatter plot of the lengths of the iris petals and sepals in the Appendix. Here we are not necessarily trying to predict one of the values from the others. We would simply be interested in the nature of the relationship between these dimensions. There seems to be a fairly good relationship between the 100 points with longer petals but the irises with shorter petals seem to belong to a separate cluster of points. The symbols in the plot correspond to different iris varieties. One goal with data like this is to find a way of automatically discriminating between varieties based on observations of the variables, and one method for doing this is known as discriminant analysis. Another approach is to use logistic regression, as discussed in Chapter 23.

Petal length by sepal length for Iris data

There are many ways of exploring the relationships between multiple variables. For example, the figure below shows a scatter plot matrix of the Iris data. This shows all the scatter plots for pairs of variables in the data. With 4 variables there are 6 possible pairs but the matrix shows each choice of response and predictor roles, giving 12 plots in total. Look for where the small copy of the previous figure is embedded in this bigger plot.

Scatter plot matrix for Iris data

Time Plots

One case where the choice of axes is made for you is in a time plot. A time plot shows the behaviour of a variable over time, with the vertical axis giving the variable value and the horizontal axis giving time. This is done because the eye is used to following along from left to right and so it is easier to “read” the behaviour of the variable in this way. To emphasise this, time plots also join successive points together with lines to accentuate the ups and downs.

Learning Effects

The figure below shows Newcomb’s measurements of the passage time of light from the Appendix in the order in which he made them.

Newcomb’s passage time measurements

It is clear from this plot that the two outliers that we identified in Chapter 3 occur early on in Newcomb’s work. It is likely then that the explanation for these is some learning effect, as Newcomb became proficient with using his apparatus. Rather than just removing the two outliers from the data when calculating the mean, it may then be better to drop the first 15 or 20 observations altogether. This is another example of why it is useful to visualise your data with more than just one plot.

Drug Concentration Profiles

A standard study in pharmacokinetics involves testing the bioequivalence of a test product to a reference product. It is of interest to know whether the new product will deliver a drug in a similar way to existing products. This is usually done as a cross-over design where all subjects get both treatments.

Professor Maree Smith from the University of Queensland provided data from such a study involving 24 subjects. Multiple variables were recorded during the trials, but here we will summarise the results of the drug concentration measurements. The tables below show the measurements made on the first subject over an 8-hour period after each product was administered.

Drug concentrations ([asciimath]\mu[/asciimath]g/L) from two formulations over time - reference

Time 0.0 0.3 0.7 1.0 1.4 1.7 2.0 2.3 2.7
Concentration 0 0 0 0 10 74 73 100 92
Time 3.0 3.5 4.0 4.5 5.0 6.0 7.0 8.0
Concentration 87 594 627 404 217 85 36 15

Drug concentrations ([asciimath]\mu[/asciimath]g/L) from two formulations over time - test

Time 0.0 0.4 0.7 1.0 1.3 1.7 2.0 2.3 2.7
Concentration 0 0 0 37 704 544 329 240 138
Time 3.0 3.5 4.0 4.5 5.0 6.0
Concentration 136 119 125 79 48 32

A time plot of this data is shown in the figure below. One line shows the concentration profile for the test tablet while the other shows the profile for the reference tablet. It is easy to compare the different speeds at which the drug becomes available to the body from this plot. Which line is which?

Drug concentration over time

Data such as this is referred to as repeated measurements data. We have similar plots for 23 other subjects in the experiment and want to make an overall comparison between the two formulations. This is a complex problem since there is so much information for each subject. To simplify the problem it is common to try and summarise a response curve for a subject by one or more characteristic numbers. For example, we could record the maximum concentration of the drug during each trial of a formulation. Here these values are 704 [latex]\mu[/latex]g/L for the test formulation and 627 [latex]\mu[/latex]g/L for the reference formulation. We will explore these results further in Chapter 14. Alternatively, we could record the times of the maximum concentrations and compare those. Another common measurement is to calculate the area under each curve, giving an estimate of the total drug available to the body during the trial.

Describing Relationships

Scatter plots are simple to draw but because they involve two variables there is a lot more richness in the types of patterns you may see. The general model we will work with is that there is an average response for a particular value of the explanatory variable and that this average response might change as the explanatory variable changes. For example, we can imagine taking all the people in the world who are 160 cm tall and looking at the distribution of their weights. You can fit a relationship by hand in this way by drawing a line which follows the vertical mean as you move horizontally.

There is a mathematical procedure called loess fitting which captures this idea of following the changes in average response. Loess (Cleveland, 1993) is a local regression method which fits a straight line at each value of the explanatory variable using only the points which are in the neighbourhood of that value. The response level of the loess fit is the point on the locally fitted line. The figure below shows the weight and height data again with a loess fit. It appears that there is a generally increasing relationship but that below 165 cm the relationship is uncertain.

Weight by height with loess line

Loess is a useful tool for exploring bivariate data because it doesn’t make any assumptions about the nature of the relationship. This is similar to the density curve estimation we saw in Chapter 3 for visualising a single variable, another way of smoothing the data to see an overall pattern.

Direction

If an increase in the explanatory variable tends to correspond to an increase in the response then we say there is a positive association between the variables. For example, the relationship between weight and height is positive, since taller people tend to be heavier.

If an increase in the explanatory variable tends to correspond to a decrease in the response then we say there is a negative association. The following table shows the results from an experiment where Vitamin C tablets were placed into beakers that each had 250 mL of water in them. The temperature of the water was controlled, ranging from 0[latex]^\circ[/latex]C to 100[latex]^\circ[/latex]C. Each beaker was stirred constantly, stirring until all of the tablet had dissolved, and the time taken to completely dissolve was recorded.

Vitamin C dissolving times (s) by temperature (°C)

Temperature 0 5 10 14 15 20 25 30
Time 1532 1326 1152 1012 961 842 680 568
Temperature 40 55 57 61 70 79 81 83
Time 419 265 246 209 174 117 87 83
Temperature 88 90.5 94 95 97 100 100
Time 87 81 66 64 57 44 42

The figure below shows a plot of the relationship between time take to dissolve and temperature. This association is negative, with higher temperatures giving shorter dissolving times.

Time taken to dissolve against temperature

Linearity

A simple pattern to look for in a scatter plot is a straight line relationship between the two variables. That is, as the explanatory variable changes the average response changes by following a straight line. We call such relationships linear. The relationship between weight and height might be modelled with a straight line, while the relationship in the previous figure is clearly nonlinear.

Microwaving Seeds

Thirty seeds of the same variety were grown for three days under the same conditions. The length of each seedling was measured and they were split into 6 groups of 5 seedlings each. Each group was then placed in a microwave on ‘high’ for an amount of time which was different for each group. On the following day the seedlings were measured again and their growth since the microwaving treatment was recorded. The results are shown in the table below.

Seedling growth (cm) for different microwave exposure times (s)

Exposure Growth
10 2.2 2.4 2.5 2.2 2.3
20 2.7 2.5 2.6 2.9 2.8
30 3.0 2.9 3.4 3.6 3.5
40 2.2 2.1 2.0 2.4 1.9
50 2.2 1.4 1.0 1.1 1.1
60 0.0 0.0 0.2 0.2 0.1

The following figure shows the growth of seedlings against the different amounts of time in a microwave oven. This is another type of nonlinear relationship. This type of pattern occurs frequently when a small dosage of something is beneficial but a large dosage is harmful. An optimal level, in this case around 30 seconds of radiation, is often sought.

Effect of microwave radiation on seedling growth (with jitter)

Strength

A strong relationship is one which does not have much variability about the general trend, while a weak relationship is one which does have a lot of variability. The relationship between weight and height is rather weak. For a particular height there is a wide range of matching weights. The previous two figures show stronger relationships.

Two-Way Tables

Describing a single categorical variable simply involves making a table of proportions and then showing the distribution of these proportions with a bar chart. The method is the same for exploring relationships between two variables, except that there are more distributions to look at.

The table below shows the data from a survey of 200 residents, aged 17 to 40, from the northern island of Ironbard. Since the counts are classified by two variables, sex and preferred pizza, we call this a two-way table or a contingency table.

Counts of preferred pizza by sex

  Mushroom Pineapple Prawns Sausage Spinach Total
Female 10 39 17 13 23 102
Male 18 10 13 36 21 98
Total 28 49 30 49 44 200

Marginal Proportions

The last row in the table gives the totals of each preferred pizza, combining both sexes. Taking proportions of the total 200 gives the distribution shown in the table below.

Marginal distribution of preferred pizza

Pizza Mushroom Pineapple Prawns Sausage Spinach
Proportion 0.140 0.245 0.150 0.245 0.220

This is just the distribution of preferred pizza, ignoring sex, similar to the one we saw in Chapter 3. In this context we call it a marginal distribution since it appears in a margin of the two-way table. The other marginal distribution, that of sex ignoring preferred pizza, is shown in the table below.

Marginal distribution of sex

Sex Female Male
Proportion 0.51 0.49

Working with these marginal distributions will be part of testing for a relationship between two categorical variables, as discussed in Chapter 22.

Conditional Proportions

The following table shows the proportions within each combination of categories, the counts in the original table divided by the total 200. For example, the percentage of people who are male and prefer sausage pizza is 18%.

Proportions of pizza preferences

  Mushroom Pineapple Prawns Sausage Spinach Total
Female 0.050 0.195 0.085 0.065 0.115 0.51
Male 0.090 0.050 0.065 0.180 0.105 0.49
Total 0.140 0.245 0.150 0.245 0.220 1.000

Now what proportion of males prefer sausage pizza? This is a different question since we are now asking just about males rather than the whole group. The answer can be found easily from the table since there are 49% males in total and 18% are males who prefer sausage, giving the proportion
\[ \frac{0.18}{0.49} = 0.367. \]
Thus about 37% of males prefer sausage pizza. We call this the conditional proportion of sausage pizza given that the person is male. There are many conditional distributions here, such as that of pizza preference given male, or sex given pineapple pizza.

Note that if there was no association between the variables we would expect the conditional distribution of pizza preference given male to be similar to the conditional distribution of pizza preference given female. The sex of the person should not tell us anything about the chance that they prefer a particular kind of pizza.

Bar chart of pizza preference conditional on sex

The figure above shows a segmented bar chart of conditional pizza preference proportions where we have split the groups by sex. You should be able to identify the 37% of males who prefer sausage pizza. Compare this to the figure below which shows proportions of sex conditional on pizza. The proportion of sausage eaters who are male is
\[ \frac{0.180}{0.245} = 0.735, \]
so now we have a 74% bar representing the same 36 males.

Bar chart of sex conditional on pizza preference

Three-Way Tables

A second survey of pizza preference was conducted with 200 Islanders, again aged 17 to 40, from the eastern island of Providence. Combining the results with those from the Ironbard gives the table below. This is a three-way table since the Islanders are now classified by three variables.

Counts of preferred pizza by sex and island

  Ironbard Providence
Female Male Female Male
Mushroom 10 18 22 21
Pineapple 39 10 33 8
Prawns 17 13 17 11
Sausage 13 36 14 30
Spinach 23 21 18 26

Visualising relationships between more than two categorical variables at a time becomes difficult. One option is to use a mosaic plot, as shown in the following figure. These plots extended the basic idea of a segmented bar chart by breaking the areas down in the additional variables. They can be challenging to read but with only three variables it is not too hard to compare the two-way relationships between Ironbard and Providence.

Mosaic plot of island, sex and preferred pizza
Mosaic plot of island, sex and preferred pizza

Summary

  • Scatter plots are used to visualise the relationship between two quantitative variables.
  • A response variable in a study is the quantity we are interested in. A predictor variable is a variable that we use to try and estimate the response.
  • Response variables go on the vertical axis, with predictor variables on the horizontal axis.
  • A time plot is a plot of a response variable that has been measured over time.
  • Our model for bivariate relationships is a mean trend in the response combined with variability about the trend.
  • We describe associations in terms of being positive or negative, linear or nonlinear, and strong or weak.
  • Two-way tables give conditional distributions of one variable with respect to the other. These should be compared when making a bar chart, rather than using the overall proportions from the table.

Exercise 1

Make a scatter plot of height against forearm length for the observations in the survey data. Draw a curve that follows the general pattern of the relationship. Describe the association you see.

Exercise 2

Calculate conditional proportions and make a bar chart showing the relationship between sex and pizza preference for the Providence data in the previous table.

Exercise 3

Make a time plot of Cavendish’s measurements of the mean density of the Earth in the Appendix. Describe the pattern before and after the change in the suspension wire.

Exercise 4

Make a bar chart comparing the distribution of education level between the three towns in the survey data.

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

A Portable Introduction to Data Analysis Copyright © 2024 by The University of Queensland is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book