7 Linear Relationships
A linear relationship is the simplest association to analyse between two quantitative variables. A straight line relationship between [latex]y[/latex] and [latex]x[/latex] can be written in a number of ways, such as [latex]y = a + bx[/latex] or [latex]y = mx + c[/latex]. Here we will use the form
\[ y = b_{0} + b_{1}x, \]
so that [latex]b_{1}[/latex] is the slope of the line and [latex]b_{0}[/latex] is the intercept (the value of [latex]y[/latex] when [latex]x = 0[/latex]). Using the subscripts will allow us to easily talk about more complicated relationships later.
Pearson Correlation
To summarise the strength of a linear relationship we can use a statistic called the Pearson correlation coefficient, [latex]r[/latex]. If the points in our scatter plot are [latex](x_1, y_1)[/latex], [latex](x_2, y_2)[/latex], [latex]\ldots[/latex], [latex](x_n, y_n)[/latex] then the correlation is defined by
\[ r = \frac{1}{n-1} \sum \left( \frac{x_j – \overline{x}}{s_x} \right) \left( \frac{y_j – \overline{y}}{s_y} \right), \]
where [latex]s_x[/latex] and [latex]s_y[/latex] are the sample standard deviations of the [latex]x[/latex] and [latex]y[/latex] values of the points. For each point this formula standardises the [latex]x[/latex] and [latex]y[/latex] values into how many standard deviations they are above or below their respective means. This is a terrible formula and you would never calculate it by hand in practice, though you may want to try it on a small data set to see how it works.
If you have a positive association the [latex]x[/latex] values above the [latex]x[/latex] mean will correspond to [latex]y[/latex] values above the [latex]y[/latex] mean, and [latex]x[/latex] values below the [latex]x[/latex] mean will correspond to [latex]y[/latex] values below the [latex]y[/latex] mean. The result is that [latex]r[/latex] will add up a lot of positive terms and so it will be large and positive. The maximum value [latex]r[/latex] can take is 1.
If you have a negative association the [latex]x[/latex] values above the [latex]x[/latex] mean will correspond to [latex]y[/latex] values below the [latex]y[/latex] mean, and [latex]x[/latex] values below the [latex]x[/latex] mean will correspond to [latex]y[/latex] values above the [latex]y[/latex] mean. The result is that [latex]r[/latex] will add up a lot of negative terms and so it will be large and negative. The minimum value [latex]r[/latex] can take is -1.
If there is no relationship then [latex]r[/latex] will be adding up terms which are sometimes positive and sometimes negative, giving a value of around 0 for [latex]r[/latex].
The following six figures give examples of some scatter plots of relationships which give certain values of [latex]r[/latex]. The plots for [latex]r = 0[/latex] and [latex]r = +0.20[/latex] are very similar but are not identical. This illustrates that 0.20 is not a very strong correlation.
Note that the correlation coefficient does not actually distinguish between a response and explanatory variable in its formula. If you reverse the roles of the variables you still get the same correlation value.
Height Relationships
The relationship between weight and height, shown in a previous figure, gives a correlation of [latex]r = +0.6834[/latex]. Compare the strength of association you see there with the examples in the six figures above to confirm that this is reasonable.
The figure above shows the relationship between pulse rate and height for the same sample of Islanders. What is the correlation likely to be here? From the previous six figures (examples of correlation coefficients) you can see it will be pretty close to 0, since there doesn’t seem to be any real trend in the data. You should never use the correlation coefficient without first plotting the data. It should be used to support the patterns you see. In this case we find [latex]r = +0.0153[/latex].
Least-Squares Lines
The correlation coefficient implicitly draws a straight line through the data and gives a measure of how close the data lie to this line. An obvious refinement is to actually draw the line that correlation is using so that we can see how well the line summarises the data and where any deviations might be. How then do we find this line?
Now suppose the points in our scatter plot are
\[ (x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n). \]
For any [latex]x_j[/latex] we can use a line to predict what the [latex]y[/latex] value would be by substituting it into the formula:
\[ \hat{y}_j = b_{0} + b_{1} x_j. \]
We will often use this “hat” ([latex]\; \hat{ } \;[/latex]) notation to describe an estimate or prediction; here [latex]\hat{y}_j[/latex] is the estimated [latex]y[/latex] value for the [latex]j[/latex]th observation. Unless the data points lie perfectly along a straight line, we will make errors in our predictions given by
\[ e_j = y_j – \hat{y}_j. \]
An obvious criterion for choosing a line (that is, choosing [latex]b_0[/latex] and [latex]b_1[/latex]) would be to minimise the total of these prediction errors for all the points. Unfortunately, this doesn’t work very well since the positive and negative errors tend to cancel out. Instead we minimise the sum of the squared errors. This is similar to the sum of squared deviations used to define the sample standard deviation. We will return to this idea in Chapter 18.
You can use calculus to show that the values of [latex]b_0[/latex] and [latex]b_1[/latex] that minimise the sum of the squared prediction errors are
\begin{eqnarray*}
b_1 & = & \frac{\sum (x_j – \overline{x})(y_j – \overline{y})}{\sum (x_j – \overline{x})^2}, \\
b_0 & = & \overline{y} – b_{1}\overline{x}.
\end{eqnarray*}
These are called the least-squares estimates for the slope and intercept of the linear relationship. As with the formula for the correlation coefficient, you would never actually calculate these by hand.
Oxytocin and Emotions
Inspired by the work of Turner et al. (1999), Hanne Blomgren from the University of Arcadia carried out a study of the effects of emotions on oxytocin levels in the blood. She recruited twenty-four women to participate in the study, twelve who were in a relationship and twelve who were single. At the start of the trial each subject had their plasma oxytocin level (pg/mL) measured and they were then randomly assigned to one of three stimulus events: reliving happy memories for one minute, reliving sad memories for one minute, or receiving a two-minute Swedish massage. After the intervention their plasma oxytocin level was measured again. The results of the experiment, along with the age and weight of the subjects, are shown in the following table.
Results from oxytocin and emotions study
Single | Group | Name | Age | Weight | Basal | After |
---|---|---|---|---|---|---|
Yes | Happy | Katie Sato | 64 | 50 | 4.40 | 4.40 |
Jana Clausen | 18 | 62 | 4.50 | 4.56 | ||
Nanako Connolly | 60 | 46 | 4.17 | 4.21 | ||
Abigail Jones | 21 | 58 | 4.67 | 4.70 | ||
Sad | Kelly Brown | 31 | 42 | 4.88 | 4.75 | |
Marie Sorensen | 55 | 47 | 4.41 | 4.13 | ||
Asuka McCarthy | 26 | 64 | 4.19 | 4.09 | ||
Tyra Carlsen | 20 | 47 | 4.69 | 4.41 | ||
Massage | Britt Solberg | 33 | 63 | 4.62 | 5.38 | |
Jeneve Bager | 79 | 68 | 3.92 | 4.25 | ||
Gerda Jensen | 25 | 58 | 4.44 | 4.63 | ||
Kaya Solberg | 41 | 61 | 4.26 | 4.70 | ||
No | Happy | Vanessa Solberg | 38 | 58 | 5.06 | 5.05 |
Bronwyn Kimura | 32 | 45 | 4.91 | 5.00 | ||
Miyu Morris | 42 | 68 | 4.42 | 4.46 | ||
Berit Eklund | 65 | 60 | 4.83 | 5.11 | ||
Sad | Louise Murphy | 24 | 85 | 4.64 | 4.64 | |
Kelly White | 49 | 61 | 4.44 | 4.29 | ||
Chloe Regan | 22 | 57 | 5.09 | 4.98 | ||
Ayano Collins | 19 | 64 | 5.38 | 4.97 | ||
Massage | Yui Moore | 28 | 48 | 4.99 | 5.61 | |
Ayaka Price | 23 | 61 | 4.86 | 5.16 | ||
Ursula Lund | 77 | 54 | 4.57 | 5.05 | ||
Miho Connolly | 51 | 59 | 4.06 | 4.60 |
We will look at the analysis of the emotional effects on oxytocin in later chapters. However, an initial question of interest is whether basal plasma oxytocin levels are related to age. For example, if there was an association then it may be important to consider how the design and analysis of the study might be affected by the ages of the subjects involved.
The figure above shows a scatter plot of oxytocin levels against age. The horizontal line is at the mean basal level, [latex]\overline{x} = 4.6[/latex] pg/mL, ignoring the age variable. This is one possible line to describe the relationship between the two variables, one where the prediction value never changes. In fact in Chapter 5 we saw that the sample mean did give the minimum sum of squared prediction errors for a single variable. Here that sum of squared errors is 2.915 pg[latex]^2[/latex]/mL[latex]^2[/latex].
However knowing the age of the subjects gives us more information and it appears from the plot that there is a negative association between oxytocin and age, with older women tending to have lower mean levels of plasma oxytocin. The least-squares calculations gives the line
\[ \mbox{Basal} = 4.98 – 0.0097 \; \mbox{Age}, \]
as shown in the figure below. The slope of this line suggests that for every year older, the mean plasma oxytocin level of women is around 0.01 pg/mL less.
The sum of squared prediction errors from this line is now 2.138 pg[latex]^2[/latex]/mL[latex]^2[/latex]. This is better than 2.915 pg[latex]^2[/latex]/mL[latex]^2[/latex], so a prediction of basal oxytocin level will be more accurate if we know the age of the subject. It is not a whole lot better though — there is still a lot of variability in oxytocin that we cannot account for by knowing the age. We call this unexplained variability the residual variability in our model of oxytocin level based on age. In Chapter 18 we will look at how to compare the size of the slope with this residual variability to decide whether age is a significant predictor of oxytocin level.
Influential Points
Correlation and least-squares fitting are both based on means and standard deviations, and so both are susceptible to outliers. There are two types of outliers now for regression. A point may have an unusual response value which will tend to inflate the estimate of residual variability. More significant here is a point with an unusual value for the horizontal variable.
As an example, in the oxytocin study Jeneve Bager was the oldest subject at 79 years of age. Suppose her basal measurement had been 5.40 pg/mL instead of the observed 3.92 pg/mL. The above figure shows the resulting relationship. The revised least-squares line is now
\[ \mbox{Basal} = 4.76 – 0.0026 \; \mbox{Age}. \]
A change to one observation out of 24 has dramatically changed the slope. While the general relationship is still negative the least-squares line is now almost horizontal. We call points like this influential points because of this effect.
The correlation has similarly been affected. For the original data the Pearson correlation between oxytocin level and age was [latex]r = -0.5165[/latex], reasonably strong in reference to the examples earlier in this chapter. With the change the correlation drops in strength to [latex]r = -0.1351[/latex], even though the bulk of the relationship is unchanged.
The general solution in this case is to examine any influential points carefully. Consider the analysis with the points and without them and if there is an important difference in conclusions then investigate further. Here we could repeat the measurement of Jeneve’s basal level, for example.
We will also describe a more robust method for summarising association in Chapter 24.
Transforming Nonlinear Relationships
The relationship between dissolving time and temperature in the previous chapter looks like an exponential decay, given by
\[ y = a 10^{bx}, \; \mbox{ or } \; y = a e^{bx}. \]
We could actually use iterative algorithms to find the values of [latex]a[/latex] and [latex]b[/latex] that minimise the sum of squared prediction errors. However, since routines for linear regression are more readily available (even on standard scientific calculators) it is common practice to try and transform the nonlinear relationship into one that is linear.
Suppose we want to model dissolving time with the base 10 relationship [latex]y = a 10^{bx}[/latex]. If we take base 10 logarithms of both sides of this equation we get
\[ \log(y) = \log(a) + bx \log(10). \]
(If you are rusty with logarithms then see the Appendix for some background.) Since [latex]\log(10) = 1[/latex] this is just
\[ \log(y) = \log(a) + bx. \]
This is a linear relationship between [latex]\log(y)[/latex] and [latex]x[/latex], with intercept [latex]\log(a)[/latex] and slope [latex]b[/latex]. Thus if we plot the logarithm of dissolving time against temperature we would hope to have a relationship closer to a straight line. The following figure shows this plot together with the least-squares line and indeed it is a good fit.
The least-squares line is
\[ \log(y) = 3.22 – 0.015x, \]
so we estimate [latex]b[/latex] by [latex]-0.015[/latex] and [latex]\log(a)[/latex] by 3.22. This gives [latex]a = 10^{3.22} = 1660[/latex], so our exponential model of dissolving time is
\[ y = 1660 \times 10^{-0.015x}. \]
The figure below shows the original data with this exponential fit.
Least-squares fitting can also be used to model data with the power relationship
\[ y = ax^b. \]
This model is also known as an allometric scaling relationship, and has an important role to play in biology. It can be turned into a linear relationship by taking logarithms of both sides. This transformation gives
\[ \log(y) = \log(ax^b) = \log(a) + b\log(x). \]
This is a linear relationship between [latex]\log(y)[/latex] and [latex]\log(x)[/latex], with intercept [latex]\log(a)[/latex] and slope [latex]b[/latex]. As above we can fit a least-squares line to the relationship between [latex]\log(y)[/latex] and [latex]\log(x)[/latex] and then work backwards to get estimates for [latex]a[/latex] and [latex]b[/latex].
Summary
- The Pearson correlation coefficient, [latex]r[/latex], is a measure of the strength and direction of a linear relationship.
- The least-squares line is the line that minimises the sum of the squared prediction errors.
- Least-squares lines are badly affected by influential points.
- Certain nonlinear relationships can be transformed into linear relationships.
Exercise 1
Another tablet experiment looked at the effect of volume on the time taken for tablets to dissolve. This table shows the results from dissolving one tablet in a range of different volumes.
Dissolving times (min) by volume (mL)
Volume | 2 | 8 | 10 | 15 | 25 | 50 | 65 | 100 | 200 |
Time | 19.1 | 8.5 | 7.5 | 5.5 | 4.8 | 2.5 | 2.0 | 1.9 | 1.5 |
Make a scatter plot of the logarithm of dissolving time against volume for the data. Describe the relationship. How could a least-squares line be used to model the original relationship?
Exercise 2
The figure below shows the relationship between forearm length and height for the data in the survey data (with the unusual forearm value corrected). Based on the least-squares line drawn in the plot, estimate the forearm length of somebody who was 100 cm tall.
Exercise 3
Daniel wanted to fit the relationship between a response variable, [latex]y[/latex], and a predictor variable, [latex]x[/latex], using a power model with [latex]y=a x^b[/latex]. He log-transformed the [latex]x[/latex] and [latex]y[/latex] values and fitted the least squares line
\[ \log_{10} y = 0.810 + 2.669 \log_{10} x. \]
Based on this fit, what is the predicted response for the power model when [latex]x=3[/latex]?
Exercise 4
The correlation coefficient, [latex]r[/latex], measures the strength and direction of a linear association. Draw a scatterplot that shows a strong nonlinear association but which would have a correlation close to 0.
Exercise 5
Suppose you have a data set with just two points, [latex](x_1,y_1)[/latex] and [latex](x_2,y_2)[/latex]. What will be the correlation coefficient, [latex]r[/latex], for this data? Is it possible to choose two points such that [latex]r = 0[/latex]?