7 Linear Relationships
A linear relationship is the simplest association to analyse between two quantitative variables. A straight line relationship between
so that
Pearson Correlation
To summarise the strength of a linear relationship we can use a statistic called the Pearson correlation coefficient,
where
If you have a positive association the
If you have a negative association the
If there is no relationship then
The following six figures give examples of some scatter plots of relationships which give certain values of
Note that the correlation coefficient does not actually distinguish between a response and explanatory variable in its formula. If you reverse the roles of the variables you still get the same correlation value.
Height Relationships
The relationship between weight and height, shown in a previous figure, gives a correlation of
The figure above shows the relationship between pulse rate and height for the same sample of Islanders. What is the correlation likely to be here? From the previous six figures (examples of correlation coefficients) you can see it will be pretty close to 0, since there doesn’t seem to be any real trend in the data. You should never use the correlation coefficient without first plotting the data. It should be used to support the patterns you see. In this case we find
Least-Squares Lines
The correlation coefficient implicitly draws a straight line through the data and gives a measure of how close the data lie to this line. An obvious refinement is to actually draw the line that correlation is using so that we can see how well the line summarises the data and where any deviations might be. How then do we find this line?
Now suppose the points in our scatter plot are
For any
We will often use this “hat” (
An obvious criterion for choosing a line (that is, choosing
You can use calculus to show that the values of
These are called the least-squares estimates for the slope and intercept of the linear relationship. As with the formula for the correlation coefficient, you would never actually calculate these by hand.
Oxytocin and Emotions
Inspired by the work of Turner et al. (1999), Hanne Blomgren from the University of Arcadia carried out a study of the effects of emotions on oxytocin levels in the blood. She recruited twenty-four women to participate in the study, twelve who were in a relationship and twelve who were single. At the start of the trial each subject had their plasma oxytocin level (pg/mL) measured and they were then randomly assigned to one of three stimulus events: reliving happy memories for one minute, reliving sad memories for one minute, or receiving a two-minute Swedish massage. After the intervention their plasma oxytocin level was measured again. The results of the experiment, along with the age and weight of the subjects, are shown in the following table.
Results from oxytocin and emotions study
Single | Group | Name | Age | Weight | Basal | After |
---|---|---|---|---|---|---|
Yes | Happy | Katie Sato | 64 | 50 | 4.40 | 4.40 |
Jana Clausen | 18 | 62 | 4.50 | 4.56 | ||
Nanako Connolly | 60 | 46 | 4.17 | 4.21 | ||
Abigail Jones | 21 | 58 | 4.67 | 4.70 | ||
Sad | Kelly Brown | 31 | 42 | 4.88 | 4.75 | |
Marie Sorensen | 55 | 47 | 4.41 | 4.13 | ||
Asuka McCarthy | 26 | 64 | 4.19 | 4.09 | ||
Tyra Carlsen | 20 | 47 | 4.69 | 4.41 | ||
Massage | Britt Solberg | 33 | 63 | 4.62 | 5.38 | |
Jeneve Bager | 79 | 68 | 3.92 | 4.25 | ||
Gerda Jensen | 25 | 58 | 4.44 | 4.63 | ||
Kaya Solberg | 41 | 61 | 4.26 | 4.70 | ||
No | Happy | Vanessa Solberg | 38 | 58 | 5.06 | 5.05 |
Bronwyn Kimura | 32 | 45 | 4.91 | 5.00 | ||
Miyu Morris | 42 | 68 | 4.42 | 4.46 | ||
Berit Eklund | 65 | 60 | 4.83 | 5.11 | ||
Sad | Louise Murphy | 24 | 85 | 4.64 | 4.64 | |
Kelly White | 49 | 61 | 4.44 | 4.29 | ||
Chloe Regan | 22 | 57 | 5.09 | 4.98 | ||
Ayano Collins | 19 | 64 | 5.38 | 4.97 | ||
Massage | Yui Moore | 28 | 48 | 4.99 | 5.61 | |
Ayaka Price | 23 | 61 | 4.86 | 5.16 | ||
Ursula Lund | 77 | 54 | 4.57 | 5.05 | ||
Miho Connolly | 51 | 59 | 4.06 | 4.60 |
We will look at the analysis of the emotional effects on oxytocin in later chapters. However, an initial question of interest is whether basal plasma oxytocin levels are related to age. For example, if there was an association then it may be important to consider how the design and analysis of the study might be affected by the ages of the subjects involved.
The figure above shows a scatter plot of oxytocin levels against age. The horizontal line is at the mean basal level,
However knowing the age of the subjects gives us more information and it appears from the plot that there is a negative association between oxytocin and age, with older women tending to have lower mean levels of plasma oxytocin. The least-squares calculations gives the line
as shown in the figure below. The slope of this line suggests that for every year older, the mean plasma oxytocin level of women is around 0.01 pg/mL less.
The sum of squared prediction errors from this line is now 2.138 pg
Influential Points
Correlation and least-squares fitting are both based on means and standard deviations, and so both are susceptible to outliers. There are two types of outliers now for regression. A point may have an unusual response value which will tend to inflate the estimate of residual variability. More significant here is a point with an unusual value for the horizontal variable.
As an example, in the oxytocin study Jeneve Bager was the oldest subject at 79 years of age. Suppose her basal measurement had been 5.40 pg/mL instead of the observed 3.92 pg/mL. The above figure shows the resulting relationship. The revised least-squares line is now
A change to one observation out of 24 has dramatically changed the slope. While the general relationship is still negative the least-squares line is now almost horizontal. We call points like this influential points because of this effect.
The correlation has similarly been affected. For the original data the Pearson correlation between oxytocin level and age was
The general solution in this case is to examine any influential points carefully. Consider the analysis with the points and without them and if there is an important difference in conclusions then investigate further. Here we could repeat the measurement of Jeneve’s basal level, for example.
We will also describe a more robust method for summarising association in Chapter 24.
Transforming Nonlinear Relationships
The relationship between dissolving time and temperature in the previous chapter looks like an exponential decay, given by
We could actually use iterative algorithms to find the values of
Suppose we want to model dissolving time with the base 10 relationship
(If you are rusty with logarithms then see the Appendix for some background.) Since
This is a linear relationship between
The least-squares line is
so we estimate
The figure below shows the original data with this exponential fit.
Least-squares fitting can also be used to model data with the power relationship
This model is also known as an allometric scaling relationship, and has an important role to play in biology. It can be turned into a linear relationship by taking logarithms of both sides. This transformation gives
This is a linear relationship between
Summary
- The Pearson correlation coefficient,
, is a measure of the strength and direction of a linear relationship. - The least-squares line is the line that minimises the sum of the squared prediction errors.
- Least-squares lines are badly affected by influential points.
- Certain nonlinear relationships can be transformed into linear relationships.
Exercise 1
Another tablet experiment looked at the effect of volume on the time taken for tablets to dissolve. This table shows the results from dissolving one tablet in a range of different volumes.
Dissolving times (min) by volume (mL)
Volume | 2 | 8 | 10 | 15 | 25 | 50 | 65 | 100 | 200 |
Time | 19.1 | 8.5 | 7.5 | 5.5 | 4.8 | 2.5 | 2.0 | 1.9 | 1.5 |
Make a scatter plot of the logarithm of dissolving time against volume for the data. Describe the relationship. How could a least-squares line be used to model the original relationship?
Exercise 2
The figure below shows the relationship between forearm length and height for the data in the survey data (with the unusual forearm value corrected). Based on the least-squares line drawn in the plot, estimate the forearm length of somebody who was 100 cm tall.
Exercise 3
Daniel wanted to fit the relationship between a response variable,
Based on this fit, what is the predicted response for the power model when
Exercise 4
The correlation coefficient,
Exercise 5
Suppose you have a data set with just two points,