21 Multiple Regression

[latex]\newcommand{\pr}[1]{P(#1)} \newcommand{\var}[1]{\mbox{var}(#1)} \newcommand{\mean}[1]{\mbox{E}(#1)} \newcommand{\sd}[1]{\mbox{sd}(#1)} \newcommand{\Binomial}[3]{#1 \sim \mbox{Binomial}(#2,#3)} \newcommand{\Student}[2]{#1 \sim \mbox{Student}(#2)} \newcommand{\Normal}[3]{#1 \sim \mbox{Normal}(#2,#3)} \newcommand{\Poisson}[2]{#1 \sim \mbox{Poisson}(#2)} \newcommand{\se}[1]{\mbox{se}(#1)} \newcommand{\prbig}[1]{P\left(#1\right)} \newcommand{\degc}{$^{\circ}$C}[/latex]

In this chapter we briefly introduce multiple regression as a generalisation of linear regression and analysis of variance, and show how the ideas we have seen can be extended to problems with more variables.

Indicator Variables

We have used the method of least-squares to determine the values of [latex]b_0[/latex] and [latex]b_1[/latex] which make the line
\[ y = b_0 + b_1 x \]
best fit some data, where [latex]y[/latex] and [latex]x[/latex] were continuous variables, such as weight and height. However, we can turn a categorical variable into a continuous variable by defining an indicator variable. For example, we could define
\[ x = \left\{ \begin{array}{ll}
1, & \mbox{ if male} \\
0, & \mbox{ if female} \\
\end{array} \right. \]
This [latex]x[/latex] is also known as a dummy variable. We can then carry out least-squares regression in the usual way.

Regression summary for height by sex

  Estimate SE T P
Constant 167.42 1.210 138.4 <0.001
Male 9.64 1.608 5.99 <0.001

For example, suppose we want to use linear regression to model sex differences in height. The table above shows the results obtained by software, giving the line
\[ y = 167.4 + 9.64 x, \]
where [latex]x[/latex] is the indicator for sex, as above. Compare this result to the pooled [latex]t[/latex] test in Chapter 16. The “slope” of this line, 9.64, is the same as the difference between the mean male and female heights. This makes sense, since moving [latex]x[/latex] from 0 to 1 is essentially asking how much height is added by 1 unit of “maleness”. The answer is 9.64 cm on average.

The [latex]t[/latex] statistic given by regression is (essentially) the same as the [latex]t[/latex] statistic in the pooled two-sample test, and the resulting confidence interval is also the same. It can be shown that this will always be the case, and so regression is a generalisation of the simple [latex]t[/latex] procedures we began with.

Similarly, the table below shows the ANOVA table for the above regression.

ANOVA table for height by sex regression

Source df SS MS F P
Sex 1 1368 1368 35.93 <0.001
Residuals 58 2208 38.07
Total 59 3576

We have now seen close relationships between the [latex]t[/latex] procedures, regression and correlation, and analysis of variance. It is these methods, developed by R. A. Fisher and not yet a century old, that have become the cornerstone of scientific data analysis. If you can appreciate the unity of these ideas then you should be able to approach more advanced methods with confidence.

Multiple Predictors

A Chapter 3 table gave the time that 20 students could hold their breath, along with the height and sex of each student. The following figure shows the relationship between the time that breath could be held and height. A least squares line has been fitted, and there seems to be a strong association between the time that breath was held and height. The following table shows the regression results, including the strong [latex]P[/latex]-value of 0.001.

Time breath held against height for 20 students

Regression summary for time breath held by height

  Estimate SE T P
Constant -181.7 59.29 -3.07 0.007
Height 1.277 0.3388 3.77 0.001

However, this analysis is misleading. The figure below shows the same relationship but with different symbols to show males and females. Least-square lines have been fitted for males and females separately, and there now appears to be little association between the time that breath was held and height. The line in the previous figure was essentially just joining two clouds of points together, giving the spurious association.

Time breath held against height for males and females

How can we detect this effect in our regression? One approach is to fit more than one variable at a time to try and estimate the time that breath is held. For example, let [latex]x_1[/latex] be the height of a subject and let
\[ x_2 = \left\{ \begin{array}{ll}
1, & \mbox{ if male} \\
0, & \mbox{ if female}
\end{array} \right. \]
be an indicator variable for their sex. We can then use least-squares regression to fit the “line”
\[ y = b_0 + b_1 x_1 + b_2 x_2 + b_3 x_1 x_2. \]
Here [latex]b_1[/latex] will give the effect of height on breath holding, [latex]b_2[/latex] will give the sex difference in breath holding, while [latex]b_3[/latex] will measure the difference in the slopes between the males and females. To see the role of [latex]b_3[/latex], note that if a subject is female then this equation will estimate [latex]y[/latex] by
\[ y = b_0 + b_1 x_1 + b_2 \times 0 + b_3 x_1 \times 0 = b_0 + b_1 x_1. \]
If a subject is male then we have
\[ y = b_0 + b_1 x_1 + b_2 \times 1 + b_3 x_1 \times 1 = (b_0 + b_2) + (b_1 + b_3) x_1. \]
So [latex]b_2[/latex] is the overall change in time for being male, while [latex]b_3[/latex] is the change in the slope of the relationship with height for being male. If [latex]b_3 = 0[/latex] then the male and female lines would be parallel.

Summary for full multiple regression model

  Estimate SE T P
Constant 84.33 86.88 0.97 0.346
Male -41.75 125.0 -0.33 0.743
Height -0.3468 0.5178 -0.67 0.513
Male [asciimath]\times[/asciimath] Height 0.4249 0.7158 0.59 0.561

The table above shows the least squares estimates for these four parameters, along with their standard errors. Each of these can be used to test the null hypothesis
\[ H_0: \beta_j = 0 \]
conditional on the other variables being included in the model. For example, here the [latex]R^2[/latex] value is 0.7847, suggesting the overall model does a reasonable job of explaining the variability, but none of the [latex]P[/latex]-values show a significant difference from a null hypothesis of 0.

In particular, there is no evidence that [latex]b_3[/latex] is not 0. This suggests we could try a simpler model where the interaction between sex and height is dropped. That is, we look at the linear relationship
\[ y = b_0 + b_1 x_1 + b_2 x_2, \]
where the variables and coefficients are as before. The following table shows the results of the regression analysis. The [latex]R^2[/latex] value is now 0.7800, only slightly lower than for the full regression model. We now find very strong evidence of an overall sex difference, as you might expect from the previous figure. There is no evidence of a relationship between the time that breath was held and the height of the subject.

Summary for reduced multiple regression model

  Estimate SE T P
Constant 47.05 58.87 0.80 0.435
Male 32.37 6.327 5.12 <0.001
Height -0.1244 0.3506 -0.35 0.727

We can also use analysis of variance for the regression model to assess the effect of each factor on the time subjects can hold their breath. The table below gives the ANOVA table for the full regression model, corresponding to the estimates presented in previous table of the full model. You will notice that the three components are essentially the same (except that we do not need an explicit indicator variable now) but the [latex]P[/latex]-values for each are different.

ANOVA table for time breath held (s) by sex and height (cm)

Source df SS MS F P
Sex 1 4685 4685 57.845 <0.001
Height 1 10 10 0.121 0.732
Sex [asciimath]\times[/asciimath] Height 1 29 29 0.352 0.561
Residuals 16 1296 81
Total 19 6020

For example, the sum of squares for Sex (4685 s[latex]^2[/latex]) is a measure of the variability in the response explained by knowing the sex of a subject. This value is quite high and so the [latex]P[/latex]-value for the Sex component is very significant. This is in contrast to the [latex]P[/latex]-value of 0.743 in the table of the full model which assessed the significance of Sex given that all the other variables were in the model.

The sum of squares for Height (10 s[latex]^2[/latex]) in the previous table is then a measure of how much more variability is explained by knowing the height of a subject, given that you already know their sex. We have already seen that there is little association with height once sex is taking into account, as in the previous figure, so this small value is not surprising.

Lastly, the sum of squares for Sex [latex]\times[/latex] Height (29 s[latex]^2[/latex]) is a measure of how much more variability is explained by knowing the combination of sex and height for a subject, given that you already have the other two variables in the model. Note that this case is equivalent to the previous [latex]t[/latex] tests since this is the last variable added to the model. We thus get the same [latex]P[/latex]-value of 0.561 in both cases.

These last two [latex]F[/latex] tests are examples of partial F tests, testing for the contribution of an additional variable added to the model. In both cases the null hypothesis was that the new variable does not significantly add to the prediction of the response, given the existing variables in the model. In general this procedure is a useful tool in helping decide which variables to include in a model. Further discussion of model building is beyond the scope of this book but there are many texts on regression analysis, such as Kleinbaum et al. (1988), that give more details on this topic.

Two-Way Analysis of Variance

The basic idea behind analysis of variance, of separating variance into components due to the effect of a variable and to residual error, can be easily extended to analyse more variables. In this section we look at the special case of a design with two factors.

Oxytocin and Emotions

We saw in Chapter 19 that there was evidence of a difference between the three stimulus events in their effect on plasma oxytocin level. However we also saw earlier in Chapter 19 that there was a difference in basal oxytocin between women in and out of a relationship. Could relationship status also be related to the observed change in oxytocin level?

The figure below shows a dot plot comparing the changes in oxytocin level between single women and those in a relationship. There does not seem to be a relationship effect — even though the mean change was 0.051 pg/mL less for single women there is a lot of variability in the response. The analysis of variance in the following table shows that, relative to this variability, the difference in the means is not significant. The corresponding [latex]R^2[/latex] value is just 0.0074.

Change in oxytocin level by relationship status

ANOVA table for change in oxytocin level (pg/mL) by relationship status

Source df SS MS F P
Single 1 0.0155 0.0155 0.17 0.689
Residuals 22 2.0673 0.0940
Total 23 2.0828

However this analysis may be rather unfair on relationship status as a predictor. We already know one reason why there is so much variability in the response — we have demonstrated in Chapter 19 that a lot of the variability ([latex]R^2= 0.7996[/latex]) is explained by the different stimulus events experienced by the women. We repeat the ANOVA table from there in the table below for comparison. Note again that the total sum of squares is the same for both tables since they are modelling the same response.

ANOVA table for change in oxytocin (pg/mL) by stimulus event

Source df SS MS F P
Group 2 1.6655 0.8327 41.91 <0.001
Residuals 21 0.4173 0.0199
Total 23 2.0828

So instead of looking at the effect of relationship status in isolation we should look at both relationship status and stimulus event simultaneously. The figure below shows a trellis plot (Cleveland, 1993; Sarkar, 2008) which separates out these two factors. The relationship effect is still quite small but now it is relative to the much smaller variability within each of the six treatment combinations.

Change in oxytocin level by relationship status and stimulus event

As with one-way analysis of variance, we will assume that the residual variability is constant across the six treatments. We could write the model for the response [latex]y[/latex] as
\[ y = \mu_{rc} + U, \]
where [latex]\mu_{rc}[/latex] is the mean response for the row [latex]r[/latex] and column [latex]c[/latex] combination, and [latex]\Normal{U}{0}{\sigma}[/latex], where [latex]\sigma[/latex] does not depend on [latex]r[/latex] and [latex]c[/latex]. To estimate [latex]\sigma[/latex] we can pool together individual estimates to obtain a combined estimate of the residual variability. The standard deviations within each combination and the corresponding sums of squares are shown in the following table. Pooling these together, we find
\[ \mbox{SSE } = 0.4003, \]
and
\[ \mbox{DFE } = 6 \times 3 = 18. \]

Standard deviations within treatment combinations

Single Stimulus [asciimath]s[/asciimath] df SS
No Happy 0.1268 3 0.0482
Massage 0.1360 3 0.0555
Sad 0.1737 3 0.0905
Yes Happy 0.0250 3 0.0019
Massage 0.2426 3 0.1766
Sad 0.0960 3 0.0277
18 0.4003

The total degrees of freedom are [latex]24 - 1 = 23[/latex] with sum of squares 2.0828. From the earlier table we have 1 degree of freedom with sum of squares 0.0155 for the relationship status while from the next table we have 2 degrees of freedom with sum of squares 1.6655 for the stimulus event. Combining these with the residual sum of squares we have a total of
\[ 0.0155 + 1.6655 + 0.4003 = 2.0813 \]
for the sum of squares from our model components, with degrees of freedom
\[ 1 + 2 + 18 = 21. \]
In both of these values we are missing something from the total variability in observed oxytocin change.

This missing component comes from a source of variability due to a possible interaction in the effects of the stimulus events and the relationship status on the change in oxytocin level. We can quantify the significance of an interaction effect using analysis of variance. By difference, the sum of squares for the interaction effect is 2.0828 – 2.0813 = 0.0015, with 2 degrees of freedom.

The table below shows the full ANOVA table for this two-way analysis of variance. Note that the sum of squares for the relationship effect are the same as before but the [latex]F[/latex] statistic is now calculated relative to the much smaller residual error, giving a higher [latex]F[/latex] value and correspondingly lower [latex]P[/latex]-value. However we still find inconclusive evidence for a difference in oxytocin response between women in and out of a relationship.

ANOVA table for change in oxytocin (pg/mL) by stimulus event and relationship status

Source df SS MS F P
Single 1 0.0155 0.0155 0.70 0.415
Group 2 1.6655 0.8327 37.44 <0.001
Single [asciimath]\times[/asciimath] Group 2 0.0015 0.0007 0.03 0.968
Total 23 2.0828

It is important to note that the calculations involved in two-way analysis of variance are only valid if there are the same number of observations in each cell. Such an experimental design is called balanced, for obvious reasons. An unbalanced design, where some cells have different numbers of observations, cannot be analysed directly in this way. There are alternatives, such as replacing each cell by its mean, to give a trivially balanced design. However, the simplest approach is to use multiple regression techniques, as outlined in the following example.

Intestinal Worms and Native Rat Species

A table in Chapter 16 gives data from a study into worms in native rats. In that section we used log transforms to show that there was a significant difference in worm counts in the small intestine between those rats which had been treated and those which were in a control group. We will return to this data now and look to see if there is any difference between the two species of native rat in the study.

Since we already know that the treatment has a strong effect on worm counts, it is good to keep this factor in the analysis. Spurious results may be seen between the species otherwise, particularly if there is an interaction between the treatment and the species. With two factors, a two-way ANOVA may be appropriate. However, we don’t have the same number of observations in each of the four combinations of factors and levels. This means it is an unbalanced design and so we cannot use the simple method discussed in the above oxytocin example. Instead we will use a regression model for the data.

Both explanatory variables are categorical factors and so we start by defining indicator variables for each of these. Firstly, let
\[ x_1 = \left\{ \begin{array}{ll}
1, & \mbox{ if control} \\
0, & \mbox{ if treated,} \\
\end{array} \right. \]
and then let
\[ x_2 = \left\{ \begin{array}{ll}
1, & \mbox{ if Melomys} \\
0, & \mbox{ if Rattus.} \\
\end{array} \right. \]
We then use least-squares regression to fit the linear relationship
\[ \log(y) = b_0 + b_1 x_1 + b_2 x_2 + b_3 x_1 x_2. \]
Here [latex]b_1[/latex] will give the effect of not being treated, [latex]b_2[/latex] will give the effect of being Melomys over Rattus, while [latex]b_3[/latex] will measure the interaction between these. As before we will use the log response since the observations are skewed.

Regression summary for transformed worm data

  Estimate SE T P
Constant 0.547 0.1743 3.14 0.005
Control 1.596 0.2465 6.48 <0.001
Melomys 0.309 0.2812 1.10 0.283
Control [asciimath]\times[/asciimath] Melomys -1.270 0.3975 -3.20 0.004

The table above shows the usual regression summary, except now there are three variables plus a constant. The [latex]P[/latex]-value for [latex]b_1[/latex] suggests a strong treatment effect, with a 95% confidence interval of (1.08, 2.11). This is an estimate, in the log scale, of how many more worms there are for untreated rats. When we just looked at treatment alone in Chapter 16 we found the interval (0.620, 1.596). Adding the extra variables has allowed us to get a better estimate of the residual variability by removing the variability attributable to the presence of the different species.

There is inconclusive evidence of species effect by itself ([latex]p=0.283[/latex]). However, there is strong evidence of an interaction effect. When [latex]x_1[/latex] and [latex]x_2[/latex] are both 1 (the only way that [latex]x_1 x_2[/latex] can be nonzero) there is a large decrease (1.270 in the log scale) in the predicted worm counts. This corresponds to the Melomys rats which haven’t been treated, suggesting that the treatment has less effect on that species. This effect can be seen in the interaction effects plot below.

Effect of treatment on worm count for Melomys and Rattus

Finally, the table below shows an analysis of variance for this regression. Note that the model degrees of freedom are 3 because there are three variables in our function. The [latex]R^2[/latex] value here is 0.6750, compared to 0.4847 when only treatment is considered.

ANOVA table for model of transformed worm data

Source df SS MS F P
Group 1 7.977 7.977 32.81 <0.001
Species 1 0.652 0.652 2.68 0.116
Group [asciimath]\times[/asciimath] Species 1 2.481 2.481 10.21 0.004
Residuals 22 5.348 0.243
Total 25 16.459

A Very Simple Model

Suppose we wanted to model the increase in pulse rate of the 10 students in the Chapter 2 example who drank the caffeinated cola. A very simple model would be
\[ y = b_0. \]
This model has no explanatory variables, just the constant [latex]b_0[/latex]. Yet we can still use the method of least-squares to determine the estimate for [latex]b_0[/latex] that minimises the sum of the squared prediction errors. That is, we want to find [latex]b_0[/latex] to minimise
\[ \sum (y_j – b_0)^2. \]
Software was used to do this, giving the regression output in the following table. The estimate of [latex]b_0[/latex] is 15.80 bpm. This is just the sample mean, [latex]\overline{x}[/latex]. In a way, our whole discussion so far has really been about least-squares estimation. The sample mean is just the value that minimises the sum of squared prediction errors for a single variable.

Regression summary for simple pulse rate model

  Estimate SE T P CI
[asciimath]b_0[/asciimath] 15.80 2.632 6.00 0.0002 (9.85, 21.75)

The above table also gives the standard error and a 95% confidence interval for [latex]b_0[/latex]. Compare these with the values found in Chapter 14. They are the same as the standard error and confidence interval for the sample mean.

Finally, the table below shows an ANOVA table for this model. The residual error and total values are the same since there is no variable to help explain the variability observed. The square root of MSE is [latex]\sqrt{69.29} = 8.324[/latex] kg, and this is just the sample standard deviation of the increases.

ANOVA table for simple pulse rate model

Source df SS MS F P
Residuals 9 623.6 69.29
Total 9 623.6

R-Squared Revisited

In this chapter we have looked at a variety of models for a quantitative response variable based on one or more predictor variables. The coefficient of determination, [latex]R^2[/latex], has been a useful measure for comparing these models. For example, the relationship status model for change in oxytocin in the previous section only had [latex]R^2 = 0.0074[/latex], compared to the much more impressive value of [latex]R^2 = 0.7996[/latex] for the model based on the actual stimulus event.

However there are some problems with using [latex]R^2[/latex] to compare models in this way. Consider the ANOVA results in the oxytocin example where we included relationship status and stimulus event, along with a possible interaction between them, in our model. Even after taking into account the stimulus event effect we still didn’t find any evidence of a relationship effect so a model containing relationship status is not going to be useful in practice. Yet the [latex]R^2[/latex] value for this model is
\[ R^2 = \frac{0.0155 + 1.6655 + 0.0015}{2.0828} = 0.8078, \]
higher than for the better model that just has the stimulus event as predictor. The reason for this is that [latex]R^2[/latex] is measuring how much of the variability is explained by a model and so as we add more terms to our model this can only go up.

If we want to use [latex]R^2[/latex] to compare models with different numbers of predictor variables then we need to take the number of variables into account. The standard method for doing this gives the adjusted R[latex]^2[/latex] value
\[ \overline{R}^2 = 1 – (1 – R^2)\frac{\mbox{DFT}}{\mbox{DFE}}. \]
If the model has lots of terms then [latex]\mbox{DFE}[/latex] will be much less than [latex]\mbox{DFT}[/latex], increasing the weight on [latex]1 - R^2[/latex] and thus decreasing the value of [latex]\overline{R}^2[/latex] (Triola & Triola, 2006). For example, the adjusted [latex]R^2[/latex] value for the model based only on the stimulus event is
\[ \overline{R}^2 = 1 – (1 – 0.7996)\frac{23}{21} = 0.7805. \]
In contrast, the adjusted [latex]R^2[/latex] value for the full interaction model in the oxytocin example is
\[ \overline{R}^2 = 1 – (1 – 0.8078)\frac{23}{18} = 0.7544. \]
Thus, based on the adjusted [latex]R^2[/latex] values, we would say that the model for change in oxytocin level based on stimulus event alone is better than the model that adds relationship status as well.

The downside of the adjusted [latex]R^2[/latex] is that it doesn’t have the same interpretation as a proportion that the original [latex]R^2[/latex] does since it no longer has to be between 0 and 1. Exercise 3 gives an example of this in practice.

Summary

  • Multiple regression is a generalisation of least-squares regression to multiple predictor variables. Least-squares fitting is used to estimate coefficients in the multiple regression model.
  • Indicator variables can be used to include categorical predictors in the regression model.
  • Two-way analysis of variance partitions the total variability in a response due to two categorical factors.
  • Two-way analysis of variance can help detect an interaction effect between two factors on the response.

Exercise 1

Based on the survey data, is the slope of a linear relationship between height and forearm length the same for males and females? You can test this using multiple regression by making an indicator variable for sex. Fit a model for height with a constant term, forearm length, sex, and a forearm length [latex]\times[/latex] sex interaction. Look at the coefficient estimates and their associated tests. Discuss the results.

Exercise 2

A student was interested in whether you could obtain more juice from an orange by warming it first. After weighing 30 Navel oranges to the nearest 5 g, they were divided into two groups with 15 microwaved (at 600 W) for one minute prior to juicing. Each orange was cut in half and juiced immediately, using a manual juice press, and the amount of juice was recorded to the nearest 5 mL. The results are given in the table below.

Juice (mL) from oranges by weight (g) and heating

Treatment Weight Juice Weight Juice
None 230 105 260 80
250 120 285 105
310 125 245 115
290 120 280 125
265 125 250 110
275 90 285 90
275 115 250 105
330 100
Heat 330 110 340 120
280 85 230 100
245 100 270 110
275 105 275 105
340 125 250 95
275 110 285 100
290 85 235 90
235 95

Of interest here is the effect of heating on the amount of juice obtained, but of course the weight of the orange is going to have an effect. Here the weight is called a covariate, a variable we need to take into account when looking at the effect of the main factor. Use multiple regression to test for the effect of the heating by also including the initial weight as a covariate in the model.

Exercise 3

Calculate the adjusted [latex]R^2[/latex] value for the one-way ANOVA model shown in the oxytocin example.

Exercise 4

The treatments in the potato yield example used in Chapter 20 corresponded to combinations of nitrogen and phosphate levels, as outlined in a table in the Appendix. A two-way analysis of variance broke the total sum of squares (357387 lbs[latex]^2[/latex]) into sums of squares for the phosphate (164872 lbs[latex]^2[/latex]) and nitrogen (77191 lbs[latex]^2[/latex]) factors, and a residual sum of squares (109207 lbs[latex]^2[/latex]). Use these values to complete an ANOVA table and calculate the [latex]P[/latex]-values for each component. Is there evidence of an interaction between phosphate and nitrogen in their effect on potato yield?

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

A Portable Introduction to Data Analysis Copyright © 2024 by The University of Queensland is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book