Multiple Regression

Michael Bulmer

21 Multiple Regression

[latex]\newcommand{\pr}[1]{P(#1)} \newcommand{\var}[1]{\mbox{var}(#1)} \newcommand{\mean}[1]{\mbox{E}(#1)} \newcommand{\sd}[1]{\mbox{sd}(#1)} \newcommand{\Binomial}[3]{#1 \sim \mbox{Binomial}(#2,#3)} \newcommand{\Student}[2]{#1 \sim \mbox{Student}(#2)} \newcommand{\Normal}[3]{#1 \sim \mbox{Normal}(#2,#3)} \newcommand{\Poisson}[2]{#1 \sim \mbox{Poisson}(#2)} \newcommand{\se}[1]{\mbox{se}(#1)} \newcommand{\prbig}[1]{P\left(#1\right)} \newcommand{\degc}{$^{\circ}$C}[/latex]

In this chapter we briefly introduce multiple regression as a generalisation of linear regression and analysis of variance, and show how the ideas we have seen can be extended to problems with more variables.

Indicator Variables

We have used the method of least-squares to determine the values of [latex]b_0[/latex] and [latex]b_1[/latex] which make the line
\[ y = b_0 + b_1 x \]
best fit some data, where [latex]y[/latex] and [latex]x[/latex] were continuous variables, such as weight and height. However, we can turn a categorical variable into a continuous variable by defining an indicator variable. For example, we could define
\[ x = \left\{ \begin{array}{ll}
1, & \mbox{ if male} \\
0, & \mbox{ if female} \\
\end{array} \right. \]
This [latex]x[/latex] is also known as a dummy variable. We can then carry out least-squares regression in the usual way.

Regression summary for height by sex

	Estimate	SE	T	P
Constant	167.42	1.210	138.4	<0.001
Male	9.64	1.608	5.99	<0.001

For example, suppose we want to use linear regression to model sex differences in height. The table above shows the results obtained by software, giving the line
\[ y = 167.4 + 9.64 x, \]
where [latex]x[/latex] is the indicator for sex, as above. Compare this result to the pooled [latex]t[/latex] test in Chapter 16. The "slope'' of this line, 9.64, is the same as the difference between the mean male and female heights. This makes sense, since moving [latex]x[/latex] from 0 to 1 is essentially asking how much height is added by 1 unit of "maleness''. The answer is 9.64 cm on average.

The [latex]t[/latex] statistic given by regression is (essentially) the same as the [latex]t[/latex] statistic in the pooled two-sample test, and the resulting confidence interval is also the same. It can be shown that this will always be the case, and so regression is a generalisation of the simple [latex]t[/latex] procedures we began with.

Similarly, the table below shows the ANOVA table for the above regression.

ANOVA table for height by sex regression

Source	df	SS	MS	F	P
Sex	1	1368	1368	35.93	<0.001
Residuals	58	2208	38.07
Total	59	3576

We have now seen close relationships between the [latex]t[/latex] procedures, regression and correlation, and analysis of variance. It is these methods, developed by R. A. Fisher and not yet a century old, that have become the cornerstone of scientific data analysis. If you can appreciate the unity of these ideas then you should be able to approach more advanced methods with confidence.

Multiple Predictors

A Chapter 3 table gave the time that 20 students could hold their breath, along with the height and sex of each student. The following figure shows the relationship between the time that breath could be held and height. A least squares line has been fitted, and there seems to be a strong association between the time that breath was held and height. The following table shows the regression results, including the strong [latex]P[/latex]-value of 0.001.

Time breath held against height for 20 students

Regression summary for time breath held by height

	Estimate	SE	T	P
Constant	-181.7	59.29	-3.07	0.007
Height	1.277	0.3388	3.77	0.001

However, this analysis is misleading. The figure below shows the same relationship but with different symbols to show males and females. Least-square lines have been fitted for males and females separately, and there now appears to be little association between the time that breath was held and height. The line in the previous figure was essentially just joining two clouds of points together, giving the spurious association.

Time breath held against height for males and females

How can we detect this effect in our regression? One approach is to fit more than one variable at a time to try and estimate the time that breath is held. For example, let [latex]x_1[/latex] be the height of a subject and let
\[ x_2 = \left\{ \begin{array}{ll}
1, & \mbox{ if male} \\
0, & \mbox{ if female}
\end{array} \right. \]
be an indicator variable for their sex. We can then use least-squares regression to fit the "line''
\[ y = b_0 + b_1 x_1 + b_2 x_2 + b_3 x_1 x_2. \]
Here [latex]b_1[/latex] will give the effect of height on breath holding, [latex]b_2[/latex] will give the sex difference in breath holding, while [latex]b_3[/latex] will measure the difference in the slopes between the males and females. To see the role of [latex]b_3[/latex], note that if a subject is female then this equation will estimate [latex]y[/latex] by
\[ y = b_0 + b_1 x_1 + b_2 \times 0 + b_3 x_1 \times 0 = b_0 + b_1 x_1. \]
If a subject is male then we have
\[ y = b_0 + b_1 x_1 + b_2 \times 1 + b_3 x_1 \times 1 = (b_0 + b_2) + (b_1 + b_3) x_1. \]
So [latex]b_2[/latex] is the overall change in time for being male, while [latex]b_3[/latex] is the change in the slope of the relationship with height for being male. If [latex]b_3 = 0[/latex] then the male and female lines would be parallel.

Summary for full multiple regression model

	Estimate	SE	T	P
Constant	84.33	86.88	0.97	0.346
Male	-41.75	125.0	-0.33	0.743
Height	-0.3468	0.5178	-0.67	0.513
Male [asciimath]\times[/asciimath] Height	0.4249	0.7158	0.59	0.561

The table above shows the least squares estimates for these four parameters, along with their standard errors. Each of these can be used to test the null hypothesis
\[ H_0: \beta_j = 0 \]
conditional on the other variables being included in the model. For example, here the [latex]R^2[/latex] value is 0.7847, suggesting the overall model does a reasonable job of explaining the variability, but none of the [latex]P[/latex]-values show a significant difference from a null hypothesis of 0.

In particular, there is no evidence that [latex]b_3[/latex] is not 0. This suggests we could try a simpler model where the interaction between sex and height is dropped. That is, we look at the linear relationship
\[ y = b_0 + b_1 x_1 + b_2 x_2, \]
where the variables and coefficients are as before. The following table shows the results of the regression analysis. The [latex]R^2[/latex] value is now 0.7800, only slightly lower than for the full regression model. We now find very strong evidence of an overall sex difference, as you might expect from the previous figure. There is no evidence of a relationship between the time that breath was held and the height of the subject.

Summary for reduced multiple regression model

	Estimate	SE	T	P
Constant	47.05	58.87	0.80	0.435
Male	32.37	6.327	5.12	<0.001
Height	-0.1244	0.3506	-0.35	0.727

We can also use analysis of variance for the regression model to assess the effect of each factor on the time subjects can hold their breath. The table below gives the ANOVA table for the full regression model, corresponding to the estimates presented in previous table of the full model. You will notice that the three components are essentially the same (except that we do not need an explicit indicator variable now) but the [latex]P[/latex]-values for each are different.

ANOVA table for time breath held (s) by sex and height (cm)

Source	df	SS	MS	F	P
Sex	1	4685	4685	57.845	<0.001
Height	1	10	10	0.121	0.732
Sex [asciimath]\times[/asciimath] Height	1	29	29	0.352	0.561
Residuals	16	1296	81
Total	19	6020

For example, the sum of squares for Sex (4685 s[latex]^2[/latex]) is a measure of the variability in the response explained by knowing the sex of a subject. This value is quite high and so the [latex]P[/latex]-value for the Sex component is very significant. This is in contrast to the [latex]P[/latex]-value of 0.743 in the table of the full model which assessed the significance of Sex given that all the other variables were in the model.

The sum of squares for Height (10 s[latex]^2[/latex]) in the previous table is then a measure of how much more variability is explained by knowing the height of a subject, given that you already know their sex. We have already seen that there is little association with height once sex is taking into account, as in the previous figure, so this small value is not surprising.

Lastly, the sum of squares for Sex [latex]\times[/latex] Height (29 s[latex]^2[/latex]) is a measure of how much more variability is explained by knowing the combination of sex and height for a subject, given that you already have the other two variables in the model. Note that this case is equivalent to the previous [latex]t[/latex] tests since this is the last variable added to the model. We thus get the same [latex]P[/latex]-value of 0.561 in both cases.

These last two [latex]F[/latex] tests are examples of partial F tests, testing for the contribution of an additional variable added to the model. In both cases the null hypothesis was that the new variable does not significantly add to the prediction of the response, given the existing variables in the model. In general this procedure is a useful tool in helping decide which variables to include in a model. Further discussion of model building is beyond the scope of this book but there are many texts on regression analysis, such as Kleinbaum et al. (1988), that give more details on this topic.

Two-Way Analysis of Variance

The basic idea behind analysis of variance, of separating variance into components due to the effect of a variable and to residual error, can be easily extended to analyse more variables. In this section we look at the special case of a design with two factors.

Oxytocin and Emotions

We saw in Chapter 19 that there was evidence of a difference between the three stimulus events in their effect on plasma oxytocin level. However we also saw earlier in Chapter 19 that there was a difference in basal oxytocin between women in and out of a relationship. Could relationship status also be related to the observed change in oxytocin level?

The figure below shows a dot plot comparing the changes in oxytocin level between single women and those in a relationship. There does not seem to be a relationship effect -- even though the mean change was 0.051 pg/mL less for single women there is a lot of variability in the response. The analysis of variance in the following table shows that, relative to this variability, the difference in the means is not significant. The corresponding [latex]R^2[/latex] value is just 0.0074.

Change in oxytocin level by relationship status

ANOVA table for change in oxytocin level (pg/mL) by relationship status

Source	df	SS	MS	F	P
Single	1	0.0155	0.0155	0.17	0.689
Residuals	22	2.0673	0.0940
Total	23	2.0828

However this analysis may be rather unfair on relationship status as a predictor. We already know one reason why there is so much variability in the response -- we have demonstrated in Chapter 19 that a lot of the variability ([latex]R^2= 0.7996[/latex]) is explained by the different stimulus events experienced by the women. We repeat the ANOVA table from there in the table below for comparison. Note again that the total sum of squares is the same for both tables since they are modelling the same response.

ANOVA table for change in oxytocin (pg/mL) by stimulus event

Source	df	SS	MS	F	P
Group	2	1.6655	0.8327	41.91	<0.001
Residuals	21	0.4173	0.0199
Total	23	2.0828

So instead of looking at the effect of relationship status in isolation we should look at both relationship status and stimulus event simultaneously. The figure below shows a trellis plot (Cleveland, 1993; Sarkar, 2008) which separates out these two factors. The relationship effect is still quite small but now it is relative to the much smaller variability within each of the six treatment combinations.

Change in oxytocin level by relationship status and stimulus event

As with one-way analysis of variance, we will assume that the residual variability is constant across the six treatments. We could write the model for the response [latex]y[/latex] as
\[ y = \mu_{rc} + U, \]
where [latex]\mu_{rc}[/latex] is the mean response for the row [latex]r[/latex] and column [latex]c[/latex] combination, and [latex]\Normal{U}{0}{\sigma}[/latex], where [latex]\sigma[/latex] does not depend on [latex]r[/latex] and [latex]c[/latex]. To estimate [latex]\sigma[/latex] we can pool together individual estimates to obtain a combined estimate of the residual variability. The standard deviations within each combination and the corresponding sums of squares are shown in the following table. Pooling these together, we find
\[ \mbox{SSE } = 0.4003, \]
and
\[ \mbox{DFE } = 6 \times 3 = 18. \]

Standard deviations within treatment combinations

Single	Stimulus	[asciimath]s[/asciimath]	df	SS
No	Happy	0.1268	3	0.0482
	Massage	0.1360	3	0.0555
	Sad	0.1737	3	0.0905
Yes	Happy	0.0250	3	0.0019
	Massage	0.2426	3	0.1766
	Sad	0.0960	3	0.0277
			18	0.4003

The total degrees of freedom are [latex]24 - 1 = 23[/latex] with sum of squares 2.0828. From the earlier table we have 1 degree of freedom with sum of squares 0.0155 for the relationship status while from the next table we have 2 degrees of freedom with sum of squares 1.6655 for the stimulus event. Combining these with the residual sum of squares we have a total of
\[ 0.0155 + 1.6655 + 0.4003 = 2.0813 \]
for the sum of squares from our model components, with degrees of freedom
\[ 1 + 2 + 18 = 21. \]
In both of these values we are missing something from the total variability in observed oxytocin change.

This missing component comes from a source of variability due to a possible interaction in the effects of the stimulus events and the relationship status on the change in oxytocin level. We can quantify the significance of an interaction effect using analysis of variance. By difference, the sum of squares for the interaction effect is 2.0828 - 2.0813 = 0.0015, with 2 degrees of freedom.

The table below shows the full ANOVA table for this two-way analysis of variance. Note that the sum of squares for the relationship effect are the same as before but the [latex]F[/latex] statistic is now calculated relative to the much smaller residual error, giving a higher [latex]F[/latex] value and correspondingly lower [latex]P[/latex]-value. However we still find inconclusive evidence for a difference in oxytocin response between women in and out of a relationship.

ANOVA table for change in oxytocin (pg/mL) by stimulus event and relationship status

Source	df	SS	MS	F	P
Single	1	0.0155	0.0155	0.70	0.415
Group	2	1.6655	0.8327	37.44	<0.001
Single [asciimath]\times[/asciimath] Group	2	0.0015	0.0007	0.03	0.968
Total	23	2.0828

It is important to note that the calculations involved in two-way analysis of variance are only valid if there are the same number of observations in each cell. Such an experimental design is called balanced, for obvious reasons. An unbalanced design, where some cells have different numbers of observations, cannot be analysed directly in this way. There are alternatives, such as replacing each cell by its mean, to give a trivially balanced design. However, the simplest approach is to use multiple regression techniques, as outlined in the following example.

Intestinal Worms and Native Rat Species

A table in Chapter 16 gives data from a study into worms in native rats. In that section we used log transforms to show that there was a significant difference in worm counts in the small intestine between those rats which had been treated and those which were in a control group. We will return to this data now and look to see if there is any difference between the two species of native rat in the study.

Since we already know that the treatment has a strong effect on worm counts, it is good to keep this factor in the analysis. Spurious results may be seen between the species otherwise, particularly if there is an interaction between the treatment and the species. With two factors, a two-way ANOVA may be appropriate. However, we don't have the same number of observations in each of the four combinations of factors and levels. This means it is an unbalanced design and so we cannot use the simple method discussed in the above oxytocin example. Instead we will use a regression model for the data.

Both explanatory variables are categorical factors and so we start by defining indicator variables for each of these. Firstly, let
\[ x_1 = \left\{ \begin{array}{ll}
1, & \mbox{ if control} \\
0, & \mbox{ if treated,} \\
\end{array} \right. \]
and then let
\[ x_2 = \left\{ \begin{array}{ll}
1, & \mbox{ if Melomys} \\
0, & \mbox{ if Rattus.} \\
\end{array} \right. \]
We then use least-squares regression to fit the linear relationship
\[ \log(y) = b_0 + b_1 x_1 + b_2 x_2 + b_3 x_1 x_2. \]
Here [latex]b_1[/latex] will give the effect of not being treated, [latex]b_2[/latex] will give the effect of being Melomys over Rattus, while [latex]b_3[/latex] will measure the interaction between these. As before we will use the log response since the observations are skewed.

Regression summary for transformed worm data

	Estimate	SE	T	P
Constant	0.547	0.1743	3.14	0.005
Control	1.596	0.2465	6.48	<0.001
Melomys	0.309	0.2812	1.10	0.283
Control [asciimath]\times[/asciimath] Melomys	-1.270	0.3975	-3.20	0.004

The table above shows the usual regression summary, except now there are three variables plus a constant. The [latex]P[/latex]-value for [latex]b_1[/latex] suggests a strong treatment effect, with a 95% confidence interval of (1.08, 2.11). This is an estimate, in the log scale, of how many more worms there are for untreated rats. When we just looked at treatment alone in Chapter 16 we found the interval (0.620, 1.596). Adding the extra variables has allowed us to get a better estimate of the residual variability by removing the variability attributable to the presence of the different species.

There is inconclusive evidence of species effect by itself ([latex]p=0.283[/latex]). However, there is strong evidence of an interaction effect. When [latex]x_1[/latex] and [latex]x_2[/latex] are both 1 (the only way that [latex]x_1 x_2[/latex] can be nonzero) there is a large decrease (1.270 in the log scale) in the predicted worm counts. This corresponds to the Melomys rats which haven't been treated, suggesting that the treatment has less effect on that species. This effect can be seen in the interaction effects plot below.

Effect of treatment on worm count for Melomys and Rattus

Finally, the table below shows an analysis of variance for this regression. Note that the model degrees of freedom are 3 because there are three variables in our function. The [latex]R^2[/latex] value here is 0.6750, compared to 0.4847 when only treatment is considered.

ANOVA table for model of transformed worm data

Source	df	SS	MS	F	P
Group	1	7.977	7.977	32.81	<0.001
Species	1	0.652	0.652	2.68	0.116
Group [asciimath]\times[/asciimath] Species	1	2.481	2.481	10.21	0.004
Residuals	22	5.348	0.243
Total	25	16.459

A Very Simple Model

Suppose we wanted to model the increase in pulse rate of the 10 students in the Chapter 2 example who drank the caffeinated cola. A very simple model would be
\[ y = b_0. \]
This model has no explanatory variables, just the constant [latex]b_0[/latex]. Yet we can still use the method of least-squares to determine the estimate for [latex]b_0[/latex] that minimises the sum of the squared prediction errors. That is, we want to find [latex]b_0[/latex] to minimise
\[ \sum (y_j - b_0)^2. \]
Software was used to do this, giving the regression output in the following table. The estimate of [latex]b_0[/latex] is 15.80 bpm. This is just the sample mean, [latex]\overline{x}[/latex]. In a way, our whole discussion so far has really been about least-squares estimation. The sample mean is just the value that minimises the sum of squared prediction errors for a single variable.

Regression summary for simple pulse rate model

	Estimate	SE	T	P	CI
[asciimath]b_0[/asciimath]	15.80	2.632	6.00	0.0002	(9.85, 21.75)

The above table also gives the standard error and a 95% confidence interval for [latex]b_0[/latex]. Compare these with the values found in Chapter 14. They are the same as the standard error and confidence interval for the sample mean.

Finally, the table below shows an ANOVA table for this model. The residual error and total values are the same since there is no variable to help explain the variability observed. The square root of MSE is [latex]\sqrt{69.29} = 8.324[/latex] kg, and this is just the sample standard deviation of the increases.

ANOVA table for simple pulse rate model

Source	df	SS	MS	F	P
Residuals	9	623.6	69.29
Total	9	623.6

R-Squared Revisited

In this chapter we have looked at a variety of models for a quantitative response variable based on one or more predictor variables. The coefficient of determination, [latex]R^2[/latex], has been a useful measure for comparing these models. For example, the relationship status model for change in oxytocin in the previous section only had [latex]R^2 = 0.0074[/latex], compared to the much more impressive value of [latex]R^2 = 0.7996[/latex] for the model based on the actual stimulus event.

However there are some problems with using [latex]R^2[/latex] to compare models in this way. Consider the ANOVA results in the oxytocin example where we included relationship status and stimulus event, along with a possible interaction between them, in our model. Even after taking into account the stimulus event effect we still didn't find any evidence of a relationship effect so a model containing relationship status is not going to be useful in practice. Yet the [latex]R^2[/latex] value for this model is
\[ R^2 = \frac{0.0155 + 1.6655 + 0.0015}{2.0828} = 0.8078, \]
higher than for the better model that just has the stimulus event as predictor. The reason for this is that [latex]R^2[/latex] is measuring how much of the variability is explained by a model and so as we add more terms to our model this can only go up.

If we want to use [latex]R^2[/latex] to compare models with different numbers of predictor variables then we need to take the number of variables into account. The standard method for doing this gives the adjusted R[latex]^2[/latex] value
\[ \overline{R}^2 = 1 - (1 - R^2)\frac{\mbox{DFT}}{\mbox{DFE}}. \]
If the model has lots of terms then [latex]\mbox{DFE}[/latex] will be much less than [latex]\mbox{DFT}[/latex], increasing the weight on [latex]1 - R^2[/latex] and thus decreasing the value of [latex]\overline{R}^2[/latex] (Triola & Triola, 2006). For example, the adjusted [latex]R^2[/latex] value for the model based only on the stimulus event is
\[ \overline{R}^2 = 1 - (1 - 0.7996)\frac{23}{21} = 0.7805. \]
In contrast, the adjusted [latex]R^2[/latex] value for the full interaction model in the oxytocin example is
\[ \overline{R}^2 = 1 - (1 - 0.8078)\frac{23}{18} = 0.7544. \]
Thus, based on the adjusted [latex]R^2[/latex] values, we would say that the model for change in oxytocin level based on stimulus event alone is better than the model that adds relationship status as well.

The downside of the adjusted [latex]R^2[/latex] is that it doesn't have the same interpretation as a proportion that the original [latex]R^2[/latex] does since it no longer has to be between 0 and 1. Exercise 3 gives an example of this in practice.

Summary

Multiple regression is a generalisation of least-squares regression to multiple predictor variables. Least-squares fitting is used to estimate coefficients in the multiple regression model.
Indicator variables can be used to include categorical predictors in the regression model.
Two-way analysis of variance partitions the total variability in a response due to two categorical factors.
Two-way analysis of variance can help detect an interaction effect between two factors on the response.

Exercise 1

Based on the survey data, is the slope of a linear relationship between height and forearm length the same for males and females? You can test this using multiple regression by making an indicator variable for sex. Fit a model for height with a constant term, forearm length, sex, and a forearm length [latex]\times[/latex] sex interaction. Look at the coefficient estimates and their associated tests. Discuss the results.

Exercise 2

A student was interested in whether you could obtain more juice from an orange by warming it first. After weighing 30 Navel oranges to the nearest 5 g, they were divided into two groups with 15 microwaved (at 600 W) for one minute prior to juicing. Each orange was cut in half and juiced immediately, using a manual juice press, and the amount of juice was recorded to the nearest 5 mL. The results are given in the table below.

Juice (mL) from oranges by weight (g) and heating

Treatment	Weight	Juice	Weight	Juice
None	230	105	260	80
	250	120	285	105
	310	125	245	115
	290	120	280	125
	265	125	250	110
	275	90	285	90
	275	115	250	105
	330	100
Heat	330	110	340	120
	280	85	230	100
	245	100	270	110
	275	105	275	105
	340	125	250	95
	275	110	285	100
	290	85	235	90
	235	95

Of interest here is the effect of heating on the amount of juice obtained, but of course the weight of the orange is going to have an effect. Here the weight is called a covariate, a variable we need to take into account when looking at the effect of the main factor. Use multiple regression to test for the effect of the heating by also including the initial weight as a covariate in the model.

Exercise 3

Calculate the adjusted [latex]R^2[/latex] value for the one-way ANOVA model shown in the oxytocin example.

Exercise 4

The treatments in the potato yield example used in Chapter 20 corresponded to combinations of nitrogen and phosphate levels, as outlined in a table in the Appendix. A two-way analysis of variance broke the total sum of squares (357387 lbs[latex]^2[/latex]) into sums of squares for the phosphate (164872 lbs[latex]^2[/latex]) and nitrogen (77191 lbs[latex]^2[/latex]) factors, and a residual sum of squares (109207 lbs[latex]^2[/latex]). Use these values to complete an ANOVA table and calculate the [latex]P[/latex]-values for each component. Is there evidence of an interaction between phosphate and nitrogen in their effect on potato yield?

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

A Portable Introduction to Data Analysis Copyright © 2024 by The University of Queensland is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.