# 18 Inferences for Regression

[latex]\newcommand{\pr}[1]{P(#1)} \newcommand{\var}[1]{\mbox{var}(#1)} \newcommand{\mean}[1]{\mbox{E}(#1)} \newcommand{\sd}[1]{\mbox{sd}(#1)} \newcommand{\Binomial}[3]{#1 \sim \mbox{Binomial}(#2,#3)} \newcommand{\Student}[2]{#1 \sim \mbox{Student}(#2)} \newcommand{\Normal}[3]{#1 \sim \mbox{Normal}(#2,#3)} \newcommand{\Poisson}[2]{#1 \sim \mbox{Poisson}(#2)} \newcommand{\se}[1]{\mbox{se}(#1)} \newcommand{\prbig}[1]{P\left(#1\right)}[/latex]

# Statistical Models

We saw in Chapter 7 that least-squares fitting could be used to summarise a linear relationship that we saw in data. Fitting the line gave us an estimate of the **intercept**, [latex]b_0[/latex], and **slope**, [latex]b_1[/latex], for the line. It should be clear by now that we would not be happy to report [latex]b_0[/latex] and [latex]b_1[/latex] by themselves. We would like to be able to give some idea about how precise they are as estimates.

The first question to ask is what are [latex]b_0[/latex] and [latex]b_1[/latex] estimating? We will imagine that there really is a true linear relationship between the response variable, [latex]y[/latex], and the explanatory variable, [latex]x[/latex], and that the line for this relationship gives the mean response, [latex]\mu_y[/latex], for a particular value of [latex]x[/latex]. That is, if [latex]\beta_0[/latex] and [latex]\beta_1[/latex] are the intercept and slope of this **population** line, then

\[ \mu_y = \beta_0 + \beta_1 x. \]

An actual response we see is this average together with some natural variability in the response. We will assume that this variability has a Normal distribution and that the amount of variability does not depend on [latex]x[/latex]. That is, the response is

\[ Y = \beta_0 + \beta_1 x + U, \]

where [latex]\Normal{U}{0}{\sigma}[/latex], and [latex]\sigma[/latex] is constant for all [latex]x[/latex]. This is known as the linear regression model. The requirement that the standard deviation [latex]\sigma[/latex] is constant for all values of [latex]x[/latex] is known as the **homoscedasticity** assumption. We will return to the problem of checking this assumption, as well as the assumptions of linearity and Normality, later in this chapter.

When we take a random sample of [latex]n[/latex] subjects we are getting values of [latex]x[/latex] and then values of [latex]y[/latex] from this relationship. There are actually two different types of [latex]x[/latex] variable. It might be a **fixed factor** where it has been controlled, such as in the microwave times used for the experimental data seen in Chapter 6. Alternatively, it may be a **random factor**, such as height, which also has its own natural variability. In some of our discussion these can be treated in the same way, but when making predictions (see the section on prediction intervals) there are naturally going to be extra issues if you do not have precision in your explanatory values.

Each time we take a random sample we will obtain new values of [latex]b_0[/latex] and [latex]b_1[/latex], so we can think of them as coming from random processes [latex]B_0[/latex] and [latex]B_1[/latex], respectively. These random variables have sampling distributions, just as did the sample mean and sample proportion. It turns out that both of these distributions are Normal, provided that the above assumption of Normal response variability holds. Their expected values are

\[ \mean{B_0} = \beta_0 \mbox{ and } \mean{B_1} = \beta_1, \]

so both are **unbiased**. Their standard deviations are

\[ \sd{B_0} = \sigma \sqrt{\frac{1}{n} + \frac{\overline{x}^2}{\sum(x_j – \overline{x})^2}}, \]

and

\[ \sd{B_1} = \frac{\sigma}{\sqrt{\sum(x_j – \overline{x})^2}}, \]

where the [latex]\sigma[/latex] in each is the Normal variability in the response. Note that there will be less variability in the slope estimate if the sum of squared deviations of the [latex]x[/latex] values was larger. This is just a measure of spread of the [latex]x[/latex] values, and so the slope will be more accurate if the [latex]x[/latex] values are more spread out. Draw yourself pictures to see why. Since the sum of squared deviations is not being averaged, it will also be bigger if the sample size is larger, another way to improve precision.

If we knew [latex]\sigma[/latex] then we could use these standard deviations to make confidence intervals and carry out tests. Of course, we don’t know it in practice but we can estimate it using the same idea we used for the sample standard deviation, taking squared deviations from the sample mean.

The difference in this case is that we believe the mean is changing with [latex]x[/latex]. However, we can use our least squares line,

\[ \hat{y} = b_0 + b_1 x, \]

to estimate this mean for each value of [latex]x[/latex] in the data. So a pair [latex](x_j, y_j)[/latex] in the data gives a **residual** deviation

\[ e_j = y_j – \hat{y}_j = y_j – (b_0 + b_1 x_j), \]

We call this a **prediction error**, and we saw in Chapter 7 that the least-squares criterion minimises the sum of the squares of these prediction errors. We use this sum of squared deviations here to estimate [latex]\sigma[/latex]. We average by its degrees of freedom but instead of [latex]n-1[/latex], as in the one-sample case, the degrees of freedom are now [latex]n-2[/latex]. This comes from having to estimate a slope and an intercept before we can measure variability. Thus our estimate of [latex]\sigma[/latex] is

\[ s_U = \sqrt{\frac{\sum e_j^2}{n-2}}. \]

We call this the **residual standard error**, and use the subscript [latex]U[/latex] to distinguish it from the sample standard deviation. This is a terrible statistic to calculate by hand, because you have to plug each [latex]x[/latex] value into the line equation and then take differences, square them, and add them up. We will assume all of this is done with a computer. However, it is useful to see the form of the formulas for standard deviations and standard errors, as discussed above.

# Inferences for the Slope

The intercept of the line gives the estimated response for [latex]x = 0[/latex]. This is occasionally of interest, but far more frequently we are interested in the slope. If [latex]\beta_1[/latex] = 0 in the population line then there is no association between the response and explanatory variable. A standard hypothesis test then is to consider [latex]H_0: \beta_1 = 0[/latex] and see if there is any evidence against it.

Oxytocin and Age

In Chapter 7 we found a least-squares fit for the relationship between basal plasma oxytocin level and subject age to be

\[ \mbox{Basal} = 4.98 – 0.0097 \; \mbox{Age}. \]

Computer software gives [latex]s_U[/latex] = 0.312 pg/mL for this fit, together with further output shown in the table below.

## Regression summary for oxytocin level (pg/mL) by age (years)

Estimate | SE | T | P | |
---|---|---|---|---|

Constant | 4.98 | 0.1493 | 33.38 | < 0.001 |

Age | -0.0097 | 0.003436 | -2.83 | 0.0098 |

The “Constant” row refers to the intercept estimate while the “Age” row refers to the slope estimate, the coefficient of “Age” in the regression equation. For this slope we see that its standard error was 0.003436 pg/mL/year so that a [latex]t[/latex] test of [latex]H_0: \beta_1 = 0[/latex] would use the statistic

\[ t_{22} = \frac{-0.0097 – 0}{0.003436} = -2.83. \]

Here we have 24 – 2 = 22 degrees of freedom. The corresponding two-sided [latex]P[/latex]-value is 0.01, giving strong evidence that the slope is not 0 and thus that plasma oxytocin level depends on age.

Knowing the standard errors for the estimates given in this table, we can also construct confidence intervals for the regression parameters. For example, a 95% confidence interval for the slope would use the critical value [latex]t_{22}^{*} = 2.074[/latex], giving a range of

\[ -0.0097 \pm 2.074 \times 0.003436 = -0.0097 \pm 0.0071, \]

or (-0.0168, -0.0026) pg/mL/year. This means that we are 95% confident that the mean oxytocin level of women drops by 0.0026 to 0.0168 pg/mL per year of age.

# Confidence and Prediction Intervals

Suppose we want to estimate the average basal oxytocin level, [latex]\mu_y[/latex], of all women who are 40 years old. The estimate itself comes from the equation of the line,

\[ \hat{\mu}_y = 4.98 – 0.0097 \times 40 = 4.592 \mbox{ pg/mL}. \]

How accurate is this estimate? Its accuracy depends on the accuracy of [latex]b_0[/latex] and [latex]b_1[/latex], so we can calculate the standard deviation of this new estimate using the standard deviations of these existing estimates. We find

\[ \se{\hat{\mu}_y} = s_U \sqrt{\frac{1}{n} + \frac{(x^{*} – \overline{x})^2}{\sum(x_j – \overline{x})^2}}, \]

where here [latex]x^{*} = 40[/latex]. Note from this that the estimate will be more precise if it is closer to [latex]\overline{x}[/latex], in this case 39.29 years. This makes intuitive sense because estimates about responses a long way from the typical explanatory value are naturally going to be less reliable.

For [latex]x^{*} = 40[/latex] we find [latex]\se{\hat{\mu}_y} = 0.0637[/latex] pg/mL. A 95% confidence interval uses [latex]t_{22}^{*} = 2.074[/latex], giving

\[ 4.592 \pm 2.074 \times 0.0637 = 4.592 \pm 0.132 \mbox{ pg/mL}. \]

Thus we can be reasonably precise in our estimate of the **mean** oxytocin level. The figure below shows the original regression line with 95% confidence **bands**. These bands are formed by plotting the limits of the confidence intervals as you move along the plot. Note that they are most narrow around the mean age but get progressively wider as you move away.

Suppose though we want to predict the oxytocin level, [latex]\hat{y}[/latex], of an **individual** who is 40 years old. Our estimate is the same, 4.592 pg/mL, but now in addition to the variability of the estimate we also need to account for the natural variability in oxytocin levels. This is estimated by [latex]s_U[/latex] and so the standard error of [latex]\hat{y}[/latex] adds an extra ‘1’ under the square root, giving

\[ \se{y – \hat{y}} = s_U \sqrt{1 + \frac{1}{n} + \frac{(x^{*} – \overline{x})^2}{\sum(x_j – \overline{x})^2}}. \]

Here [latex]s_U = 0.3117[/latex] and for [latex]x^{*} = 40[/latex] we find [latex]\se{y - \hat{y}} = 0.3181[/latex]. This gives the 95% **prediction interval**

\[ 4.592 \pm 2.074 \times 0.3181 = 4.592 \pm 0.660 \mbox{ pg/mL}, \]

so a woman who is 40 years old could have a basal oxytocin level anywhere from about 3.932 pg/mL to 5.252 pg/mL, quite a large range. However, a 95% prediction interval for the oxytocin level without knowing the woman’s age is [latex]4.6 \pm 0.752 \mbox{ pg/mL}[/latex],

so we are better off knowing her age since it reduces the margin of error for the prediction.

The following figure shows the regression line with 95% prediction bands, instead of the confidence bands shown in the previous figure. These are obviously wider than the confidence bands. They also get wider as you move away from the mean age, but this effect is less noticeable relative to the residual variability in weights.

# Checking Assumptions

In addition to the usual requirements of independence, the key assumptions for linear regression are that the underlying relationship is linear and that the residual variability is Normal with constant variability. The best way of checking these three assumptions is to look at the residuals after fitting the least-squares line.

To check linearity, make a plot of the residuals against the explanatory variable, as in the figure above. Since the straight line has been subtracted by calculating the residuals, if the relationship was linear then there should be no obvious pattern left over. In the relationship of oxytocin by age, there does seem to be some pattern present in this residual plot. There still appears to be a positive trend and this is probably the result of a poor fit caused by some of the unusual values. We would proceed by looking at the fit without certain values and then deciding whether those values should remain when making our conclusions.

A plot of residuals like that in the previous figure also allows you to assess whether the variability is the same across the explanatory variable. Here it seems fairly constant.

An alternative plot for checking the residuals is the **fitted value plot** shown in the following figure. Instead of using the explanatory variable, Age, here we plot the residuals against the predicted values from the straight line equation. Note that since this is a linear transformation of age it does not change the overall pattern of points (except that direction of the axis is swapped around since the coefficient of Age was negative) so it provides nothing new. However, the advantage of plotting the fitted values is that this plot can be used when we have multiple explanatory variables, as in Chapter 21.

Finally, the following two figures show two pictures of the distribution of the residuals, a density plot and a Normal probability plot. These both suggest that the residuals are reasonably Normal.

# Inferences for Correlation

## Hypothesis Tests

Correlation is best used as a descriptive statistic, rather than a statistic with which to make decisions. Correlations of 0.2 can be significantly different from 0 in the statistical sense but if you look at the examples in this figure from Chapter 7 you will see that a value of 0.2 is probably of little practical interest.

However, a test for correlation is included to try and convince you that once you understand the basic ideas of estimates and standard errors, then it is easy to apply these methods to new settings. When looking at confidence intervals, in the next section, we also see another application of transformations.

A standard test for the correlation coefficient is based on the null hypothesis [latex]H_0: \rho = 0[/latex], involving the **population correlation** coefficient, [latex]\rho[/latex], the Greek letter ‘r’. That is, we assume the correlation coefficient for the population from which the sample was drawn is 0. We estimate [latex]\rho[/latex] by [latex]r[/latex]. All we need now is the standard error of [latex]r[/latex] and its distribution. The standard error is

\[ \se{r} = \sqrt{\frac{1 – r^2}{n – 2}}, \]

and this gives a [latex]t[/latex] statistic

\[ t_{n-2} = \frac{r – 0}{\sqrt{\frac{1 – r^2}{n – 2}}}, \]

which is compared to the [latex]t[/latex] distribution with [latex]n-2[/latex] degrees of freedom.

Oxytocin and Age

The correlation between the basal oxytocin level and age for the data in the original oxytocin example is [latex]r = -0.5165[/latex] from [latex]n = 24[/latex] observations. To test the null hypothesis [latex]H_0: \rho = 0[/latex] against the alternative [latex]H_1: \rho \ne 0[/latex], we calculate the [latex]t[/latex] statistic

\[ t_{22} = \frac{-0.5165 – 0}{\sqrt{\frac{1 – (-0.5165)^2}{22}}} = -2.83. \]

This is exactly the same value, with the same degrees of freedom, that we found earlier in this chapter when testing the hypothesis [latex]H_0: \beta_1 = 0[/latex] using the least-squares estimates. In fact these numbers will always be the same, highlighting the connection between regression and correlation. The conclusions are thus the same as before.

## Confidence Intervals

Confidence intervals are more complicated since this distribution is only valid under the assumption that [latex]\rho = 0[/latex]. When you are calculating a confidence interval you do not have any hypothesis to work with, and so an alternative approach is needed.

The problem is that the correlation [latex]r[/latex] can only be between -1 and 1. If the population [latex]\rho[/latex] was 0.99 then possible values of [latex]r[/latex] can only go a little way to the right but can go a long way to the left. Thus the sampling distribution will be skewed and so the [latex]t[/latex] distribution above will not be appropriate when [latex]\rho[/latex] is close to these extremes. Fisher (1915) discovered a transformation which can be used to generate a new statistic that actually has an approximate Normal distribution. Known as **Fisher’s Z transformation**, this is calculated as

\[ z = \mbox{arctanh}(r) = \frac{1}{2} \ln \left(\frac{1 + r}{1 – r}\right), \]

where [latex]\mbox{arctanh}[/latex] is the **inverse hyperbolic tangent**, often denoted by [latex]\mbox{tanh}^{-1}[/latex] on calculators. The standard error of this statistic is

\[ \se{z} = \sqrt{\frac{1}{n-3}}. \]

The inverse transformation, to get from [latex]z[/latex] back to [latex]r[/latex], is

\[ r = \tanh(z) = \frac{e^{2z} – 1}{e^{2z} + 1}, \]

where [latex]\tanh[/latex] is the **hyperbolic tangent** function.

Returning to our example of oxytocin level and age, we can now find a 95% confidence interval for the true correlation based on the sample correlation [latex]r = -0.5165[/latex]. First we calculate

\[ z = \mbox{arctanh}(-0.5165) = -0.5716, \]

with

\[ \se{z} = \sqrt{\frac{1}{21}} = 0.2182. \]

Since [latex]z[/latex] is approximately Normal, a 95% confidence interval in terms of [latex]z[/latex] is

\[ -0.5716 \pm 1.96 \times 0.2182 = -0.5716 \pm 0.4277, \]

a range of -0.9993 to -0.1439. Transforming these endpoints back to correlations gives the 95% confidence interval for [latex]\rho[/latex] as

\[ \left( \mbox{tanh}(-0.9993), \mbox{tanh}(-0.1439) \right) =

(-0.7613, -0.1429). \]

Thus we are 95% confident that the true correlation between oxytocin level and age is between about -0.76 and -0.14.

# A Summary of Parametric Methods

The confidence intervals and tests we have discussed so far have all been **parametric methods**. These methods work by using a sample statistic to estimate some parameter of the population. For example, the sample mean [latex]\overline{x}[/latex] is used to estimate the population mean [latex]\mu[/latex] and the sample least-squares slope [latex]b_1[/latex] is used to estimate the population slope [latex]\beta_1[/latex]. This process also involves making an assumption about the distribution of the estimate of the parameter, and so far we have assumed that it is a Normal distribution. The Central Limit Theorem says that this is okay for large enough samples, but we have to be more careful about checking the assumption for smaller samples. We will look at some **nonparametric** procedures in Chapter 24 which avoid having to make the normality assumption.

In essence then we have only really looked at one method of inference. Every confidence interval has had the form

\[ \mbox{estimate} \pm t^{*} \mbox{se(estimate)}, \]

though for proportions we used [latex]z^{*}[/latex] instead of [latex]t^{*}[/latex]. We have used Greek letters to denote population parameters so we will now use the letter [latex]\theta[/latex] to denote an arbitrary parameter (such as [latex]\mu[/latex] or [latex]\beta_1[/latex]).

We can then write the above general confidence interval more mathematically as

\[ \hat{\theta} \pm t^{*} \se{\hat{\theta}}. \]

Similarly, every significance test has had the form

\[ t = \frac{\mbox{estimate} – \mbox{hypothesised}}{\mbox{se(estimate)}}, \]

or, in mathematical notation,

\[ t = \frac{\hat{\theta} – \theta_0}{\se{\hat{\theta}}}, \]

where [latex]\theta_0[/latex] is the value of [latex]\theta[/latex] given by the null hypothesis.

All we need to calculate a confidence interval or a test statistic is the appropriate standard error for the estimate we are using. These are listed in the table below.

## Parameter estimates and their standard errors

Parameter | Estimate | Standard Error |
---|---|---|

[asciimath]\theta[/asciimath] | [asciimath]\hat{\theta}[/asciimath] | [latex]\mathrm{se}({\hat{\theta}})[/latex] |

[asciimath]\mu[/asciimath] | [asciimath]\overline{x}[/asciimath] | [asciimath]\frac{s}{\sqrt{n}}[/asciimath] |

[asciimath]p[/asciimath] | [asciimath]\hat{p}[/asciimath] | [asciimath]\sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}[/asciimath] |

[asciimath]\mu_1 - \mu_2[/asciimath] | [asciimath]\overline{x}_1 - \overline{x}_2[/asciimath] | [asciimath]\sqrt{\frac{s_{1}^{2}}{n_1} + \frac{s_{2}^{2}}{n_2}}[/asciimath] |

[asciimath]p_1 - p_2[/asciimath] | [asciimath]\hat{p}_1 - \hat{p}_2[/asciimath] | [asciimath]\sqrt{\frac{\hat{p}_{1}(1 - \hat{p}_{1})}{n_1} + \frac{\hat{p}_{2}(1 - \hat{p}_{2})}{n_2}}[/asciimath] |

[asciimath]\beta_0[/asciimath] | [asciimath]b_0[/asciimath] | [asciimath]s_U \sqrt{\frac{1}{n} + \frac{\overline{x}^2}{\sum(x_j - \overline{x})^2}}[/asciimath] |

[asciimath]\beta_1[/asciimath] | [asciimath]b_1[/asciimath] | [asciimath]\frac{s_U}{\sqrt{\sum(x_j - \overline{x})^2}}[/asciimath] |

[asciimath]\mu_y[/asciimath] | [asciimath]\hat{\mu}_y[/asciimath] | [asciimath]s_U \sqrt{\frac{1}{n} + \frac{(x^{*} - \overline{x})^2}{\sum(x_j - \overline{x})^2}}[/asciimath] |

[asciimath]y[/asciimath] | [asciimath]\hat{y}[/asciimath] | [asciimath]s_U \sqrt{1 + \frac{1}{n} + \frac{(x^{*} - \overline{x})^2}{\sum(x_j - \overline{x})^2}}[/asciimath] |

[asciimath]\rho[/asciimath] | [asciimath]r[/asciimath] | [asciimath]\sqrt{\frac{1 - r^2}{n - 2}}[/asciimath] |

Summary

- Regression analysis involves modelling the response variable in terms of a linear relationship between the mean response and the predictor variable with residual variability about the mean.
- The prediction errors for least-squares fitting are used to estimate the residual standard error, with degrees of freedom [latex]n-2[/latex].
- The standard inference for regression is to see whether the slope is zero or not. A slope of zero indicates no association between the variables while a significant nonzero slope indicates an association.
- Confidence intervals and prediction intervals can be calculated for estimated means and predicted outcomes, respectively, based on the least-squares fit.
- The assumptions of linear regression are that the relationship is linear and that the residual variability is Normally distributed with constant standard deviation. These assumptions should be checked using plots of the residuals.
- Hypothesis tests for the Pearson correlation coefficient give a method of detecting linear association that is equivalent to a regression test of slope.
- Confidence intervals for correlation require a transformation using the hyperbolic tangent function since the sampling distribution may not be symmetric.

Exercise 1

Based on the data given in Exercise 1 of Chapter 3, is there any evidence that the time breath can be held is related to height?

Exercise 2

Repeat the previous question for males and females separately. Does your decision regarding breath holding and height change? If so, why?

Exercise 3

Calculate the Pearson correlation coefficient between height and forearm length for the Islanders in the survey data. Based on the correlation, is there any evidence of an association?

Exercise 4

Calculate a 95% confidence interval for the correlation coefficient between height and forearm length for the Islanders based on the survey data.

Exercise 5

The correlation coefficient for the relationship between total plant biodiversity and the years of organic management for the vineyard data in Exercise 7 of Chapter 16 is 0.7060. Based on this correlation, is there any evidence of an increase in biodiversity over time?