23 Logistic Regression
[latex]\newcommand{\pr}[1]{P(#1)} \newcommand{\var}[1]{\mbox{var}(#1)} \newcommand{\mean}[1]{\mbox{E}(#1)} \newcommand{\sd}[1]{\mbox{sd}(#1)} \newcommand{\Binomial}[3]{#1 \sim \mbox{Binomial}(#2,#3)} \newcommand{\Student}[2]{#1 \sim \mbox{Student}(#2)} \newcommand{\Normal}[3]{#1 \sim \mbox{Normal}(#2,#3)} \newcommand{\Poisson}[2]{#1 \sim \mbox{Poisson}(#2)} \newcommand{\se}[1]{\mbox{se}(#1)} \newcommand{\prbig}[1]{P\left(#1\right)} \newcommand{\degc}{$^{\circ}$C}[/latex]
Logistic Curves
We saw in Chapter 3 that we could look for a relationship between height and sex by making a side-by-side plot of the height values split into two groups. We can use this to predict the height of a person based on knowing whether they are male or female, using the sample mean for the appropriate group as the prediction. In the language of Chapter 6, height is the response variable and sex is the explanatory variable.
Suppose though we are interested in the case where the variables have opposite roles, so that sex is now the response with height as the predictor. One way to plot this relationship is to give two numerical values for sex, such as 0 for female and 1 for male, and then make a scatter plot as usual. The figure below shows the resulting plot for the survey data.
It seems from the plot that for larger values of heights the values of sex seem to also be at the higher level, but there is clearly no rule that can predict sex exactly based on a height value. Instead we are interested in estimating the probability that a person is male, for example, given a known height. We could try estimating probabilities by fitting a least-squares line to the scatter plot, as shown in the figure above. This seems reasonable for most height values but above 185 cm the estimate is more than 100%, not a valid probability. Similarly, below around 155 cm the estimated probability is negative, which again is not valid.
The solution to this problem is to use odds instead of proportions since odds are not restricted to the range 0 to 1. In fact it is better to work with log odds, since these can also be negative, and try to fit them with a straight line. That is, we find estimates for [latex]b_0[/latex] and [latex]b_1[/latex] such that
\[ \ln \left( \frac{p}{1-p} \right) = b_0 + b_1 x, \]
where [latex]\ln[/latex] is the natural logarithm. Unfortunately this cannot be done using the least-squares approach — we can make our predictions using the line but we can’t calculate the prediction errors because the left-hand side is undefined for observed [latex]p=0[/latex] (female) or [latex]p=1[/latex] (male).
Instead computer software can estimate [latex]b_0[/latex] and [latex]b_1[/latex] by what is known as the maximum likelihood method. This method chooses [latex]b_0[/latex] and [latex]b_1[/latex] to maximise the probability of observing the data that was observed, in a similar way to that discussed in Chapter 11 for proportions. Details of this approach for logistic regression are beyond the scope of this book but are given by Hosmer & Lemeshow (2000).
For sex by height, where [latex]p[/latex] is the probability of being male, we find [latex]b_0 = -51.71[/latex] and [latex]b_1 = 0.3021[/latex]. Once you have these values you can rearrange the above formula to find [latex]p[/latex] in terms of [latex]x[/latex], giving
\[ \hat{p} = \frac{e^{-51.71 + 0.3021x}}{1 + e^{-51.71 + 0.3021x}} \]
where [latex]x[/latex] is height and [latex]\hat{p}[/latex] is the estimated probability of being male. When you plot this function for [latex]\hat{p}[/latex] you get a curve known as a logistic curve. This type of curve was introduced by Verhulst (1838) and is an important tool in modelling population growth (Campbell & Reece, 2002).
The figure below shows the logistic curve fitted to the relationship between sex and height. From this plot you can see that for heights less than about 165 cm the person is most likely female while for heights greater than about 176 cm they are most likely male. It is an even bet for someone who is around 171 cm tall.
Odds Ratios
To plot a logistic curve we rearranged the formula for log odds to get [latex]p[/latex] in terms of [latex]x[/latex], as in the previous figure. However, we can also use it directly to calculate odds and odds ratios. We have already seen odds ratios in Chapter 17. One reason that odds ratios are often used instead of differences in proportions or relative risk is because of their link to logistic regression.
As an example, suppose we want to estimate how much more likely it is for a person to be male if they are 165 cm tall rather than 160 cm tall. We could estimate the individual odds by substituting 165 and 160 into the linear equation
\[ \ln \left( \frac{p}{1-p} \right) = -51.71 + 0.3021x, \]
taking exponentials of each, and then finding the ratio of the odds. However, it is slightly easier to remember that the log of a ratio is the difference of the logs, and then use this to work out [latex]\ln(\mbox{OR})[/latex] directly. Here we have
\begin{eqnarray*}
\ln(\mbox{OR}) & = & (-51.71 + 0.3021 \times 165) – (-51.71 + 0.3021 \times 160) \\
& = & 0.3021 \times 5 = 1.5105.
\end{eqnarray*}
Thus the odds ratio is [latex]e^{1.5105}[/latex] = 4.53, so the odds of a person who is 165 cm tall being male are 4.53 times the odds of a person who is only 160 cm tall.
Note that the intercepts cancel out when doing this calculation so all that matters is the slope. This value, 0.3021, can thus be interpreted as the rate of increase in the log odds for each unit increase in the explanatory variable. This is analogous to the interpretation of slope for standard linear regression, as discussed in the Chapter 18.
Assuming that log odds can be described by a straight line means that the rate of increase is constant across all values of the explanatory variable. Thus the odds ratio of being male between 185 cm and 180 cm will be the same as the odds ratio between 165 cm and 160 cm, 4.53 from above. Of course, this may not always be a realistic assumption in practice.
Inference for Logistic Regression
So is there evidence that sex is related to height? Our model for logistic regression will be that log odds of being male is given by
\[ \ln \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 x, \]
where [latex]\beta_0[/latex] and [latex]\beta_1[/latex] are the underlying intercept and slope of the relationship, as in the usual linear regression. If [latex]\beta_1 = 0[/latex] then the log odds never change with different values of [latex]x[/latex]. Thus our null hypothesis of no association will be [latex]H_0: \beta_1 = 0[/latex].
We already have an estimate for [latex]\beta_1[/latex] from above with [latex]b_1 = 0.3021[/latex] so we just need a measure of the variability of this value in estimating [latex]\beta_1[/latex] and its corresponding sample distribution. As with the log of odds ratio in Chapter 17, the distribution here can also be approximated by a Normal distribution. The standard error of the estimate requires further maximum likelihood estimation — here we find [latex]\se{b_1} = 0.07807[/latex]. Combining these we have a [latex]z[/latex] statistic of
\[ z = \frac{0.3021 – 0}{0.07807} = 3.87. \]
From the Normal distribution table, the two-sided [latex]P[/latex]-value is 0.0001, very strong evidence that there is a relationship between sex and height. The table below gives the regression summary for this analysis.
Logistic regression summary for sex by height (cm)
Estimate | SE | Z | P | |
---|---|---|---|---|
Constant | -51.71276 | 13.41209 | -3.856 | 0.000115 |
Height | 0.30206 | 0.07807 | 3.869 | 0.000109 |
Life-Changing Events and Smoking Risk
A researcher on the Islands, Jessica Kennedy, was interested in possible risk factors for smoking. One hypothesis was that major life-changing events could increase the incidence of smoking due to stress. Using a sample of 40 Islanders, Jessica counted the number of life-changing events each of them had experienced. These events were any of illnesses, marriages, child births, loss of spouses and migration between villages. She recorded their current smoking status along with other variables of possible interest, such as age and the number of children they had. The results are shown in the table below.
Life-changing events and smoking status
Age | Sex | Married | Children | Events | Smoker |
---|---|---|---|---|---|
42 | F | Yes | 0 | 3 | Yes |
42 | M | No | 0 | 1 | Yes |
30 | F | No | 0 | 5 | No |
43 | F | Yes | 5 | 5 | Yes |
30 | F | Yes | 0 | 3 | Yes |
20 | F | No | 0 | 7 | No |
38 | M | Yes | 0 | 2 | Yes |
30 | M | Yes | 2 | 3 | Yes |
43 | F | Yes | 4 | 5 | Yes |
110 | M | Yes | 6 | 10 | No |
36 | F | Yes | 6 | 9 | No |
52 | F | Yes | 2 | 3 | Yes |
55 | F | Yes | 0 | 2 | Yes |
36 | M | Yes | 6 | 9 | No |
70 | M | Yes | 6 | 10 | No |
44 | M | Yes | 6 | 11 | Yes |
65 | M | Yes | 6 | 6 | Yes |
29 | F | Yes | 0 | 3 | Yes |
69 | F | Yes | 4 | 5 | Yes |
36 | M | No | 0 | 1 | Yes |
67 | F | Yes | 2 | 2 | Yes |
47 | F | Yes | 3 | 5 | Yes |
32 | M | Yes | 0 | 1 | Yes |
83 | F | Yes | 3 | 2 | Yes |
44 | M | Yes | 6 | 9 | No |
33 | M | Yes | 4 | 4 | Yes |
27 | M | Yes | 2 | 5 | Yes |
51 | M | Yes | 3 | 8 | Yes |
25 | M | Yes | 4 | 6 | No |
26 | F | Yes | 0 | 3 | Yes |
44 | M | Yes | 4 | 11 | No |
33 | M | Yes | 5 | 5 | Yes |
39 | M | Yes | 5 | 11 | No |
31 | M | Yes | 2 | 5 | No |
40 | M | Yes | 5 | 7 | No |
105 | F | Yes | 0 | 2 | Yes |
37 | F | Yes | 5 | 8 | No |
116 | F | Yes | 0 | 7 | No |
84 | F | Yes | 4 | 9 | No |
126 | M | Yes | 4 | 1 | Yes |
Taking smoking status as the response variable, with the number of life changing events as the predictor, the logistic regression output is given in the following table.
Estimate | SE | Z | P | |
---|---|---|---|---|
Constant | 4.9676 | 1.4450 | 3.438 | 0.00059 |
Life Changing Events |
-0.7630 | 0.2263 | -3.371 | 0.00075 |
The [latex]P[/latex]-value of 0.00075 is strong evidence of an association between the two variables but it is actually negative! Increasing the number of life changing events decreases the log odds, and thus the probability, of smoking. The figure below shows the logistic fit to the data (where some vertical jitter has been added to the observations to make their distribution more apparent).
So why is there a negative association between these variables? It may be very similar to the example of Simpson’s Paradox given in Chapter 22. It could be that life changing events do lead to smoking but that smokers then tend to die younger than nonsmokers, removing the opportunity to accumulate further life changing events. This effect is known as survivor bias. The possibility of such effects is always an issue with observational studies like this — further investigation is needed to explore any causal link.
Odds Ratios
Logistic regression can also be used to compare odds between groups, just as we did with odds ratios directly in Chapter 17. Consider the nicotine inhaler study again where there were two groups, the placebo inhaler and the nicotine inhaler. This will be the predictor variable in our logistic regression and we can represent it using an indicator variable, as we did in Chapter 21. Here we define a variable
\[ x = \left\{ \begin{array}{ll}
1, & \mbox{ if in nicotine group} \\
0, & \mbox{ if in placebo group} \\
\end{array} \right. \]
We can then find a linear relationship of the form
\[ \ln \left( \frac{p}{1-p} \right) = b_0 + b_1 x \]
using logistic regression to estimate the odds of sustaining a reduction in smoking.
Since the log of a ratio is the difference of the logs, this model gives the log of the odds ratio for a reduction in smoking between those in the nicotine group and those in the placebo group to be
\[ \ln(\mbox{OR}) = (b_0 + b_1 \times 1) – (b_0 + b_1 \times 0) = b_1. \]
Thus the odds ratio is simply [latex]e^{b_1}[/latex]. The logistic regression summary for this data is given in the table below.
Logistic regression summary for smoking reduction and inhaler group
Estimate | SE | Z | P | |
---|---|---|---|---|
Constant | -2.3136 | 0.2471 | -9.364 | <0.001 |
Nicotine | 1.2677 | 0.2950 | 4.297 | <0.001 |
The regression equation here is
\[ \ln \left( \frac{p}{1-p} \right) = -2.3136 + 1.2677 x. \]
The coefficient of the nicotine indicator may look familiar — it is the value of [latex]\ln(\mbox{OR})[/latex] we calculated by hand in Chapter 17. As noted above, the odds ratio is
\[ \mbox{OR} = e^{1.2677} = 3.55, \]
again the value we obtained previously.
Note that the standard error of the nicotine coefficient in the regression summary, 0.2950, is also the same as value we calculated for the standard error of [latex]\ln(\mbox{OR})[/latex] in Chapter 17. We can use this to calculate confidence intervals for the regression parameters since we are assuming the estimates have approximate Normal distributions. For example, a 95% confidence interval for the log of the odds ratio is
\[ 1.2677 \pm 1.96 \times 0.2950 = (0.6895, 1.8459). \]
Taking exponentials gives a confidence interval for the odds ratio to be
\[ (e^{0.6895}, e^{1.8459}) = (1.99, 6.33), \]
again what we found in Chapter 17.
Summary
- Logistic regression is a method for estimating the probability or odds of an outcome based on a quantitative predictor variable.
- As with multiple linear regression, logistic regression can be extended to involve multiple predictor variables, including indicator variables.
- Logistic regression is based on maximum likelihood estimation and thus requires computer software for most applications.
Exercise 1
For logistic regression, what value of [latex]x[/latex] gives an estimated equal chance for the response?
Exercise 2
Based on the survey data, is there an association between whether a person said “Yes” to the question about kissing on the first date and their height? Use logistic regression to assess the evidence.
Exercise 3
A study by Robertson et al. (2013) (see Exercise 5 of Chapter 16) investigated the association between an individual having any criminal conviction by age 26 and the average hours of television they watched as children. The table below gives an (incomplete) summary of the logistic regression analysis. Calculate the appropriate [latex]P[/latex]-value to determine whether there is any evidence that watching more television as a child increases the chance of a criminal conviction by age 26.
Logistic regression summary for criminal convictions by childhood weekday hours of television
Estimate | SE | |
---|---|---|
Constant | -2.5956 | 0.0834 |
Television | 0.4383 | 0.1012 |
Exercise 4
Based on the previous table (in exercise 3), give a 95% confidence interval for the increase in odds of a criminal conviction by age 26 for a one-hour increase in mean weekday television viewing.