22 Categorical Data
[latex]\newcommand{\pr}[1]{P(#1)} \newcommand{\var}[1]{\mbox{var}(#1)} \newcommand{\mean}[1]{\mbox{E}(#1)} \newcommand{\sd}[1]{\mbox{sd}(#1)} \newcommand{\Binomial}[3]{#1 \sim \mbox{Binomial}(#2,#3)} \newcommand{\Student}[2]{#1 \sim \mbox{Student}(#2)} \newcommand{\Normal}[3]{#1 \sim \mbox{Normal}(#2,#3)} \newcommand{\Poisson}[2]{#1 \sim \mbox{Poisson}(#2)} \newcommand{\se}[1]{\mbox{se}(#1)} \newcommand{\prbig}[1]{P\left(#1\right)} \newcommand{\degc}{$^{\circ}$C}[/latex]
Testing Randomness
In contrast to the algorithmic random digits seen in Chapter 2, the table following this chapter gives 1800 digits from a human asked to create a random sequence. Are these digits random?
We will look at the first half of this sequence, the first 900 digits, leaving an analysis of the second half as an exercise. There are actually many different criteria for a sequence being “random” in this context, one of which is that the outcomes should all be equally likely. Here we would expect each digit to appear 900/10 = 90 times. The observed values are given in the table below with a bar chart of these shown in the following figure. This is a one-way table, giving observed counts for a single categorical variable.
Observed and expected counts for the first 900 digits
Digit | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
Observed | 47 | 101 | 109 | 90 | 145 | 111 | 132 | 75 | 50 | 40 |
Expected | 90 | 90 | 90 | 90 | 90 | 90 | 90 | 90 | 90 | 90 |
There are certainly deviations from what we would expect, ranging from only 40 occurrences for ‘9’ up to 145 for ‘4’. Even if the digits were truly random we would not expect to get exactly 90 of each one appearing. But are the observed deviations plausible if they were truly random? This is a standard hypothesis test setting. We want to know the probability of getting values as far away (or further) than those observed by chance if they really were equally likely.
This is a good chance to reflect on the basic ideas of hypothesis testing. We are not estimating a parameter here, and so will not be talking about confidence intervals. Instead we write [latex]H_0[/latex] in words as
[latex]H_0[/latex]: observations follow hypothesised distribution.
The alternative is the very general statement
[latex]H_1[/latex]: observations do not follow hypothesised distribution.
This is known as a goodness-of-fit test and we need some way of measuring how close the observed counts are to the expected counts. An obvious measure is to add up all the differences between the observed and expected counts, since we would expect this to be bigger if there were bigger deviations. However this sum is always 0 because the positive and negative differences always cancel out. (Why?) We could fix this by adding up the absolute differences, but as usual we add up the squared differences, just as we did for the sample standard deviation. This gives the statistic
\[ \sum (\mbox{observed} – \mbox{expected})^2, \]
where the sum is over all the categories (the 10 digits).
This, however, is not perfect as it does not take into account the relative size of deviations. For example, an observed value of 20 would be the same distance from an expected value of 10 as an observed value of 1010 would be from an expected value of 1000. However the first is much more significant since the observation was double the expected, while the second is not much of a difference at all. To capture this we take the ratio of the squared difference by the expected value, giving
\[ \chi^2 = \sum \frac{(\mbox{observed} – \mbox{expected})^2}{\mbox{expected}}. \]
Here [latex]\chi[/latex] is the Greek letter chi, and this statistic is called the chi-square statistic. If there is evidence against the null hypothesis then we would expect [latex]\chi^2[/latex] to be large. Here we find
\[ \chi^2 = \frac{(47 – 90)^2}{90} + \frac{(101 – 90)^2}{90} + \cdots + \frac{(40 – 90)^2}{90} = 132.07. \]
How do we know if 132.07 could simply be due to sampling variability? We need to know the sampling distribution of this statistic, assuming that [latex]H_0[/latex] is true. This distribution is called the chi-square distribution. Like the [latex]t[/latex] distribution, there is a different chi-square distribution for each number of categories. Here we have 10 categories but the sum of the differences between observed and expected is always 0, so there are only 9 free differences in the analysis. As before, we call this the degrees of freedom of the chi-square statistic.
The figure below shows the [latex]\chi^2_9[/latex] distribution, the chi-square distribution with 9 degrees of freedom, along with the [latex]\chi^2_1[/latex], [latex]\chi^2_4[/latex] and [latex]\chi^2_8[/latex] distributions for comparison. Since we are squaring everything the value of [latex]\chi^2[/latex] can never be negative but there is no real limit on how big [latex]\chi^2[/latex] can be, so this is a rather skewed distribution.
Like the other continuous distributions we have seen, there is no simple way of working out areas under the [latex]\chi^2_9[/latex] density curve. The table below gives the areas under the [latex]\chi^2_1[/latex] distribution as an example but it is impractical to provide tables for each degrees of freedom and is also worthless since computer packages can provide these areas easily.
[latex]\chi^2(1)[/latex] distribution
First decimal place of [latex]x[/latex] | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
[latex]x[/latex] | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
0.0 | 1.000 | 0.752 | 0.655 | 0.584 | 0.527 | 0.480 | 0.439 | 0.403 | 0.371 | 0.343 |
1.0 | 0.317 | 0.294 | 0.273 | 0.254 | 0.237 | 0.221 | 0.206 | 0.192 | 0.180 | 0.168 |
2.0 | 0.157 | 0.147 | 0.138 | 0.129 | 0.121 | 0.114 | 0.107 | 0.100 | 0.094 | 0.089 |
3.0 | 0.083 | 0.078 | 0.074 | 0.069 | 0.065 | 0.061 | 0.058 | 0.054 | 0.051 | 0.048 |
4.0 | 0.046 | 0.043 | 0.040 | 0.038 | 0.036 | 0.034 | 0.032 | 0.030 | 0.028 | 0.027 |
5.0 | 0.025 | 0.024 | 0.023 | 0.021 | 0.020 | 0.019 | 0.018 | 0.017 | 0.016 | 0.015 |
6.0 | 0.014 | 0.014 | 0.013 | 0.012 | 0.011 | 0.011 | 0.010 | 0.010 | 0.009 | 0.009 |
7.0 | 0.008 | 0.008 | 0.007 | 0.007 | 0.007 | 0.006 | 0.006 | 0.006 | 0.005 | 0.005 |
8.0 | 0.005 | 0.004 | 0.004 | 0.004 | 0.004 | 0.004 | 0.003 | 0.003 | 0.003 | 0.003 |
9.0 | 0.003 | 0.003 | 0.002 | 0.002 | 0.002 | 0.002 | 0.002 | 0.002 | 0.002 | 0.002 |
10.0 | 0.002 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 |
11.0 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 |
12.0 | 0.001 | 0.001 |
This table gives [latex]\pr{X^2 \ge x}[/latex] where [latex]X^2 \sim \chi^2_1[/latex].
The following table provides the critical values for a range of degrees of freedom so you can see their general pattern. Unlike the [latex]t[/latex] distributions, the critical values here keep getting higher as the degrees of freedom increase, not surprising since [latex]\chi^2[/latex] is the sum of more and more terms.
[latex]\chi^2[/latex] distribution
Probability [latex]p[/latex] | |||||||||
---|---|---|---|---|---|---|---|---|---|
df | 0.975 | 0.95 | 0.25 | 0.10 | 0.05 | 0.025 | 0.01 | 0.005 | 0.001 |
1 | 0.001 | 0.004 | 1.323 | 2.706 | 3.841 | 5.024 | 6.635 | 7.879 | 10.83 |
2 | 0.051 | 0.103 | 2.773 | 4.605 | 5.991 | 7.378 | 9.210 | 10.60 | 13.82 |
3 | 0.216 | 0.352 | 4.108 | 6.251 | 7.815 | 9.348 | 11.34 | 12.84 | 16.27 |
4 | 0.484 | 0.711 | 5.385 | 7.779 | 9.488 | 11.14 | 13.28 | 14.86 | 18.47 |
5 | 0.831 | 1.145 | 6.626 | 9.236 | 11.07 | 12.83 | 15.09 | 16.75 | 20.52 |
6 | 1.237 | 1.635 | 7.841 | 10.64 | 12.59 | 14.45 | 16.81 | 18.55 | 22.46 |
7 | 1.690 | 2.167 | 9.037 | 12.02 | 14.07 | 16.01 | 18.48 | 20.28 | 24.32 |
8 | 2.180 | 2.733 | 10.22 | 13.36 | 15.51 | 17.53 | 20.09 | 21.95 | 26.12 |
9 | 2.700 | 3.325 | 11.39 | 14.68 | 16.92 | 19.02 | 21.67 | 23.59 | 27.88 |
10 | 3.247 | 3.940 | 12.55 | 15.99 | 18.31 | 20.48 | 23.21 | 25.19 | 29.59 |
11 | 3.816 | 4.575 | 13.70 | 17.28 | 19.68 | 21.92 | 24.72 | 26.76 | 31.26 |
12 | 4.404 | 5.226 | 14.85 | 18.55 | 21.03 | 23.34 | 26.22 | 28.30 | 32.91 |
13 | 5.009 | 5.892 | 15.98 | 19.81 | 22.36 | 24.74 | 27.69 | 29.82 | 34.53 |
14 | 5.629 | 6.571 | 17.12 | 21.06 | 23.68 | 26.12 | 29.14 | 31.32 | 36.12 |
15 | 6.262 | 7.261 | 18.25 | 22.31 | 25.00 | 27.49 | 30.58 | 32.80 | 37.70 |
16 | 6.908 | 7.962 | 19.37 | 23.54 | 26.30 | 28.85 | 32.00 | 34.27 | 39.25 |
17 | 7.564 | 8.672 | 20.49 | 24.77 | 27.59 | 30.19 | 33.41 | 35.72 | 40.79 |
18 | 8.231 | 9.390 | 21.60 | 25.99 | 28.87 | 31.53 | 34.81 | 37.16 | 42.31 |
19 | 8.907 | 10.12 | 22.72 | 27.20 | 30.14 | 32.85 | 36.19 | 38.58 | 43.82 |
20 | 9.591 | 10.85 | 23.83 | 28.41 | 31.41 | 34.17 | 37.57 | 40.00 | 45.31 |
21 | 10.28 | 11.59 | 24.93 | 29.62 | 32.67 | 35.48 | 38.93 | 41.40 | 46.80 |
22 | 10.98 | 12.34 | 26.04 | 30.81 | 33.92 | 36.78 | 40.29 | 42.80 | 48.27 |
23 | 11.69 | 13.09 | 27.14 | 32.01 | 35.17 | 38.08 | 41.64 | 44.18 | 49.73 |
24 | 12.40 | 13.85 | 28.24 | 33.20 | 36.42 | 39.36 | 42.98 | 45.56 | 51.18 |
25 | 13.12 | 14.61 | 29.34 | 34.38 | 37.65 | 40.65 | 44.31 | 46.93 | 52.62 |
26 | 13.84 | 15.38 | 30.43 | 35.56 | 38.89 | 41.92 | 45.64 | 48.29 | 54.05 |
27 | 14.57 | 16.15 | 31.53 | 36.74 | 40.11 | 43.19 | 46.96 | 49.64 | 55.48 |
28 | 15.31 | 16.93 | 32.62 | 37.92 | 41.34 | 44.46 | 48.28 | 50.99 | 56.89 |
29 | 16.05 | 17.71 | 33.71 | 39.09 | 42.56 | 45.72 | 49.59 | 52.34 | 58.30 |
30 | 16.79 | 18.49 | 34.80 | 40.26 | 43.77 | 46.98 | 50.89 | 53.67 | 59.70 |
40 | 24.43 | 26.51 | 45.62 | 51.81 | 55.76 | 59.34 | 63.69 | 66.77 | 73.40 |
50 | 32.36 | 34.76 | 56.33 | 63.17 | 67.50 | 71.42 | 76.15 | 79.49 | 86.66 |
60 | 40.48 | 43.19 | 66.98 | 74.40 | 79.08 | 83.30 | 88.38 | 91.95 | 99.61 |
70 | 48.76 | 51.74 | 77.58 | 85.53 | 90.53 | 95.02 | 100.4 | 104.2 | 112.3 |
80 | 57.15 | 60.39 | 88.13 | 96.58 | 101.9 | 106.6 | 112.3 | 116.3 | 124.8 |
90 | 65.65 | 69.13 | 98.65 | 107.6 | 113.1 | 118.1 | 124.1 | 128.3 | 137.2 |
100 | 74.22 | 77.93 | 109.1 | 118.5 | 124.3 | 129.6 | 135.8 | 140.2 | 149.4 |
This table gives [latex]x^{*}[/latex] such that [latex]\pr{X^2 \ge x^{*}} = p[/latex], where [latex]X^2 \sim \chi^2(\mbox{df})[/latex].
The [latex]P[/latex]-value we want is [latex]\pr{X^2 \ge 132.07}[/latex], where [latex]X^2 \sim \chi^2_9[/latex]. For 9 degrees of freedom in the table above this [latex]P[/latex]-value is far below 0.001. Note that this is a two-sided [latex]P[/latex]-value already, since positive and negative deviations have been squared and combined, so there is no need to multiply it by 2. This is very strong evidence against [latex]H_0[/latex] so in conclusion there is very strong evidence to suggest that the human-generated numbers are not uniformly distributed.
Assumptions
Note that the observed counts in our data are discrete and so the [latex]\chi^2[/latex] statistic is also discrete, even though it might look like a continuous decimal number. However, the [latex]\chi^2[/latex] distribution we are using for hypothesis testing is continuous, so the underlying assumption is that this continuous approximation to the real discrete distribution is a good one. This is analogous to using the Normal distribution to approximate the Binomial distribution for proportion tests.
To satisfy this assumption we use the rule of thumb that all expected counts should be at least 1 and 80% of them should be at least 5. In the random digits example this is justified, with all expected counts equal to 90. In the section below we’ll see an example where we need to combine groups to satisfy the assumption.
Correlation Test of Randomness
Before we continue discussing categorical data, note that there are other requirements that a genuinely random sequence of numbers should satisfy. One important one is that consecutive numbers should be independent. This is particularly important if we were using a random number generator on a calculator or computer to help choose random samples for an experiment. All of the statistical tests we have described assume that samples are independent of each other and so a poor random number generator could undermine our studies.
One way of testing for independence here is to tally occurrences of the 100 possible pairs of digits. If the outcomes were independent then these 100 pairs should be equally likely and we could use a chi-square statistic with 99 degrees of freedom to test this uniformity.
Another method is to make a scatter plot where each point represents a digit and the digit that followed it in the sequence. Such a plot is shown in the following figure, where jittering has been used to separate points which would otherwise be obscured (since this is discrete data). If one digit and the next were independent then there should be no association present in this plot. However, there are noticeable gaps (an ‘8’ was never followed by a ‘1’) and very dense combinations (‘6’ was often followed by ‘5’). This leads to a smoothed line suggesting that perhaps there is a positive association between one digit and the next.
We can support the visual impression by calculating the correlation coefficient, [latex]r = +0.232[/latex]. This is not particularly large but it is significantly different from 0 ([latex]p \lt 0.001[/latex]). This gives strong evidence that low digits tend to be followed by low digits and high digits tend to be followed by high digits. Thus there is further evidence that the digits produced are not genuinely random.
Parametric Distributions
Poisson Yeast Cells
The table below shows the counts of yeast cells made by Student (1907), along with the expected counts from the Poisson distribution with [latex]\lambda = 4.68[/latex]. Is there any evidence to suggest that the Poisson distribution is not appropriate for this data?
Observed and expected counts of yeast cells
Yeast Cells | 0-1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9+ |
---|---|---|---|---|---|---|---|---|---|
Observed | 20 | 43 | 53 | 86 | 70 | 54 | 37 | 18 | 19 |
Expected | 21.1 | 40.6 | 63.4 | 74.2 | 69.4 | 54.2 | 36.2 | 21.2 | 19.7 |
For [latex]\Poisson{X}{4.68}[/latex] we have [latex]\pr{X=0} = 0.009279[/latex] so that the expected count is [latex]400 \times 0.009279 = 3.71[/latex]. Since this is quite low we have combined the 0 and 1 counts together to satisfy the assumptions given earlier. Similarly, since the Poisson probabilities all get small for large [latex]x[/latex] and so we combine 9 and over into a group. The expected value for this group can be calculated using complements since
\[ \pr{X \ge 9} = 1 – \pr{X \le 8} = 1 – \sum_{x=0}^8 \frac{e^{-\lambda} \lambda^x}{x!}. \]
We calculate the [latex]\chi^2[/latex] statistic
\[ x = \frac{(20 – 21.1)^2}{21.1} + \frac{(43 – 40.6)^2}{40.6} + \cdots + \frac{(19 – 19.7)^2}{19.7} = 4.31. \]
The basic degrees of freedom are 9 – 1 = 8. However, for this example we did not actually know the theoretical distribution before we looked at the data, unlike the previous example where we could specify the uniform expected values beforehand. To get the expected values here we first needed to estimate the Poisson parameter [latex]\lambda[/latex] from the data. When we do this we lose one degree of freedom, just as when we calculated the sample standard deviation we lost one degree of freedom because we had to estimate the sample mean or when we calculated the residual standard error we lost two degrees of freedom because we had to estimate the sample intercept and slope. Hence the degrees of freedom for this test are 9 – 2 = 7.
From the [latex]\chi^2[/latex] table we find the [latex]P[/latex]-value is greater than 0.25. This gives no evidence against the null hypothesis and so the observed counts are consistent with a Poisson(4.68) distribution.
Relationship to Proportion Test
Mendel Revisited
In Chapter 17 we analysed Mendel’s experiment concerning the inheritance of pea plant flower colours. The two counts, 705 purple and 224 white, can also be written as a one-way table. In the table below we have put the counts together with the expected counts based on Mendel’s theory of a 3:1 ratio.
Observed and expected counts for Mendel's experiment
Colour | Purple | White |
---|---|---|
Observed | 705 | 224 |
Expected | 696.75 | 232.25 |
To test the significance of the deviations from the expected values we calculate the [latex]\chi^2[/latex] statistic
\[ x = \frac{(705 – 696.75)^2}{696.75} + \frac{(224 – 232.25)^2}{232.25} = 0.39. \]
This statistic has 1 degree of freedom. From the [latex]\chi^2(1)[/latex] table we see the [latex]P[/latex]-value is around 0.527, no evidence that the observed results differ from the theory.
Note that this is almost identical to the result we obtained in Chapter 17 when we used the hypothesised value of [latex]p[/latex] to estimated the standard deviation of [latex]\hat{p}[/latex]. In fact the [latex]z[/latex] value there was 0.626 and [latex]0.626^2[/latex] = 0.39, the value of [latex]x[/latex]. The [latex]\chi_1^2[/latex] distribution is just the square of the Normal distribution. The chi-square test for one-way tables is thus a generalisation of the one-sample test of a proportion using the Normal distribution, allowing us to test a distribution with more than one free proportion. This is similar to the relationship we saw in Chapter 19, where the [latex]F[/latex] test for comparing two means gave identical results to the pooled two-sample [latex]t[/latex] test.
Two-Way Tables
In Chapter 17 we looked at the effect of nicotine inhalers on smoking reduction using a comparison between two proportions. We can try to determine whether the inhalers are beneficial by testing for an association in the two-way table of counts. The following table shows this data again, with marginal totals included. Our null hypothesis is that the inhaler contents and the reduction outcome are independent, while the alternative hypothesis is simply that they are not.
Sustained reductions after 4 months of inhaler use
Nicotine | Placebo | Total | |
---|---|---|---|
Reduction | 52 | 18 | 70 |
No Reduction | 148 | 182 | 330 |
Total | 200 | 200 | 400 |
From the marginal distributions we see that 200/400 = 0.5 of the subjects had the nicotine inhaler, while 70/400 = 0.175 of the subjects had a reduction. If there was no association between inhaler contents and reduction then these outcomes should be independent of each other. We should then be able to multiply their proportions together to estimate the proportion of subjects having nicotine and having a reduction,
\[ 0.5 \times 0.175 = 0.0875. \]
Thus we would expect 8.75% of all subjects to have this combination. Now 8.75% of 400 is 35, compared to the observed value of 52. Is this a significant difference? We can use a chi-square test to find out.
Firstly, we calculate the other three expected counts. Note that we divide by 400 twice in getting our proportions but then multiply by it at the end. We can save one step and give the simple formula
\[ \mbox{expected count } = \frac{\mbox{row total } \times \mbox{ column total}}{\mbox{total}}. \]
For example, we would expect the count of subjects in the placebo group who don’t sustain a reduction to be
\[ \frac{200 \times 330}{400} = 165. \]
The table below gives all the expected counts.
Expected counts for inhaler data
Nicotine | Placebo | |
---|---|---|
Reduction | 35 | 35 |
No Reduction | 165 | 165 |
We can simply work out the chi-square statistic as before,
\[ \chi^2 = \frac{(52 – 35)^2}{35} + \cdots + \frac{(182 – 165)^2}{165} = 20.02. \]
This statistic has a [latex]\chi^2_{\mbox{df}}[/latex] distribution with
\[ \mbox{df } = (\mbox{rows } – 1) \times (\mbox{columns } – 1), \]
since the degrees of freedom from the two variables multiply in the same way that you multiply the number of rows and columns to find the number of cells. For this example, df = 1. The [latex]\chi^2(1)[/latex] table or the [latex]\chi^2[/latex] table tell us the [latex]P[/latex]-value is very close to 0. Thus there is very strong evidence of an association between the type of inhaler and whether a reduction was sustained.
Note that [latex]\sqrt{20.02} = 4.47[/latex], the [latex]z[/latex] statistic we found in Chapter 17 with the pooled sample proportion, so this test for association is identical to the two-sample proportion test using the Normal approximation. The advantage of the chi-square test is that it can be applied to tables with more than two rows or columns.
Pizza Preference and Sex
The table below gives the two-way table of counts of 200 Islanders by pizza preference and sex that we first saw in Chapter 6.
Counts of preferred pizza by sex
Mushroom | Pineapple | Prawns | Sausage | Spinach | Total | |
---|---|---|---|---|---|---|
Female | 10 | 39 | 17 | 13 | 23 | 102 |
Male | 18 | 10 | 13 | 36 | 21 | 98 |
Total | 28 | 49 | 30 | 49 | 44 | 200 |
The table below shows the expected counts if the two variables were independent.
Expected counts of preferred pizza by sex
Mushroom | Pineapple | Prawns | Sausage | Spinach | Total | |
---|---|---|---|---|---|---|
Female | 14.3 | 25.0 | 15.3 | 25.0 | 22.4 | 102 |
Male | 13.7 | 24.0 | 14.7 | 24.0 | 21.6 | 98 |
Total | 28 | 49 | 30 | 49 | 44 | 200 |
The [latex]\chi^2[/latex] statistic is [latex]30.8[/latex] with 4 degrees of freedom (2 rows and 5 columns). From the [latex]\chi^2[/latex] table we find the [latex]P[/latex]-value is less than 0.001, giving substantial evidence from this data to suggest that pizza preference differs between males and females.
Fisher’s Exact Test
For small sample sizes the test based on the [latex]\chi^2[/latex] distribution will usually give a poor approximation to the [latex]P[/latex]-values. An alternative is to use Fisher’s exact test (Glantz, 2002). This procedure enumerates all the possible tables that would be as unusual as the one obtained assuming no association or simulates the generation of such tables if the number of possibilities is too big. Either method is straightforward but can require a lot of calculation and so is usually left to a computer.
Simpson’s Paradox
Appleton et al. (1996) surveyed women twenty years after they were part of a study in 1972-1974 on thyroid and heart disease. The following table shows the survival status at the time of the second survey of the 1314 women who had been classified as either a current smoker or as never having smoked in the original survey.
Two-way table of smoking and survival
Survival | |||
---|---|---|---|
Survival | Yes | No | Total |
Dead | 139 | 230 | 369 |
Alive | 443 | 502 | 945 |
Total | 582 | 732 | 1314 |
Of the 582 women who smoked, 443 were still alive after twenty years, a survival rate of 76%. Of the 732 women who didn’t smoke, 502 were still alive, a survival rate of 69%. That is interesting: it seems that, for the population these women came from, smoking might actual help survival. A chi-square test of association gives a [latex]P[/latex]-value of 0.003, strong evidence that smoking status and survival are related.
This seems like good news for smokers! But now consider the table below which shows the same data but as a three-way table with an extra variable for age group.
Three-way table of age group, smoking and survival
Age Group | |||||||
18-44 | 45-64 | 65+ | |||||
Smoking | Smoking | Smoking | |||||
Survival | Yes | No | Yes | No | Yes | No | Total |
Dead | 19 | 13 | 78 | 52 | 42 | 165 | 369 |
Alive | 269 | 327 | 167 | 147 | 7 | 28 | 945 |
Total | 288 | 340 | 245 | 199 | 49 | 193 | 1314 |
We can now look at the relationship between smoking status and survival for the different age groups.
- For 18-44 year olds, 269 out of 288 smokers survived (93%) compared to 327 out of 340 nonsmokers (96%). For this age group it was better to be a nonsmoker.
- For 45-64 year olds, 167 out of 245 smokers survived (68%) compared to 147 out of 199 nonsmokers (74%). For this age group it was also better to be a nonsmoker.
- For women 65 and over, 7 out of 49 smokers survived (14%) compared to 28 out of 193 nonsmokers (15%). For this age group it was also slightly better to be a nonsmoker.
So for each age group the survival rate was higher for nonsmokers. That seems odd since above we saw that altogether it was smokers who had the better survival rate. The same data are giving us completely opposite conclusions!
This phenomenon is known as Simpson’s paradox. We will leave you to figure out which is the correct conclusion and why the other one gives the opposite result.
Summary
- The chi-square test is a general procedure for comparing observed and expected counts.
- For a one-way table the expected counts need to come from some hypothetical distribution.
- Comparing counts to a uniform distribution is a simple test of randomness.
- For a two-way table the expected counts come from the null hypothesis of no association between the two variables.
- Simpson’s paradox is an example where ignoring a variable can dramatically change conclusions from a study.
Exercise 1
Consider the last 900 digits of the human random digits. Are these digits uniformly random?
Exercise 2
This table in the Appendix gives 1800 decimal places of [latex]e[/latex]. Are the decimal digits of [latex]e[/latex] uniformly random?
Exercise 3
This table in the Appendix gives 1800 decimal places of [latex]\pi[/latex]. Are the decimal digits of [latex]\pi[/latex] uniformly random?
Exercise 4
The small town of Shinobi is located near a lake that has been associated with mystical powers. Nathan Yamada, a resident of Shinobi, rolled a die 120 times to give the results shown in the table below. Is there evidence of anything unusual with the dice rolls?
Outcomes of 120 dice rolls in the Island village of Shinobi
65612 | 64662 | 32226 | 35535 | 22565 | 25525 |
16132 | 12451 | 33635 | 66521 | 35553 | 41332 |
42565 | 21541 | 35113 | 55624 | 55362 | 65635 |
14232 | 11532 | 52635 | 56661 | 54544 | 44152 |
Exercise 5
Is there any evidence that the data collected for Exercise 9 of Chapter 11 does not follow a Poisson distribution?
Exercise 6
A table in Chapter 6 gives a three-way table of pizza preference by sex and island. Collapse this data into a two-way table of pizza preference and island. Is there any evidence that pizza preference differs between the two islands?
Exercise 7
Discuss the data in Chapter 22. What is the correct conclusion to make and why does the other table suggest the opposite conclusion?
Exercise 8
Green roofs are becoming a standard way of introducing vegetation in dense urban areas. Fernandez-Cañero et al. (2013) conducted a survey to assess attitudes towards green roof systems. From responses to 450 questionnaires they obtained the data shown in the table below. Is there evidence of an age difference in the interest in green roof systems?
Interest in green roof systems by age group
Age group | Under 18 | 18-25 | 26-40 | Over 40 |
---|---|---|---|---|
Interested | 136 | 42 | 48 | 75 |
Not interested | 83 | 18 | 31 | 17 |
Human random digits
45632 | 68450 | 63215 | 64789 | 62354 | 56121 | 33654 | 12126 | 44789 | 50112 | 35641 | 13254 |
46877 | 46521 | 11254 | 45789 | 64423 | 65789 | 65121 | 21523 | 65498 | 95632 | 45630 | 12186 |
65452 | 64478 | 96542 | 36512 | 45879 | 86614 | 23211 | 25354 | 40708 | 96362 | 13354 | 68755 |
21246 | 57252 | 01230 | 65456 | 98726 | 32548 | 96542 | 36544 | 21224 | 31675 | 98382 | 01645 |
82053 | 46725 | 64352 | 16546 | 53411 | 54653 | 82861 | 54653 | 60452 | 56563 | 74124 | 61365 |
67553 | 42116 | 56535 | 21454 | 62351 | 07467 | 53210 | 34411 | 32124 | 56467 | 67589 | 54624 |
22132 | 12466 | 72580 | 86549 | 79889 | 98764 | 52013 | 62656 | 05463 | 21154 | 84643 | 79164 |
58360 | 43461 | 25648 | 70586 | 32132 | 54060 | 96735 | 64945 | 46538 | 64594 | 65043 | 62165 |
04257 | 46686 | 44213 | 74663 | 46564 | 57986 | 93568 | 21326 | 45821 | 26432 | 80246 | 42467 |
63421 | 54679 | 47365 | 03253 | 61213 | 45767 | 67360 | 03431 | 26342 | 46157 | 28240 | 64560 |
84327 | 20542 | 32727 | 32373 | 16314 | 32789 | 21327 | 09127 | 13297 | 15321 | 41461 | 16420 |
45214 | 65542 | 57046 | 44141 | 44501 | 47147 | 68234 | 26146 | 50434 | 23516 | 46382 | 46151 |
46594 | 67246 | 15179 | 73986 | 73246 | 18467 | 97237 | 53488 | 74797 | 94653 | 15464 | 95267 |
34641 | 67897 | 97978 | 46121 | 21346 | 45797 | 91521 | 34543 | 14123 | 03451 | 32751 | 32761 |
76213 | 76842 | 73683 | 72014 | 00154 | 07404 | 17010 | 19480 | 76210 | 07104 | 46045 | 40174 |
00147 | 90545 | 71747 | 47448 | 07106 | 70075 | 15746 | 72727 | 07501 | 10457 | 10471 | 24178 |
10220 | 47579 | 46210 | 76867 | 51376 | 15746 | 12374 | 41645 | 32873 | 79783 | 14618 | 42556 |
42150 | 64274 | 06754 | 71419 | 87672 | 99874 | 62135 | 16143 | 45619 | 78495 | 64315 | 12036 |
43458 | 77847 | 31648 | 54546 | 41818 | 79798 | 46598 | 45146 | 12121 | 30615 | 44949 | 74825 |
36416 | 54894 | 97494 | 94530 | 32041 | 49050 | 74694 | 16731 | 45106 | 06085 | 05074 | 61810 |
58150 | 64154 | 51241 | 41079 | 78707 | 40079 | 15460 | 24641 | 87914 | 24645 | 27300 | 41948 |
46421 | 46745 | 42767 | 23154 | 68442 | 16427 | 35149 | 18246 | 15645 | 12465 | 16541 | 24376 |
54914 | 87216 | 45461 | 24681 | 94216 | 24900 | 97040 | 60504 | 05030 | 60405 | 12400 | 64867 |
91046 | 84142 | 70160 | 16421 | 97364 | 76915 | 61464 | 35247 | 97586 | 45356 | 45163 | 25798 |
83164 | 51356 | 48135 | 60804 | 50604 | 08780 | 98075 | 02032 | 06504 | 25010 | 20560 | 87875 |
77814 | 54110 | 74870 | 14749 | 74070 | 45781 | 07181 | 48779 | 87114 | 00413 | 21457 | 46512 |
49687 | 46541 | 32786 | 15762 | 13761 | 65745 | 16740 | 01567 | 84270 | 76491 | 84017 | 94047 |
59407 | 44974 | 51654 | 74974 | 97497 | 79712 | 04176 | 75746 | 54401 | 46404 | 46752 | 28346 |
41516 | 07288 | 58462 | 52346 | 75465 | 98497 | 72561 | 52432 | 64578 | 52612 | 40145 | 76761 |
24204 | 65413 | 12768 | 21846 | 31043 | 54196 | 84657 | 31224 | 68149 | 84216 | 45124 | 63149 |
These digits were generated by a human asked to type a sequence of 1800 random digits.