"

22 Categorical Data

[latex]\newcommand{\pr}[1]{P(#1)} \newcommand{\var}[1]{\mbox{var}(#1)} \newcommand{\mean}[1]{\mbox{E}(#1)} \newcommand{\sd}[1]{\mbox{sd}(#1)} \newcommand{\Binomial}[3]{#1 \sim \mbox{Binomial}(#2,#3)} \newcommand{\Student}[2]{#1 \sim \mbox{Student}(#2)} \newcommand{\Normal}[3]{#1 \sim \mbox{Normal}(#2,#3)} \newcommand{\Poisson}[2]{#1 \sim \mbox{Poisson}(#2)} \newcommand{\se}[1]{\mbox{se}(#1)} \newcommand{\prbig}[1]{P\left(#1\right)} \newcommand{\degc}{$^{\circ}$C}[/latex]

Testing Randomness

In contrast to the algorithmic random digits seen in Chapter 2, the table following this chapter gives 1800 digits from a human asked to create a random sequence. Are these digits random?

We will look at the first half of this sequence, the first 900 digits, leaving an analysis of the second half as an exercise. There are actually many different criteria for a sequence being “random” in this context, one of which is that the outcomes should all be equally likely. Here we would expect each digit to appear 900/10 = 90 times. The observed values are given in the table below with a bar chart of these shown in the following figure. This is a one-way table, giving observed counts for a single categorical variable.

Observed and expected counts for the first 900 digits

Digit 0 1 2 3 4 5 6 7 8 9
Observed 47 101 109 90 145 111 132 75 50 40
Expected 90 90 90 90 90 90 90 90 90 90

Bar chart of observed frequency for the first 900 digits

There are certainly deviations from what we would expect, ranging from only 40 occurrences for ‘9’ up to 145 for ‘4’. Even if the digits were truly random we would not expect to get exactly 90 of each one appearing. But are the observed deviations plausible if they were truly random? This is a standard hypothesis test setting. We want to know the probability of getting values as far away (or further) than those observed by chance if they really were equally likely.

This is a good chance to reflect on the basic ideas of hypothesis testing. We are not estimating a parameter here, and so will not be talking about confidence intervals. Instead we write [latex]H_0[/latex] in words as

[latex]H_0[/latex]: observations follow hypothesised distribution.

The alternative is the very general statement

[latex]H_1[/latex]: observations do not follow hypothesised distribution.

This is known as a goodness-of-fit test and we need some way of measuring how close the observed counts are to the expected counts. An obvious measure is to add up all the differences between the observed and expected counts, since we would expect this to be bigger if there were bigger deviations. However this sum is always 0 because the positive and negative differences always cancel out. (Why?) We could fix this by adding up the absolute differences, but as usual we add up the squared differences, just as we did for the sample standard deviation. This gives the statistic
\[ \sum (\mbox{observed} – \mbox{expected})^2, \]
where the sum is over all the categories (the 10 digits).
This, however, is not perfect as it does not take into account the relative size of deviations. For example, an observed value of 20 would be the same distance from an expected value of 10 as an observed value of 1010 would be from an expected value of 1000. However the first is much more significant since the observation was double the expected, while the second is not much of a difference at all. To capture this we take the ratio of the squared difference by the expected value, giving
\[ \chi^2 = \sum \frac{(\mbox{observed} – \mbox{expected})^2}{\mbox{expected}}. \]
Here [latex]\chi[/latex] is the Greek letter chi, and this statistic is called the chi-square statistic. If there is evidence against the null hypothesis then we would expect [latex]\chi^2[/latex] to be large. Here we find
\[ \chi^2 = \frac{(47 – 90)^2}{90} + \frac{(101 – 90)^2}{90} + \cdots + \frac{(40 – 90)^2}{90} = 132.07. \]

How do we know if 132.07 could simply be due to sampling variability? We need to know the sampling distribution of this statistic, assuming that [latex]H_0[/latex] is true. This distribution is called the chi-square distribution. Like the [latex]t[/latex] distribution, there is a different chi-square distribution for each number of categories. Here we have 10 categories but the sum of the differences between observed and expected is always 0, so there are only 9 free differences in the analysis. As before, we call this the degrees of freedom of the chi-square statistic.

The figure below shows the [latex]\chi^2_9[/latex] distribution, the chi-square distribution with 9 degrees of freedom, along with the [latex]\chi^2_1[/latex], [latex]\chi^2_4[/latex] and [latex]\chi^2_8[/latex] distributions for comparison. Since we are squaring everything the value of [latex]\chi^2[/latex] can never be negative but there is no real limit on how big [latex]\chi^2[/latex] can be, so this is a rather skewed distribution.

[latex]\chi^2_1[/latex], [latex]\chi^2_4[/latex], [latex]\chi^2_8[/latex] and [latex]\chi^2_9[/latex] distributions

Like the other continuous distributions we have seen, there is no simple way of working out areas under the [latex]\chi^2_9[/latex] density curve. The table below gives the areas under the [latex]\chi^2_1[/latex] distribution as an example but it is impractical to provide tables for each degrees of freedom and is also worthless since computer packages can provide these areas easily.

[latex]\chi^2(1)[/latex] distribution

  First decimal place of [latex]x[/latex]
[latex]x[/latex] 0 1 2 3 4 5 6 7 8 9
0.0 1.000 0.752 0.655 0.584 0.527 0.480 0.439 0.403 0.371 0.343
1.0 0.317 0.294 0.273 0.254 0.237 0.221 0.206 0.192 0.180 0.168
2.0 0.157 0.147 0.138 0.129 0.121 0.114 0.107 0.100 0.094 0.089
3.0 0.083 0.078 0.074 0.069 0.065 0.061 0.058 0.054 0.051 0.048
4.0 0.046 0.043 0.040 0.038 0.036 0.034 0.032 0.030 0.028 0.027
5.0 0.025 0.024 0.023 0.021 0.020 0.019 0.018 0.017 0.016 0.015
6.0 0.014 0.014 0.013 0.012 0.011 0.011 0.010 0.010 0.009 0.009
7.0 0.008 0.008 0.007 0.007 0.007 0.006 0.006 0.006 0.005 0.005
8.0 0.005 0.004 0.004 0.004 0.004 0.004 0.003 0.003 0.003 0.003
9.0 0.003 0.003 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002
10.0 0.002 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
11.0 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
12.0 0.001 0.001

This table gives [latex]\pr{X^2 \ge x}[/latex] where [latex]X^2 \sim \chi^2_1[/latex].

The following table provides the critical values for a range of degrees of freedom so you can see their general pattern. Unlike the [latex]t[/latex] distributions, the critical values here keep getting higher as the degrees of freedom increase, not surprising since [latex]\chi^2[/latex] is the sum of more and more terms.

[latex]\chi^2[/latex] distribution

  Probability [latex]p[/latex]
df 0.975 0.95 0.25 0.10 0.05 0.025 0.01 0.005 0.001
1 0.001 0.004 1.323 2.706 3.841 5.024 6.635 7.879 10.83
2 0.051 0.103 2.773 4.605 5.991 7.378 9.210 10.60 13.82
3 0.216 0.352 4.108 6.251 7.815 9.348 11.34 12.84 16.27
4 0.484 0.711 5.385 7.779 9.488 11.14 13.28 14.86 18.47
5 0.831 1.145 6.626 9.236 11.07 12.83 15.09 16.75 20.52
6 1.237 1.635 7.841 10.64 12.59 14.45 16.81 18.55 22.46
7 1.690 2.167 9.037 12.02 14.07 16.01 18.48 20.28 24.32
8 2.180 2.733 10.22 13.36 15.51 17.53 20.09 21.95 26.12
9 2.700 3.325 11.39 14.68 16.92 19.02 21.67 23.59 27.88
10 3.247 3.940 12.55 15.99 18.31 20.48 23.21 25.19 29.59
11 3.816 4.575 13.70 17.28 19.68 21.92 24.72 26.76 31.26
12 4.404 5.226 14.85 18.55 21.03 23.34 26.22 28.30 32.91
13 5.009 5.892 15.98 19.81 22.36 24.74 27.69 29.82 34.53
14 5.629 6.571 17.12 21.06 23.68 26.12 29.14 31.32 36.12
15 6.262 7.261 18.25 22.31 25.00 27.49 30.58 32.80 37.70
16 6.908 7.962 19.37 23.54 26.30 28.85 32.00 34.27 39.25
17 7.564 8.672 20.49 24.77 27.59 30.19 33.41 35.72 40.79
18 8.231 9.390 21.60 25.99 28.87 31.53 34.81 37.16 42.31
19 8.907 10.12 22.72 27.20 30.14 32.85 36.19 38.58 43.82
20 9.591 10.85 23.83 28.41 31.41 34.17 37.57 40.00 45.31
21 10.28 11.59 24.93 29.62 32.67 35.48 38.93 41.40 46.80
22 10.98 12.34 26.04 30.81 33.92 36.78 40.29 42.80 48.27
23 11.69 13.09 27.14 32.01 35.17 38.08 41.64 44.18 49.73
24 12.40 13.85 28.24 33.20 36.42 39.36 42.98 45.56 51.18
25 13.12 14.61 29.34 34.38 37.65 40.65 44.31 46.93 52.62
26 13.84 15.38 30.43 35.56 38.89 41.92 45.64 48.29 54.05
27 14.57 16.15 31.53 36.74 40.11 43.19 46.96 49.64 55.48
28 15.31 16.93 32.62 37.92 41.34 44.46 48.28 50.99 56.89
29 16.05 17.71 33.71 39.09 42.56 45.72 49.59 52.34 58.30
30 16.79 18.49 34.80 40.26 43.77 46.98 50.89 53.67 59.70
40 24.43 26.51 45.62 51.81 55.76 59.34 63.69 66.77 73.40
50 32.36 34.76 56.33 63.17 67.50 71.42 76.15 79.49 86.66
60 40.48 43.19 66.98 74.40 79.08 83.30 88.38 91.95 99.61
70 48.76 51.74 77.58 85.53 90.53 95.02 100.4 104.2 112.3
80 57.15 60.39 88.13 96.58 101.9 106.6 112.3 116.3 124.8
90 65.65 69.13 98.65 107.6 113.1 118.1 124.1 128.3 137.2
100 74.22 77.93 109.1 118.5 124.3 129.6 135.8 140.2 149.4

This table gives [latex]x^{*}[/latex] such that [latex]\pr{X^2 \ge x^{*}} = p[/latex], where [latex]X^2 \sim \chi^2(\mbox{df})[/latex].

The [latex]P[/latex]-value we want is [latex]\pr{X^2 \ge 132.07}[/latex], where [latex]X^2 \sim \chi^2_9[/latex]. For 9 degrees of freedom in the table above this [latex]P[/latex]-value is far below 0.001. Note that this is a two-sided [latex]P[/latex]-value already, since positive and negative deviations have been squared and combined, so there is no need to multiply it by 2. This is very strong evidence against [latex]H_0[/latex] so in conclusion there is very strong evidence to suggest that the human-generated numbers are not uniformly distributed.

Assumptions

Note that the observed counts in our data are discrete and so the [latex]\chi^2[/latex] statistic is also discrete, even though it might look like a continuous decimal number. However, the [latex]\chi^2[/latex] distribution we are using for hypothesis testing is continuous, so the underlying assumption is that this continuous approximation to the real discrete distribution is a good one. This is analogous to using the Normal distribution to approximate the Binomial distribution for proportion tests.

To satisfy this assumption we use the rule of thumb that all expected counts should be at least 1 and 80% of them should be at least 5. In the random digits example this is justified, with all expected counts equal to 90. In the section below we’ll see an example where we need to combine groups to satisfy the assumption.

Correlation Test of Randomness

Before we continue discussing categorical data, note that there are other requirements that a genuinely random sequence of numbers should satisfy. One important one is that consecutive numbers should be independent. This is particularly important if we were using a random number generator on a calculator or computer to help choose random samples for an experiment. All of the statistical tests we have described assume that samples are independent of each other and so a poor random number generator could undermine our studies.

One way of testing for independence here is to tally occurrences of the 100 possible pairs of digits. If the outcomes were independent then these 100 pairs should be equally likely and we could use a chi-square statistic with 99 degrees of freedom to test this uniformity.

Another method is to make a scatter plot where each point represents a digit and the digit that followed it in the sequence. Such a plot is shown in the following figure, where jittering has been used to separate points which would otherwise be obscured (since this is discrete data). If one digit and the next were independent then there should be no association present in this plot. However, there are noticeable gaps (an ‘8’ was never followed by a ‘1’) and very dense combinations (‘6’ was often followed by ‘5’). This leads to a smoothed line suggesting that perhaps there is a positive association between one digit and the next.

Jittered scatter plot of consecutive digits

We can support the visual impression by calculating the correlation coefficient, [latex]r = +0.232[/latex]. This is not particularly large but it is significantly different from 0 ([latex]p \lt 0.001[/latex]). This gives strong evidence that low digits tend to be followed by low digits and high digits tend to be followed by high digits. Thus there is further evidence that the digits produced are not genuinely random.

Parametric Distributions

Poisson Yeast Cells

The table below shows the counts of yeast cells made by Student (1907), along with the expected counts from the Poisson distribution with [latex]\lambda = 4.68[/latex]. Is there any evidence to suggest that the Poisson distribution is not appropriate for this data?

Observed and expected counts of yeast cells

Yeast Cells 0-1 2 3 4 5 6 7 8 9+
Observed 20 43 53 86 70 54 37 18 19
Expected 21.1 40.6 63.4 74.2 69.4 54.2 36.2 21.2 19.7

For [latex]\Poisson{X}{4.68}[/latex] we have [latex]\pr{X=0} = 0.009279[/latex] so that the expected count is [latex]400 \times 0.009279 = 3.71[/latex]. Since this is quite low we have combined the 0 and 1 counts together to satisfy the assumptions given earlier. Similarly, since the Poisson probabilities all get small for large [latex]x[/latex] and so we combine 9 and over into a group. The expected value for this group can be calculated using complements since
\[ \pr{X \ge 9} = 1 – \pr{X \le 8} = 1 – \sum_{x=0}^8 \frac{e^{-\lambda} \lambda^x}{x!}. \]

We calculate the [latex]\chi^2[/latex] statistic
\[ x = \frac{(20 – 21.1)^2}{21.1} + \frac{(43 – 40.6)^2}{40.6} + \cdots + \frac{(19 – 19.7)^2}{19.7} = 4.31. \]
The basic degrees of freedom are 9 – 1 = 8. However, for this example we did not actually know the theoretical distribution before we looked at the data, unlike the previous example where we could specify the uniform expected values beforehand. To get the expected values here we first needed to estimate the Poisson parameter [latex]\lambda[/latex] from the data. When we do this we lose one degree of freedom, just as when we calculated the sample standard deviation we lost one degree of freedom because we had to estimate the sample mean or when we calculated the residual standard error we lost two degrees of freedom because we had to estimate the sample intercept and slope. Hence the degrees of freedom for this test are 9 – 2 = 7.

From the [latex]\chi^2[/latex] table we find the [latex]P[/latex]-value is greater than 0.25. This gives no evidence against the null hypothesis and so the observed counts are consistent with a Poisson(4.68) distribution.

Relationship to Proportion Test

Mendel Revisited

In Chapter 17 we analysed Mendel’s experiment concerning the inheritance of pea plant flower colours. The two counts, 705 purple and 224 white, can also be written as a one-way table. In the table below we have put the counts together with the expected counts based on Mendel’s theory of a 3:1 ratio.

Observed and expected counts for Mendel's experiment

Colour Purple White
Observed 705 224
Expected 696.75 232.25

To test the significance of the deviations from the expected values we calculate the [latex]\chi^2[/latex] statistic
\[ x = \frac{(705 – 696.75)^2}{696.75} + \frac{(224 – 232.25)^2}{232.25} = 0.39. \]
This statistic has 1 degree of freedom. From the [latex]\chi^2(1)[/latex] table we see the [latex]P[/latex]-value is around 0.527, no evidence that the observed results differ from the theory.

Note that this is almost identical to the result we obtained in Chapter 17 when we used the hypothesised value of [latex]p[/latex] to estimated the standard deviation of [latex]\hat{p}[/latex]. In fact the [latex]z[/latex] value there was 0.626 and [latex]0.626^2[/latex] = 0.39, the value of [latex]x[/latex]. The [latex]\chi_1^2[/latex] distribution is just the square of the Normal distribution. The chi-square test for one-way tables is thus a generalisation of the one-sample test of a proportion using the Normal distribution, allowing us to test a distribution with more than one free proportion. This is similar to the relationship we saw in Chapter 19, where the [latex]F[/latex] test for comparing two means gave identical results to the pooled two-sample [latex]t[/latex] test.

Two-Way Tables

In Chapter 17 we looked at the effect of nicotine inhalers on smoking reduction using a comparison between two proportions. We can try to determine whether the inhalers are beneficial by testing for an association in the two-way table of counts. The following table shows this data again, with marginal totals included. Our null hypothesis is that the inhaler contents and the reduction outcome are independent, while the alternative hypothesis is simply that they are not.

Sustained reductions after 4 months of inhaler use

  Nicotine Placebo Total
Reduction 52 18 70
No Reduction 148 182 330
Total 200 200 400

From the marginal distributions we see that 200/400 = 0.5 of the subjects had the nicotine inhaler, while 70/400 = 0.175 of the subjects had a reduction. If there was no association between inhaler contents and reduction then these outcomes should be independent of each other. We should then be able to multiply their proportions together to estimate the proportion of subjects having nicotine and having a reduction,
\[ 0.5 \times 0.175 = 0.0875. \]
Thus we would expect 8.75% of all subjects to have this combination. Now 8.75% of 400 is 35, compared to the observed value of 52. Is this a significant difference? We can use a chi-square test to find out.

Firstly, we calculate the other three expected counts. Note that we divide by 400 twice in getting our proportions but then multiply by it at the end. We can save one step and give the simple formula
\[ \mbox{expected count } = \frac{\mbox{row total } \times \mbox{ column total}}{\mbox{total}}. \]
For example, we would expect the count of subjects in the placebo group who don’t sustain a reduction to be
\[ \frac{200 \times 330}{400} = 165. \]
The table below gives all the expected counts.

Expected counts for inhaler data

  Nicotine Placebo
Reduction 35 35
No Reduction 165 165

We can simply work out the chi-square statistic as before,
\[ \chi^2 = \frac{(52 – 35)^2}{35} + \cdots + \frac{(182 – 165)^2}{165} = 20.02. \]
This statistic has a [latex]\chi^2_{\mbox{df}}[/latex] distribution with
\[ \mbox{df } = (\mbox{rows } – 1) \times (\mbox{columns } – 1), \]
since the degrees of freedom from the two variables multiply in the same way that you multiply the number of rows and columns to find the number of cells. For this example, df = 1. The [latex]\chi^2(1)[/latex] table or the [latex]\chi^2[/latex] table tell us the [latex]P[/latex]-value is very close to 0. Thus there is very strong evidence of an association between the type of inhaler and whether a reduction was sustained.

Note that [latex]\sqrt{20.02} = 4.47[/latex], the [latex]z[/latex] statistic we found in Chapter 17 with the pooled sample proportion, so this test for association is identical to the two-sample proportion test using the Normal approximation. The advantage of the chi-square test is that it can be applied to tables with more than two rows or columns.

Pizza Preference and Sex

The table below gives the two-way table of counts of 200 Islanders by pizza preference and sex that we first saw in Chapter 6.

Counts of preferred pizza by sex

  Mushroom Pineapple Prawns Sausage Spinach Total
Female 10 39 17 13 23 102
Male 18 10 13 36 21 98
Total 28 49 30 49 44 200

The table below shows the expected counts if the two variables were independent.

Expected counts of preferred pizza by sex

  Mushroom Pineapple Prawns Sausage Spinach Total
Female 14.3 25.0 15.3 25.0 22.4 102
Male 13.7 24.0 14.7 24.0 21.6 98
Total 28 49 30 49 44 200

The [latex]\chi^2[/latex] statistic is [latex]30.8[/latex] with 4 degrees of freedom (2 rows and 5 columns). From the [latex]\chi^2[/latex] table we find the [latex]P[/latex]-value is less than 0.001, giving substantial evidence from this data to suggest that pizza preference differs between males and females.

Fisher’s Exact Test

For small sample sizes the test based on the [latex]\chi^2[/latex] distribution will usually give a poor approximation to the [latex]P[/latex]-values. An alternative is to use Fisher’s exact test (Glantz, 2002). This procedure enumerates all the possible tables that would be as unusual as the one obtained assuming no association or simulates the generation of such tables if the number of possibilities is too big. Either method is straightforward but can require a lot of calculation and so is usually left to a computer.

Simpson’s Paradox

Appleton et al. (1996) surveyed women twenty years after they were part of a study in 1972-1974 on thyroid and heart disease. The following table shows the survival status at the time of the second survey of the 1314 women who had been classified as either a current smoker or as never having smoked in the original survey.

Two-way table of smoking and survival

  Survival  
Survival Yes No Total
Dead 139 230 369
Alive 443 502 945
Total 582 732 1314

Of the 582 women who smoked, 443 were still alive after twenty years, a survival rate of 76%. Of the 732 women who didn’t smoke, 502 were still alive, a survival rate of 69%. That is interesting: it seems that, for the population these women came from, smoking might actual help survival. A chi-square test of association gives a [latex]P[/latex]-value of 0.003, strong evidence that smoking status and survival are related.

This seems like good news for smokers! But now consider the table below which shows the same data but as a three-way table with an extra variable for age group.

Three-way table of age group, smoking and survival

Age Group
18-44 45-64 65+
Smoking Smoking Smoking
Survival Yes No Yes No Yes No Total
Dead 19 13 78 52 42 165 369
Alive 269 327 167 147 7 28 945
Total 288 340 245 199 49 193 1314

We can now look at the relationship between smoking status and survival for the different age groups.

  • For 18-44 year olds, 269 out of 288 smokers survived (93%) compared to 327 out of 340 nonsmokers (96%). For this age group it was better to be a nonsmoker.
  • For 45-64 year olds, 167 out of 245 smokers survived (68%) compared to 147 out of 199 nonsmokers (74%). For this age group it was also better to be a nonsmoker.
  • For women 65 and over, 7 out of 49 smokers survived (14%) compared to 28 out of 193 nonsmokers (15%). For this age group it was also slightly better to be a nonsmoker.

So for each age group the survival rate was higher for nonsmokers. That seems odd since above we saw that altogether it was smokers who had the better survival rate. The same data are giving us completely opposite conclusions!

This phenomenon is known as Simpson’s paradox. We will leave you to figure out which is the correct conclusion and why the other one gives the opposite result.

Summary

  • The chi-square test is a general procedure for comparing observed and expected counts.
  • For a one-way table the expected counts need to come from some hypothetical distribution.
  • Comparing counts to a uniform distribution is a simple test of randomness.
  • For a two-way table the expected counts come from the null hypothesis of no association between the two variables.
  • Simpson’s paradox is an example where ignoring a variable can dramatically change conclusions from a study.

Exercise 1

Consider the last 900 digits of the human random digits. Are these digits uniformly random?

Exercise 2

This table in the Appendix gives 1800 decimal places of [latex]e[/latex]. Are the decimal digits of [latex]e[/latex] uniformly random?

Exercise 3

This table in the Appendix gives 1800 decimal places of [latex]\pi[/latex]. Are the decimal digits of [latex]\pi[/latex] uniformly random?

Exercise 4

The small town of Shinobi is located near a lake that has been associated with mystical powers. Nathan Yamada, a resident of Shinobi, rolled a die 120 times to give the results shown in the table below. Is there evidence of anything unusual with the dice rolls?

Outcomes of 120 dice rolls in the Island village of Shinobi

65612 64662 32226 35535 22565 25525
16132 12451 33635 66521 35553 41332
42565 21541 35113 55624 55362 65635
14232 11532 52635 56661 54544 44152

Exercise 5

Is there any evidence that the data collected for Exercise 9 of Chapter 11 does not follow a Poisson distribution?

Exercise 6

A table in Chapter 6 gives a three-way table of pizza preference by sex and island. Collapse this data into a two-way table of pizza preference and island. Is there any evidence that pizza preference differs between the two islands?

Exercise 7

Discuss the data in Chapter 22. What is the correct conclusion to make and why does the other table suggest the opposite conclusion?

Exercise 8

Green roofs are becoming a standard way of introducing vegetation in dense urban areas. Fernandez-Cañero et al. (2013) conducted a survey to assess attitudes towards green roof systems. From responses to 450 questionnaires they obtained the data shown in the table below. Is there evidence of an age difference in the interest in green roof systems?

Interest in green roof systems by age group

Age group Under 18 18-25 26-40 Over 40
Interested 136 42 48 75
Not interested 83 18 31 17

Human random digits

45632 68450 63215 64789 62354 56121 33654 12126 44789 50112 35641 13254
46877 46521 11254 45789 64423 65789 65121 21523 65498 95632 45630 12186
65452 64478 96542 36512 45879 86614 23211 25354 40708 96362 13354 68755
21246 57252 01230 65456 98726 32548 96542 36544 21224 31675 98382 01645
82053 46725 64352 16546 53411 54653 82861 54653 60452 56563 74124 61365
67553 42116 56535 21454 62351 07467 53210 34411 32124 56467 67589 54624
22132 12466 72580 86549 79889 98764 52013 62656 05463 21154 84643 79164
58360 43461 25648 70586 32132 54060 96735 64945 46538 64594 65043 62165
04257 46686 44213 74663 46564 57986 93568 21326 45821 26432 80246 42467
63421 54679 47365 03253 61213 45767 67360 03431 26342 46157 28240 64560
84327 20542 32727 32373 16314 32789 21327 09127 13297 15321 41461 16420
45214 65542 57046 44141 44501 47147 68234 26146 50434 23516 46382 46151
46594 67246 15179 73986 73246 18467 97237 53488 74797 94653 15464 95267
34641 67897 97978 46121 21346 45797 91521 34543 14123 03451 32751 32761
76213 76842 73683 72014 00154 07404 17010 19480 76210 07104 46045 40174
00147 90545 71747 47448 07106 70075 15746 72727 07501 10457 10471 24178
10220 47579 46210 76867 51376 15746 12374 41645 32873 79783 14618 42556
42150 64274 06754 71419 87672 99874 62135 16143 45619 78495 64315 12036
43458 77847 31648 54546 41818 79798 46598 45146 12121 30615 44949 74825
36416 54894 97494 94530 32041 49050 74694 16731 45106 06085 05074 61810
58150 64154 51241 41079 78707 40079 15460 24641 87914 24645 27300 41948
46421 46745 42767 23154 68442 16427 35149 18246 15645 12465 16541 24376
54914 87216 45461 24681 94216 24900 97040 60504 05030 60405 12400 64867
91046 84142 70160 16421 97364 76915 61464 35247 97586 45356 45163 25798
83164 51356 48135 60804 50604 08780 98075 02032 06504 25010 20560 87875
77814 54110 74870 14749 74070 45781 07181 48779 87114 00413 21457 46512
49687 46541 32786 15762 13761 65745 16740 01567 84270 76491 84017 94047
59407 44974 51654 74974 97497 79712 04176 75746 54401 46404 46752 28346
41516 07288 58462 52346 75465 98497 72561 52432 64578 52612 40145 76761
24204 65413 12768 21846 31043 54196 84657 31224 68149 84216 45124 63149

These digits were generated by a human asked to type a sequence of 1800 random digits.

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

A Portable Introduction to Data Analysis Copyright © 2024 by The University of Queensland is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.