24 Nonparametric Methods

[latex]\newcommand{\pr}[1]{P(#1)} \newcommand{\var}[1]{\mbox{var}(#1)} \newcommand{\mean}[1]{\mbox{E}(#1)} \newcommand{\sd}[1]{\mbox{sd}(#1)} \newcommand{\Binomial}[3]{#1 \sim \mbox{Binomial}(#2,#3)} \newcommand{\Student}[2]{#1 \sim \mbox{Student}(#2)} \newcommand{\Normal}[3]{#1 \sim \mbox{Normal}(#2,#3)} \newcommand{\Poisson}[2]{#1 \sim \mbox{Poisson}(#2)} \newcommand{\se}[1]{\mbox{se}(#1)} \newcommand{\prbig}[1]{P\left(#1\right)} \newcommand{\degc}{$^{\circ}$C}[/latex]

Nonparametric methods provide an alternative to methods based on the [latex]t[/latex] distribution when the assumptions for those methods are not satisfied. Although they come with their own assumptions, nonparametric tests are typically more robust in the presence of outliers or strong skewness. We start by highlighting the susceptibility of the [latex]t[/latex] methods to even a single outlier before then introducing a number of nonparametric equivalents to the various procedures we have covered so far.

Effect of Outliers

Darwin (1902) published data from an experiment comparing the growth of cross-fertilised and self-fertilised plants, with pairs of plants of the same age grown together in pots to eliminate other factors. The table below gives the differences in heights between the cross- and self-fertilised plants in each of 15 pots, converted from the original units of eighths of an inch into centimetres. The original data table is given in the Appendix.

Differences in heights (cm) between 15 pairs of cross- and self-fertilised plants

15.6 -21.3 2.5 5.1 1.9 7.3 8.9 13.0
4.4 9.2 17.8 7.6 23.8 19.1 -15.2

Darwin’s hypothesis was that cross-fertilised plants would grow more vigorously than those that had been self-fertilised. Thus if [latex]\mu[/latex] was the population mean of the differences then we would test [latex]H_0: \mu = 0[/latex] against the one-sided alternative [latex]H_1: \mu \gt 0[/latex].

The sample mean of the differences is 6.65 cm with standard deviation 11.990 cm. This does not seem like a very big increase in plant growth but the statistic
\[ t_{14} = \frac{6.65 – 0}{11.990/\sqrt{15}} = 2.148 \]
gives a [latex]P[/latex]-value of 0.025, moderate evidence that the cross-fertilised plants are growing higher than their self-fertilised counterparts.

Suppose now that the experimenter mistakenly omitted the decimal place when entering the first observation. The results are repeated in the table below but with the much higher difference given for the first pair.

Differences in heights (cm) between 15 pairs of cross- and self-fertilised plants with outlier

156 -21.3 2.5 5.1 1.9 7.3 8.9 13.0
4.4 9.2 17.8 7.6 23.8 19.1 -15.2

Think about the effect you would expect this to have on the analysis. Darwin wanted to show that the cross-fertilised plants were doing better and we have already found some evidence for this based on the original data. This modified data set now includes a pair for which the cross-fertilised plant did amazingly well, growing a metre and a half taller than its competitor! We would expect to get even stronger evidence now, and indeed the new sample mean is 16.0 cm, suggesting the mean [latex]\mu[/latex] is even further away from 0 than before.

However, the [latex]t[/latex] statistic is now
\[ t_{14} = \frac{16.0 – 0}{40.465/\sqrt{15}} = 1.531, \]
no longer significant at the 5% level. Strangely we now have no evidence that cross-fertilisation is superior, despite including a prime example of its benefits!

The reason for this seemingly paradoxical result is that while the sample mean is increased by the outlying value, the sample standard deviation is inflated even more. The net result is that our standardised statistic is smaller than before, giving the weaker evidence. We say that the outlier has diluted the test results. This is a perfect example of why visualisation of your data is so important before doing any kind of testing.

In this case the experimenter could easily have corrected the value, or even repeated the measurement if there was uncertainty about it. But what if we have outlying values for which we cannot justify removing them? One way around this problem is to use robust methods for estimating parameters. For example, the trimmed mean is calculated by removing the extreme 5% of observations and finding the regular mean of what is left. However, the distributions of such statistics can be difficult to determine.

An alternative is to use a nonparametric method to analyse the data, one that doesn’t involve using a parameter estimate, like the sample mean, which is susceptible to outliers (Higgins, 2004). We will start by looking at a very simple method in the next section, the sign test, and then continue to some more sophisticated settings.

Sign Test

Returning to the data in original table (without the outlier), note that 13 out of 15 pairs have the cross-fertilised plant doing better. What would happen if the null hypothesis were true and the fertilisation method had no effect on growth? It is unlikely that the heights would be identical since there is also natural variability in plant growth. However it is reasonable to expect that in about half of the pairs the cross-fertilised plants would do better while in the other half it would be the self-fertilised plants. If the alternative hypothesis were true instead then we would expect the chance of the cross-fertilised plant being taller to be more than half.

If we let [latex]p = \pr{\mbox{cross-fertilised plant taller}}[/latex] then we can make this idea precise with
\[ H_0 : p = 0.5 \mbox{ versus } H_1 : p \gt 0.5. \]
This is a good opportunity to reflect on the basic ideas of hypothesis testing. We want to calculate the probability of observing the data we did assuming that [latex]H_0[/latex] is true. If this were the case then each pair would have a 0.5 chance of having the cross-fertilised plant taller and so counting the number, [latex]X[/latex], of pairs where the cross-fertilised plant is taller simply has the Binomial distribution
\[ \Binomial{X}{15}{0.5}. \]
We observed [latex]X = 13[/latex] and so the [latex]P[/latex]-value is the probability of getting a result as extreme or more extreme than 13. This is a one-sided test and we would expect [latex]X[/latex] to be larger if [latex]H_1[/latex] were true, so the [latex]P[/latex]-value is
\begin{eqnarray*}
\pr{X \ge 13} & = & \pr{X = 13} + \pr{X = 14} + \pr{X = 15} \\
& = & 0.003 + 0.000 + 0.000 \\
& = & 0.003 \\
\end{eqnarray*}
Thus the sign test gives very strong evidence that cross-fertilised plants are doing better than self-fertilised plants. It is nonparametric and so it gives no estimate as to by how much taller they are.

Now consider the modified data in the previous table where an outlier was introduced. The [latex]t[/latex] test was badly effected by this single value, the new [latex]P[/latex]-value being only 0.074, but what happens with the sign test? Counting we find that there are still 13 positive values and so the [latex]P[/latex]-value doesn’t change at all! By being very wasteful the sign test is incredibly robust against the effects of outliers.

Note that the null hypothesis for the sign test here was that the 0 difference in plant growth had 50% of values above it and 50% of values below it. This is saying that 0 is the median difference in plant growth, so the sign test can be viewed as a test of the median, just as the [latex]t[/latex] test is a test of the mean. If we let the Greek letter [latex]\eta[/latex] (Moore & McCabe, 1999) be the population median of the growth differences then we could write the test hypotheses as [latex]H_0: \eta = 0[/latex] and [latex]H_1: \eta \gt 0[/latex]. However, we don’t explicitly estimate the population median to test the null hypothesis, and that is why we call this a nonparametric test.

If the hypothesised median, [latex]\eta_0[/latex], was not 0 then rather than counting the positive signs for the test statistic we count the observations that were greater than [latex]\eta_0[/latex].

Arbuthnot

Nonparametric tests have a longer history than those involving estimates of means and standard deviations, largely because the latter have complex sampling distributions whereas the former are based on simple counting. The idea of the sign test dates back to at least 1710, when the polymath John Arbuthnot published a study of births in London over the previous 82 years (Arbuthnott, 1710). In each of those 82 years, there had been more males born than females. The null hypothesis was that it should be an equal chance of any individual birth being a male or female, and hence that in each year it should be an equal chance that more males or more females were born. Thus if [latex]X[/latex] is the number of years in which there were more males than females then we would assume [latex]\Binomial{X}{82}{0.5}[/latex]. The [latex]P[/latex]-value is then
\[ \pr{X \ge 82} = 0.5^{82} \simeq 0.0000000000000000000000002068, \]
giving very strong evidence that the chance of more males being born than females is greater than 0.5. Arbuthnot calculated this probability and concluded

But it is very improbable (if mere Chance govern’d) that they would never reach as far as the Extremities: But this Event is wisely prevented by the wise Oeconomy of Nature; and to judge of the wisdom of the Contrivance, we must observe that the external Accidents to which Males are subject (who must seek their Food with danger) do make a great havock of them, and that this loss exceeds far that of the other Sex, occasioned by Diseases incident to it, as Experience convinces us. To repair that Loss, provident Nature, by the Disposal of its wise Creator, brings forth more Males than Females; and that in almost a constant proportion.

Signed-Rank Test


For Darwin’s data the [latex]P[/latex]-value we obtained from the sign test ([latex]p = 0.003[/latex]) is much lower and thus gives more significant evidence than the original [latex]t[/latex] test ([latex]p = 0.025[/latex]). However the sign test may be being a bit optimistic here. It is rather wasteful since it just looks at the sign of each observation, disregarding the magnitude of the value. Only two pots had the self-fertilised plants doing better but they both did a lot better (15.2 and 21.3 cm) than the cross-fertilised plants. The sign test gives the same weight to the positive difference of 1.9 as it does to the negative difference of 21.3.

One way to add a measure of the magnitude of the observations while retaining robustness is to use observation ranks. Wilcoxon (1945) gives a test based on a signed-rank statistic that combines the signs of the differences, as in the sign test, with the ranks of the differences.

Begin by ranking the absolute differences. Give the smallest value (1.9 cm) a rank of 1, the second smallest value (2.5 cm) a rank of 2, and so on up to the largest value (23.8 cm) with a rank of 15. The table below shows these ranks for the above data. Note that although there are only two negative values, they do have quite large ranks.

Ranked absolute differences in lengths between cross- and self-fertilised plants

11 14 2 4 1 5 7 9
3 8 12 6 15 13 10

The signed-rank statistic, [latex]S[/latex], is the sum of the ranks corresponding to positive differences. Here
\[ S = 11 + 2 + 4 + 1 + 5 + 7 + 9 + 3 + 8 + 12 + 6 + 15 + 13 = 96. \]
If the alternative hypothesis tended to be true, we would expect to see large values of [latex]S[/latex]. Thus the [latex]P[/latex]-value is [latex]\pr{S \ge 96}[/latex]. This is a discrete statistic, like a Binomial count, but its distribution when [latex]H_0[/latex] is true is a little more complicated than the Binomial. The following table gives a table of critical values of [latex]S[/latex] for small [latex]n[/latex]. From the table we find that [latex]\pr{S \ge 96}[/latex] is between 0.025 and 0.01, some evidence of a difference but not as strong as the sign test result. Again, this may be a more accurate result since it takes into account the large magnitudes of the negative differences.

Signed-Rank critical values

  Probability [latex]p[/latex]
[latex]n[/latex] 0.25 0.10 0.05 0.025 0.01 0.005 0.001 0.0005 0.0001
2 3
3 5
4 8 10
5 11 13 15
6 15 18 19 21
7 19 23 25 26 28
8 24 28 31 33 35 36
9 30 35 37 40 42 44
10 35 41 45 47 50 52 55
11 42 49 53 56 59 61 65 66
12 49 57 61 65 69 71 76 77
13 56 65 70 74 79 82 87 89
14 65 74 80 84 90 93 99 101 105
15 73 84 90 95 101 105 112 114 118
16 82 94 101 107 113 117 125 128 133
17 92 105 112 119 126 130 139 142 148
18 102 116 124 131 139 144 153 157 163
19 113 128 137 144 153 158 169 172 180
20 124 141 150 158 167 173 184 189 197
21 136 154 164 173 182 189 201 206 214
22 149 167 178 188 198 205 218 223 233
23 162 182 193 203 214 222 236 241 252
24 175 196 209 219 231 239 255 260 272
25 189 212 225 236 249 257 274 280 292
26 203 227 241 253 267 276 293 300 313
27 218 244 259 271 286 295 314 321 335
28 234 261 276 290 305 315 335 342 357
29 250 278 295 309 325 335 356 364 380
30 267 296 314 328 345 356 379 387 404

This table gives [latex]S^{*}[/latex] such that [latex]\pr{S \ge S^{*}} \le p[/latex], where [latex]S[/latex] is a random Wilcoxon signed-rank statistic when the null hypothesis is true. Empty cells indicate that it is not possible to achieve the given probability. In fact it is impossible to get significance at the 5% level when [latex]n \lt 5[/latex].

If [latex]H_0[/latex] is true then it can be shown that
\[ \mean{S} = \frac{n(n+1)}{4}, \]
and
\[ \sd{S} = \sqrt{\frac{n(n+1)(2n+1)}{24}}, \]
where [latex]n[/latex] is the sample size.
These can be used with a Normal distribution to find approximate [latex]P[/latex]-values, particularly for [latex]n \gt 20[/latex]. For [latex]n = 15[/latex], [latex]\mean{S} = 60[/latex] and [latex]\sd{S} = 17.61[/latex], so
\[ \pr{S \ge 96} \simeq \prbig{Z \ge \frac{96 – 60}{17.61}} = \pr{Z \ge 2.04} = 0.021, \]
in agreement with the tabulated value from the previoy.

Methods based on ranks are naturally robust since they ignore the absolute size of observations. As a simple example, if the difference of 23.8 cm had been given as 238 mm by mistake then it would still have the rank 15 and the results would be unchanged.

Assumptions for the Signed-Rank Test

While the signed-rank test is nonparametric, it is still based on assumptions. To start with, as always, the observations should be independent of each other. The [latex]P[/latex]-value is then calculated under the assumption that the sign of any rank is equally likely to be positive or negative. This means that the ranks should be distributed evenly on both sides of the hypothesised median. For this to hold, the distribution of the data should be roughly symmetric.

If this assumption holds then the [latex]t[/latex] test could be used anyway, since any symmetric distribution will give an approximately Normal sample mean for even small values of [latex]n[/latex]. In most circumstances it will indeed be preferable to use a [latex]t[/latex] procedure rather than a nonparametric procedure. However, one or two unusual values can have a much bigger effect on the [latex]t[/latex] result than on a method like the signed-rank test.

Rank-Sum Test

Wilcoxon (1945) also describes a test for comparing two independent samples, referred to as the Wilcoxon rank-sum test. It again does this by working with the ranks of the observations. As with the signed-rank test, this makes the rank-sum test resistant to the effects of outliers. A slightly more general rank-sum test was published two years later by Mann and Whitney (1947). The rank-sum test is thus also referred to as the Mann-Whitney test, and software packages vary in the name that they use.

Caffeine and Pulse Rate

To see how the rank sum test works, consider Alice’s caffeine study one last time. We analysed the data from Chapter 2 with a [latex]t[/latex] test in Chapter 16. The table below shows the observed differences between before and after pulse rates for the 20 subjects, ranked from smallest to largest.

Ranked increases in pulse rate

Caffeinated 16 19 18 14.5 7.5 2 20 13 14.5 17
Decaffeinated 3.5 11 9.5 1 5.5 3.5 5.5 9.5 7.5 12

The smallest increase was -9 bpm, so it gets rank 1 while the second smallest increase was -2 bpm, getting rank 2. Note that this is different to the signed-rank test where we ranked the absolute values and then looked at which were positive or negative. Here we are ranking the whole scale and then comparing the ranks between the two groups.

The next smallest increase, 4 bpm, occurs twice and so we give both of them the average of the ranks that they would have had if they were not tied, 3 and 4. This actually won’t matter for the test since both observations are in the same group. However there are also later tied values that appear in both groups. These will change the sum of ranks in each group.

The null hypothesis is that there is no difference between the distributions of the ‘Caffeinated’ and ‘Decaffeinated’ groups. If this was the case then we would expect the 10 ‘Caffeinated’ labels to be scattered randomly over the 20 ranks. We measure where the ‘Caffeinated’ labels are by summing the ranks they appear over. This gives the statistic
\[ W = 16 + 19 + 18 + 14.5 + 7.5 + 2 + 20 + 13 + 14.5 + 17 = 141.5. \]
If the subjects with caffeine tended to have higher increases in pulse rate then they would tend to have higher ranks and so [latex]W[/latex] would tend to be bigger. The [latex]P[/latex]-value is the probability of getting a value as extreme or more extreme, so here we want [latex]\pr{W \ge 141.5}[/latex].

[latex]W[/latex] is a discrete random variable and so it has a distribution, similar to the Binomial, which can be tabulated. The following table gives the critical values needed for common significance levels for [latex]n_1[/latex] and [latex]n_2[/latex] up to 10, and computer software can be used to give exact probabilities for particular values of [latex]W[/latex]. The figure below shows the null distribution of [latex]W[/latex] for data with two groups of sizes [latex]n_1 = 5[/latex] and [latex]n_2 = 5[/latex].

Wilcoxon rank-sum distribution ([latex]n_1 = 5, n_2 =5[/latex]) with Normal approximation

For [latex]n_1 =10[/latex] and [latex]n_2 = 10[/latex], the Wilcoxon rank-sum table says that we need a value of at least 136 for [latex]W[/latex] in order to get significance at the 1% level. We have [latex]W = 141.5[/latex], so this is again strong evidence that caffeine is producing a higher increase in pulse rate.

Wilcoxon rank-sum critical values

    [latex]n_1[/latex]
[latex]n_2[/latex] [latex]p[/latex] 2 3 4 5 6 7 8 9 10
2 0.100 12 18 24 32 41 50 61 72
0.050 25 33 42 51 62 74
0.010
0.001
3 0.100 14 21 28 36 45 55 67 79
0.050 15 22 29 37 47 57 68 81
0.010 49 60 71 84
0.001
4 0.100 11 17 23 31 40 50 61 72 85
0.050 18 25 33 42 52 63 75 88
0.010 35 44 55 66 78 92
0.001 95
5 0.100 12 19 26 35 44 55 66 78 92
0.050 13 20 28 36 46 57 68 81 94
0.010 30 39 49 60 72 85 99
0.001 76 89 104
6 0.100 14 21 29 38 48 59 71 84 98
0.050 15 22 31 40 50 62 74 87 101
0.010 33 43 54 66 78 92 107
0.001 70 83 97 112
7 0.100 16 23 32 42 52 64 76 90 104
0.050 17 25 34 44 55 66 79 93 108
0.010 27 37 47 59 71 85 99 114
0.001 63 76 90 105 120
8 0.100 17 25 35 45 56 68 81 95 111
0.050 18 27 37 47 59 71 85 99 115
0.010 30 40 51 63 77 91 106 122
0.001 55 68 82 96 112 129
9 0.100 19 28 37 48 60 73 86 101 117
0.050 20 29 40 51 63 76 90 105 121
0.010 32 43 55 68 82 97 112 129
0.001 59 73 88 103 119 137
10 0.100 20 30 40 52 64 77 92 107 123
0.050 22 32 43 54 67 81 96 111 128
0.010 35 47 59 73 87 103 119 136
0.001 50 64 78 93 110 127 145

This table gives [latex]W^{*}[/latex] such that [latex]\pr{W \ge W^{*}} \le p[/latex], where [latex]W[/latex] is a random Wilcoxon rank-sum statistic under the null hypothesis that two groups have the same distributions. Empty cells indicate that it is not possible to achieve the given probability.

Randomisation Revisited

The critical values in the previous table are actually only correct for data sets that are free of ties. Each time we replace two or more different ranks with a single tied rank we are changing the distribution of [latex]W[/latex] slightly. In practice this will only be an issue if many of the ranks were tied or if the value of [latex]W[/latex] is on the borderline of significance. Neither of these is the case with our example and so we can be quite confident that our conclusion is correct. However it is worth reflecting on the similarities between the Wilcoxon rank-sum test and the randomisation test.

Recall our original discussion in Chapter 2 of whether Alice had found evidence that caffeine increases pulse rate. If the null hypothesis was true and there was no difference between her two groups then she had really just made 20 observations of the same effect. The explanation for the observed difference in means of 10.7 bpm was then that it had happened by chance due to the random allocation of the subjects to the groups. We can calculate the [latex]P[/latex]-value, the probability that this could happen by chance, by going through all the possible random allocations and seeing how often it actually does happen. Allocating 20 subjects into two groups of 10 can be done in [latex]{20 \choose 10} = 184756[/latex] ways. We found that of all these only 351 gave a value as unusual as the observed 10.7, a probability of 0.0019. Thus we had very strong evidence against the null hypothesis of no difference, suggesting that caffeine did increase pulse rate.

But this is exactly the same reasoning behind the rank-sum test. Our null hypothesis of no difference means that the 20 ranks could have been assigned to the two groups in any way. The [latex]P[/latex]-value is the probability of finding an allocation where the sum of ranks in the first group is 141.5 or more. We can calculate the exact probability, incorporating the presence of the ties, by counting through the 184756 allocations of these ranks. The full distribution of [latex]W[/latex] is shown in the figure below. There were 379 allocations where [latex]W \ge 141.5[/latex], giving a [latex]P[/latex]-value of 0.0021.

Wilcoxon rank-sum distribution with tied values

Normal Approximation

The expected value for [latex]W[/latex], if the null hypothesis is true, is easy to understand. The null hypothesis says that the groups are the same and so the rank of any observation is really just a random number between 1 and 20. A rank chosen randomly will be 10.5 on average, in the same way that rolling a die gives 3.5 on average. In general, the expected values of individual ranks will be
\[ \frac{(n_1 + n_2 + 1)}{2}. \]
If [latex]n_1[/latex] corresponds to the group whose ranks we are summing then [latex]W[/latex] is simply the sum of [latex]n_1[/latex] random ranks, each with this expected value, so that
\[ \mean{W} = n_1 \frac{(n_1 + n_2 + 1)}{2}. \]
The standard deviation requires a bit more algebra, but can be shown to be
\[ \sd{W} = \sqrt{\frac{n_1 n_2 (n_1 + n_2 + 1)}{12}}. \]
The figure within the previous example also shows a Normal approximation to the Wilcoxon distribution using
these values for [latex]\mean{W}[/latex] and [latex]\sd{W}[/latex]. For this example, [latex]\mean{W} = 10 \times 10.5 = 105[/latex] and
\[ \sd{W} = \sqrt{\frac{10 \times 10 \times 21}{12}} = 13.23.\]
We can use this in the same way as we used the Normal approximation for Binomial probabilities. Here the Normal approximation gives
\[ \pr{W \ge 141.5} = \prbig{Z \ge \frac{141.5 – 105}{13.23}} = \pr{Z \ge 2.76} = 0.003, \]
very close to the exact value of 0.002.

Note that ties don’t change the expected value of [latex]W[/latex] since the total sum of the available ranks stays the same. However the variability does change — the exact standard deviation from all the possible allocations of the tied ranks, shown in the previous figure, is 13.20. Of course this is only slightly different from the untied value of 13.23 and so it will not matter much in practice.

Water Uptake

This table in a Chapter 16 example gives the height of water uptake in 40 celery stalks, 20 of which had the top of their leaves coated with petroleum jelly. The table below gives the ranks of the combined observations.

Ranked water uptake height

Uncoated 36 30 34 26 31.5 23 29 38 18 22
24 28 40 31.5 35 27 38 25 33 38
Coated 3.5 14 20.5 10 5 1 7.5 12 16 7.5
19 15 6 11 17 13 20.5 2 3.5 9

Let [latex]W[/latex] be the sum of the ranks of the water uptake heights for the coated stalks. Thus
\[ W = 3.5 + 14 + 20.5 + 10 + 5 + 1 + \cdots + 3.5 + 9 = 213. \]
The alternative hypothesis is that the coated stalks will not lift water up as high as the uncoated stalks. Thus we expect [latex]W[/latex] to be small and so the [latex]P[/latex]-value for this test will be [latex]\pr{W \le 213}[/latex]. The Wilcoxon table does not include the critical values for these sample sizes ([latex]n_1 = n_2 = 20[/latex]) so we will need to use the Normal approximation. Using the above formulas we calculate [latex]\mean{W} = 410[/latex] and [latex]\sd{W} = 36.95[/latex], giving
\[ \pr{W \le 213} = \prbig{Z \le \frac{213 – 410}{36.95}} = \pr{Z \le -5.33} \approx 0, \]
very strong evidence that coating the leaves results in lower water uptake.

Assumptions for the Rank-Sum Test

The calculation of the [latex]P[/latex]-value is based on the assumption that the ranks are randomly assigned to the two groups. This will mean that the two groups should have the same distribution for the quantitative variable. Any difference detected by the rank-sum test is suggesting a difference between these distributions. If you want it to show that the population medians are different then the other features, the variability and shape of the two distributions, should be the same.

Kruskal-Wallis Test

The simplest method for comparing more than two groups is the Kruskal-Wallis test (Kruskal & Wallis, 1952). It works in a similar way to the rank-sum test, starting off by ranking all of the observations regardless of group and then calculating a test statistic.

There are direct formulas for calculating the test statistic (Daniel, 1990) but there is also a method which relates back to the [latex]F[/latex] test. After ranking all the observations, carry out the usual ANOVA on the ranks, rather than the original data. Instead of doing an [latex]F[/latex] test, we calculate the Kruskal-Wallis statistic

\[ H = \frac{SSG}{MST}. \]

We calculate the [latex]P[/latex]-value by looking at the distribution of [latex]H[/latex] if there was no difference in response between the groups. The exact distribution can be calculated for small numbers of groups and small sample sizes. A table at the end of this chapter gives critical values for 3 groups with up to 6 observations in each group, while the table following it gives critical values for 4 groups with up to 4 observations in each group. Exact calculations are impractical for larger experiments and so a [latex]\chi^2[/latex] approximation is usually applied, using the [latex]\chi^2_{\small{\mbox{DFG}}}[/latex] distribution. Significance can then be tested by referring to the [latex]\chi^2[/latex] distribution table.

Oxytocin and Emotion

The following table gives the changes in plasma oxytocin level for the 12 women not in a relationship in the original oxytocin example. We use this smaller sample to illustrate the use of the exact probabilities given in the 3-group Kruskal-Wallis table — we leave the analysis of the full data set, using the [latex]\chi^2[/latex] approximation, to Exercise 5.

Changes in oxytocin level (pg/mL) by stimulus event

Stimulus Oxytocin Change
Sad 0.00 -0.15 -0.11 -0.41
Happy -0.01 0.09 0.04 0.28
Massage 0.62 0.30 0.48 0.54

The first step in the Kruskall-Wallis test is to rank the data across the three groups, from lowest to highest, including the signs. The results of this are given in the table below.

Ranked change in oxytocin level by stimulus event

Stimulus Ranked change
Sad 5 2 3 1
Happy 4 7 6 8
Massage 12 9 10 11

From this ranked data we find SST = 143 and SSG = 120.5, so
\[ H = \frac{120.5}{143/11} = 9.27. \]
From the Kruskal-Wallis table, the critical value for a [latex]P[/latex]-value of 0.01 is 7.654, so the [latex]P[/latex]-value for [latex]H=9.27[/latex] is less than 0.01, giving substantial evidence of a difference between the stimulus events in terms of oxytocin change.

For the [latex]\chi^2[/latex] approximation, with 2 degrees of freedom, the [latex]\chi^2[/latex] distribution table gives a [latex]P[/latex]-value of closer to 0.01. This is somewhat conservative in comparison to the exact value from the Kruskal-Wallis table.

Spearman’s Rank Correlation

In Chapter 6 we defined the Pearson correlation coefficient, [latex]r[/latex], for the linear relationship between two quantitative variables. This was based on multiplying standardised scores for [latex]x[/latex] and [latex]y[/latex] values, and since it involved means and standard deviations the result is susceptible to the effects of outliers.

The reason why [latex]r[/latex] is formally known as Pearson’s correlation coefficient is to distinguish it from other measures of association. Here we will describe a measure published by Spearman in 1904 (Spearman, 1904). As with previous methods in the last few sections, Spearman’s correlation coefficient is based on ranks, and we will see how it is calculated using an example.

Oxytocin and Age

The figure below shows the basal plasma oxytocin levels and ages of the 12 single women in the oxytocin example. There are two variables now, oxytocin level and age, and we start by ranking these separately. The original values are shown together with their ranks in the following table.

Basal oxytocin by age for 12 single women

Basal oxytocin and age with ranks

Name Age Rank Basal Rank [asciimath]d_j[/asciimath]
Katie Sato 64 11 4.4 5 6
Jana Clausen 18 1 4.5 8 -7
Nanako Connolly 60 10 4.17 2 8
Abigail Jones 21 3 4.67 10 -7
Kelly Brown 31 6 4.88 12 -6
Marie Sorensen 55 9 4.41 6 3
Asuka McCarthy 26 5 4.19 3 2
Tyra Carlsen 20 2 4.69 11 -9
Britt Solberg 33 7 4.62 9 -2
Jeneve Bager 79 12 3.92 1 11
Gerda Jensen 25 4 4.44 7 -3
Kaya Solberg 41 8 4.26 4 4

If there was a perfect negative association between the variables then we would expect the ranks to all be far apart. For example, Jeneve is ranked 12 in age and ranked 1 in oxytocin level while Jana is ranked 1 in age and 8 in oxytocin level. We can measure how close they are in the usual way, by adding up the squared deviations. The last column of our table gives the differences [latex]d_j[/latex], so we calculate
\[ \sum d_j^2 = 478. \]
If there was perfect positive correlation then this would be 0. What would happen if there was perfect negative correlation? In that case, the person ranked 12 in one variable would be ranked 1 in the other variable, the person ranked 11 in one would be ranked 2 in the other, and so on. This would give the biggest differences possible between the ranks, and the sum of the squared differences would be
\[ 2\frac{n(n^2-1)}{6}, \]
where here [latex]n = 12[/latex]. We would like Spearman’s correlation to have the same range of values as Pearson’s, between -1 and 1, and so we define
\[ r_S = 1 – \frac{6 \sum d_j^2}{n(n^2 – 1)}. \]
You can convince yourself that this measure has the desired range.

For our example,
\[ r_S = 1 – \frac{6 \times 478}{12(12^2 – 1)} = -0.6713. \]
As seen in the previous figure, this data does not contain any outliers or influential points, so [latex]r_S[/latex] is quite close to the Pearson correlation [latex]r = -0.6618[/latex]. However, suppose the first person had their basal value recorded as “440” instead of “4.40”. Pearson’s correlation becomes [latex]r = 0.3917[/latex], suggesting a positive relationship, while Spearman’s [latex]r_S = -0.3566[/latex], much less affected by the outlier and still suggesting the correct direction.

Spearman’s [latex]r_S[/latex] can also be used as a test statistic for the null hypothesis of no association. The following table gives the critical values of the null distribution for small values of [latex]n[/latex]. This null distribution is symmetric so a test for negative association would use the [latex]P[/latex]-value from the right of the corresponding positive value. A two-sided test would simply double the one-sided result.
For example, if our aim had been to use this data to establish whether there was a negative relationship between oxytocin level and age for single women then the [latex]P[/latex]-value for the test would be [latex]\pr{R_S \le -0.6713}[/latex]. This value is somewhere between 0.025 and 0.01 on the [latex]n=12[/latex] row of the table, giving moderate evidence of a negative association.

Spearman's rank correlation critical values

[latex]n[/latex] 0.25 0.10 0.05 0.025 0.01 0.005 0.001 0.0005 0.0001
2
3 1.000
4 0.600 1.000 1.000
5 0.500 0.800 0.900 1.000 1.000
6 0.371 0.657 0.829 0.886 0.943 1.000
7 0.321 0.571 0.714 0.821 0.893 0.964
8 0.310 0.524 0.643 0.738 0.833 0.905 0.976 0.976 1.000
9 0.267 0.483 0.600 0.700 0.783 0.833 0.917 0.933 0.967
10 0.248 0.455 0.564 0.648 0.745 0.794 0.879 0.903 0.927
11 0.236 0.427 0.536 0.618 0.709 0.764 0.855 0.873 0.909
12 0.217 0.406 0.503 0.587 0.678 0.734 0.825 0.846 0.881
13 0.209 0.385 0.484 0.560 0.648 0.703 0.797 0.824 0.863
14 0.200 0.367 0.464 0.538 0.626 0.679 0.771 0.802 0.846
15 0.189 0.354 0.446 0.521 0.604 0.657 0.750 0.779 0.829
16 0.182 0.341 0.429 0.503 0.585 0.635 0.729 0.759 0.812
17 0.176 0.328 0.414 0.488 0.566 0.618 0.711 0.743 0.797
18 0.170 0.317 0.401 0.472 0.550 0.600 0.692 0.723 0.781
19 0.165 0.309 0.391 0.460 0.535 0.584 0.675 0.709 0.767
20 0.161 0.299 0.380 0.447 0.522 0.570 0.662 0.693 0.753
21 0.156 0.292 0.370 0.436 0.509 0.556 0.647 0.678 0.739
22 0.152 0.284 0.361 0.425 0.497 0.544 0.633 0.665 0.726
23 0.148 0.278 0.353 0.416 0.486 0.532 0.621 0.652 0.713
24 0.144 0.271 0.344 0.407 0.476 0.521 0.609 0.640 0.702
25 0.142 0.265 0.337 0.398 0.466 0.511 0.597 0.628 0.690
26 0.138 0.259 0.331 0.390 0.457 0.501 0.586 0.618 0.679
27 0.136 0.255 0.324 0.383 0.449 0.492 0.576 0.607 0.668
28 0.133 0.250 0.318 0.375 0.441 0.483 0.567 0.597 0.658
29 0.130 0.245 0.312 0.368 0.433 0.475 0.558 0.588 0.649
30 0.128 0.240 0.306 0.362 0.425 0.467 0.549 0.579 0.640
40 0.110 0.207 0.264 0.313 0.368 0.405 0.479 0.506 0.563
50 0.097 0.184 0.235 0.279 0.329 0.363 0.430 0.456 0.508
60 0.089 0.168 0.214 0.255 0.301 0.331 0.394 0.417 0.467
70 0.082 0.155 0.198 0.235 0.278 0.307 0.365 0.387 0.434
80 0.076 0.145 0.185 0.220 0.260 0.287 0.342 0.363 0.407
90 0.072 0.136 0.174 0.207 0.245 0.271 0.323 0.343 0.385
100 0.068 0.129 0.165 0.197 0.233 0.257 0.307 0.326 0.366

This table gives [latex]r_S^{*}[/latex] such that [latex]\pr{R_S \ge r_S^{*}} \le p[/latex], where [latex]R_S[/latex] is the Spearman's rank correlation from two rankings of [latex]n[/latex] objects when no association is present ([latex]H_0[/latex]). Note that is impossible to get significance at the 5% level when [latex]n \lt 4[/latex].

For large [latex]n[/latex], the null distribution of [latex]z = r_S \sqrt{n-1}[/latex] is approximately the standard Normal distribution. This can be used instead of the table above for testing with large samples.

Final Words

A discussion of nonparametric methods is an appropriate way to conclude our story of data analysis and statistical inference. In methods such as the sign test we have revisited the essence of hypothesis testing, a logical argument based on probability models for observed data. We have also seen the importance of visualising data to confirm the assumptions underlying the [latex]t[/latex] tests and [latex]F[/latex] tests, and how the nonparametric methods could be used if suitable transformations of the data could not be found. The range of methods presented also give a good overview of some of the types of data and experimental designs we have analysed in this book.

A notable absence in this section has been in the calculation of confidence intervals. These intervals are usually more important than a hypothesis test since they give an idea of how big an effect is, rather than just being told it exists. Daniel (1990) gives details on methods for nonparametric confidence intervals.

Of course this book has only given a taste of the wide variety of experimental designs and statistical methods that are used in research. The numerous references along the way should give you a start on finding appropriate methods to help you design and analyse your own scientific studies, or to help critically understand the work of others. A strong understanding of the basic principles covered in this book will give you a good basis for your future learning.

We conclude in the following chapter with a reflection on significance and power with a gallery of data.

Summary

  • Nonparametric methods can provide a robust alternative to methods based on means and standard deviations.
  • The sign test and signed-rank test are nonparametric equivalents of the one-sample [latex]t[/latex] test (Chapter 15).
  • The Wilcoxon rank-sum test is a nonparametric equivalent of the two-sample [latex]t[/latex] test (Chapter 16).
  • The Kruskall-Wallis test is a nonparametric equivalent of the one-way ANOVA [latex]F[/latex] test (Chapter 19).
  • Spearman’s rank correlation is a nonparametric equivalent of Pearson’s correlation (Chapter 18).

Exercise 1

A study measured the hopping speed of 11 subjects on their left and right legs. The number of hops made in one minute on one leg were counted, ten minutes were then given for rest, and then the number of hops made in one minute on the other leg were counted. Five subjects hopped on their right leg first while the remaining six started with their left leg. The counts are shown in the table below.

Number of hops in one minute

  Subject
Leg 1 2 3 4 5 6 7 8 9 10 11
Left 124 86 98 112 104 190 135 110 78 82 94
Right 110 98 110 110 108 195 125 120 70 80 98

Use the sign test to see whether there is evidence that people hop faster on their right leg. Obtain the exact [latex]P[/latex]-value from the cumulative binomial distribution table.

Exercise 2

Repeat the previous exercise using the signed-rank test instead. Obtain bounds for the [latex]P[/latex]-value from the signed-rank critical values table and also estimate it using a Normal approximation.

Exercise 3

Islander weights appear to have a skewed distribution, as seen in Chapter 3, and so a [latex]t[/latex] test may not be appropriate for working with small samples of weights. Use a rank-sum test instead to see whether there is a difference between male and female weights in the survey data. Compare your results to those from a [latex]t[/latex] test.

Exercise 4

Based on the data in the celery bending example, use a Kruskall-Wallis test to see whether there is a difference between celery bend angle between the three storage conditions. Obtain exact bounds for the [latex]P[/latex]-value from the 3-group Kruskal-Wallis table and also estimate it using the [latex]\chi^2[/latex] approximation.

Exercise 5

Use a Kruskall-Wallis test with the [latex]\chi^2[/latex] approximation to see whether there is a relationship between change in oxytocin level and stimulus event in the full data in the original oxytocin example.

Exercise 6

Carry out a rank-sum test for the sleep deprivation and internal clock study in Exercise 4 of Chapter 2. Compare your results with the exact [latex]P[/latex]-value from the randomisation test and the results from a two-sample [latex]t[/latex] test.

Exercise 7

Calculate the sampling distribution of Spearman’s rank correlation coefficient for [latex]n = 4[/latex]. Use this to verify the critical values in the table of Spearman’s rank correlation critical values.

Kruskal-Wallis critical values for 3 groups

[latex]n_1[/latex] [latex]n_2[/latex] [latex]n_3[/latex] [latex]p=[/latex]0.25 0.10 0.05 0.025 0.01
2 2 2 3.714 4.571
2 2 3 3.429 4.500 4.714
2 2 4 3.125 4.458 5.333 5.500
2 2 5 3.240 4.373 5.160 6.000 6.533
2 2 6 3.018 4.545 5.345 5.745 6.655
2 3 3 3.139 4.556 5.361 5.556
2 3 4 3.111 4.511 5.444 6.000 6.444
2 3 5 3.022 4.651 5.251 6.004 6.909
2 3 6 2.970 4.682 5.348 6.136 6.970
2 4 4 3.055 4.555 5.455 6.327 7.036
2 4 5 2.914 4.541 5.273 6.068 7.205
2 4 6 3.058 4.494 5.340 6.186 7.340
2 5 5 3.023 4.623 5.338 6.346 7.338
2 5 6 3.033 4.596 5.338 6.196 7.376
2 6 6 3.010 4.438 5.410 6.210 7.467
3 3 3 3.289 4.622 5.600 5.956 7.200
3 3 4 3.027 4.709 5.791 6.155 6.745
3 3 5 2.970 4.533 5.648 6.315 7.079
3 3 6 2.987 4.590 5.615 6.436 7.410
3 4 4 2.932 4.545 5.598 6.394 7.144
3 4 5 2.953 4.549 5.656 6.410 7.445
3 4 6 2.940 4.604 5.610 6.538 7.500
3 5 5 2.936 4.545 5.705 6.549 7.578
3 5 6 2.897 4.535 5.602 6.667 7.590
3 6 6 2.900 4.558 5.625 6.725 7.725
4 4 4 3.038 4.654 5.692 6.615 7.654
4 4 5 2.918 4.668 5.657 6.673 7.760
4 4 6 2.895 4.595 5.681 6.667 7.795
4 5 5 2.931 4.523 5.666 6.760 7.823
4 5 6 2.896 4.523 5.661 6.750 7.936
4 6 6 2.882 4.548 5.724 6.812 8.000
5 5 5 2.960 4.560 5.780 6.740 8.000
5 5 6 2.853 4.547 5.729 6.788 8.028
5 6 6 2.895 4.542 5.765 6.848 8.124
6 6 6 2.889 4.643 5.801 6.889 8.222

This table gives [latex]H^{*}[/latex] such that [latex]\pr{H \ge H^{*}} \le p[/latex], where [latex]H[/latex] is a random Kruskal-Wallis statistic for three groups under the null hypothesis that the groups have the same distributions.

Kruskal-Wallis critical values for 4 groups

[latex]n_1[/latex] [latex]n_2[/latex] [latex]n_3[/latex] [latex]n_4[/latex] [latex]p=[/latex]0.25 0.10 0.05 0.025 0.01
2 2 2 2 4.667 5.667 6.167 6.667 6.667
2 2 2 3 4.378 5.711 6.333 6.978 7.133
2 2 2 4 4.473 5.755 6.545 7.064 7.391
2 2 3 3 4.436 5.745 6.527 7.055 7.727
2 2 3 4 4.348 5.750 6.621 7.326 7.871
2 2 4 4 4.308 5.808 6.731 7.538 8.346
2 3 3 3 4.364 5.879 6.727 7.515 8.015
2 3 3 4 4.327 5.872 6.795 7.564 8.333
2 3 4 4 4.275 5.901 6.874 7.747 8.621
2 4 4 4 4.271 5.914 6.957 7.914 8.871
3 3 3 3 4.385 6.026 7.000 7.667 8.538
3 3 3 4 4.302 6.016 6.984 7.775 8.659
3 3 4 4 4.267 6.019 7.038 7.929 8.876
3 4 4 4 4.254 6.042 7.142 8.079 9.075
4 4 4 4 4.235 6.088 7.235 8.228 9.287

This table gives [latex]H^{*}[/latex] such that [latex]\pr{H \ge H^{*}} \le p[/latex], where [latex]H[/latex] is a random Kruskal-Wallis statistic for four groups under the null hypothesis that the groups have the same distributions.

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

A Portable Introduction to Data Analysis Copyright © 2024 by The University of Queensland is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book