Nonparametric Methods

Michael Bulmer

24 Nonparametric Methods

[latex]\newcommand{\pr}[1]{P(#1)} \newcommand{\var}[1]{\mbox{var}(#1)} \newcommand{\mean}[1]{\mbox{E}(#1)} \newcommand{\sd}[1]{\mbox{sd}(#1)} \newcommand{\Binomial}[3]{#1 \sim \mbox{Binomial}(#2,#3)} \newcommand{\Student}[2]{#1 \sim \mbox{Student}(#2)} \newcommand{\Normal}[3]{#1 \sim \mbox{Normal}(#2,#3)} \newcommand{\Poisson}[2]{#1 \sim \mbox{Poisson}(#2)} \newcommand{\se}[1]{\mbox{se}(#1)} \newcommand{\prbig}[1]{P\left(#1\right)} \newcommand{\degc}{$^{\circ}$C}[/latex]

Nonparametric methods provide an alternative to methods based on the [latex]t[/latex] distribution when the assumptions for those methods are not satisfied. Although they come with their own assumptions, nonparametric tests are typically more robust in the presence of outliers or strong skewness. We start by highlighting the susceptibility of the [latex]t[/latex] methods to even a single outlier before then introducing a number of nonparametric equivalents to the various procedures we have covered so far.

Effect of Outliers

Darwin (1902) published data from an experiment comparing the growth of cross-fertilised and self-fertilised plants, with pairs of plants of the same age grown together in pots to eliminate other factors. The table below gives the differences in heights between the cross- and self-fertilised plants in each of 15 pots, converted from the original units of eighths of an inch into centimetres. The original data table is given in the Appendix.

Differences in heights (cm) between 15 pairs of cross- and self-fertilised plants

15.6	-21.3	2.5	5.1	1.9	7.3	8.9	13.0
4.4	9.2	17.8	7.6	23.8	19.1	-15.2

Darwin’s hypothesis was that cross-fertilised plants would grow more vigorously than those that had been self-fertilised. Thus if [latex]\mu[/latex] was the population mean of the differences then we would test [latex]H_0: \mu = 0[/latex] against the one-sided alternative [latex]H_1: \mu \gt 0[/latex].

The sample mean of the differences is 6.65 cm with standard deviation 11.990 cm. This does not seem like a very big increase in plant growth but the statistic
\[ t_{14} = \frac{6.65 – 0}{11.990/\sqrt{15}} = 2.148 \]
gives a [latex]P[/latex]-value of 0.025, moderate evidence that the cross-fertilised plants are growing higher than their self-fertilised counterparts.

Suppose now that the experimenter mistakenly omitted the decimal place when entering the first observation. The results are repeated in the table below but with the much higher difference given for the first pair.

Differences in heights (cm) between 15 pairs of cross- and self-fertilised plants with outlier

156	-21.3	2.5	5.1	1.9	7.3	8.9	13.0
4.4	9.2	17.8	7.6	23.8	19.1	-15.2

Think about the effect you would expect this to have on the analysis. Darwin wanted to show that the cross-fertilised plants were doing better and we have already found some evidence for this based on the original data. This modified data set now includes a pair for which the cross-fertilised plant did amazingly well, growing a metre and a half taller than its competitor! We would expect to get even stronger evidence now, and indeed the new sample mean is 16.0 cm, suggesting the mean [latex]\mu[/latex] is even further away from 0 than before.

However, the [latex]t[/latex] statistic is now
\[ t_{14} = \frac{16.0 – 0}{40.465/\sqrt{15}} = 1.531, \]
no longer significant at the 5% level. Strangely we now have no evidence that cross-fertilisation is superior, despite including a prime example of its benefits!

The reason for this seemingly paradoxical result is that while the sample mean is increased by the outlying value, the sample standard deviation is inflated even more. The net result is that our standardised statistic is smaller than before, giving the weaker evidence. We say that the outlier has diluted the test results. This is a perfect example of why visualisation of your data is so important before doing any kind of testing.

In this case the experimenter could easily have corrected the value, or even repeated the measurement if there was uncertainty about it. But what if we have outlying values for which we cannot justify removing them? One way around this problem is to use robust methods for estimating parameters. For example, the trimmed mean is calculated by removing the extreme 5% of observations and finding the regular mean of what is left. However, the distributions of such statistics can be difficult to determine.

An alternative is to use a nonparametric method to analyse the data, one that doesn’t involve using a parameter estimate, like the sample mean, which is susceptible to outliers (Higgins, 2004). We will start by looking at a very simple method in the next section, the sign test, and then continue to some more sophisticated settings.

Sign Test

Returning to the data in original table (without the outlier), note that 13 out of 15 pairs have the cross-fertilised plant doing better. What would happen if the null hypothesis were true and the fertilisation method had no effect on growth? It is unlikely that the heights would be identical since there is also natural variability in plant growth. However it is reasonable to expect that in about half of the pairs the cross-fertilised plants would do better while in the other half it would be the self-fertilised plants. If the alternative hypothesis were true instead then we would expect the chance of the cross-fertilised plant being taller to be more than half.

If we let [latex]p = \pr{\mbox{cross-fertilised plant taller}}[/latex] then we can make this idea precise with
\[ H_0 : p = 0.5 \mbox{ versus } H_1 : p \gt 0.5. \]
This is a good opportunity to reflect on the basic ideas of hypothesis testing. We want to calculate the probability of observing the data we did assuming that [latex]H_0[/latex] is true. If this were the case then each pair would have a 0.5 chance of having the cross-fertilised plant taller and so counting the number, [latex]X[/latex], of pairs where the cross-fertilised plant is taller simply has the Binomial distribution
\[ \Binomial{X}{15}{0.5}. \]
We observed [latex]X = 13[/latex] and so the [latex]P[/latex]-value is the probability of getting a result as extreme or more extreme than 13. This is a one-sided test and we would expect [latex]X[/latex] to be larger if [latex]H_1[/latex] were true, so the [latex]P[/latex]-value is
\begin{eqnarray*}
\pr{X \ge 13} & = & \pr{X = 13} + \pr{X = 14} + \pr{X = 15} \\
& = & 0.003 + 0.000 + 0.000 \\
& = & 0.003 \\
\end{eqnarray*}
Thus the sign test gives very strong evidence that cross-fertilised plants are doing better than self-fertilised plants. It is nonparametric and so it gives no estimate as to by how much taller they are.

Now consider the modified data in the previous table where an outlier was introduced. The [latex]t[/latex] test was badly effected by this single value, the new [latex]P[/latex]-value being only 0.074, but what happens with the sign test? Counting we find that there are still 13 positive values and so the [latex]P[/latex]-value doesn’t change at all! By being very wasteful the sign test is incredibly robust against the effects of outliers.

Note that the null hypothesis for the sign test here was that the 0 difference in plant growth had 50% of values above it and 50% of values below it. This is saying that 0 is the median difference in plant growth, so the sign test can be viewed as a test of the median, just as the [latex]t[/latex] test is a test of the mean. If we let the Greek letter [latex]\eta[/latex] (Moore & McCabe, 1999) be the population median of the growth differences then we could write the test hypotheses as [latex]H_0: \eta = 0[/latex] and [latex]H_1: \eta \gt 0[/latex]. However, we don’t explicitly estimate the population median to test the null hypothesis, and that is why we call this a nonparametric test.

If the hypothesised median, [latex]\eta_0[/latex], was not 0 then rather than counting the positive signs for the test statistic we count the observations that were greater than [latex]\eta_0[/latex].

Arbuthnot

Nonparametric tests have a longer history than those involving estimates of means and standard deviations, largely because the latter have complex sampling distributions whereas the former are based on simple counting. The idea of the sign test dates back to at least 1710, when the polymath John Arbuthnot published a study of births in London over the previous 82 years (Arbuthnott, 1710). In each of those 82 years, there had been more males born than females. The null hypothesis was that it should be an equal chance of any individual birth being a male or female, and hence that in each year it should be an equal chance that more males or more females were born. Thus if [latex]X[/latex] is the number of years in which there were more males than females then we would assume [latex]\Binomial{X}{82}{0.5}[/latex]. The [latex]P[/latex]-value is then
\[ \pr{X \ge 82} = 0.5^{82} \simeq 0.0000000000000000000000002068, \]
giving very strong evidence that the chance of more males being born than females is greater than 0.5. Arbuthnot calculated this probability and concluded

But it is very improbable (if mere Chance govern’d) that they would never reach as far as the Extremities: But this Event is wisely prevented by the wise Oeconomy of Nature; and to judge of the wisdom of the Contrivance, we must observe that the external Accidents to which Males are subject (who must seek their Food with danger) do make a great havock of them, and that this loss exceeds far that of the other Sex, occasioned by Diseases incident to it, as Experience convinces us. To repair that Loss, provident Nature, by the Disposal of its wise Creator, brings forth more Males than Females; and that in almost a constant proportion.

Signed-Rank Test

For Darwin’s data the [latex]P[/latex]-value we obtained from the sign test ([latex]p = 0.003[/latex]) is much lower and thus gives more significant evidence than the original [latex]t[/latex] test ([latex]p = 0.025[/latex]). However the sign test may be being a bit optimistic here. It is rather wasteful since it just looks at the sign of each observation, disregarding the magnitude of the value. Only two pots had the self-fertilised plants doing better but they both did a lot better (15.2 and 21.3 cm) than the cross-fertilised plants. The sign test gives the same weight to the positive difference of 1.9 as it does to the negative difference of 21.3.

One way to add a measure of the magnitude of the observations while retaining robustness is to use observation ranks. Wilcoxon (1945) gives a test based on a signed-rank statistic that combines the signs of the differences, as in the sign test, with the ranks of the differences.

Begin by ranking the absolute differences. Give the smallest value (1.9 cm) a rank of 1, the second smallest value (2.5 cm) a rank of 2, and so on up to the largest value (23.8 cm) with a rank of 15. The table below shows these ranks for the above data. Note that although there are only two negative values, they do have quite large ranks.

Ranked absolute differences in lengths between cross- and self-fertilised plants

11	14	2	4	1	5	7	9
3	8	12	6	15	13	10

The signed-rank statistic, [latex]S[/latex], is the sum of the ranks corresponding to positive differences. Here
\[ S = 11 + 2 + 4 + 1 + 5 + 7 + 9 + 3 + 8 + 12 + 6 + 15 + 13 = 96. \]
If the alternative hypothesis tended to be true, we would expect to see large values of [latex]S[/latex]. Thus the [latex]P[/latex]-value is [latex]\pr{S \ge 96}[/latex]. This is a discrete statistic, like a Binomial count, but its distribution when [latex]H_0[/latex] is true is a little more complicated than the Binomial. The following table gives a table of critical values of [latex]S[/latex] for small [latex]n[/latex]. From the table we find that [latex]\pr{S \ge 96}[/latex] is between 0.025 and 0.01, some evidence of a difference but not as strong as the sign test result. Again, this may be a more accurate result since it takes into account the large magnitudes of the negative differences.

Signed-Rank critical values

	Probability [latex]p[/latex]
[latex]n[/latex]	0.25	0.10	0.05	0.025	0.01	0.005	0.001	0.0005	0.0001
2	3
3	5
4	8	10
5	11	13	15
6	15	18	19	21
7	19	23	25	26	28
8	24	28	31	33	35	36
9	30	35	37	40	42	44
10	35	41	45	47	50	52	55
11	42	49	53	56	59	61	65	66
12	49	57	61	65	69	71	76	77
13	56	65	70	74	79	82	87	89
14	65	74	80	84	90	93	99	101	105
15	73	84	90	95	101	105	112	114	118
16	82	94	101	107	113	117	125	128	133
17	92	105	112	119	126	130	139	142	148
18	102	116	124	131	139	144	153	157	163
19	113	128	137	144	153	158	169	172	180
20	124	141	150	158	167	173	184	189	197
21	136	154	164	173	182	189	201	206	214
22	149	167	178	188	198	205	218	223	233
23	162	182	193	203	214	222	236	241	252
24	175	196	209	219	231	239	255	260	272
25	189	212	225	236	249	257	274	280	292
26	203	227	241	253	267	276	293	300	313
27	218	244	259	271	286	295	314	321	335
28	234	261	276	290	305	315	335	342	357
29	250	278	295	309	325	335	356	364	380
30	267	296	314	328	345	356	379	387	404

This table gives [latex]S^{*}[/latex] such that [latex]\pr{S \ge S^{*}} \le p[/latex], where [latex]S[/latex] is a random Wilcoxon signed-rank statistic when the null hypothesis is true. Empty cells indicate that it is not possible to achieve the given probability. In fact it is impossible to get significance at the 5% level when [latex]n \lt 5[/latex].

If [latex]H_0[/latex] is true then it can be shown that
\[ \mean{S} = \frac{n(n+1)}{4}, \]
and
\[ \sd{S} = \sqrt{\frac{n(n+1)(2n+1)}{24}}, \]
where [latex]n[/latex] is the sample size.
These can be used with a Normal distribution to find approximate [latex]P[/latex]-values, particularly for [latex]n \gt 20[/latex]. For [latex]n = 15[/latex], [latex]\mean{S} = 60[/latex] and [latex]\sd{S} = 17.61[/latex], so
\[ \pr{S \ge 96} \simeq \prbig{Z \ge \frac{96 – 60}{17.61}} = \pr{Z \ge 2.04} = 0.021, \]
in agreement with the tabulated value from the previoy.

Methods based on ranks are naturally robust since they ignore the absolute size of observations. As a simple example, if the difference of 23.8 cm had been given as 238 mm by mistake then it would still have the rank 15 and the results would be unchanged.

Assumptions for the Signed-Rank Test

While the signed-rank test is nonparametric, it is still based on assumptions. To start with, as always, the observations should be independent of each other. The [latex]P[/latex]-value is then calculated under the assumption that the sign of any rank is equally likely to be positive or negative. This means that the ranks should be distributed evenly on both sides of the hypothesised median. For this to hold, the distribution of the data should be roughly symmetric.

If this assumption holds then the [latex]t[/latex] test could be used anyway, since any symmetric distribution will give an approximately Normal sample mean for even small values of [latex]n[/latex]. In most circumstances it will indeed be preferable to use a [latex]t[/latex] procedure rather than a nonparametric procedure. However, one or two unusual values can have a much bigger effect on the [latex]t[/latex] result than on a method like the signed-rank test.

Rank-Sum Test

Wilcoxon (1945) also describes a test for comparing two independent samples, referred to as the Wilcoxon rank-sum test. It again does this by working with the ranks of the observations. As with the signed-rank test, this makes the rank-sum test resistant to the effects of outliers. A slightly more general rank-sum test was published two years later by Mann and Whitney (1947). The rank-sum test is thus also referred to as the Mann-Whitney test, and software packages vary in the name that they use.

Caffeine and Pulse Rate

To see how the rank sum test works, consider Alice’s caffeine study one last time. We analysed the data from Chapter 2 with a [latex]t[/latex] test in Chapter 16. The table below shows the observed differences between before and after pulse rates for the 20 subjects, ranked from smallest to largest.

Ranked increases in pulse rate

Caffeinated	16	19	18	14.5	7.5	2	20	13	14.5	17
Decaffeinated	3.5	11	9.5	1	5.5	3.5	5.5	9.5	7.5	12

The smallest increase was -9 bpm, so it gets rank 1 while the second smallest increase was -2 bpm, getting rank 2. Note that this is different to the signed-rank test where we ranked the absolute values and then looked at which were positive or negative. Here we are ranking the whole scale and then comparing the ranks between the two groups.

The next smallest increase, 4 bpm, occurs twice and so we give both of them the average of the ranks that they would have had if they were not tied, 3 and 4. This actually won’t matter for the test since both observations are in the same group. However there are also later tied values that appear in both groups. These will change the sum of ranks in each group.

The null hypothesis is that there is no difference between the distributions of the ‘Caffeinated’ and ‘Decaffeinated’ groups. If this was the case then we would expect the 10 ‘Caffeinated’ labels to be scattered randomly over the 20 ranks. We measure where the ‘Caffeinated’ labels are by summing the ranks they appear over. This gives the statistic
\[ W = 16 + 19 + 18 + 14.5 + 7.5 + 2 + 20 + 13 + 14.5 + 17 = 141.5. \]
If the subjects with caffeine tended to have higher increases in pulse rate then they would tend to have higher ranks and so [latex]W[/latex] would tend to be bigger. The [latex]P[/latex]-value is the probability of getting a value as extreme or more extreme, so here we want [latex]\pr{W \ge 141.5}[/latex].

[latex]W[/latex] is a discrete random variable and so it has a distribution, similar to the Binomial, which can be tabulated. The following table gives the critical values needed for common significance levels for [latex]n_1[/latex] and [latex]n_2[/latex] up to 10, and computer software can be used to give exact probabilities for particular values of [latex]W[/latex]. The figure below shows the null distribution of [latex]W[/latex] for data with two groups of sizes [latex]n_1 = 5[/latex] and [latex]n_2 = 5[/latex].

Wilcoxon rank-sum distribution ([latex]n_1 = 5, n_2 =5[/latex]) with Normal approximation

For [latex]n_1 =10[/latex] and [latex]n_2 = 10[/latex], the Wilcoxon rank-sum table says that we need a value of at least 136 for [latex]W[/latex] in order to get significance at the 1% level. We have [latex]W = 141.5[/latex], so this is again strong evidence that caffeine is producing a higher increase in pulse rate.

Wilcoxon rank-sum critical values

		[latex]n_1[/latex]
[latex]n_2[/latex]	[latex]p[/latex]	2	3	4	5	6	7	8	9	10
2	0.100		12	18	24	32	41	50	61	72
	0.050				25	33	42	51	62	74
	0.010
	0.001
3	0.100		14	21	28	36	45	55	67	79
	0.050		15	22	29	37	47	57	68	81
	0.010						49	60	71	84
	0.001
4	0.100	11	17	23	31	40	50	61	72	85
	0.050		18	25	33	42	52	63	75	88
	0.010				35	44	55	66	78	92
	0.001									95
5	0.100	12	19	26	35	44	55	66	78	92
	0.050	13	20	28	36	46	57	68	81	94
	0.010			30	39	49	60	72	85	99
	0.001							76	89	104
6	0.100	14	21	29	38	48	59	71	84	98
	0.050	15	22	31	40	50	62	74	87	101
	0.010			33	43	54	66	78	92	107
	0.001						70	83	97	112
7	0.100	16	23	32	42	52	64	76	90	104
	0.050	17	25	34	44	55	66	79	93	108
	0.010		27	37	47	59	71	85	99	114
	0.001					63	76	90	105	120
8	0.100	17	25	35	45	56	68	81	95	111
	0.050	18	27	37	47	59	71	85	99	115
	0.010		30	40	51	63	77	91	106	122
	0.001				55	68	82	96	112	129
9	0.100	19	28	37	48	60	73	86	101	117
	0.050	20	29	40	51	63	76	90	105	121
	0.010		32	43	55	68	82	97	112	129
	0.001				59	73	88	103	119	137
10	0.100	20	30	40	52	64	77	92	107	123
	0.050	22	32	43	54	67	81	96	111	128
	0.010		35	47	59	73	87	103	119	136
	0.001			50	64	78	93	110	127	145

This table gives [latex]W^{*}[/latex] such that [latex]\pr{W \ge W^{*}} \le p[/latex], where [latex]W[/latex] is a random Wilcoxon rank-sum statistic under the null hypothesis that two groups have the same distributions. Empty cells indicate that it is not possible to achieve the given probability.

Randomisation Revisited

The critical values in the previous table are actually only correct for data sets that are free of ties. Each time we replace two or more different ranks with a single tied rank we are changing the distribution of [latex]W[/latex] slightly. In practice this will only be an issue if many of the ranks were tied or if the value of [latex]W[/latex] is on the borderline of significance. Neither of these is the case with our example and so we can be quite confident that our conclusion is correct. However it is worth reflecting on the similarities between the Wilcoxon rank-sum test and the randomisation test.

Recall our original discussion in Chapter 2 of whether Alice had found evidence that caffeine increases pulse rate. If the null hypothesis was true and there was no difference between her two groups then she had really just made 20 observations of the same effect. The explanation for the observed difference in means of 10.7 bpm was then that it had happened by chance due to the random allocation of the subjects to the groups. We can calculate the [latex]P[/latex]-value, the probability that this could happen by chance, by going through all the possible random allocations and seeing how often it actually does happen. Allocating 20 subjects into two groups of 10 can be done in [latex]{20 \choose 10} = 184756[/latex] ways. We found that of all these only 351 gave a value as unusual as the observed 10.7, a probability of 0.0019. Thus we had very strong evidence against the null hypothesis of no difference, suggesting that caffeine did increase pulse rate.

But this is exactly the same reasoning behind the rank-sum test. Our null hypothesis of no difference means that the 20 ranks could have been assigned to the two groups in any way. The [latex]P[/latex]-value is the probability of finding an allocation where the sum of ranks in the first group is 141.5 or more. We can calculate the exact probability, incorporating the presence of the ties, by counting through the 184756 allocations of these ranks. The full distribution of [latex]W[/latex] is shown in the figure below. There were 379 allocations where [latex]W \ge 141.5[/latex], giving a [latex]P[/latex]-value of 0.0021.

Wilcoxon rank-sum distribution with tied values

Normal Approximation

The expected value for [latex]W[/latex], if the null hypothesis is true, is easy to understand. The null hypothesis says that the groups are the same and so the rank of any observation is really just a random number between 1 and 20. A rank chosen randomly will be 10.5 on average, in the same way that rolling a die gives 3.5 on average. In general, the expected values of individual ranks will be
\[ \frac{(n_1 + n_2 + 1)}{2}. \]
If [latex]n_1[/latex] corresponds to the group whose ranks we are summing then [latex]W[/latex] is simply the sum of [latex]n_1[/latex] random ranks, each with this expected value, so that
\[ \mean{W} = n_1 \frac{(n_1 + n_2 + 1)}{2}. \]
The standard deviation requires a bit more algebra, but can be shown to be
\[ \sd{W} = \sqrt{\frac{n_1 n_2 (n_1 + n_2 + 1)}{12}}. \]
The figure within the previous example also shows a Normal approximation to the Wilcoxon distribution using
these values for [latex]\mean{W}[/latex] and [latex]\sd{W}[/latex]. For this example, [latex]\mean{W} = 10 \times 10.5 = 105[/latex] and
\[ \sd{W} = \sqrt{\frac{10 \times 10 \times 21}{12}} = 13.23.\]
We can use this in the same way as we used the Normal approximation for Binomial probabilities. Here the Normal approximation gives
\[ \pr{W \ge 141.5} = \prbig{Z \ge \frac{141.5 – 105}{13.23}} = \pr{Z \ge 2.76} = 0.003, \]
very close to the exact value of 0.002.

Note that ties don’t change the expected value of [latex]W[/latex] since the total sum of the available ranks stays the same. However the variability does change — the exact standard deviation from all the possible allocations of the tied ranks, shown in the previous figure, is 13.20. Of course this is only slightly different from the untied value of 13.23 and so it will not matter much in practice.

Water Uptake

This table in a Chapter 16 example gives the height of water uptake in 40 celery stalks, 20 of which had the top of their leaves coated with petroleum jelly. The table below gives the ranks of the combined observations.

Ranked water uptake height

Uncoated	36	30	34	26	31.5	23	29	38	18	22
Uncoated	24	28	40	31.5	35	27	38	25	33	38
Coated	3.5	14	20.5	10	5	1	7.5	12	16	7.5
Coated	19	15	6	11	17	13	20.5	2	3.5	9

Let [latex]W[/latex] be the sum of the ranks of the water uptake heights for the coated stalks. Thus
\[ W = 3.5 + 14 + 20.5 + 10 + 5 + 1 + \cdots + 3.5 + 9 = 213. \]
The alternative hypothesis is that the coated stalks will not lift water up as high as the uncoated stalks. Thus we expect [latex]W[/latex] to be small and so the [latex]P[/latex]-value for this test will be [latex]\pr{W \le 213}[/latex]. The Wilcoxon table does not include the critical values for these sample sizes ([latex]n_1 = n_2 = 20[/latex]) so we will need to use the Normal approximation. Using the above formulas we calculate [latex]\mean{W} = 410[/latex] and [latex]\sd{W} = 36.95[/latex], giving
\[ \pr{W \le 213} = \prbig{Z \le \frac{213 – 410}{36.95}} = \pr{Z \le -5.33} \approx 0, \]
very strong evidence that coating the leaves results in lower water uptake.

Assumptions for the Rank-Sum Test

The calculation of the [latex]P[/latex]-value is based on the assumption that the ranks are randomly assigned to the two groups. This will mean that the two groups should have the same distribution for the quantitative variable. Any difference detected by the rank-sum test is suggesting a difference between these distributions. If you want it to show that the population medians are different then the other features, the variability and shape of the two distributions, should be the same.

Kruskal-Wallis Test

The simplest method for comparing more than two groups is the Kruskal-Wallis test (Kruskal & Wallis, 1952). It works in a similar way to the rank-sum test, starting off by ranking all of the observations regardless of group and then calculating a test statistic.

There are direct formulas for calculating the test statistic (Daniel, 1990) but there is also a method which relates back to the [latex]F[/latex] test. After ranking all the observations, carry out the usual ANOVA on the ranks, rather than the original data. Instead of doing an [latex]F[/latex] test, we calculate the Kruskal-Wallis statistic

\[ H = \frac{SSG}{MST}. \]

We calculate the [latex]P[/latex]-value by looking at the distribution of [latex]H[/latex] if there was no difference in response between the groups. The exact distribution can be calculated for small numbers of groups and small sample sizes. A table at the end of this chapter gives critical values for 3 groups with up to 6 observations in each group, while the table following it gives critical values for 4 groups with up to 4 observations in each group. Exact calculations are impractical for larger experiments and so a [latex]\chi^2[/latex] approximation is usually applied, using the [latex]\chi^2_{\small{\mbox{DFG}}}[/latex] distribution. Significance can then be tested by referring to the [latex]\chi^2[/latex] distribution table.

Oxytocin and Emotion

The following table gives the changes in plasma oxytocin level for the 12 women not in a relationship in the original oxytocin example. We use this smaller sample to illustrate the use of the exact probabilities given in the 3-group Kruskal-Wallis table — we leave the analysis of the full data set, using the [latex]\chi^2[/latex] approximation, to Exercise 5.

Changes in oxytocin level (pg/mL) by stimulus event

Stimulus	Oxytocin Change
Sad	0.00	-0.15	-0.11	-0.41
Happy	-0.01	0.09	0.04	0.28
Massage	0.62	0.30	0.48	0.54

The first step in the Kruskall-Wallis test is to rank the data across the three groups, from lowest to highest, including the signs. The results of this are given in the table below.

Ranked change in oxytocin level by stimulus event

Stimulus	Ranked change
Sad	5	2	3	1
Happy	4	7	6	8
Massage	12	9	10	11

From this ranked data we find SST = 143 and SSG = 120.5, so
\[ H = \frac{120.5}{143/11} = 9.27. \]
From the Kruskal-Wallis table, the critical value for a [latex]P[/latex]-value of 0.01 is 7.654, so the [latex]P[/latex]-value for [latex]H=9.27[/latex] is less than 0.01, giving substantial evidence of a difference between the stimulus events in terms of oxytocin change.

For the [latex]\chi^2[/latex] approximation, with 2 degrees of freedom, the [latex]\chi^2[/latex] distribution table gives a [latex]P[/latex]-value of closer to 0.01. This is somewhat conservative in comparison to the exact value from the Kruskal-Wallis table.

Spearman’s Rank Correlation

In Chapter 6 we defined the Pearson correlation coefficient, [latex]r[/latex], for the linear relationship between two quantitative variables. This was based on multiplying standardised scores for [latex]x[/latex] and [latex]y[/latex] values, and since it involved means and standard deviations the result is susceptible to the effects of outliers.

The reason why [latex]r[/latex] is formally known as Pearson’s correlation coefficient is to distinguish it from other measures of association. Here we will describe a measure published by Spearman in 1904 (Spearman, 1904). As with previous methods in the last few sections, Spearman’s correlation coefficient is based on ranks, and we will see how it is calculated using an example.

Oxytocin and Age

The figure below shows the basal plasma oxytocin levels and ages of the 12 single women in the oxytocin example. There are two variables now, oxytocin level and age, and we start by ranking these separately. The original values are shown together with their ranks in the following table.

Basal oxytocin by age for 12 single women

Basal oxytocin and age with ranks

Name	Age	Rank	Basal	Rank	[asciimath]d_j[/asciimath]
Katie Sato	64	11	4.4	5	6
Jana Clausen	18	1	4.5	8	-7
Nanako Connolly	60	10	4.17	2	8
Abigail Jones	21	3	4.67	10	-7
Kelly Brown	31	6	4.88	12	-6
Marie Sorensen	55	9	4.41	6	3
Asuka McCarthy	26	5	4.19	3	2
Tyra Carlsen	20	2	4.69	11	-9
Britt Solberg	33	7	4.62	9	-2
Jeneve Bager	79	12	3.92	1	11
Gerda Jensen	25	4	4.44	7	-3
Kaya Solberg	41	8	4.26	4	4

If there was a perfect negative association between the variables then we would expect the ranks to all be far apart. For example, Jeneve is ranked 12 in age and ranked 1 in oxytocin level while Jana is ranked 1 in age and 8 in oxytocin level. We can measure how close they are in the usual way, by adding up the squared deviations. The last column of our table gives the differences [latex]d_j[/latex], so we calculate
\[ \sum d_j^2 = 478. \]
If there was perfect positive correlation then this would be 0. What would happen if there was perfect negative correlation? In that case, the person ranked 12 in one variable would be ranked 1 in the other variable, the person ranked 11 in one would be ranked 2 in the other, and so on. This would give the biggest differences possible between the ranks, and the sum of the squared differences would be
\[ 2\frac{n(n^2-1)}{6}, \]
where here [latex]n = 12[/latex]. We would like Spearman’s correlation to have the same range of values as Pearson’s, between -1 and 1, and so we define
\[ r_S = 1 – \frac{6 \sum d_j^2}{n(n^2 – 1)}. \]
You can convince yourself that this measure has the desired range.

For our example,
\[ r_S = 1 – \frac{6 \times 478}{12(12^2 – 1)} = -0.6713. \]
As seen in the previous figure, this data does not contain any outliers or influential points, so [latex]r_S[/latex] is quite close to the Pearson correlation [latex]r = -0.6618[/latex]. However, suppose the first person had their basal value recorded as “440” instead of “4.40”. Pearson’s correlation becomes [latex]r = 0.3917[/latex], suggesting a positive relationship, while Spearman’s [latex]r_S = -0.3566[/latex], much less affected by the outlier and still suggesting the correct direction.

Spearman’s [latex]r_S[/latex] can also be used as a test statistic for the null hypothesis of no association. The following table gives the critical values of the null distribution for small values of [latex]n[/latex]. This null distribution is symmetric so a test for negative association would use the [latex]P[/latex]-value from the right of the corresponding positive value. A two-sided test would simply double the one-sided result.
For example, if our aim had been to use this data to establish whether there was a negative relationship between oxytocin level and age for single women then the [latex]P[/latex]-value for the test would be [latex]\pr{R_S \le -0.6713}[/latex]. This value is somewhere between 0.025 and 0.01 on the [latex]n=12[/latex] row of the table, giving moderate evidence of a negative association.

Spearman's rank correlation critical values

[latex]n[/latex]	0.25	0.10	0.05	0.025	0.01	0.005	0.001	0.0005	0.0001
2
3	1.000
4	0.600	1.000	1.000
5	0.500	0.800	0.900	1.000	1.000
6	0.371	0.657	0.829	0.886	0.943	1.000
7	0.321	0.571	0.714	0.821	0.893	0.964
8	0.310	0.524	0.643	0.738	0.833	0.905	0.976	0.976	1.000
9	0.267	0.483	0.600	0.700	0.783	0.833	0.917	0.933	0.967
10	0.248	0.455	0.564	0.648	0.745	0.794	0.879	0.903	0.927
11	0.236	0.427	0.536	0.618	0.709	0.764	0.855	0.873	0.909
12	0.217	0.406	0.503	0.587	0.678	0.734	0.825	0.846	0.881
13	0.209	0.385	0.484	0.560	0.648	0.703	0.797	0.824	0.863
14	0.200	0.367	0.464	0.538	0.626	0.679	0.771	0.802	0.846
15	0.189	0.354	0.446	0.521	0.604	0.657	0.750	0.779	0.829
16	0.182	0.341	0.429	0.503	0.585	0.635	0.729	0.759	0.812
17	0.176	0.328	0.414	0.488	0.566	0.618	0.711	0.743	0.797
18	0.170	0.317	0.401	0.472	0.550	0.600	0.692	0.723	0.781
19	0.165	0.309	0.391	0.460	0.535	0.584	0.675	0.709	0.767
20	0.161	0.299	0.380	0.447	0.522	0.570	0.662	0.693	0.753
21	0.156	0.292	0.370	0.436	0.509	0.556	0.647	0.678	0.739
22	0.152	0.284	0.361	0.425	0.497	0.544	0.633	0.665	0.726
23	0.148	0.278	0.353	0.416	0.486	0.532	0.621	0.652	0.713
24	0.144	0.271	0.344	0.407	0.476	0.521	0.609	0.640	0.702
25	0.142	0.265	0.337	0.398	0.466	0.511	0.597	0.628	0.690
26	0.138	0.259	0.331	0.390	0.457	0.501	0.586	0.618	0.679
27	0.136	0.255	0.324	0.383	0.449	0.492	0.576	0.607	0.668
28	0.133	0.250	0.318	0.375	0.441	0.483	0.567	0.597	0.658
29	0.130	0.245	0.312	0.368	0.433	0.475	0.558	0.588	0.649
30	0.128	0.240	0.306	0.362	0.425	0.467	0.549	0.579	0.640
40	0.110	0.207	0.264	0.313	0.368	0.405	0.479	0.506	0.563
50	0.097	0.184	0.235	0.279	0.329	0.363	0.430	0.456	0.508
60	0.089	0.168	0.214	0.255	0.301	0.331	0.394	0.417	0.467
70	0.082	0.155	0.198	0.235	0.278	0.307	0.365	0.387	0.434
80	0.076	0.145	0.185	0.220	0.260	0.287	0.342	0.363	0.407
90	0.072	0.136	0.174	0.207	0.245	0.271	0.323	0.343	0.385
100	0.068	0.129	0.165	0.197	0.233	0.257	0.307	0.326	0.366

This table gives [latex]r_S^{*}[/latex] such that [latex]\pr{R_S \ge r_S^{*}} \le p[/latex], where [latex]R_S[/latex] is the Spearman's rank correlation from two rankings of [latex]n[/latex] objects when no association is present ([latex]H_0[/latex]). Note that is impossible to get significance at the 5% level when [latex]n \lt 4[/latex].

For large [latex]n[/latex], the null distribution of [latex]z = r_S \sqrt{n-1}[/latex] is approximately the standard Normal distribution. This can be used instead of the table above for testing with large samples.

Final Words

A discussion of nonparametric methods is an appropriate way to conclude our story of data analysis and statistical inference. In methods such as the sign test we have revisited the essence of hypothesis testing, a logical argument based on probability models for observed data. We have also seen the importance of visualising data to confirm the assumptions underlying the [latex]t[/latex] tests and [latex]F[/latex] tests, and how the nonparametric methods could be used if suitable transformations of the data could not be found. The range of methods presented also give a good overview of some of the types of data and experimental designs we have analysed in this book.

A notable absence in this section has been in the calculation of confidence intervals. These intervals are usually more important than a hypothesis test since they give an idea of how big an effect is, rather than just being told it exists. Daniel (1990) gives details on methods for nonparametric confidence intervals.

Of course this book has only given a taste of the wide variety of experimental designs and statistical methods that are used in research. The numerous references along the way should give you a start on finding appropriate methods to help you design and analyse your own scientific studies, or to help critically understand the work of others. A strong understanding of the basic principles covered in this book will give you a good basis for your future learning.

We conclude in the following chapter with a reflection on significance and power with a gallery of data.

Summary

Nonparametric methods can provide a robust alternative to methods based on means and standard deviations.
The sign test and signed-rank test are nonparametric equivalents of the one-sample [latex]t[/latex] test (Chapter 15).
The Wilcoxon rank-sum test is a nonparametric equivalent of the two-sample [latex]t[/latex] test (Chapter 16).
The Kruskall-Wallis test is a nonparametric equivalent of the one-way ANOVA [latex]F[/latex] test (Chapter 19).
Spearman’s rank correlation is a nonparametric equivalent of Pearson’s correlation (Chapter 18).

Exercise 1

A study measured the hopping speed of 11 subjects on their left and right legs. The number of hops made in one minute on one leg were counted, ten minutes were then given for rest, and then the number of hops made in one minute on the other leg were counted. Five subjects hopped on their right leg first while the remaining six started with their left leg. The counts are shown in the table below.

Number of hops in one minute

	Subject
Leg	1	2	3	4	5	6	7	8	9	10	11
Left	124	86	98	112	104	190	135	110	78	82	94
Right	110	98	110	110	108	195	125	120	70	80	98

Use the sign test to see whether there is evidence that people hop faster on their right leg. Obtain the exact [latex]P[/latex]-value from the cumulative binomial distribution table.

Exercise 2

Repeat the previous exercise using the signed-rank test instead. Obtain bounds for the [latex]P[/latex]-value from the signed-rank critical values table and also estimate it using a Normal approximation.

Exercise 3

Islander weights appear to have a skewed distribution, as seen in Chapter 3, and so a [latex]t[/latex] test may not be appropriate for working with small samples of weights. Use a rank-sum test instead to see whether there is a difference between male and female weights in the survey data. Compare your results to those from a [latex]t[/latex] test.

Exercise 4

Based on the data in the celery bending example, use a Kruskall-Wallis test to see whether there is a difference between celery bend angle between the three storage conditions. Obtain exact bounds for the [latex]P[/latex]-value from the 3-group Kruskal-Wallis table and also estimate it using the [latex]\chi^2[/latex] approximation.

Exercise 5

Use a Kruskall-Wallis test with the [latex]\chi^2[/latex] approximation to see whether there is a relationship between change in oxytocin level and stimulus event in the full data in the original oxytocin example.

Exercise 6

Carry out a rank-sum test for the sleep deprivation and internal clock study in Exercise 4 of Chapter 2. Compare your results with the exact [latex]P[/latex]-value from the randomisation test and the results from a two-sample [latex]t[/latex] test.

Exercise 7

Calculate the sampling distribution of Spearman’s rank correlation coefficient for [latex]n = 4[/latex]. Use this to verify the critical values in the table of Spearman’s rank correlation critical values.

Kruskal-Wallis critical values for 3 groups

[latex]n_1[/latex]	[latex]n_2[/latex]	[latex]n_3[/latex]	[latex]p=[/latex]0.25	0.10	0.05	0.025	0.01
2	2	2	3.714	4.571
2	2	3	3.429	4.500	4.714
2	2	4	3.125	4.458	5.333	5.500
2	2	5	3.240	4.373	5.160	6.000	6.533
2	2	6	3.018	4.545	5.345	5.745	6.655
2	3	3	3.139	4.556	5.361	5.556
2	3	4	3.111	4.511	5.444	6.000	6.444
2	3	5	3.022	4.651	5.251	6.004	6.909
2	3	6	2.970	4.682	5.348	6.136	6.970
2	4	4	3.055	4.555	5.455	6.327	7.036
2	4	5	2.914	4.541	5.273	6.068	7.205
2	4	6	3.058	4.494	5.340	6.186	7.340
2	5	5	3.023	4.623	5.338	6.346	7.338
2	5	6	3.033	4.596	5.338	6.196	7.376
2	6	6	3.010	4.438	5.410	6.210	7.467
3	3	3	3.289	4.622	5.600	5.956	7.200
3	3	4	3.027	4.709	5.791	6.155	6.745
3	3	5	2.970	4.533	5.648	6.315	7.079
3	3	6	2.987	4.590	5.615	6.436	7.410
3	4	4	2.932	4.545	5.598	6.394	7.144
3	4	5	2.953	4.549	5.656	6.410	7.445
3	4	6	2.940	4.604	5.610	6.538	7.500
3	5	5	2.936	4.545	5.705	6.549	7.578
3	5	6	2.897	4.535	5.602	6.667	7.590
3	6	6	2.900	4.558	5.625	6.725	7.725
4	4	4	3.038	4.654	5.692	6.615	7.654
4	4	5	2.918	4.668	5.657	6.673	7.760
4	4	6	2.895	4.595	5.681	6.667	7.795
4	5	5	2.931	4.523	5.666	6.760	7.823
4	5	6	2.896	4.523	5.661	6.750	7.936
4	6	6	2.882	4.548	5.724	6.812	8.000
5	5	5	2.960	4.560	5.780	6.740	8.000
5	5	6	2.853	4.547	5.729	6.788	8.028
5	6	6	2.895	4.542	5.765	6.848	8.124
6	6	6	2.889	4.643	5.801	6.889	8.222

This table gives [latex]H^{*}[/latex] such that [latex]\pr{H \ge H^{*}} \le p[/latex], where [latex]H[/latex] is a random Kruskal-Wallis statistic for three groups under the null hypothesis that the groups have the same distributions.

Kruskal-Wallis critical values for 4 groups

[latex]n_1[/latex]	[latex]n_2[/latex]	[latex]n_3[/latex]	[latex]n_4[/latex]	[latex]p=[/latex]0.25	0.10	0.05	0.025	0.01
2	2	2	2	4.667	5.667	6.167	6.667	6.667
2	2	2	3	4.378	5.711	6.333	6.978	7.133
2	2	2	4	4.473	5.755	6.545	7.064	7.391
2	2	3	3	4.436	5.745	6.527	7.055	7.727
2	2	3	4	4.348	5.750	6.621	7.326	7.871
2	2	4	4	4.308	5.808	6.731	7.538	8.346
2	3	3	3	4.364	5.879	6.727	7.515	8.015
2	3	3	4	4.327	5.872	6.795	7.564	8.333
2	3	4	4	4.275	5.901	6.874	7.747	8.621
2	4	4	4	4.271	5.914	6.957	7.914	8.871
3	3	3	3	4.385	6.026	7.000	7.667	8.538
3	3	3	4	4.302	6.016	6.984	7.775	8.659
3	3	4	4	4.267	6.019	7.038	7.929	8.876
3	4	4	4	4.254	6.042	7.142	8.079	9.075
4	4	4	4	4.235	6.088	7.235	8.228	9.287

This table gives [latex]H^{*}[/latex] such that [latex]\pr{H \ge H^{*}} \le p[/latex], where [latex]H[/latex] is a random Kruskal-Wallis statistic for four groups under the null hypothesis that the groups have the same distributions.

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

A Portable Introduction to Data Analysis Copyright © 2024 by The University of Queensland is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.