Historical Data Sets

[latex]\newcommand{\myei}[2]{#1\frac{#2}{8}} \newcommand{\myeib}[1]{#1\hspace{0.63em}}[/latex]

This appendix gives several data sets which have become famous in the statistical literature and have been used as examples in this book.

The Passage Time of Light

This data set and the next were collected by Stigler (1977). Stigler looked at how effective estimators, such as the sample mean and sample median, were at estimating population values. We have given examples of this using simulations of the sampling process but Stigler’s idea was to try this with real data. The problem is that you need settings for which the true values are known. A solution is to use historical experiments which were trying to determine physical quantities, such as the speed of light, which are now known very accurately. Stigler considered three such experiments, involving the determination of the parallax of the sun, the speed of light, and the mean density of the earth.

Coded measurements of the passage time of light (ns)

28 26 33 24 34 -44 27 16 40 -2
29 22 24 21 25 30 23 29 31 19
24 20 36 32 36 28 25 21 28 29
37 25 28 26 30 32 36 26 30 22
36 23 27 27 28 27 31 27 26 33
26 32 32 24 39 28 24 25 32 25
29 27 28 29 16 23

The table above shows the coded results of Newcomb’s third series of measurements of the passage time of light, made from 24 July 1882 to 5 September 1882, as presented by Stigler (1977). The values given are the differences between the actual values and 24800 ns. For example, ’28’ is the coded value for a passage time of 24828 ns, while ‘-44’ is the coded value of 24756 ns. The observations are given in the order in which they were made, reading across the rows.

The currently accepted value for the true passage time of light for this experiment would be coded as 33.02 ns.

The Mean Density of the Earth

The table below shows 29 determinations of the mean density of the Earth made by Cavendish in 1798 using a torsion balance devised earlier by Michell. There were many experiments around this time of a similar nature, using the results to estimate Newton’s gravitational constant [latex]G[/latex], but Stigler notes that Cavendish’s experiment is generally considered the best.

The first six observations were made under the same conditions but Cavendish then replaced a suspension wire in the apparatus by one that was stiffer.

Measurements of the mean density of the Earth (g/cm[latex]^3[/latex])

5.50 5.61 4.88 5.07 5.26 5.55 5.36 5.29 5.58 5.65
5.57 5.53 5.62 5.29 5.44 5.34 5.79 5.10 5.27 5.39
5.42 5.47 5.63 5.34 5.46 5.30 5.75 5.68 5.85

Effects of Cross and Self Fertilization

Darwin (1902, originally published in 1876) gives results from a large number of experiments to compare the effects of cross-fertilisation and self-fertilisation in plants. One of these used thirty seeds from Zea mays (maize):

Some plants were raised in in the greenhouse, and were crossed with pollen taken from a distinct plant; and a single plant, growing quite separately in a different part of the house, was allowed to fertilise itself spontaneously. The seeds thus obtained were placed on damp sand, and as they germinated in pairs of equal age were planted on the opposite sides of four very large pots.

Darwin (1902)

The table below gives the heights of the fifteen pairs of plants when initial measurements were made, along with the difference between the cross-fertilised and self-fertilised height in each pair. In this table the measurements are given in inches, reproducing the table of values given by Darwin. In Chapter 24 the data is converted to centimetres for use there.

Height (inches) of Zea mays

Pot Crossed Self-Fertilised Difference ([latex]\times \frac{1}{8}[/latex])
1 [latex]\myei{23}{4}[/latex] [latex]\myei{17}{3}[/latex] 49
[latex]\myeib{12}[/latex] [latex]\myei{20}{3}[/latex] -67
[latex]\myeib{21}[/latex] [latex]\myeib{20}[/latex] 8
2 [latex]\myeib{22}[/latex] [latex]\myeib{20}[/latex] 16
[latex]\myei{19}{1}[/latex] [latex]\myei{18}{3}[/latex] 6
[latex]\myei{21}{4}[/latex] [latex]\myei{18}{5}[/latex] 23
3 [latex]\myei{22}{1}[/latex] [latex]\myei{18}{5}[/latex] 28
[latex]\myei{20}{3}[/latex] [latex]\myei{15}{2}[/latex] 41
[latex]\myei{18}{2}[/latex] [latex]\myei{16}{4}[/latex] 14
[latex]\myei{21}{5}[/latex] [latex]\myeib{18}[/latex] 29
[latex]\myei{23}{2}[/latex] [latex]\myei{16}{2}[/latex] 56
4 [latex]\myeib{21}[/latex] [latex]\myeib{18}[/latex] 24
[latex]\myei{22}{1}[/latex] [latex]\myei{12}{6}[/latex] 75
[latex]\myeib{23}[/latex] [latex]\myei{15}{4}[/latex] 60
[latex]\myeib{12}[/latex] [latex]\myeib{18}[/latex] -48

Counting Yeast Cells with a Hemocytometer

Student (1907) describes an experiment in which he spread a thin layer of mixture containing yeast cells, gelatine and water on a plate of glass. A small area was divided into 400 squares of equal size and a hemocytometer was used to count the number of yeast cells in each square. The resulting counts are given in the table below.

Counts of yeast cells in 400 squares

                                       
2 2 4 4 4 5 2 4 7 7 4 7 5 2 8 6 7 4 3 4
3 3 2 4 2 5 4 2 8 6 3 6 6 10 8 3 5 6 4 4
7 9 5 2 7 4 4 2 4 4 4 3 5 6 5 4 1 4 2 6
4 1 4 7 3 2 3 5 8 2 9 5 3 9 5 5 2 4 3 4
4 1 5 9 3 4 4 6 6 5 4 6 5 5 4 3 5 9 6 4
4 4 5 10 4 4 3 8 3 2 1 4 1 5 6 4 2 3 3 3
3 7 4 5 1 8 5 7 9 5 8 9 5 6 6 4 3 7 4 4
7 5 6 3 6 7 4 5 8 6 3 3 4 3 7 4 4 4 5 3
8 10 6 3 3 6 5 2 5 3 11 3 7 4 7 3 5 5 3 4
1 3 7 2 5 5 5 3 3 4 6 5 6 1 6 4 4 4 6 4
4 2 5 4 8 6 3 4 6 5 2 6 6 1 2 2 2 5 2 2
5 9 3 5 6 4 6 5 7 1 3 6 5 4 2 8 9 5 4 3
2 2 11 4 6 6 4 6 2 5 3 5 7 2 6 5 5 1 2 7
5 12 5 8 2 4 2 1 6 4 5 1 2 9 1 3 4 7 3 6
5 6 5 4 4 5 2 7 6 2 7 3 5 4 4 5 4 7 5 4
8 4 6 6 5 3 3 5 7 4 5 5 5 6 10 2 3 8 3 5
6 6 4 2 6 6 7 5 4 5 8 6 7 6 4 2 6 1 1 4
7 2 5 7 4 6 4 5 1 5 10 8 7 5 4 6 4 4 7 5
4 3 1 6 2 5 3 3 3 7 4 3 7 8 4 7 3 1 4 4
7 6 7 2 4 5 1 3 12 4 2 2 8 7 6 7 6 3 5 4

The First [latex]t[/latex] Test

The first [latex]t[/latex] test was presented by Student (1908) in which he analysed data that came from a study to determine whether a new sedative helped increase sleep duration more than an existing sedative (Cushny & Peebles, 1905). In the study each of 10 patients was given a tablet of one of the drugs on alternate evenings and the researchers recorded the mean increase in hours of sleep that each drug gave relative to a previous control. The results are reproduced in the table below.

Additional sleep (hours) gained by use of hyoscyamine hydrobromide

Patient 1 (Dextro-) 2 (Laevo-) Difference (2-1)
1 0.7 1.9 1.2
2 -1.6 0.8 2.4
3 -0.2 1.1 1.3
4 -1.2 0.1 1.3
5 -0.1 -0.1 0.0
6 3.4 4.4 1.0
7 3.7 5.5 1.8
8 0.8 1.6 0.8
9 0 4.6 4.6
10 2.0 3.4 1.4

In addition to being used for the first ever [latex]t[/latex] test, this data set is also famous because it appeared in Fisher’s Statistical Methods for Research Workers (Fisher, 1925). However it also turns out that the table above is wrong. Senn and Richardson (1994) detail a correspondence which explains that Student had incorrect labels for the two compounds, as well as other concerns such that the values in the table are variously the means of 3 to 6 observations and so have no common [latex]\sigma[/latex] to estimate. Zabell (2008) gives a general overview of these issues and the historical context and significance of the paper.

Yields of Potatoes

Many early advances in the history of experimental design and statistical analysis came through agricultural trials. The following table shows the results of an agricultural experiment taken from the 1932 Report of the Rothamsted Experimental Station and presented as an example by Fisher (1951). Here the six treatments have been arranged in a latin square where each treatment occurs exactly once in each row and in each column. This design allows you to simultaneously take into account possible effects in either the row or column direction, such as gradients in soil fertility or prevailing patterns in sun and rain.

Yields of potatoes (lbs)

E
633
B
527
F
652
A
390
C
504
D
416
B
489
C
475
D
415
E
488
F
571
A
282
A
384
E
481
C
483
B
422
D
334
F
646
F
620
D
448
E
505
C
439
A
323
B
384
D
452
A
432
B
411
F
617
E
594
C
466
C
500
F
505
A
259
D
366
B
326
E
420

The treatments themselves are combinations of two factors, phosphate and nitrogen. The table below shows the levels of each factor in each treatment.

Treatment coding

  No Nitrogen Nitrogen
No phosphate A D
Single phosphate B E
Double phosphate C F

Dimensions of Irises

The dataset iris in R gives the measurements of 150 irises from the Gaspé Peninsula in Québec by Edgar Anderson (Anderson, 1935) and first analyzed by Fisher (1936b). This data set has become a famous challenge for testing classification algorithms (Bezdek, Keller, Krishnapuram, Kuncheva, & Pal, 1999; Frank & Asuncion, 2010).

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

A Portable Introduction to Data Analysis Copyright © 2024 by The University of Queensland is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book