1 Variability
Why is data analysis necessary? What is it about scientific studies and the data they give us that makes statistical analysis so important?
The title of this chapter gives it away, but think about these questions before reading on.
Sources of Variability
As an example, suppose we do a simple study involving some measurements of forearm lengths. Ten students were asked to measure the length of their forearm in centimetres using a tape measure. The following table shows the data obtained:
Sex | Forearm (cm) |
---|---|
Female | 27 |
Male | 29 |
Male | 30 |
Male | 30 |
Female | 27 |
Female | 24 |
Male | 27 |
Female | 28 |
Female | 28 |
Male | 30 |
This may seem like a silly question, but why are the 10 values we obtained different numbers? After all, we measured the same quantity, the length of a forearm, each time, so why aren’t our measurements all the same?
Error Variability
The obvious answer is that people come in different shapes and sizes and so we expect to get different forearm length measurements. This is just the natural variability of the thing we are measuring.
Also fairly obvious is that the measurement process may be prone to error, giving measurement variability. Deciding where a forearm begins and ends is not trivial. Even if a person measures their own forearm twice they may well come up with different answers. With different people making the measurements then this problem is compounded. However, measurements can be made more accurate in a systematic way, such as by giving a protocol for how a forearm is to be measured or by getting just one person to do all the measurements.
Note also that all the values have been rounded to the nearest centimetre. In this case it might be realistic to do this, given the general uncertainty of the measurements, but in general rounding can contribute to measurement error as well.
There is usually no way to distinguish between measurement variability and natural variability in the data. We cannot tell whether the first two males, with forearms of 29 cm and 30 cm, recorded different lengths because their forearms were really of different length or were in fact of the same length but just measured differently. We collectively call this variability the error variability since it gets in the way of making inferences from our data. For example, if all males had the same length forearm and all females had the same length forearm then it would be easy to decide whether there was a difference between males and females.
The presence of error variability makes it necessary to replicate our experiments. Taking a single male or female and measuring their forearm length tells us very little about forearm lengths in general. Having the 10 observations of forearm lengths not only gives us information about the typical forearm lengths, it gives us information about the nature and magnitude of the variability present in them which is equally important.
Group Variability
In this example we also find differences in our observations because the two groups, males and females, do tend to have different lengths. This is the variability we are really interested in.
Note that we cannot make any conclusion like “males have longer forearms than females” because this statement is not universally true. There is a male who has a shorter forearm than some of the females. Instead we will talk about averages, so we might claim more correctly that “the average forearm length of males is longer than the average forearm length of females”. From the data we find that the average for males is 29.2 cm while for females it is 26.8 cm, a 2.4 cm difference.
A standard statistical approach to seeing if there is a difference between the groups is to see if this group variability is larger than the error variability.
Sampling Variability
As noted, we will have to make conclusions in terms of averages. Suppose we calculate the average lengths for each group, the 29.2 cm and 26.8 cm. This is a straightforward calculation and we wouldn’t expect there to be any variability in the result. But there is! This is because if we took another 5 males and 5 females and carried out the measurements again then we would most likely get different average lengths. The difference would most likely not be 2.4 cm again. The average we calculate depends on the sample.
This is a very important point. What we would like to do is to use the averages we calculate in our experiment to say something about people in general, such as “males tend to have longer forearms than females”. But if we did the experiment again then we might get different data which say something else! Fortunately, we are able to quantify this sampling variability, particularly if the experiment has been properly designed.
As a result of this, statistics can be viewed as a communication skill. If a researcher wants to communicate her findings to someone then she has to use the language of statistics in order to incorporate sampling variability. Most research articles in the biological sciences, particularly in medical and other human-related settings, are full of statistical statements and conclusions.
Types of Variables
Before we start looking at designing experiments and analyzing data we need to introduce some terminology to describe the different types of information we might be interested in obtaining from our subjects.
A variable is a characteristic that we can record about the subjects or objects in a study. These can be measurements we make, like a forearm length or blood pressure, or can be attributes, like sex or age. It is important to classify variables as being either quantitative or categorical. These two types will require different tools for exploration and analysis.
Quantitative Variables
Quantitative variables represent measurements, such as the height of a person or the temperature of an environment. These are quite often continuous, taking any value over some range. Continuous variables capture the idea that measurements can always be made more precisely.
Discrete variables have only a small number of possibilities, such as a count of some outcomes or an age measured in whole years.
Categorical Variables
Categorical variables represent groups of objects with a particular characteristic. For example, recording the sex of subjects is essentially the same as making a group of males and a group of females. Variables like sex are called nominal because they are arbitrary categories with no order between them.
Ordinal variables are those whose categories do have an order. A common example of this is in recording the age group someone falls into. We can put these groups in order because we can put ages in order. In most of this book we will not make much of the distinction between nominal and ordinal variables.
It is important to be able to distinguish between quantitative and categorical variables because the two types require different methods for visualizing, summarizing, and analyzing. We will see this distinction throughout this book.
Categorical variables are sometimes referred to as qualitative variables. We will avoid that usage since the term qualitative data is used to describe the type of data that comes from investigations that examine people’s opinions, behaviours and experiences, usually captured through written answers to surveys, transcripts of interviews or field-based observations. Qualitative data may be studied in a number of ways, using text or visual artifacts, that don’t rely on quantitative measurements, but can sometimes be encoded as either quantitative or categorical variables for statistical analysis. Hence qualitative data and categorical data are generally quite distinct.
Survey
To illustrate the different types of variables, Alice conducted a survey that was completed by 60 people. The questions asked in the survey gave values for the following variables:
Variable | Question |
---|---|
Sex | Are you male or female? |
Age | How many years old are you? |
Height | How tall are you in centimetres? |
Mass | What is your mass in kilograms? |
Forearm | How long is your right forearm (measuring from your elbow to your wrist) in centimetres? |
Pulse | What is your pulse rate while you are completing this survey? (Count your pulse for 30 seconds and multiply by 2 to obtain your pulse rate in beats/minute.) |
Eyes | What is the predominant colour of your eyes? |
Pizza | What is your favourite pizza topping? Sausage, Prawn, Pineapple, Mushroom or Spinach |
Education | What is your highest level of education attained? Primary, Secondary, University or Postgrad |
Kiss | Do you approve of kissing on the first date? |
Sensitive Questions
Asking people a sensitive question like “Do you approve of kissing on the first date?” can be difficult because participants may be reluctant to give an honest answer. To overcome this problem we aim to make the survey anonymous but this can also be difficult in practice, especially if the questions are being asked by an interviewer. One of the ways to give the subject control over their anonymity is the following randomised response technique:
- Toss a coin twice and note the results, without telling anyone what they were.
- If the result of the first toss was heads then answer the question truthfully.
- If the result of the first toss was tails then look at the second toss: if it was heads then answer “Yes” and if it was tails then answer “No”.
Now if someone responds with `Yes’ to the question then we cannot know whether they do approve of kissing on the first date or whether the coin told them to say `Yes’, giving perfect anonymity. We used this particular question since Fidler and Kleinknecht (1977) conducted a study where they found responses differed between direct questioning and the randomised response. They noted that “almost all respondents in the direct-questioning sample approved of kissing on the first date. In contrast, respondents in the randomized-response sample reported much less approval.” Other studies have shown similar effects over a wider range of topics.
The table at the end of this chapter shows the results obtained from this survey. Alternatively, it can be downloaded.
Summary
- The need for data analysis comes from the variability present in data.
- Separating the differences between groups from background variability is a fundamental task of statistical analysis.
- It is important to be able to identify the types of variables recorded in a study. Data from quantitative and categorical variables will be described and analysed in different ways.
Survey Data
Name | Sex | Age | Height | Mass | Forearm | Pulse | Eyes | Pizza | Town | Education | Kiss |
---|---|---|---|---|---|---|---|---|---|---|---|
Scott Davies | Male | 22 | 174 | 74 | 25 | 80 | Blue | Mushroom | Arcadia | Secondary | Yes |
Orla Morris | Female | 15 | 174 | 67 | 26 | 70 | Green | Pineapple | Arcadia | Primary | No |
Anna Sorensen | Female | 39 | 160 | 68 | 24 | 66 | Blue | Pineapple | Hofn | University | Yes |
Lena Larsen | Female | 43 | 169 | 53 | 25 | 50 | Green | Pineapple | Colmar | University | Yes |
Lara Solberg | Female | 30 | 174 | 65 | 26 | 66 | Green | Pineapple | Hofn | Secondary | No |
Hannah Watanabe | Female | 19 | 166 | 59 | 25 | 74 | Blue | Spinach | Arcadia | Secondary | Yes |
David Eklund | Male | 36 | 173 | 52 | 25 | 78 | Green | Sausage | Hofn | University | Yes |
Brigit Lund | Female | 19 | 166 | 60 | 25 | 58 | Green | Sausage | Arcadia | Secondary | No |
Adam Connolly | Male | 23 | 179 | 72 | 25 | 78 | Brown | Spinach | Arcadia | Secondary | Yes |
Kerstin Bager | Female | 15 | 166 | 72 | 25 | 76 | Blue | Spinach | Arcadia | Primary | No |
Jasmin Blomgren | Female | 42 | 171 | 59 | 26 | 70 | Green | Pineapple | Colmar | University | Yes |
Jun Wilson | Male | 20 | 182 | 62 | 26 | 70 | Green | Spinach | Hofn | Secondary | Yes |
Dr William Summers | Male | 45 | 185 | 100 | 27 | 86 | Brown | Spinach | Arcadia | Postgrad | Yes |
Lea Herbert | Female | 35 | 166 | 58 | 25 | 66 | Purple | Spinach | Colmar | University | Yes |
Ian McCarthy | Male | 17 | 176 | 79 | 24 | 72 | Brown | Pineapple | Hofn | Secondary | Yes |
Kaito Price | Male | 24 | 177 | 81 | 24 | 62 | Brown | Mushroom | Arcadia | Secondary | Yes |
Jack Brown | Male | 39 | 173 | 55 | 25 | 78 | Purple | Pineapple | Arcadia | University | Yes |
Kaya Carlsen | Female | 61 | 164 | 45 | 26 | 82 | Green | Spinach | Arcadia | University | Yes |
Ella Edwards | Female | 37 | 169 | 72 | 25 | 62 | Blue | Mushroom | Arcadia | Secondary | Yes |
Daiki Yamada | Male | 56 | 169 | 64 | 23 | 64 | Brown | Sausage | Arcadia | University | Yes |
Leif Thorn | Male | 16 | 179 | 59 | 27 | 90 | Purple | Spinach | Hofn | Primary | No |
Thomas Hardy | Male | 19 | 171 | 65 | 25 | 82 | Purple | Prawns | Arcadia | Secondary | Yes |
Anthony Hall | Male | 25 | 171 | 71 | 24 | 54 | Brown | Spinach | Colmar | University | Yes |
Nathan Collins | Male | 60 | 184 | 60 | 25 | 58 | Purple | Mushroom | Colmar | University | No |
Dr Kristjana Erickson | Female | 27 | 165 | 57 | 25 | 66 | Blue | Mushroom | Hofn | Postgrad | Yes |
Ren Kimura | Male | 20 | 169 | 62 | 24 | 62 | Brown | Sausage | Colmar | Secondary | Yes |
Emma Simon | Female | 27 | 177 | 77 | 26 | 70 | Brown | Pineapple | Colmar | University | Yes |
Lamont Dupont | Male | 48 | 175 | 68 | 25 | 70 | Purple | Pineapple | Colmar | Secondary | Yes |
Shota Burke | Male | 28 | 177 | 81 | 25 | 68 | Blue | Spinach | Arcadia | University | Yes |
Ava Suzuki | Female | 43 | 170 | 54 | 26 | 66 | Green | Spinach | Colmar | University | No |
Amy Sato | Female | 36 | 178 | 79 | 27 | 76 | Blue | Prawns | Colmar | University | No |
Michael Pallesen | Male | 57 | 178 | 75 | 25 | 54 | Brown | Prawns | Colmar | University | Yes |
Colin Kennedy | Male | 54 | 193 | 109 | 29 | 92 | Blue | Mushroom | Colmar | University | Yes |
Zoe Jackson | Female | 29 | 173 | 60 | 25 | 54 | Purple | Pineapple | Arcadia | University | No |
Naoto Mori | Male | 41 | 175 | 76 | 24 | 52 | Blue | Mushroom | Colmar | University | Yes |
Ayaka Murphy | Female | 52 | 175 | 71 | 27 | 74 | Blue | Pineapple | Arcadia | University | Yes |
Michelle Regan | Female | 33 | 155 | 57 | 24 | 76 | Brown | Prawns | Hofn | Secondary | No |
Liam Moore | Male | 58 | 178 | 76 | 25 | 60 | Brown | Sausage | Colmar | University | Yes |
Halden Ibsen | Male | 40 | 173 | 54 | 25 | 66 | Green | Spinach | Colmar | University | Yes |
Taylor Jones | Female | 16 | 171 | 63 | 27 | 84 | Purple | Pineapple | Hofn | Primary | No |
Lisa Jensen | Female | 36 | 156 | 62 | 24 | 80 | Blue | Spinach | Colmar | University | No |
Andrew White | Male | 22 | 174 | 79 | 25 | 68 | Blue | Sausage | Hofn | Secondary | No |
Maxine Page | Female | 18 | 165 | 57 | 25 | 78 | Blue | Pineapple | Hofn | Secondary | Yes |
Antoine Abel | Male | 25 | 193 | 80 | 27 | 48 | Green | Mushroom | Colmar | University | Yes |
Jeremy Lavigne | Male | 19 | 166 | 64 | 24 | 64 | Blue | Pineapple | Colmar | Secondary | Yes |
Noel Swift | Male | 19 | 184 | 85 | 260 | 68 | Brown | Spinach | Colmar | Secondary | Yes |
Alexander Svendsen | Male | 21 | 185 | 95 | 26 | 74 | Blue | Sausage | Arcadia | Secondary | Yes |
Jermaine Gagnon | Male | 23 | 178 | 66 | 25 | 50 | Green | Sausage | Hofn | Secondary | Yes |
Ragnar Madsen | Male | 23 | 176 | 59 | 25 | 62 | Purple | Sausage | Hofn | Secondary | Yes |
Mallory Perrot | Female | 37 | 165 | 46 | 25 | 74 | Purple | Mushroom | Colmar | Secondary | No |
Anund Clausen | Male | 23 | 170 | 57 | 24 | 72 | Green | Pineapple | Colmar | Secondary | Yes |
Dionne Delacroix | Female | 45 | 165 | 54 | 24 | 60 | Purple | Spinach | Colmar | University | No |
Sanna Olsen | Female | 33 | 167 | 53 | 26 | 80 | Purple | Pineapple | Hofn | Secondary | Yes |
Raum Holst | Male | 36 | 179 | 68 | 25 | 48 | Purple | Spinach | Hofn | University | Yes |
Raphael Favreau | Male | 38 | 182 | 69 | 26 | 78 | Purple | Sausage | Colmar | University | No |
Maximilian Blomgren | Male | 23 | 168 | 67 | 23 | 56 | Blue | Sausage | Hofn | Secondary | Yes |
Florian Eklund | Male | 41 | 179 | 85 | 25 | 64 | Blue | Prawns | Colmar | Secondary | Yes |
Kate Connolly | Female | 47 | 159 | 48 | 23 | 54 | Green | Pineapple | Arcadia | Secondary | No |
Leon Sorensen | Male | 58 | 175 | 74 | 25 | 86 | Blue | Mushroom | Colmar | University | Yes |
Anika Sorensen | Female | 35 | 167 | 61 | 26 | 66 | Brown | Spinach | Hofn | University | Yes |