1 Variability

Why is data analysis necessary? What is it about scientific studies and the data they give us that makes statistical analysis so important?

The title of this chapter gives it away, but think about these questions before reading on.

Sources of Variability

As an example, suppose we do a simple study involving some measurements of forearm lengths. Ten students were asked to measure the length of their forearm in centimetres using a tape measure. The following table shows the data obtained:

Sex Forearm (cm)
Female 27
Male 29
Male 30
Male 30
Female 27
Female 24
Male 27
Female 28
Female 28
Male 30

This may seem like a silly question, but why are the 10 values we obtained different numbers? After all, we measured the same quantity, the length of a forearm, each time, so why aren’t our measurements all the same?

Error Variability

The obvious answer is that people come in different shapes and sizes and so we expect to get different forearm length measurements. This is just the natural variability of the thing we are measuring.

Also fairly obvious is that the measurement process may be prone to error, giving measurement variability. Deciding where a forearm begins and ends is not trivial. Even if a person measures their own forearm twice they may well come up with different answers. With different people making the measurements then this problem is compounded. However, measurements can be made more accurate in a systematic way, such as by giving a protocol for how a forearm is to be measured or by getting just one person to do all the measurements.

Note also that all the values have been rounded to the nearest centimetre. In this case it might be realistic to do this, given the general uncertainty of the measurements, but in general rounding can contribute to measurement error as well.

There is usually no way to distinguish between measurement variability and natural variability in the data. We cannot tell whether the first two males, with forearms of 29 cm and 30 cm, recorded different lengths because their forearms were really of different length or were in fact of the same length but just measured differently. We collectively call this variability the error variability since it gets in the way of making inferences from our data. For example, if all males had the same length forearm and all females had the same length forearm then it would be easy to decide whether there was a difference between males and females.

The presence of error variability makes it necessary to replicate our experiments. Taking a single male or female and measuring their forearm length tells us very little about forearm lengths in general. Having the 10 observations of forearm lengths not only gives us information about the typical forearm lengths, it gives us information about the nature and magnitude of the variability present in them which is equally important.

Group Variability

In this example we also find differences in our observations because the two groups, males and females, do tend to have different lengths. This is the variability we are really interested in.

Note that we cannot make any conclusion like “males have longer forearms than females” because this statement is not universally true. There is a male who has a shorter forearm than some of the females. Instead we will talk about averages, so we might claim more correctly that “the average forearm length of males is longer than the average forearm length of females”. From the data we find that the average for males is 29.2 cm while for females it is 26.8 cm, a 2.4 cm difference.

A standard statistical approach to seeing if there is a difference between the groups is to see if this group variability is larger than the error variability.

Sampling Variability

As noted, we will have to make conclusions in terms of averages. Suppose we calculate the average lengths for each group, the 29.2 cm and 26.8 cm. This is a straightforward calculation and we wouldn’t expect there to be any variability in the result. But there is! This is because if we took another 5 males and 5 females and carried out the measurements again then we would most likely get different average lengths. The difference would most likely not be 2.4 cm again. The average we calculate depends on the sample.

This is a very important point. What we would like to do is to use the averages we calculate in our experiment to say something about people in general, such as “males tend to have longer forearms than females”. But if we did the experiment again then we might get different data which say something else! Fortunately, we are able to quantify this sampling variability, particularly if the experiment has been properly designed.

As a result of this, statistics can be viewed as a communication skill. If a researcher wants to communicate her findings to someone then she has to use the language of statistics in order to incorporate sampling variability. Most research articles in the biological sciences, particularly in medical and other human-related settings, are full of statistical statements and conclusions.

Types of Variables

Before we start looking at designing experiments and analyzing data we need to introduce some terminology to describe the different types of information we might be interested in obtaining from our subjects.

A variable is a characteristic that we can record about the subjects or objects in a study. These can be measurements we make, like a forearm length or blood pressure, or can be attributes, like sex or age. It is important to classify variables as being either quantitative or categorical. These two types will require different tools for exploration and analysis.

Quantitative Variables

Quantitative variables represent measurements, such as the height of a person or the temperature of an environment. These are quite often continuous, taking any value over some range. Continuous variables capture the idea that measurements can always be made more precisely.

Discrete variables have only a small number of possibilities, such as a count of some outcomes or an age measured in whole years.

Categorical Variables

Categorical variables represent groups of objects with a particular characteristic. For example, recording the sex of subjects is essentially the same as making a group of males and a group of females. Variables like sex are called nominal because they are arbitrary categories with no order between them.

Ordinal variables are those whose categories do have an order. A common example of this is in recording the age group someone falls into. We can put these groups in order because we can put ages in order. In most of this book we will not make much of the distinction between nominal and ordinal variables.

It is important to be able to distinguish between quantitative and categorical variables because the two types require different methods for visualizing, summarizing, and analyzing. We will see this distinction throughout this book.

Categorical variables are sometimes referred to as qualitative variables. We will avoid that usage since the term qualitative data is used to describe the type of data that comes from investigations that examine people’s opinions, behaviours and experiences, usually captured through written answers to surveys, transcripts of interviews or field-based observations. Qualitative data may be studied in a number of ways, using text or visual artifacts, that don’t rely on quantitative measurements, but can sometimes be encoded as either quantitative or categorical variables for statistical analysis. Hence qualitative data and categorical data are generally quite distinct.

Survey

To illustrate the different types of variables, Alice conducted a survey that was completed by 60 people. The questions asked in the survey gave values for the following variables:

Variable Question
Sex Are you male or female?
Age How many years old are you?
Height How tall are you in centimetres?
Mass What is your mass in kilograms?
Forearm How long is your right forearm (measuring from your elbow to your wrist) in centimetres?
Pulse What is your pulse rate while you are completing this survey? (Count your pulse for 30 seconds and multiply by 2 to obtain your pulse rate in beats/minute.)
Eyes What is the predominant colour of your eyes?
Pizza What is your favourite pizza topping?
Sausage, Prawn, Pineapple, Mushroom or Spinach
Education What is your highest level of education attained?
Primary, Secondary, University or Postgrad
Kiss Do you approve of kissing on the first date?

Sensitive Questions

Asking people a sensitive question like “Do you approve of kissing on the first date?” can be difficult because participants may be reluctant to give an honest answer. To overcome this problem we aim to make the survey anonymous but this can also be difficult in practice, especially if the questions are being asked by an interviewer. One of the ways to give the subject control over their anonymity is the following randomised response technique:

  • Toss a coin twice and note the results, without telling anyone what they were.
  • If the result of the first toss was heads then answer the question truthfully.
  • If the result of the first toss was tails then look at the second toss: if it was heads then answer “Yes” and if it was tails then answer “No”.

Now if someone responds with `Yes’ to the question then we cannot know whether they do approve of kissing on the first date or whether the coin told them to say `Yes’, giving perfect anonymity. We used this particular question since Fidler and Kleinknecht (1977) conducted a study where they found responses differed between direct questioning and the randomised response. They noted that “almost all respondents in the direct-questioning sample approved of kissing on the first date. In contrast, respondents in the randomized-response sample reported much less approval.” Other studies have shown similar effects over a wider range of topics.

The table at the end of this chapter shows the results obtained from this survey. Alternatively, it can be downloaded.

Survey data (CSV, 4KB)

Summary

  • The need for data analysis comes from the variability present in data.
  • Separating the differences between groups from background variability is a fundamental task of statistical analysis.
  • It is important to be able to identify the types of variables recorded in a study. Data from quantitative and categorical variables will be described and analysed in different ways.

Survey Data

Name Sex Age Height Mass Forearm Pulse Eyes Pizza Town Education Kiss
Scott Davies Male 22 174 74 25 80 Blue Mushroom Arcadia Secondary Yes
Orla Morris Female 15 174 67 26 70 Green Pineapple Arcadia Primary No
Anna Sorensen Female 39 160 68 24 66 Blue Pineapple Hofn University Yes
Lena Larsen Female 43 169 53 25 50 Green Pineapple Colmar University Yes
Lara Solberg Female 30 174 65 26 66 Green Pineapple Hofn Secondary No
Hannah Watanabe Female 19 166 59 25 74 Blue Spinach Arcadia Secondary Yes
David Eklund Male 36 173 52 25 78 Green Sausage Hofn University Yes
Brigit Lund Female 19 166 60 25 58 Green Sausage Arcadia Secondary No
Adam Connolly Male 23 179 72 25 78 Brown Spinach Arcadia Secondary Yes
Kerstin Bager Female 15 166 72 25 76 Blue Spinach Arcadia Primary No
Jasmin Blomgren Female 42 171 59 26 70 Green Pineapple Colmar University Yes
Jun Wilson Male 20 182 62 26 70 Green Spinach Hofn Secondary Yes
Dr William Summers Male 45 185 100 27 86 Brown Spinach Arcadia Postgrad Yes
Lea Herbert Female 35 166 58 25 66 Purple Spinach Colmar University Yes
Ian McCarthy Male 17 176 79 24 72 Brown Pineapple Hofn Secondary Yes
Kaito Price Male 24 177 81 24 62 Brown Mushroom Arcadia Secondary Yes
Jack Brown Male 39 173 55 25 78 Purple Pineapple Arcadia University Yes
Kaya Carlsen Female 61 164 45 26 82 Green Spinach Arcadia University Yes
Ella Edwards Female 37 169 72 25 62 Blue Mushroom Arcadia Secondary Yes
Daiki Yamada Male 56 169 64 23 64 Brown Sausage Arcadia University Yes
Leif Thorn Male 16 179 59 27 90 Purple Spinach Hofn Primary No
Thomas Hardy Male 19 171 65 25 82 Purple Prawns Arcadia Secondary Yes
Anthony Hall Male 25 171 71 24 54 Brown Spinach Colmar University Yes
Nathan Collins Male 60 184 60 25 58 Purple Mushroom Colmar University No
Dr Kristjana Erickson Female 27 165 57 25 66 Blue Mushroom Hofn Postgrad Yes
Ren Kimura Male 20 169 62 24 62 Brown Sausage Colmar Secondary Yes
Emma Simon Female 27 177 77 26 70 Brown Pineapple Colmar University Yes
Lamont Dupont Male 48 175 68 25 70 Purple Pineapple Colmar Secondary Yes
Shota Burke Male 28 177 81 25 68 Blue Spinach Arcadia University Yes
Ava Suzuki Female 43 170 54 26 66 Green Spinach Colmar University No
Amy Sato Female 36 178 79 27 76 Blue Prawns Colmar University No
Michael Pallesen Male 57 178 75 25 54 Brown Prawns Colmar University Yes
Colin Kennedy Male 54 193 109 29 92 Blue Mushroom Colmar University Yes
Zoe Jackson Female 29 173 60 25 54 Purple Pineapple Arcadia University No
Naoto Mori Male 41 175 76 24 52 Blue Mushroom Colmar University Yes
Ayaka Murphy Female 52 175 71 27 74 Blue Pineapple Arcadia University Yes
Michelle Regan Female 33 155 57 24 76 Brown Prawns Hofn Secondary No
Liam Moore Male 58 178 76 25 60 Brown Sausage Colmar University Yes
Halden Ibsen Male 40 173 54 25 66 Green Spinach Colmar University Yes
Taylor Jones Female 16 171 63 27 84 Purple Pineapple Hofn Primary No
Lisa Jensen Female 36 156 62 24 80 Blue Spinach Colmar University No
Andrew White Male 22 174 79 25 68 Blue Sausage Hofn Secondary No
Maxine Page Female 18 165 57 25 78 Blue Pineapple Hofn Secondary Yes
Antoine Abel Male 25 193 80 27 48 Green Mushroom Colmar University Yes
Jeremy Lavigne Male 19 166 64 24 64 Blue Pineapple Colmar Secondary Yes
Noel Swift Male 19 184 85 260 68 Brown Spinach Colmar Secondary Yes
Alexander Svendsen Male 21 185 95 26 74 Blue Sausage Arcadia Secondary Yes
Jermaine Gagnon Male 23 178 66 25 50 Green Sausage Hofn Secondary Yes
Ragnar Madsen Male 23 176 59 25 62 Purple Sausage Hofn Secondary Yes
Mallory Perrot Female 37 165 46 25 74 Purple Mushroom Colmar Secondary No
Anund Clausen Male 23 170 57 24 72 Green Pineapple Colmar Secondary Yes
Dionne Delacroix Female 45 165 54 24 60 Purple Spinach Colmar University No
Sanna Olsen Female 33 167 53 26 80 Purple Pineapple Hofn Secondary Yes
Raum Holst Male 36 179 68 25 48 Purple Spinach Hofn University Yes
Raphael Favreau Male 38 182 69 26 78 Purple Sausage Colmar University No
Maximilian Blomgren Male 23 168 67 23 56 Blue Sausage Hofn Secondary Yes
Florian Eklund Male 41 179 85 25 64 Blue Prawns Colmar Secondary Yes
Kate Connolly Female 47 159 48 23 54 Green Pineapple Arcadia Secondary No
Leon Sorensen Male 58 175 74 25 86 Blue Mushroom Colmar University Yes
Anika Sorensen Female 35 167 61 26 66 Brown Spinach Hofn University Yes

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

A Portable Introduction to Data Analysis Copyright © 2024 by The University of Queensland is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book