# 8 Probability

We now move from describing data distributions and relationships to thinking about where our data comes from. This chapter gives some important terminology which we will use throughout our later development.

# Populations and Sampling

A **population** is a complete set of individuals or objects that we want information about. Ideally we could do a **census** and collect the information we want about the whole population. However, this is typically too expensive in time or money, or simply impossible. For example, how would you do a census of fish in the sea?

Instead we take a **sample**, a subset of the population, and use data from the sample to say something about the population as a whole. The sample should be chosen so that it is representative of the population but also so that it is not biased in any way (see below). One way of achieving this is to take a **random** sample from the population.

A **population parameter** is a numerical characteristic of a population. For example, suppose the total [latex]N = 7837[/latex] residents in the main towns of Hofn, Arcadia and Colmar is our population of interest. The proportion of females in this population is [latex]p = 3946/7837 = 0.504[/latex]. This [latex]p[/latex] is a parameter since it says something about the population.

Similarly, the average height of females in this population is [latex]\mu[/latex] = 166.0 cm, another parameter. The symbol ‘[latex]\mu[/latex]’ is the Greek letter ‘m’, as in ‘mean’. We will almost always use lowercase Greek letters to denote population parameters. The main exception to this rule is the population proportion [latex]p[/latex].

In contrast, a **statistic** is a numerical characteristic of a sample. We can view the individuals in the survey as a sample of the population of 7837 residents in those three towns. There are [latex]n = 60[/latex] individuals in this sample. The proportion of females in the sample is [latex]\hat{p} = 26/60 = 0.433[/latex]. This [latex]\hat{p}[/latex] is the statistic we calculate to estimate the (usually unknown) population parameter [latex]p[/latex]. Similarly, the mean height of females in the sample is [latex]\overline{x}[/latex] = 167.4 cm and we calculate this sample mean to estimate the (usually unknown) population parameter [latex]\mu[/latex].

## Sampling Bias

If we want to draw wider conclusions from an experiment we need to be clear about what the population of interest will be and we need a way of obtaining a representative sample from that population. Issues of **sampling bias** are particularly evident in conducting **surveys**. The survey data came from a simple survey of Islanders. In contrast to an experiment, this was essentially a passive process in that the subjects were not changed in any way by collecting the data. The main control you have in a survey is on how you select your subjects.

**Selection bias** occurs when the sample itself is unrepresentative of the population you are trying to describe. A common survey method is to select names at random from a telephone directory and conduct a telephone interview. This will under-represent young people since they will typically live in families or in shared accommodation, where there are several people per telephone, while older people may live as a couple or alone. We call this **undercoverage bias**.

Another form of bias is **self-selection bias**. This is common in surveys where people ring in to give their opinions, such as in television polls or with Big Brother voting. The people who select themselves for these surveys often have strong opinions which may differ from the population at large.

The solution here is to take random samples from the population, in the same way we used randomisation in the comparative experiment to reduce bias. Random samples are difficult in practice since you often don’t have a full list of the population to select from. However, you can try to ensure that, in principle, anyone in the population has an equal chance of being chosen in your study.

Another solution is to gather details from the respondents, such as residence and age, and then look to see if the sample distribution of these match the population distributions provided by the regular national census. If there is a discrepancy then some results can be weighted more highly to compensate.

Even if you can make a perfect random sample there are bias issues to consider, particularly in surveys. The people who don’t respond may systematically differ from those who do, leading to **nonresponse bias**. Again, university students typically do not have the leisure time to complete surveys or interviews and their opinions may differ from the rest of the population.

If you get a response, question wording, survey format, and interviewer effects may give **response bias**.

### Aircraft Survivability

An interesting anecdote, related to the work of Abraham Wald, perfectly captures the problem of survey bias (Mangel & Samaniego, 1984). During World War II, aircraft would often return from missions with bullet holes in various parts of their bodies. The standard practice was to record which areas seemed most likely to receive holes in them and then reinforce those areas with extra metal.

It took a statistician to point out that this approach was flawed. What you really wanted to know was where bullets had hit the aircraft that *did not* come back. These may well have been the opposite areas to those that were being reinforced. The sample consisting of aircraft returning was extremely biased towards those aircraft that survived.

# Probability

Since we try to take random samples from our population we need to describe the behaviour of this random sampling process. A general **random process** is one for which individual results cannot be predicted exactly but for which long-term behaviour can be described. The main random process we will consider is that of picking a person (or object) at random from a population and recording some measurement about them (or it). We will look, in particular, at the results of doing this several times, the process we go through when taking a sample.

In this section we give a brief summary of the definitions of probabilities.

## Sample Spaces and Events

The **sample space**, [latex]\Omega[/latex], for a random process is the set of all the possible **outcomes** that might be observed.

For example, suppose our random process was taking a sample of 3 people from a population of males and females. We could represent the sample space of outcomes as

\[ \Omega = \{\mbox{MMM}, \mbox{MMF}, \mbox{MFM}, \mbox{MFF}, \mbox{FMM}, \mbox{FMF}, \mbox{FFM}, \mbox{FFF} \}, \]

if we were interested in the order of males and females in the sample. Alternatively we could use the sample space

\[ \Omega’ = \{ 0, 1, 2, 3 \} \]

if we just wanted to know how many males or females were in our sample. While it takes up more space, the advantage of the first representation is that it is usually easier to calculate probabilities for these simpler outcomes.

Sample spaces don’t have to be finite. If our random process was to toss a coin until the first head appears then we would have

\[ \Omega = \{ \mbox{H}, \mbox{TH}, \mbox{TTH}, \mbox{TTTH}, \ldots \}, \]

since there is no upper limit to the number of tails we might have to see before the first head (although of course the probabilities get very small).

An **event**, [latex]A[/latex], is a subset of a sample space, [latex]\Omega[/latex]. An event **occurs** if the outcome of the random process is an element of the event.

For example, the event of obtaining at least two females in a sample of size 3 is

\[ A = \{\mbox{MFF}, \mbox{FMF}, \mbox{FFM}, \mbox{FFF}\}. \]

## Probability Axioms

A **probability function** for [latex]\Omega[/latex] assigns a real number to every subset (event) of [latex]\Omega[/latex]. We denote the probability of [latex]A \subseteq \Omega[/latex] by [latex]P(A)[/latex]. This function must satisfy three axioms:

- [latex]P(\Omega) = 1[/latex]
- [latex]P(A) \ge 0[/latex] for all [latex]A \subseteq \Omega[/latex]
- [latex]P(A \cup B) = P(A) + P(B)[/latex] if [latex]A[/latex] and [latex]B[/latex] are disjoint ([latex]A \cap B = \emptyset[/latex])

### Complements

The **complement**, [latex]\overline{A}[/latex], of an event [latex]A[/latex] is the set of all outcomes in [latex]\Omega[/latex] not in [latex]A[/latex].

We can use the above axioms to obtain a formula for [latex]P(\overline{A})[/latex]. Firstly, note that [latex]A \cup \overline{A} = \Omega[/latex] so that

\[ P(A \cup \overline{A}) = P(\Omega) = 1. \]

Now [latex]A[/latex] and [latex]\overline{A}[/latex] are disjoint so

\[ P(A \cup \overline{A}) = P(A) + P(\overline{A}). \]

Putting these together gives [latex]P(\overline{A}) = 1 - P(A)[/latex].

### Equally Likely Outcomes

Let [latex]\Omega = \{\mbox{Brown}, \mbox{Blue}, \mbox{Green}, \mbox{Purple}\}[/latex] be the sample space for the eye colour of a randomly chosen Islander.

Suppose we believed that the four eye colours were all equally likely in our population. Since probabilities must add up to 1, if the outcomes are equally likely then

\[ P(A) = \frac{\mbox{number of outcomes in } A}{\mbox{number of outcomes in } \Omega}. \]

For this example [latex]P(\mbox{Brown}) = 0.25[/latex], [latex]P(\mbox{Blue}) = 0.25[/latex], and so on. We will see how to test whether data matches such a distribution in Chapter 22.

Outcomes are not always equally likely, as we’ll see in the following section.

# Discrete Random Variables

General random processes can have outcomes such as male or female, for the sex of a random person, or heads or tails, for the outcome of a coin toss. However, since these are not numbers we can’t do calculations with them directly. Instead we will focus on **random variables**, random processes with numerical outcomes. In this section we look at **discrete** random variables, random variables with discrete outcomes.

A function takes an input and returns an output. For example, the squaring function takes 2 and gives 4, 3 and gives 9, and so on. Since a random variable gives a number we can define a **probability function** for it that simply returns the probability that each value will occur.

### Keno

Although our focus is on sampling from populations, games of chance provide a useful source of general probability functions which we can build ideas on.

In the game of Keno there are 80 balls, numbered from 1 to 80, from which 20 are chosen at random. You are offered a variety of bets you can make on these balls. The simplest is to pick one number as your bet. If your number comes up in the 20 chosen then you win $3, otherwise you don’t win anything.

The probability of winning is easy to calculate. There are 20 balls chosen from 80 so you have a 1 in 4 chance of your number coming up, a probability of 0.25. If [latex]X[/latex] is the amount you win on a particular game then we can write the probability function as a table:

[latex]x[/latex] | 0 | 3 |

[latex]P(X = x)[/latex] | [latex]\frac{3}{4}[/latex] | [latex]\frac{1}{4}[/latex] |

There are only two outcomes, $3 with probability [latex]\frac{1}{4}[/latex] and $0 with probability [latex]\frac{3}{4}[/latex] (since they must add to 1).

# Odds

In addition to probability, another way of describing how likely something is to occur is to talk about its **odds**. If an outcome occurs proportion [latex]p[/latex] of the time then the odds of it occurring are

\[ \frac{p}{1-p}. \]

For example, there were 14 Islanders with green eyes in the survey so if we picked one person at random from our survey the probability of finding someone with green eyes is [latex]p = 14/60 = 0.233[/latex]. The odds of having green eyes are then

\[ \frac{0.233}{1 – 0.233} = \frac{0.233}{0.767} = 0.304. \]

We would thus say the odds of someone having green eyes are 0.304 to 1, or 1 to 3.3. This is simply the statement of a ratio, and we could just as well have said that the odds were 10 to 33 or 0.233 to 0.767. However, it is useful to normalise one side of the ratio to 1 so that it is easy to compare odds.

When the odds are less than 1 to 1, it is common to reverse the order and say that they are the odds **against** the outcome. Here the odds of having green eyes are 3.3 to 1 against.

The nice thing about odds are that they have a wider range of possible values than probabilities. While probabilities are always stuck between 0 and 1, odds can go from 0 (when [latex]p=0[/latex]) up to infinity (as [latex]p[/latex] tends towards 1). Even better are the **log odds**, defined by

\[ \ln\left(\frac{p}{1-p}\right). \]

Here [latex]\ln[/latex] is the **natural** logarithm (see the Appendix for background). When the odds are 1, the log odds are [latex]\ln(1) = 0[/latex]. For odds less than 1 the log odds are negative. For example, the log odds of having green eyes in the above example are

\[ \ln(0.304) = -1.190. \]

As probability and odds tend towards 0, log odds tends towards negative infinity. Thus log odds cover the whole range from negative to positive infinity. This makes them very useful for modelling, as we will see in Chapter 23.

# Surprisal and Information

In this book we will focus on probabilities and odds for describing the likelihood of certain outcomes. However, there are other measures related to probabilities that are also important. Here we give a brief overview of one role of probability in information theory. This is particularly relevant in the context of coded information, such as genetic sequences, but also gives another nice application of logarithms.

The **surprisal** of an outcome of a random variable is a measure of how surprised we would be to see that outcome occur (Tribus, 1961). If [latex]p[/latex] is the probability of the outcome then the surprisal, [latex]u[/latex], is defined by

\[ u = \log_2 \frac{1}{p} = – \log_2 p. \]

For example, if [latex]p = 1[/latex] then the outcome always occurs so we would not be at all surprised to see it and indeed [latex]u = \log_2 1 = 0[/latex]. If [latex]p = \frac{1}{2}[/latex], such as getting heads on the toss of a coin, then [latex]u = \log_2 2 = 1[/latex]. Since a single coin toss is like having a random 0 or 1, we say that the surprisal is 1 **bit** (binary digit). Winning the simple game of Keno has surprisal

\[ u = \log_2 4 = 2 \mbox{ bits}. \]

Suppose someone rolls a die 10 times and gets 10 sixes. How much surprisal would there be? Well if the die was fair then [latex]p = (\frac{1}{6})^{10}[/latex] and so

\[ u = \log_2 (6^{10}) = 10 \log_2(6) = 25.8, \]

giving 25.8 bits of surprisal (see the appendix for how to calculate base 2 logarithms). This is equivalent to tossing around 26 coins and getting heads each time. In this way surprisal is a nice measure for outcomes that are unlikely to happen, giving a standard reference back to repeated coin tosses.

Suppose a random process has [latex]n[/latex] outcomes with probabilities [latex]p_1, \ldots, p_n[/latex]. The **information** or **information entropy** of the process, [latex]H[/latex], is the weighted sum of the surprisals,

\[ H = \sum_{j=1}^{n} p_j \log_2 \frac{1}{p_j} = – \sum_{j=1}^{n} p_j \log_2 p_j. \]

Consider the process of tossing a coin, with two outcomes and [latex]p_1 = p_2 = \frac{1}{2}[/latex]. The information in an outcome of the process is

\[ H = \frac{1}{2} \log_2 2 + \frac{1}{2} \log_2 2 = 2 \frac{1}{2} \log_2 2 = 1 \mbox{ bit}. \]

For the roll of a fair die the information is

\[ H = 6 \frac{1}{6} \log_2 6 = \log_2 6 = 2.58 \mbox{ bits}. \]

This information measure has a practical application in giving the minimum average number of bits for encoding the outcome of a random process. For a coin we need one bit, 0 or 1. For a single die roll we need three bits, giving an encoding such as 000, 001, 010, 011, 100, and 101 for the six outcomes, but we might be able to encode the outcomes of a sequence of dice rolls with an average of 2.58 bits. Thus information entropy is the fundamental basis of electronic communication (Shannon, 1948).

Note that the outcomes do not all have to be equally likely, allowing information encoding to be analysed for more complex random processes such as the sequence of characters in an email, the sequence of pixels in an image, or the sequence of bases in a protein sequence.

# Continuous Random Variables

While it is straightforward to write down a probability function for a discrete random variable, continuous random variables are more subtle. For example, what is the probability that a randomly chosen person will be 160 cm tall? A continuous quantity is one that can be measured to arbitrary precision. If we found someone who was 160 cm we should measure it more accurately to check; they may turn out to actually be 160.00001 cm tall. In fact, it is extremely unlikely that someone would be exactly 160 cm tall; as we measure more and more accurately, we’re almost certain to find out they’re not quite 160 cm or are just a bit over. The probability of any particular height is essentially 0.

Instead we have to work with probabilities for intervals. For example, it makes sense to talk about the probability of a height being between 160.0 cm and 160.1 cm. A probability function for a continuous random variable is called a **probability density function** (pdf) and is plotted as a **density curve**. We use the word “density” since a density curve shows where we can expect values to be more dense or less dense. The pdf must always be positive (or zero) and the total area under the density curve must be 1. The probability of an interval event is simply the area under the density curve above that interval.

Our aim for continuous random variables is to provide models for experiments involving continuous variables, such as height measurements. We will illustrate this idea with two simple density curves that might be used as models for the heights in a population. In Chapter 12 we will introduce the Normal density curve, a much better model for heights.

## Uniform Density

We have seen earlier that the general height distribution is bimodal since there are really two distributions added together, one for males and one for females. To keep our model simple we’ll just look at the female distribution.

Female heights range from roughly 150 cm to 190 cm. A very simple model might be to say that all parts of this range are equally likely as the height of a randomly chosen female. Here is the density curve which captures this model:

It is not a very curvy curve since it has the same value all the way along, a **uniform** density. We have left the vertical value of the density curve as [latex]h[/latex] but you can work out what this value must be. The area of the rectangle is the base length times the height and for a density curve the area must be 1. Thus

\[ (190 – 150) \times h = 1, \]

so [latex]h = 0.025[/latex]. We can work out probabilities from this model by finding the corresponding area under the density curve. For example, if [latex]X[/latex] is the height of a randomly chosen female then the probability that a randomly chosen female is 180 cm or taller is

\[ P(X \ge 180) = (190 – 180) \times 0.025 = 0.25, \]

or 25%. Similarly, the probability that a randomly chosen female is between 160 cm and 175 cm tall is

\[ P(160 \le X \le 175) = (175 – 160) \times 0.025 = 0.375, \]

or 37.5%.

Of course this is a terrible model of female heights. In reality we would not find 25% of females being over 180 cm tall. However this simple model should make it clear what a continuous probability model involves and how probabilities can be calculated from it.

## Triangular Density

Before we look at a more realistic model in Chapter 12, let us try one more simple model of female heights for practice. Clearly the uniform model above was inappropriate because we find extreme heights, either very short or very tall, are far less common in observed data than heights nearer the average. The uniform model says that these should all be equally likely. A better model is a “triangular” density curve:

Again we have left [latex]h[/latex] as the height of the triangle but you can work this out. The area of a triangle is a half the base length times the height which again should equal 1. Thus

\[ \frac{1}{2} (190 – 150) \times h = 1, \]

so [latex]h = 0.05[/latex]. Now the probability that a randomly chosen female is 180 cm or taller is a little harder to calculate since we need to know the height of the triangle at 180 cm. Since 180 is half way between 170 and 190, the height here is half way between 0.05 (the height at 170) and 0 (the height at 190). This gives a height of 0.025 so that

\[ P(X \ge 180) = \frac{1}{2} (190 – 180) \times 0.025 = 0.125. \]

It is hard to work out the probability that a randomly chosen female is between 160 cm and 175 cm tall directly since the area required is not a simple triangle. However, the area left over is made up of two triangles whose areas we can easily calculate. Since the total area is 1 we must have

\[ P(160 \le X \le 175) = 1 – (P(X \le 160) + P(X \ge 175)). \]

Now

\[ P(X \le 160) = \frac{1}{2} (160 – 150) \times 0.025 = 0.125, \]

which we already knew since the density is symmetric around 170 cm and [latex]P(X \ge 180) = 0.125[/latex]. The height at 175 cm is 0.0375, half way between 0.05 (the height at 170) and 0.025 (the height at 180). We can use this to calculate

\[ P(X \ge 175) = \frac{1}{2} (190 – 175) \times 0.0375 = 0.28125, \]

so

\[ P(160 \le X \le 175) = 1 – (0.125 + 0.28125) = 1 – 0.40625 = 0.59375, \]

about 59%. This seems more reasonable than the 37.5% that the uniform model above would predict, though it is still not great.

For the Normal density curve in Chapter 12 it will be impractical to calculate actual areas by hand. However, it is important to be able to reason about areas, such as splitting up a complicated area or working with the opposite area, as we have done with this example.

## Summary

- We will use probabilities to describe the process of sampling from a population.
- A random variable is a random process with a numerical outcome.
- A discrete random variable is a random variable with discrete outcomes. Probabilities can be assigned using a probability function.
- A continuous random variable is a random variable with continuous outcomes. The probability of an individual outcome is always 0, with interval probabilities assigned by areas under a probability density function.
- Odds are an alternative description of the likelihood of an outcome.
- Surprisal is another measure of likelihood for discrete random variables.

## Exercise 1

Suppose a random process has three possible outcomes, [latex]A[/latex], [latex]B[/latex], and [latex]C[/latex]. Which of the following are valid probability models for this process?

- [latex]P(A) = 0.5, P(B) = 0.3, P(C) = 0.2[/latex]
- [latex]P(A) = 0.8, P(B) = 0.5, P(C) = -0.3[/latex]
- [latex]P(A) = 0.8, P(B) = 0.5, P(C) = 0.3[/latex]
- [latex]P(A) = 0.33, P(B) = 0.33, P(C) = 0.33[/latex]

## Exercise 2

The odds of an event are 1.78. What is the probability of that event occurring?

## Exercise 3

The log-odds of an event are -1.266. What is the probability of that event occurring?

## Exercise 4

Suppose an outcome has a surprisal of 2.5. What is the probability of that outcome occurring?

## Exercise 5

For the uniform density shown earlier, find a formula that gives the height [latex]x[/latex] such that [latex]P(X \ge x) = p[/latex], for any [latex]p[/latex].

## Exercise 6

For the triangular density shown earlier, find a formula that gives the height [latex]x[/latex] such that [latex]P(X \ge x) = p[/latex], for any [latex]p[/latex].