12 The Normal Distribution

[latex]\newcommand{\pr}[1]{P(#1)} \newcommand{\var}[1]{\mbox{var}(#1)} \newcommand{\mean}[1]{\mbox{E}(#1)} \newcommand{\sd}[1]{\mbox{sd}(#1)} \newcommand{\Binomial}[3]{#1 \sim \mbox{Binomial}(#2,#3)} \newcommand{\Student}[2]{#1 \sim \mbox{Student}(#2)} \newcommand{\Normal}[3]{#1 \sim \mbox{Normal}(#2,#3)} \newcommand{\Poisson}[2]{#1 \sim \mbox{Poisson}(#2)} \newcommand{\IQR}{\mbox{IQR}}[/latex]

Normal Density Curves

In Chapter 8 we looked at two continuous random variables that might model female height, one with a uniform density curve and one with a triangular density curve. These were both simple shapes and so it was relatively easy to calculate probabilities from areas under the curves. However, the most famous density curve has a much more complicated shape, given by the probability density function
\[ f(x) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{1}{2} \left( \frac{x – \mu}{\sigma} \right)^2}, \]
where [latex]\mu[/latex] and [latex]\sigma[/latex] are parameters of the function.
This curve is called the Normal density curve. There is actually a family of Normal density curves since you get a different function for each pair of values of the parameters [latex]\mu[/latex] and [latex]\sigma[/latex]. The Normal curve is symmetric about the value [latex]x = \mu[/latex], as you might see from the fact that it involves [latex](x - \mu)^2[/latex]. To compare with the uniform and triangular models, the figure below shows a Normal model for female heights where [latex]\mu = 167[/latex] and [latex]\sigma = 6.6[/latex].

Normal density model for female height

If [latex]X[/latex] is a continuous random variable whose probabilities are given by a Normal density curve with parameters [latex]\mu[/latex] and [latex]\sigma[/latex] then we say that [latex]X[/latex] has a Normal distribution and write [latex]\Normal{X}{\mu}{\sigma}[/latex].

Note that the name is a bit misleading since a random variable having a Normal distribution is not really ‘normal’ in the usual sense of the word. We will use the uppercase “Normal” to emphasise the difference. Another name for the function is the Gaussian density curve and so you will sometimes see this used to avoid confusion.

Although the density function for the Normal distribution is complicated, it is possible to use the formulas from Chapter 10 to show that the expected value (or mean) of the Normal distribution is [latex]\mu[/latex] and the standard deviation is [latex]\sigma[/latex] (which is why we use these symbols).

Unfortunately there is no simple formula for working out areas under the Normal density curve. Instead areas can be found using the Normal distribution table or with computer software. However, as a rough rule

  • the area within 1 standard deviation of the mean is 68%,
  • the area within 2 standard deviations of the mean is 95%
  • the area within 3 standard deviations of the mean is 99.7%

This is sometimes called the 68-95-99.7 rule, for obvious reasons. We will make particular use of the 95% part.

Standard Units

There are a whole family of Normal distributions but it would be impractical to have a table of areas for each one. Instead we focus on one Normal distribution, with mean 0 and standard deviation 1, the standard Normal distribution.

Suppose female heights are Normally distributed with mean 167 cm and standard deviation 6.6 cm, and consider a female whose height is 180 cm. The idea of standard units is to describe how many standard deviations a value is from the mean. Here 180 is 13 cm above the mean, which is 13/6.6 = 1.97 standard deviations. Note that the units cancel, so 1.97 is dimensionless; its units are just “standard deviations”. Subtracting and dividing by constants doesn’t change the shape of the distribution, so the 1.97 is still from a Normal distribution but with mean 0 and standard deviation 1.

The historical advantage of doing this is that we only need one table of Normal probabilities. To find an area we first standardise our value and then consult the standard Normal table. The standardised value is called a z score. For example, the [latex]z[/latex] score for a female’s height of 180 cm in the Normal(167, 6.6) distribution is 1.97, as above. The probability that a random standardised value, [latex]Z[/latex], is 1.97 or greater is [latex]\pr{Z \ge 1.97} = 0.024[/latex], from the Normal distribution table. Thus 2.4% of females are taller than a female who is 180 cm tall.

In general, standardising takes a random variable [latex]\Normal{X}{\mu}{\sigma}[/latex] and gives
\[ Z = \frac{X – \mu}{\sigma} \]
so that [latex]\Normal{Z}{0}{1}[/latex].

Such calculations are not so important these days since software and graphics calculators can work out these probabilities from scratch, without having to standardise. Just like the logarithm tables of thirty years ago, standard Normal tables will one day disappear.

However, it is still useful to think in terms of standard units. For example, if male heights are Normally distributed with mean 179 cm and standard deviation 7.6 cm, then a male who is 194 cm tall is also 1.97 standard deviations above the mean. Even though 180 and 194 are quite different, they are the same relative to their distributions.

How many standard units will put you in the middle 95% of the distribution? To have 95% in the middle would mean there is 5% left over. Since the Normal distribution is symmetric this gives 2.5% in each tail. Hunting around in the Normal distribution table we find that 1.96 standard units has an area of 0.025 to the right. Thus 95% of observations will occur within 1.96 standard deviations of the mean. This is where the rough “2 standard deviations” in the 68-95-99.7 rule comes from.

Normal Probability Plots

The standard Normal density curve can be viewed as a theoretical distribution and so we can work out properties of the distribution. For instance, the median is 0, the same as the mean since the distribution is symmetric. The third quantile is the [latex]z[/latex] value with 25% of the area above it. Looking through the Normal distribution table we see this is about 0.67. In fact, [latex]Q_3 = 0.674[/latex]. Then [latex]Q_1 = -0.674[/latex] because of the symmetry about 0.

This gives us a simple way of assessing whether observed values of a continuous variable could be described by a Normal density curve. We can work out our sample values of the median, [latex]Q_1[/latex] and [latex]Q_3[/latex], as well as more quantiles if we wish, and compare them to the theoretical values. This can be done visually by making a plot where the vertical axis is our variable and the horizontal axis is for the Normal distribution, and each point compares a quantile from the data with a theoretical quantile. Such a plot is called a Normal probability plot or a Normal quantile plot.

Normal probability plot of pulse rate (bpm)

The figure above shows a Normal probability plot for the pulse rate observations in the survey data. The horizontal axis is given in standard deviations away from the mean. The median pulse rate was 68 bpm and so on the plot there is a point at (0, 68). Similarly the first quartile was 62 bpm and so there is a point at (-0.674, 62). All the points lie roughly along a straight line which indicates a consistent correspondence between the quantiles of the data and the quantiles of the Normal distribution. This suggests the Normal distribution is a useful model of the data and does so more objectively than looking at a histogram. The horizontal stacking of points arises from the somewhat discrete nature of the pulse rate measurements.

Normal probability plot of weight (kg)

The figure above shows a Normal probability plot for the weight observations in the survey data. This time the points do not seem to follow a straight line. The few points to the right have higher weight values than they should for a Normal density curve which suggests that the right tail of the data is more stretched out than it should be. At the same time the points to the left are not as low as they should be for a Normal density which suggests that the left tail of the data does not extend as far as it should if it was Normal. If you sketch a density curve using this description you will see you have a distribution which is skewed to the right. Normal probability plots are thus also useful for detecting skewness in data.

In later chapters we will use methods which are based on the assumption that data is close to Normal. Probability plots are a useful way of assessing how valid this assumption is for a particular set of observations.

The 1.5 [latex]\times[/latex] IQR Rule Revisited

In Chapter 4 we introduced the “1.5 [latex]\times[/latex] IQR rule” for flagging possible outliers in data, typically used in making box plots. But why is this an appropriate rule and where does the “1.5” come from?

Suppose our population does in fact have a perfect Normal distribution and we make a sample from it without error. What is the probability that a genuine observation will be flagged as an outlier by the rule?

We noted in the previous section that [latex]Q_1 = -0.674[/latex] and [latex]Q_3 = 0.674[/latex] so [latex]\IQR = Q_3 - Q_1 = 1.348[/latex]. Thus we would reject any value with a [latex]z[/latex] score of
\[ 0.674 + 1.5 \times 1.348 = 2.696 \]
or higher. The probability of a real value being this high is just [latex]\pr{Z \ge 2.696} = 0.003[/latex] from the Normal distribution table. The chance of a genuine observation being rejected on the lower side is also 0.003, so together the chance of being flagged as an outlier is about 0.006. This is then the probability of making a mistake when using this rule with Normal data and is not too bad, less than 1%.

However, this again illustrates why you should always investigate observations flagged as outliers before thinking of removing them from your analysis. With a sample of 100 observations in which there really are no “outliers”, with a 1% error rate we’d expect to see an observation or two flagged. This error rate is much higher for skewed distributions, so you should be particularly cautious then.

Histograms

For large numbers of observations, histograms and other density estimates are often very useful in assessing symmetry, the most important aspect of Normality to test. However, be careful with histograms from small data sets. For example, the figure below shows ten histograms from samples of size 20.

Example histograms of Normal data

None of these look as nice as the idealised Normal density in the previous figure but in fact all of them show samples from a mathematical process that generates data from a perfect Normal distribution. In particular, each of these histograms is showing data from a symmetric distribution, so this is a chance to get a feel for how far from symmetric a histogram should look before you declare it “skewed”.

Random Numbers

To conclude this section, the table below gives 540 random numbers taken from the standard Normal distribution, with mean 0 and standard deviation 1. These can be used for making your own data sets for practice exercises. If you want to generate random data from a variable [latex]\Normal{X}{\mu}{\sigma}[/latex] then note that the formula we have used for standardising can be rearranged to give
\[ X = \mu + \sigma Z, \]
where [latex]\Normal{Z}{0}{1}[/latex].

Random numbers from Normal(0,1)

                   
-0.96 0.46 1.22 0.10 1.49 -0.58 -1.81 -0.89 0.93 -0.28
1.54 -0.86 0.93 0.74 -0.44 -1.61 -0.22 1.09 -0.28 -1.34
-1.28 -0.51 -1.10 -0.04 -0.36 -0.62 0.77 -2.27 -0.09 -0.33
1.13 -0.14 0.15 1.32 -0.41 -2.00 0.12 0.11 -1.17 -0.69
-0.92 1.05 -0.74 -0.08 0.56 -0.23 0.62 0.05 -0.96 0.02
1.28 0.99 0.56 1.36 1.80 -0.00 -1.76 0.29 -0.36 -0.99
1.65 0.86 0.44 0.65 0.15 0.78 1.83 0.18 -0.28 0.64
-0.24 1.16 -0.50 1.10 1.55 0.80 -0.88 -1.03 1.34 -1.62
0.66 -0.19 -1.49 0.27 -0.25 0.07 0.64 -0.25 -0.05 -0.67
1.94 -0.27 0.67 -1.37 0.45 1.48 0.28 -0.07 0.07 0.46
0.10 -0.80 0.34 0.14 -1.41 -2.41 0.96 0.06 0.19 -1.89
-0.52 1.55 -0.06 -1.26 -0.73 0.48 -0.72 -0.93 -1.79 0.59
0.94 0.69 -1.10 -0.60 1.88 1.86 -1.93 1.81 0.78 -0.65
-1.60 -0.42 -2.12 -0.50 0.55 1.09 1.07 -1.66 -1.20 -0.43
-2.29 -0.82 0.07 0.68 -0.10 -0.51 0.72 -0.12 0.29 0.26
-0.48 0.42 0.01 -0.12 1.25 0.74 -0.11 1.71 2.16 -0.23
0.58 1.58 0.84 -2.67 0.10 0.08 -1.61 0.59 0.80 0.64
-0.26 -0.92 -1.63 0.22 1.37 1.55 -0.44 0.03 1.14 -2.12
-1.20 -1.81 -0.94 0.06 1.02 -1.55 1.00 -0.10 0.11 0.64
1.56 0.05 0.01 1.03 -0.57 1.33 2.18 -0.27 -1.60 2.57
1.36 0.60 -0.61 0.37 0.99 1.58 -0.65 -0.24 0.18 -0.54
-1.71 0.92 1.39 -0.50 -2.06 0.06 0.11 -1.14 0.34 0.10
-0.24 0.16 1.23 2.44 -0.59 -0.41 0.70 -1.22 -1.35 -1.95
-0.57 1.46 -1.41 0.36 -1.00 -0.80 0.57 1.01 -0.05 -0.73
1.13 -0.52 0.16 -0.71 -0.42 -0.67 1.77 0.00 -0.44 -0.23
0.22 1.65 0.28 -1.14 1.71 -1.35 0.72 -1.23 -0.96 0.02
0.09 -0.86 0.00 -1.85 1.69 1.24 0.93 0.96 -0.28 0.12
0.62 0.36 -0.66 0.06 1.91 -1.16 -0.61 -0.45 -1.04 0.56
-0.74 0.13 2.13 0.50 -0.62 -0.74 -1.17 1.55 -0.15 0.36
-1.36 0.05 0.07 1.17 0.63 -1.11 -0.94 -0.51 -1.62 -0.04
-2.43 0.34 0.43 -1.18 1.50 1.30 -1.13 0.45 0.21 0.55
-0.07 -0.30 0.01 -1.14 2.51 1.54 0.46 -0.44 -1.41 1.17
0.44 1.07 -0.16 0.70 2.22 -0.55 -1.14 -0.55 2.29 1.17
0.30 -0.33 -0.48 1.02 2.01 -1.59 1.02 -0.49 -0.28 0.10
0.78 -0.23 -2.18 -0.34 -0.72 -1.52 -0.00 0.06 -0.53 -1.16
-0.55 0.78 0.56 0.71 1.96 -1.40 0.64 -0.23 -0.06 -1.13
-0.00 -0.46 0.24 -1.44 0.57 -0.03 -0.20 -0.95 -0.60 -0.56
-0.22 1.37 -0.66 -1.38 -0.60 -0.58 0.71 -1.85 -0.99 -1.39
0.78 0.08 0.69 1.01 0.30 -0.48 2.64 0.74 0.01 -0.16
-1.60 -0.58 0.94 -0.78 -0.39 1.04 0.27 0.63 0.40 -0.42
-1.28 -0.40 -0.37 2.18 0.47 1.29 0.14 -0.25 0.24 0.66
1.26 -1.31 -0.58 -0.68 1.62 0.80 -0.08 -0.75 1.32 -0.03
1.05 -1.22 -0.32 -0.46 -0.20 1.10 -0.29 -0.87 -1.02 0.04
0.68 0.74 -0.37 -1.16 -1.77 -1.62 -1.02 -0.22 -1.76 -1.17
0.25 0.44 0.52 1.29 -2.37 0.08 1.12 -0.26 -1.10 -2.57
1.20 0.03 1.89 0.13 -0.74 0.22 -0.16 0.20 0.35 -0.21
0.74 1.50 -1.15 0.14 1.03 -1.96 0.25 -0.29 0.43 -0.21
0.66 0.58 -1.53 -0.79 0.66 -0.64 -0.59 -0.79 0.23 -1.49
1.33 1.40 0.81 -0.22 -0.21 -0.65 1.94 -0.46 -1.39 -2.78
0.40 0.88 0.85 0.23 0.18 -0.34 0.50 1.28 -0.92 -0.21
0.30 -1.21 -0.80 -0.56 0.65 0.59 -1.12 -0.56 -0.27 0.68
-0.07 1.77 -0.66 -1.24 -0.01 -0.93 0.12 -0.47 0.78 -0.88
-1.08 -0.12 -0.48 -0.52 0.15 1.90 0.51 0.55 -0.50 1.18
1.58 0.08 0.97 2.14 0.31 -0.30 -1.92 1.00 -0.52 0.48

For example, suppose you want to make some data that could have come from a population of female heights with the Normal(167,6.6) distribution. The first number given in the table above is -0.96 so we would obtain a height of
\[ 167 + 6.6\times(-0.96) = 167 – 6.34 = 160.66, \]
so the first female would have a height of 160.7 cm. The second number, 0.46, gives a height of 170.0 cm, and so on. You could continue this process for all the 10 numbers on the first line, giving 10 random heights. You could then try visualising these heights, or analyse them with one of the methods later in the book. The advantage is that you know what population they have come from and so you can compare your answers based on the random data with the truth about the population.

Summary

  • The Normal distribution is a model for continuous random variables. We will use it to describe sampling from a population but will more importantly use it for describing the sampling distribution of statistics.
  • [latex]\Normal{X}{\mu}{\sigma}[/latex] indicates that [latex]X[/latex] is a continuous random variable having the Normal distribution with mean [latex]\mu[/latex] and standard deviation [latex]\sigma[/latex].
  • If [latex]\Normal{X}{\mu}{\sigma}[/latex] then [latex]\Normal{Z}{0}{1}[/latex] where [latex]Z = (X-\mu)/\sigma[/latex] is the standardised Normal score.
  • Normality can be assessed using Normal probability plots.

Exercise 1

Suppose the lymphocyte count from a blood test has a Normal distribution with mean [latex]2.5\times 10^9[/latex]/L and standard deviation [latex]0.765\times 10^9[/latex]/L. What is the probability that a randomly chosen blood test will have a lymphocyte count between [latex]2.3 \times 10^9[/latex]/L and [latex]2.9 \times 10^9[/latex]/L?

Exercise 2

Suppose the copper level from a blood test has a Normal distribution with mean 18.5 [latex]\mu[/latex]mol/L and standard deviation 3.827 [latex]\mu[/latex]mol/L. What is the lowest copper level that would put a blood test result in the highest 1%?

Exercise 3

Make a statistical table, similar to the one in the Normal distribution table, for the distribution pictured in this figure from Chapter 8. Use 10 cm units on the left and 1 cm units at the top. Stop once you are clear what is involved, or complete the whole table in a group.

Exercise 4

Make a statistical table, similar to the one in the Normal distribution table, for the distribution pictured in the triangular density model. Use 10 cm units on the left and 1 cm units at the top. Stop once you are clear what is involved, or complete the whole table in a group.

Exercise 5

Make a statistical table, similar to the one in the Normal distribution table, for the distribution pictured earlier in this chapter. Use 10 cm units on the left and 1 cm units at the top. Stop once you are clear what is involved, or complete the whole table in a group.

Exercise 6

Use each row of Normal random numbers in the previous table to make a box plot. As with the example histograms, the box plots you make will come from a symmetric distribution. Use these to get a feel for what a box plot might look like even when the population is symmetric.

Exercise 7

The lift in a university building has a sign stating a maximum of 20 persons or 1360 kg. Suppose the population of students at the university has a mean weight of 67.1 kg with standard deviation 14.08 kg and we pack a random sample of 20 students into the lift. What is the probability that their combined weight will exceed the 1360 kg limit?

Standard Normal distribution

[latex]z[/latex] 0 1 2 3 4 5 6 7 8 9
0.0 0.500 0.496 0.492 0.488 0.484 0.480 0.476 0.472 0.468 0.464
0.1 0.460 0.456 0.452 0.448 0.444 0.440 0.436 0.433 0.429 0.425
0.2 0.421 0.417 0.413 0.409 0.405 0.401 0.397 0.394 0.390 0.386
0.3 0.382 0.378 0.374 0.371 0.367 0.363 0.359 0.356 0.352 0.348
0.4 0.345 0.341 0.337 0.334 0.330 0.326 0.323 0.319 0.316 0.312
0.5 0.309 0.305 0.302 0.298 0.295 0.291 0.288 0.284 0.281 0.278
0.6 0.274 0.271 0.268 0.264 0.261 0.258 0.255 0.251 0.248 0.245
0.7 0.242 0.239 0.236 0.233 0.230 0.227 0.224 0.221 0.218 0.215
0.8 0.212 0.209 0.206 0.203 0.200 0.198 0.195 0.192 0.189 0.187
0.9 0.184 0.181 0.179 0.176 0.174 0.171 0.169 0.166 0.164 0.161
1.0 0.159 0.156 0.154 0.152 0.149 0.147 0.145 0.142 0.140 0.138
1.1 0.136 0.133 0.131 0.129 0.127 0.125 0.123 0.121 0.119 0.117
1.2 0.115 0.113 0.111 0.109 0.107 0.106 0.104 0.102 0.100 0.099
1.3 0.097 0.095 0.093 0.092 0.090 0.089 0.087 0.085 0.084 0.082
1.4 0.081 0.079 0.078 0.076 0.075 0.074 0.072 0.071 0.069 0.068
1.5 0.067 0.066 0.064 0.063 0.062 0.061 0.059 0.058 0.057 0.056
1.6 0.055 0.054 0.053 0.052 0.051 0.049 0.048 0.047 0.046 0.046
1.7 0.045 0.044 0.043 0.042 0.041 0.040 0.039 0.038 0.038 0.037
1.8 0.036 0.035 0.034 0.034 0.033 0.032 0.031 0.031 0.030 0.029
1.9 0.029 0.028 0.027 0.027 0.026 0.026 0.025 0.024 0.024 0.023
2.0 0.023 0.022 0.022 0.021 0.021 0.020 0.020 0.019 0.019 0.018
2.1 0.018 0.017 0.017 0.017 0.016 0.016 0.015 0.015 0.015 0.014
2.2 0.014 0.014 0.013 0.013 0.013 0.012 0.012 0.012 0.011 0.011
2.3 0.011 0.010 0.010 0.010 0.010 0.009 0.009 0.009 0.009 0.008
2.4 0.008 0.008 0.008 0.008 0.007 0.007 0.007 0.007 0.007 0.006
2.5 0.006 0.006 0.006 0.006 0.006 0.005 0.005 0.005 0.005 0.005
2.6 0.005 0.005 0.004 0.004 0.004 0.004 0.004 0.004 0.004 0.004
2.7 0.003 0.003 0.003 0.003 0.003 0.003 0.003 0.003 0.003 0.003
2.8 0.003 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002
2.9 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.001 0.001 0.001
3.0 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
3.1 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
3.2 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
3.3

This table gives [latex]P(Z \ge z)[/latex] for [latex]\Normal{Z}{0}{1}[/latex]. Critical values of the Normal distribution, the [latex]z^{*}[/latex] values such that [latex]\pr{Z \ge z^{*}} = p[/latex] for a particular [latex]p[/latex] can be found from the [latex]\infty[/latex] row of the table of critical values of Student’s T distribution.

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

A Portable Introduction to Data Analysis Copyright © 2024 by The University of Queensland is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book