Chapter 26 Computations Related to Normal Models

Welcome back to Quantitative Reasoning! A common assumption in statistics is that quantitative data are consistent with a so-called “normal” model. A normal model is a mathematical idealisation, which fits many, but not all, real-world data. In this tutorial, we learn how to

draw the bell-shaped curve that represents the probability density of a normal model,
find the area under the bell curve to the left of a given x-value,
find the x-value that corresponds to a given area and
generate random numbers that are consistent with a normal model.

In R, the mathematical equation for the bell curve is implemented by the function dnorm(). For example, here is the y-value that corresponds to an x-value of 600 on the bell curve with mean 500 and standard deviation 100.

dnorm(600, mean = 500, sd = 100)

## [1] 0.002419707

As our textbook points out, it’s often a good idea to make a picture when working with normal models. We can draw graphs of mathematical functions with the R function curve(), which takes three arguments:

the function we want to draw, which itself contains an argument x for the values to be plotted along the x-axis,
the argument from, which specifies the minimum x-coordinate, and
the argument to, which specifies the maximum x-coordinate.

For example, here is the bell curve for the normal model with mean 500 and standard deviation 100, plotted between the x-coordinates 200 and 800.

curve(dnorm(x, mean = 500, sd = 100),
      from = 200,
      to = 800)

If the normal model is appropriate for our data, we can use the function pnorm() to find the fraction of data points that are smaller than a given value x. Let’s look at the example on page 138 of our textbook.

Each part of the SAT Reasoning Test has a distribution that is roughly unimodal and symmetric and is designed to have an overall mean of about 500 and a standard deviation of 100 for all test takers. Suppose you earned a 600 on one part of your SAT. Where do you stand among all students who took the test?

Let’s assume that the normal model is suitable for the SAT scores. We wouldn’t have to use R to solve this problem. The numbers follow easily from the 68-95-99.7 rule stated on page 136 of our textbook. Still, it’s good to know that pnorm() can solve problems like these. Here is the command.

pnorm(600, mean = 500, sd = 100)

## [1] 0.8413447

The output confirms the textbook’s solution:

My score of 600 is higher than about 84% of all scores on this test.

We can think of pnorm() as the red area under the bell curve. Note that the total area under the curve (from \(x=-\infty\) to \(x=+\infty\)) is “normalised” to be equal to 1.

With this picture in mind, it’s easy to solve the next problem in the textbook (page 140).

What proportion of SAT scores falls between 450 and 600?

The answer is equal to the red area in the upper plot minus the blue area in the lower plot.

Here is how we calculate the answer with pnorm().

pnorm(600, mean = 500, sd = 100) - pnorm(450, mean = 500, sd = 100)

## [1] 0.5328072

Our answer agrees with the textbook.

The normal model estimates that about 53.3% of SAT scores fall between 450 and 600.

Thanks to pnorm(), we can find the fraction of data points below a given score, but sometimes we want to do the opposite: given a quantile, we want to know which minimum score we need to achieve. We can find the answer with qnorm(). Consider this example from our textbook (page 141):

Suppose a college says it admits only people with SAT Verbal test scores among the top 10%. How high a score does it take to be eligible?

Here is R’s answer. We call the qnorm() function and use as its first argument 0.9, which is the fraction of applicants below the score that we’re looking for.

qnorm(0.9, mean = 500, sd = 100)

## [1] 628.1552

We conclude that we need a score of 628 points.

Sometimes it’s useful to generate random numbers that match the normal model. For example, we may want to confirm a mathematical result with simulations. In R, we generate normally distributed random numbers with rnorm(). Here is a vector containing 10 random numbers generated by a normal model with mean \(\mu = 500\) and standard deviation \(\sigma = 100\).

rnorm(10, mean = 500, sd = 100)

##  [1] 575.2168 568.4259 531.1997 532.1635 542.3912 386.4918 516.6964 472.3145
##  [9] 381.9218 537.7838

Because these numbers are random, your numbers are very likely to be different from mine. You will get yet another set of numbers when you run the same command a second time.

rnorm(10, mean = 500, sd = 100)

##  [1] 630.2161 603.6085 341.4453 489.1401 390.6306 350.1065 516.8770 683.0398
##  [9] 515.0545 624.7085

The mean of these numbers isn’t exactly 500, and their standard deviation isn’t exactly 100.

r_10 <- rnorm(10, mean = 500, sd = 100)
mean(r_10)

## [1] 494.0917

sd(r_10)

## [1] 90.43627

However, if we generate a much larger sample of random numbers (e.g. 10,000 instead of 10), the mean and standard deviation of the sample comes closer to the specified parameters 500 and 100.

r_10000 <- rnorm(10000, mean = 500, sd = 100)
mean(r_10000)

## [1] 500.3215

sd(r_10000)

## [1] 100.1846

So far, we’ve always specified the arguments mean and sd when calling dnorm(), pnorm(), qnorm() or rnorm(). Under certain circumstances, it’s possible to leave out these arguments.

If the argument mean is missing, R assumes that the mean equals 0.
If the argument sd is missing, R assumes that the standard deviation equals 1.

For example, pnorm(-1.5) returns the same value as pnorm(-1.5, mean = 0, sd = 1).

pnorm(-1.5)

## [1] 0.0668072

pnorm(-1.5, mean = 0, sd = 1)

## [1] 0.0668072

The normal model with mean 0 and standard deviation 1 is called the “standard normal model”. It’s an important special case because the standard normal model describes the distribution of z-scores for any normally distributed data, irrespective of the mean and standard deviation of the data before taking the z-scores.

Let’s summarise what we learned in this tutorial.

The R function dnorm() implements the mathematical bell curve function that characterizes the probability density of a normal model. For a given x-value, dnorm() returns the y-value on the bell curve.
We can plot mathematical functions (e.g. the bell curve of a normal model) with the R function curve().
We compute the area under the bell curve to the left of a given x-value with pnorm().
Conversely, if we’re given the area under the bell curve, we can compute the corresponding x-value with qnorm().
We generate normally distributed random numbers with rnorm().
If we leave out the arguments mean and sd, R assumes that we want to find values for the standard normal model.

As an exercise, confirm the results of the textbook’s worked example on pages 142-143 with R.

Next time, we learn how to assess whether a normal model is appropriate for a given data set.

See you soon.