Chapter 25 z-scores with R

Welcome back to Quantitative Reasoning! In this tutorial, we calculate z-scores with R. Our textbook uses heptathlon scores in the 2012 Olympics as a motivating example to introduce z-scores. You can find the data at the URL linked below this video (https://michaelgastner.com/data_for_QR/hept2012.csv). Let’s import the data and shorten the name of the data frame to hept.

hept <- read.csv("hept2012.csv")
head(hept)
##                      athlete run200   lj
## 1              Jessica Ennis  22.83 6.48
## 2          Lilli Schwarzkopf  24.77 6.30
## 3             Austra Skujyte  25.43 6.25
## 4 Antoinette Nana Djimou Ida  24.72 6.13
## 5            Jessica Zelinka  23.32 5.91
## 6        Kristina Savitskaya  24.46 6.21

The data frame has three columns. The first column contains the names of the athletes. The second column shows their performance in the 200-metre run in seconds. The third column stores the long-jump results in metres.

Our textbook compares the results of two athletes: Jessica Ennis and Tatyana Chernova. By the way, Chernova was later disqualified because of doping, so her results shouldn’t be viewed as signs of athletic achievement, but let’s suppose we could treat the numbers at face value.

hept[hept$athlete == "Jessica Ennis" | hept$athlete == "Tatyana Chernova", ]
##             athlete run200   lj
## 1     Jessica Ennis  22.83 6.48
## 38 Tatyana Chernova  23.67 6.54

In this parallel universe, we would conclude that Ennis is the better runner, and Chernova is the better long-jumper. Which performance is more remarkable?

The z-score is a way to standardize the performance. It’s defined as \[ z = \frac{y - \bar{y}}{s}\ , \] where \(y\) is a data value, \(\bar{y}\) is the mean of all data values, and \(s\) is their standard deviation. I’ll keep this definition visible in the bottom right. The translation of this equation into R is straightforward. Let’s add a column z_run200 with the z-scores of the 200-metre run.

hept$z_run200 <- (hept$run200 - mean(hept$run200)) / sd(hept$run200)

When we look at the spreadsheet, we notice that all z-scores are NA.

head(hept)
##                      athlete run200   lj z_run200
## 1              Jessica Ennis  22.83 6.48       NA
## 2          Lilli Schwarzkopf  24.77 6.30       NA
## 3             Austra Skujyte  25.43 6.25       NA
## 4 Antoinette Nana Djimou Ida  24.72 6.13       NA
## 5            Jessica Zelinka  23.32 5.91       NA
## 6        Kristina Savitskaya  24.46 6.21       NA

The problem is that some athletes either didn’t start or didn’t finish the 200-metre run. Their performance appears as NA. We can confirm the existence of NAs with any(is.na(hept$run200)).

any(is.na(hept$run200))
## [1] TRUE

On one hand, R is correct to point out that we can’t determine a z-score if we don’t know all values. On the other hand, in this application, it’s sensible to exclude missing values from the analysis and calculate the z-score for all athletes that started and finished the race. From tutorial 18, we know that we can remove missing values with na.rm = TRUE.

hept$z_run200 <-
  (hept$run200 - mean(hept$run200, na.rm = TRUE)) /
  sd(hept$run200, na.rm = TRUE)

We can calculate the z-score for the long jump similarly.

hept$z_lj <-
  (hept$lj - mean(hept$lj, na.rm = TRUE)) /
  sd(hept$lj, na.rm = TRUE)

The commands for calculating z-scores is quite long, so we might wonder whether R has a built-in function for z-scores. Unfortunately, the answer is no. R’s base installation contains a function called scale(), which returns the z-scores, but not as a vector, so it isn’t directly useful for us. In later tutorials, we’ll learn how to write our own functions, so we’ll be able to write a z-score function ourselves. Until we reach that level, I recommend to use the explicit method shown here.

Let’s take a look at the z-scores for Ennis and Chernova.

hept[hept$athlete == "Jessica Ennis" | hept$athlete == "Tatyana Chernova", ]
##             athlete run200   lj  z_run200     z_lj
## 1     Jessica Ennis  22.83 6.48 -2.067166 1.005307
## 38 Tatyana Chernova  23.67 6.54 -1.017618 1.111769

We conclude that Ennis ran the 200 metres 2.1 standard deviations faster than the mean. Chernova jumped 1.1 standard deviations better than the mean. These numbers are consistent with the results stated in our textbook.

In summary, we compute the z-score of a numeric vector v with (v - mean(v)) / sd(v). If v contains missing values and if it’s appropriate to remove them from the analysis, we include the argument na.rm = TRUE in the mean() and sd() functions.

Next time we learn how to work with the so-called normal model in R.

See you soon.