Chapter 25 z-scores with R
Welcome back to Quantitative Reasoning!
In this tutorial, we calculate z-scores with R.
Our textbook uses heptathlon scores in the 2012 Olympics as a motivating
example to introduce z-scores.
You can find the data at the URL linked below this video
(https://michaelgastner.com/data_for_QR/hept2012.csv).
Let’s import the data and shorten the name of the data frame to hept
.
read.csv("hept2012.csv")
hept <-head(hept)
## athlete run200 lj
## 1 Jessica Ennis 22.83 6.48
## 2 Lilli Schwarzkopf 24.77 6.30
## 3 Austra Skujyte 25.43 6.25
## 4 Antoinette Nana Djimou Ida 24.72 6.13
## 5 Jessica Zelinka 23.32 5.91
## 6 Kristina Savitskaya 24.46 6.21
The data frame has three columns. The first column contains the names of the athletes. The second column shows their performance in the 200-metre run in seconds. The third column stores the long-jump results in metres.
Our textbook compares the results of two athletes: Jessica Ennis and Tatyana Chernova. By the way, Chernova was later disqualified because of doping, so her results shouldn’t be viewed as signs of athletic achievement, but let’s suppose we could treat the numbers at face value.
$athlete == "Jessica Ennis" | hept$athlete == "Tatyana Chernova", ] hept[hept
## athlete run200 lj
## 1 Jessica Ennis 22.83 6.48
## 38 Tatyana Chernova 23.67 6.54
In this parallel universe, we would conclude that Ennis is the better runner, and Chernova is the better long-jumper. Which performance is more remarkable?
The z-score is a way to standardize the performance.
It’s defined as
\[
z = \frac{y - \bar{y}}{s}\ ,
\]
where \(y\) is a data value, \(\bar{y}\) is the mean of all data values, and \(s\)
is their standard deviation.
I’ll keep this definition visible in the bottom right.
The translation of this equation into R is straightforward.
Let’s add a column z_run200
with the z-scores of the 200-metre run.
$z_run200 <- (hept$run200 - mean(hept$run200)) / sd(hept$run200) hept
When we look at the spreadsheet, we notice that all z-scores are NA
.
head(hept)
## athlete run200 lj z_run200
## 1 Jessica Ennis 22.83 6.48 NA
## 2 Lilli Schwarzkopf 24.77 6.30 NA
## 3 Austra Skujyte 25.43 6.25 NA
## 4 Antoinette Nana Djimou Ida 24.72 6.13 NA
## 5 Jessica Zelinka 23.32 5.91 NA
## 6 Kristina Savitskaya 24.46 6.21 NA
The problem is that some athletes either didn’t start or didn’t finish the
200-metre run.
Their performance appears as NA
.
We can confirm the existence of NA
s with any(is.na(hept$run200))
.
any(is.na(hept$run200))
## [1] TRUE
On one hand, R is correct to point out that we can’t determine a z-score if we
don’t know all values.
On the other hand, in this application, it’s sensible to exclude missing values
from the analysis and calculate the z-score for all athletes that started and
finished the race.
From tutorial 18, we know that we can remove missing values with
na.rm = TRUE
.
$z_run200 <-
hept (hept$run200 - mean(hept$run200, na.rm = TRUE)) /
sd(hept$run200, na.rm = TRUE)
We can calculate the z-score for the long jump similarly.
$z_lj <-
hept (hept$lj - mean(hept$lj, na.rm = TRUE)) /
sd(hept$lj, na.rm = TRUE)
The commands for calculating z-scores is quite long, so we might wonder
whether R has a built-in function for z-scores.
Unfortunately, the answer is no.
R’s base installation contains a function called scale()
, which returns the
z-scores, but not as a vector, so it isn’t directly useful for us.
In later tutorials, we’ll learn how to write our own functions, so we’ll be
able to write a z-score function ourselves.
Until we reach that level, I recommend to use the explicit method shown here.
Let’s take a look at the z-scores for Ennis and Chernova.
$athlete == "Jessica Ennis" | hept$athlete == "Tatyana Chernova", ] hept[hept
## athlete run200 lj z_run200 z_lj
## 1 Jessica Ennis 22.83 6.48 -2.067166 1.005307
## 38 Tatyana Chernova 23.67 6.54 -1.017618 1.111769
We conclude that Ennis ran the 200 metres 2.1 standard deviations faster than the mean. Chernova jumped 1.1 standard deviations better than the mean. These numbers are consistent with the results stated in our textbook.
In summary, we compute the z-score of a numeric vector v
with
(v - mean(v)) / sd(v)
.
If v
contains missing values and if it’s appropriate to remove them from the
analysis, we include the argument na.rm = TRUE
in the mean()
and sd()
functions.
Next time we learn how to work with the so-called normal model in R.
See you soon.