Chapter 18 Missing Values (NA)

Welcome back to Quantitative Reasoning! When we worked with the Titanic data in previous tutorials, we noticed that many cells contain a special value NA.

titanic <- read.csv("~/QR/titanic/titanic.csv")

NA symbolises a missing value and stands for “Not Assigned”. For example, the ticket price in the second row of the data frame titanic is NA because this row contains information about a crew member who did not have to buy a ticket to be on board. Missing values in data sets can also arise for many other reasons. In an experimental study, the measurement equipment may fail. In a survey, a participant may refuse to give a response. And, on a professor’s score sheet, an assignment may be marked as missing because the dog ate the student’s homework.

Data sets with missing values must be handled carefully. A single NA in a vector causes almost all calculations involving this vector to return NA. Let’s consider a simple example.

v <- c(1, 2, 3, NA, 4, 5)

If we apply the mean() function to v, the return value is NA.

mean(v)
## [1] NA

On one hand, this result makes sense: we can’t determine the mean if one of the values is missing. On the other hand, it’s also sensible to compute the mean for the remaining (i.e. non-NA) values. We can compute the mean of the non-NA values by adding the argument na.rm = TRUE.

mean(v, na.rm = TRUE)
## [1] 3

In tutorial 13, we saw many more R functions for summary statistics besides mean(). Most of them accept the optional argument “na.rm = TRUE”. It’s worth checking the R documentation whether a function accepts this argument.

Summary functions aren’t the only R operations that return NA. Comparison operators also produce NA if one of the operands is NA. For example, v == 3 produces an NA in the fourth element.

v == 3
## [1] FALSE FALSE  TRUE    NA FALSE FALSE

How can we find out whether an element in v is NA? At first glance, we might expect that v == NA would be able to tell us that the fourth element is NA. Alas, v == NA doesn’t return TRUE in the fourth element, but NA. In fact, all elements of v == NA are NA.

v == NA
## [1] NA NA NA NA NA NA

On second thought, this output makes sense. NA is a vector of length 1, so R applies vectorisation in the same way that we encountered in tutorial 07. That is, the comparison v == NA is carried out six times, namely once for each element in v. Each of these comparisons returns NA because operations with NA generally return NA, so v == NA won’t return TRUE if an element in v is NA. Instead we need the special function is.na(). It returns TRUE if an element is NA. Otherwise is.na() returns FALSE.

is.na(v)
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE

Sometimes we’d like to know whether a vector contains a missing value. That is, does is.na() return any element equal to TRUE? We can find the answer with the function any(), which returns TRUE if at least one of the elements in the argument is TRUE. Otherwise any() returns FALSE. Applied to is.na(v), we obtain the expected result.

any(is.na(v))
## [1] TRUE

Let’s consider a slightly less obvious case. Was there anybody on the Titanic whose age is unknown according to our data?

any(is.na(titanic$age))
## [1] TRUE

The answer is yes. Who were these travellers with unknown age? Trying to find them by scrolling through the spreadsheet is like trying to find a needle in a haystack. In our next tutorial, we learn how R can tell us which rows in the spreadsheet contain travellers with unknown age.

Let’s summarize the main points of this tutorial.

  • NA is R’s special value for missing data.
  • If an operation involves NA as an argument, its result is almost always NA.
  • Many summary statistics functions have an argument na.rm = TRUE to remove NAs before calculating a value.
  • is.na() returns TRUE if an element is NA. Otherwise is.na() returns FALSE.
  • The function any() returns TRUE if any element in the argument is TRUE. Otherwise any() returns FALSE.

Next time we learn how to find indices of TRUE values in a logical vector.

See you soon.