Chapter 18 Missing Values (NA
)
Welcome back to Quantitative Reasoning!
When we worked with the Titanic data in previous tutorials, we noticed that
many cells contain a special value NA
.
read.csv("~/QR/titanic/titanic.csv") titanic <-
NA
symbolises a missing value and stands for “Not Assigned”.
For example, the ticket price in the second row of the data frame titanic
is
NA
because this row contains information about a crew member who did not
have to buy a ticket to be on board.
Missing values in data sets can also arise for many other reasons.
In an experimental study, the measurement equipment may fail.
In a survey, a participant may refuse to give a response.
And, on a professor’s score sheet, an assignment may be marked as missing
because the dog ate the student’s homework.
Data sets with missing values must be handled carefully.
A single NA
in a vector causes almost all calculations involving this vector
to return NA
.
Let’s consider a simple example.
c(1, 2, 3, NA, 4, 5) v <-
If we apply the mean()
function to v
, the return value is NA
.
mean(v)
## [1] NA
On one hand, this result makes sense: we can’t determine the mean if
one of the values is missing.
On the other hand, it’s also sensible to compute the mean for the remaining
(i.e. non-NA
) values.
We can compute the mean of the non-NA
values by adding the argument
na.rm = TRUE
.
mean(v, na.rm = TRUE)
## [1] 3
In tutorial 13, we saw many more R functions for summary statistics besides
mean()
.
Most of them accept the optional argument “na.rm = TRUE
”.
It’s worth checking the R documentation whether a function accepts this
argument.
Summary functions aren’t the only R operations that return NA
.
Comparison operators also produce NA
if one of the operands is NA
.
For example, v == 3
produces an NA
in the fourth element.
== 3 v
## [1] FALSE FALSE TRUE NA FALSE FALSE
How can we find out whether an element in v
is NA
?
At first glance, we might expect that v == NA
would be able to tell us that
the fourth element is NA
.
Alas, v == NA
doesn’t return TRUE
in the fourth element, but NA
.
In fact, all elements of v == NA
are NA
.
== NA v
## [1] NA NA NA NA NA NA
On second thought, this output makes sense.
NA
is a vector of length 1, so R applies vectorisation in the same way that
we encountered in tutorial 07.
That is, the comparison v == NA
is carried out six times, namely once for
each element in v
.
Each of these comparisons returns NA
because operations with NA
generally
return NA
, so v == NA
won’t return TRUE
if an element in v
is NA
.
Instead we need the special function is.na()
.
It returns TRUE
if an element is NA
.
Otherwise is.na()
returns FALSE
.
is.na(v)
## [1] FALSE FALSE FALSE TRUE FALSE FALSE
Sometimes we’d like to know whether a vector contains a missing value.
That is, does is.na()
return any element equal to TRUE
?
We can find the answer with the function any()
, which returns TRUE
if at least one of the elements in the argument is TRUE
.
Otherwise any()
returns FALSE
.
Applied to is.na(v)
, we obtain the expected result.
any(is.na(v))
## [1] TRUE
Let’s consider a slightly less obvious case. Was there anybody on the Titanic whose age is unknown according to our data?
any(is.na(titanic$age))
## [1] TRUE
The answer is yes. Who were these travellers with unknown age? Trying to find them by scrolling through the spreadsheet is like trying to find a needle in a haystack. In our next tutorial, we learn how R can tell us which rows in the spreadsheet contain travellers with unknown age.
Let’s summarize the main points of this tutorial.
NA
is R’s special value for missing data.- If an operation involves
NA
as an argument, its result is almost alwaysNA
. - Many summary statistics functions have an argument
na.rm = TRUE
to removeNA
s before calculating a value. is.na()
returnsTRUE
if an element isNA
. Otherwiseis.na()
returnsFALSE
.- The function
any()
returnsTRUE
if any element in the argument isTRUE
. Otherwiseany()
returnsFALSE
.
Next time we learn how to find indices of TRUE
values in a logical vector.
See you soon.