Chapter 6 Extracting Values from Data Frames

Welcome back to Quantitative Reasoning! In the previous tutorial, we learned how to find out about the format of a data frame. For example, how many rows and columns are there? What are the column names?

In this video, we learn how we can access individual elements in a data frame and build subsets consisting of selected rows or columns.

Do you remember how we accessed elements of a vector? We used square brackets! For example, scores[3] returns the third element of the vector scores.

scores <- c(5, 6, 1, 6, 3)
scores[3]

## [1] 1

For data frames, it’s similar. The only difference is that for vectors it was enough to put a single number inside the square brackets. But data frames have rows and columns, so we must identify elements with two numbers. Let’s look at an example.

If you don’t have our titanic project open right now, please reopen it and click on titanic.R to open the script in the editor pane. Next we run the read.csv() command to bring the titanic data frame back into our environment.

titanic <- read.csv("~/QR/titanic/titanic.csv")

Let’s open the data frame as a spreadsheet by clicking on the variable name in the Environment tab. Suppose we’re interested in the element in the 2nd row and 4th column. We access this element with titanic[2, 4]. The first number in the square brackets stands for the row, the second number for the column.

titanic[2, 4]

## [1] 21

In the console, we see that the element is 21. Indeed, this value matches the number in the spreadsheet. We can get a sequence of consecutive rows with the colon operator. For example, to get the 2nd to the 5th row of the 4th column, we type

titanic[2:5, 4]

## [1] 21 13 16 39

In the console, the numbers aren’t shown as a column, but as a vector (i.e. from left to right instead of from top to bottom). Apart from this difference, the numbers are the same.

We can also use the colon operator after the comma (e.g. to obtain the third and fourth columns).

titanic[2:5, 3:4]

##   gender age
## 2   Male  21
## 3   Male  13
## 4   Male  16
## 5 Female  39

If we want all columns, we could type 1:11 inside the square brackets: titanic[2:5, 1:11]. But there’s a shortcut. If we want all columns, we just leave the second argument blank.

titanic[2:5, ]

##   fam_name        given_name gender age class survived ticket pax_on_tckt pnd
## 2   ABBOTT       Ernest Owen   Male  21  Crew    FALSE   <NA>          NA  NA
## 3   ABBOTT     Eugene Joseph   Male  13   3rd    FALSE CA2673           3  20
## 4   ABBOTT   Rossmore Edward   Male  16   3rd    FALSE CA2673           3  20
## 5   ABBOTT Rhoda Mary 'Rosa' Female  39   3rd     TRUE CA2673           3  20
##   shl pnc
## 2  NA  NA
## 3   5   0
## 4   5   0
## 5   5   0

What can we do if we don’t want every row between rows 2 and 5? For example, if we want only the 2nd and 4th? Here the c() function from tutorial 02 makes a comeback. Instead of 2:5, we write c() and in parentheses only those row numbers that we want to keep.

titanic[c(2, 4), ]

##   fam_name      given_name gender age class survived ticket pax_on_tckt pnd shl
## 2   ABBOTT     Ernest Owen   Male  21  Crew    FALSE   <NA>          NA  NA  NA
## 4   ABBOTT Rossmore Edward   Male  16   3rd    FALSE CA2673           3  20   5
##   pnc
## 2  NA
## 4   0

OK, now over to you. What do you think happens when we run the following?

titanic[, c(1, 3)]

There are many lines of output. Let me scroll up to see the beginning. We left the first argument in the square brackets blank, so we take all rows. But we included only the first and third column. Because there are too many rows to display, R doesn’t print all of them in the console. But, although hidden, the remaining rows are still part of the output. This fact is what R is telling us by saying that it “omitted 1708 rows”.

Instead of addressing the first and third column by their index numbers 1 and 3, it’s also possible to address them by their names (i.e. "fam_name" and "gender").

titanic[, c("fam_name", "gender")]

If we want to extract a single column in a data frame, say the column age, we would not need the c() inside the parentheses.

titanic[, "age"]

If all we want is a single column, there is an even shorter alternative: the dollar operator.

titanic$age

When we extract a single column, the result isn’t a data frame, but a vector. Consequently, we can apply the same subsetting operations that we learned in tutorial 02. For example,

titanic$age[2]

## [1] 21

returns the 2nd element of titanic$age. Here we don’t need a comma inside the square brackets because titanic$age is a simple vector, not a data frame.

If we want the 2nd to the 5th element of titanic$age, we can use the colon operator.

titanic$age[2:5]

## [1] 21 13 16 39

Do you remember which of our earlier commands returned the same numbers? It was titanic[2:5, 4].

titanic[2:5, 4]

## [1] 21 13 16 39

You may be wondering why we even need the dollar operator if we can accomplish the same without it. The answer is that more complex subsetting operations are often easier with the dollar notation. Suppose we want only those rows from the data frame that correspond to 40-year old travellers. We could of course scroll through the spreadsheet and note down the row numbers. For example, there’s a 40-year old traveller in row 20, and again in rows 90 and 107. But searching through the spreadsheet is tedious. The purpose of software such as R is to automate such tasks. Here is how to do it with the dollar notation.

titanic[titanic$age == 40, ]

Don’t worry if you don’t immediately understand why this command did the trick. It’ll become clearer in a tutorial 08. For right now, let’s view the syntax as a recipe that we can adjust when we face similar problems during our next in-class activity. Just bear in mind that, in order to test for equality, we must type two equals signs.

A common mistake is to forget the comma before the closing square bracket. But the comma is important because we get an error without it.

titanic[titanic$age == 40]

## Error in `[.data.frame`(titanic, titanic$age == 40): undefined columns selected

Following this pattern, do you see how can we create a subset of the titanic data frame that only contains second-class passengers? Let’s remind ourselves that this information is in the column called class. The character string to indicate a second-class passenger is “2nd”. We follow the same recipe as before, but we change “age” to “class” and “40” to “2nd”.

titanic[titanic$class == "2nd", ]

Note that we must surround character objects such as “2nd” by quotes, as we learned in tutorial 02.

Even restricted to second-class passengers, there are still many rows of output. So let’s save the output in a variable called sec_class. Variable names mustn’t start with a number and they mustn’t contain spaces, but we can, for example, use an underscore.

sec_class <- titanic[titanic$class == "2nd", ]

Because sec_class is a data frame, we can apply all functions we learned in the previous tutorial. For example, we can find out with nrow() how many passengers were in the second class. While we’re typing, RStudio tries to autocomplete the variable name. Autocompletion can be very convenient because it reduces the probability of making a typo. If RStudio proposes the correct name (and here it does), then all we need to do is hit either tab or return.

nrow(sec_class)

## [1] 271

So there were 271 second-class passengers according to our data.

We’ll keep working with the variable sec_class in the next tutorial, but for now let’s return once more briefly to the original titanic data frame.

A common question we have about a column in a data frame is how many different values it contains and what these values are. For example, how many different classes of passengers and crew are in the class column? We can find it out with the function unique(). With unique(titanic$class) we can see that there are four classes: "3rd", "Crew", "2nd", "1st". The result is in the order of first appearance in the column.

unique(titanic$class)

## [1] "3rd"  "Crew" "2nd"  "1st"

Here is a summary of the main points of this tutorial.

We learned how we can create subsets of a data frame with square brackets.
We separate the row indices from the column indices with a comma.
If we want all rows or all columns, we can leave the corresponding entry in the square brackets blank. But we mustn’t forget the comma.
If we want a single column of a data frame, we can alternatively use the dollar notation. After the dollar, we type the name of the column we want to keep. We can use either column numbers or column names when working with square brackets, but we can only use column names when using the $-notation.
We can obtain the unique values in a vector with the function unique().

Besides extracting individual values or subsets from a data frame, there are many situations where we’d like to add a new column to a data frame. Next time we’ll learn how to accomplish that task.

See you soon.