Chapter 6 Extracting Values from Data Frames
Welcome back to Quantitative Reasoning! In the previous tutorial, we learned how to find out about the format of a data frame. For example, how many rows and columns are there? What are the column names?
In this video, we learn how we can access individual elements in a data frame and build subsets consisting of selected rows or columns.
Do you remember how we accessed elements of a vector?
We used square brackets!
For example, scores[3]
returns the third element of the vector scores
.
c(5, 6, 1, 6, 3)
scores <-3] scores[
## [1] 1
For data frames, it’s similar. The only difference is that for vectors it was enough to put a single number inside the square brackets. But data frames have rows and columns, so we must identify elements with two numbers. Let’s look at an example.
If you don’t have our titanic project open right now, please reopen it and
click on titanic.R to open the script in the editor pane.
Next we run the read.csv()
command to bring the titanic
data frame back
into our environment.
read.csv("~/QR/titanic/titanic.csv") titanic <-
Let’s open the data frame as a spreadsheet by clicking on the variable name in
the Environment tab.
Suppose we’re interested in the element in the 2nd row and 4th column.
We access this element with titanic[2, 4]
.
The first number in the square brackets stands for the row, the second number
for the column.
2, 4] titanic[
## [1] 21
In the console, we see that the element is 21. Indeed, this value matches the number in the spreadsheet. We can get a sequence of consecutive rows with the colon operator. For example, to get the 2nd to the 5th row of the 4th column, we type
2:5, 4] titanic[
## [1] 21 13 16 39
In the console, the numbers aren’t shown as a column, but as a vector (i.e. from left to right instead of from top to bottom). Apart from this difference, the numbers are the same.
We can also use the colon operator after the comma (e.g. to obtain the third and fourth columns).
2:5, 3:4] titanic[
## gender age
## 2 Male 21
## 3 Male 13
## 4 Male 16
## 5 Female 39
If we want all columns, we could type 1:11
inside the square brackets:
titanic[2:5, 1:11]
.
But there’s a shortcut.
If we want all columns, we just leave the second argument blank.
2:5, ] titanic[
## fam_name given_name gender age class survived ticket pax_on_tckt pnd
## 2 ABBOTT Ernest Owen Male 21 Crew FALSE <NA> NA NA
## 3 ABBOTT Eugene Joseph Male 13 3rd FALSE CA2673 3 20
## 4 ABBOTT Rossmore Edward Male 16 3rd FALSE CA2673 3 20
## 5 ABBOTT Rhoda Mary 'Rosa' Female 39 3rd TRUE CA2673 3 20
## shl pnc
## 2 NA NA
## 3 5 0
## 4 5 0
## 5 5 0
What can we do if we don’t want every row between rows 2 and 5?
For example, if we want only the 2nd and 4th?
Here the c()
function from tutorial 02 makes a comeback.
Instead of 2:5
, we write c()
and in parentheses only those row numbers that
we want to keep.
c(2, 4), ] titanic[
## fam_name given_name gender age class survived ticket pax_on_tckt pnd shl
## 2 ABBOTT Ernest Owen Male 21 Crew FALSE <NA> NA NA NA
## 4 ABBOTT Rossmore Edward Male 16 3rd FALSE CA2673 3 20 5
## pnc
## 2 NA
## 4 0
OK, now over to you. What do you think happens when we run the following?
c(1, 3)] titanic[,
There are many lines of output. Let me scroll up to see the beginning. We left the first argument in the square brackets blank, so we take all rows. But we included only the first and third column. Because there are too many rows to display, R doesn’t print all of them in the console. But, although hidden, the remaining rows are still part of the output. This fact is what R is telling us by saying that it “omitted 1708 rows”.
Instead of addressing the first and third column by their index numbers 1 and
3, it’s also possible to address them by their names (i.e. "fam_name"
and
"gender"
).
c("fam_name", "gender")] titanic[,
If we want to extract a single column in a data frame, say the column age
,
we would not need the c()
inside the parentheses.
"age"] titanic[,
If all we want is a single column, there is an even shorter alternative: the dollar operator.
$age titanic
When we extract a single column, the result isn’t a data frame, but a vector. Consequently, we can apply the same subsetting operations that we learned in tutorial 02. For example,
$age[2] titanic
## [1] 21
returns the 2nd element of titanic$age
.
Here we don’t need a comma inside the square brackets because titanic$age
is
a simple vector, not a data frame.
If we want the 2nd to the 5th element of titanic$age
, we can use the colon
operator.
$age[2:5] titanic
## [1] 21 13 16 39
Do you remember which of our earlier commands returned the same numbers?
It was titanic[2:5, 4]
.
2:5, 4] titanic[
## [1] 21 13 16 39
You may be wondering why we even need the dollar operator if we can accomplish the same without it. The answer is that more complex subsetting operations are often easier with the dollar notation. Suppose we want only those rows from the data frame that correspond to 40-year old travellers. We could of course scroll through the spreadsheet and note down the row numbers. For example, there’s a 40-year old traveller in row 20, and again in rows 90 and 107. But searching through the spreadsheet is tedious. The purpose of software such as R is to automate such tasks. Here is how to do it with the dollar notation.
$age == 40, ] titanic[titanic
Don’t worry if you don’t immediately understand why this command did the trick. It’ll become clearer in a tutorial 08. For right now, let’s view the syntax as a recipe that we can adjust when we face similar problems during our next in-class activity. Just bear in mind that, in order to test for equality, we must type two equals signs.
A common mistake is to forget the comma before the closing square bracket. But the comma is important because we get an error without it.
$age == 40] titanic[titanic
## Error in `[.data.frame`(titanic, titanic$age == 40): undefined columns selected
Following this pattern, do you see how can we create a subset of the titanic
data frame that only contains
second-class passengers?
Let’s remind ourselves that this information is in the column called class
.
The character string to indicate a second-class passenger is “2nd”.
We follow the same recipe as before, but we change “age” to “class” and “40”
to “2nd”.
$class == "2nd", ] titanic[titanic
Note that we must surround character objects such as “2nd” by quotes, as we learned in tutorial 02.
Even restricted to second-class passengers, there are still many rows of
output.
So let’s save the output in a variable called sec_class
.
Variable names mustn’t start with a number and they mustn’t contain spaces,
but we can, for example, use an underscore.
titanic[titanic$class == "2nd", ] sec_class <-
Because sec_class
is a data frame, we can apply all functions we learned
in the previous tutorial.
For example, we can find out with nrow()
how many passengers were in the
second class.
While we’re typing, RStudio tries to autocomplete the variable name.
Autocompletion can be very convenient because it reduces the probability of
making a typo.
If RStudio proposes the correct name (and here it does), then all we need to do
is hit either tab or return.
nrow(sec_class)
## [1] 271
So there were 271 second-class passengers according to our data.
We’ll keep working with the variable sec_class
in the next tutorial, but
for now let’s return once more briefly to the original titanic
data frame.
A common question we have about a column in a data frame is how many different
values it contains and what these values are.
For example, how many different classes of passengers and crew are in the
class
column?
We can find it out with the function unique()
.
With unique(titanic$class)
we can see that there are four classes: "3rd"
,
"Crew"
, "2nd"
, "1st"
.
The result is in the order of first appearance in the column.
unique(titanic$class)
## [1] "3rd" "Crew" "2nd" "1st"
Here is a summary of the main points of this tutorial.
- We learned how we can create subsets of a data frame with square brackets.
- We separate the row indices from the column indices with a comma.
- If we want all rows or all columns, we can leave the corresponding entry in the square brackets blank. But we mustn’t forget the comma.
- If we want a single column of a data frame, we can alternatively use the
dollar notation.
After the dollar, we type the name of the column we want to keep.
We can use either column numbers or column names when working with
square brackets, but we can only use column names when using the
$
-notation. - We can obtain the unique values in a vector with the function
unique()
.
Besides extracting individual values or subsets from a data frame, there are many situations where we’d like to add a new column to a data frame. Next time we’ll learn how to accomplish that task.
See you soon.