Chapter 3 Data Frames

Welcome back to Quantitative Reasoning! In the previous tutorial, we learned how to work with vectors, an R data structure with which we can save either multiple numbers or multiple character objects in one and the same variable. In this tutorial, we’ll learn how to combine several vectors into a data frame.

Last time, we defined two vectors: players and scores.

players <- c("Rachel", "Tatyana", "Noah", "Quentin", "Aisha")
scores <- c(5, 6, 1, 6, 3)

Suppose we want to make a spreadsheet where players and scores are in two adjacent columns so that it’s easy to link an individual player to a score. In R, such a spreadsheet-like data structure is called a “data frame”. We create a data frame with the function data.frame(). Inside the parentheses, we first give the name of the first column (here player), followed by an equals sign and then the vector that we want to put in the first column. Then we type a comma, and we repeat the pattern for the second column score.

data.frame(player = c("Rachel", "Tatyana", "Noah", "Quentin", "Aisha"),
           score = c(5, 6, 1, 6, 3))
##    player score
## 1  Rachel     5
## 2 Tatyana     6
## 3    Noah     1
## 4 Quentin     6
## 5   Aisha     3

It isn’t strictly necessary to insert a line break after the first column, but it makes code more readable if we keep lines shorter than 80 characters. If you chose the same settings that I selected in tutorial 01, you see a vertical line on the right side of the editor that shows the maximum recommended line length. R is in general insensitive to the exact placement of spaces and line breaks, so we should insert them consistently to keep code readable.

In our example, we have only two columns: player and score. If we wanted to add more columns, we simply would add more pairs of column names and column values separated by equals signs. But for the sake of simplicity, let’s limit ourselves to two columns for this exercise.

It’s possible to pass predefined vectors as arguments to data.frame(). For example, we can take advantage of the fact that we already defined the vectors players and scores. The combination of these three commands

players <- c("Rachel", "Tatyana", "Noah", "Quentin", "Aisha")
scores <- c(5, 6, 1, 6, 3)
data.frame(player = players, score = scores)
##    player score
## 1  Rachel     5
## 2 Tatyana     6
## 3    Noah     1
## 4 Quentin     6
## 5   Aisha     3

achieves the same outcome as the previous call to data.frame(). Here I chose singular words (e.g. player) for the column names, but plural for the vectors (players). This naming convention isn’t strictly necessary, but column names in printed tables are usually singular words, so I follow this pattern here.

We can assign the data frame to a variable (e.g. called dice).

dice <- data.frame(player = players, score = scores)

After we run this assignment, dice appears in the environment tab. Unlike vectors, which are listed as “Values”, data frames appear in the rubric “Data”. If we click the arrow button next to the data frame’s name, we can see the names of the columns contained in the data frame. If we want to view a data frame as a spreadsheet, we can click on its name in the environment tab. The spreadsheet opens in the top left pane.

Here is an important constraint when working with data frames: all columns must be equally long. If the constituent vectors have different lengths, R throws an error. For example, here is what happens if we add a sixth number to score.

data.frame(player = c("Rachel", "Tatyana", "Noah", "Quentin", "Aisha"),
           score = c(5, 6, 1, 6, 3, 2))
## Error in data.frame(player = c("Rachel", "Tatyana", "Noah", "Quentin", : arguments imply differing number of rows: 5, 6

Let’s summarise the main points of this tutorial.

  • We can combine several vectors into a spreadsheet-like data structure called “data frame”.
  • All vectors contained in a data frame must be equally long.
  • We can create a data frame from scratch with the function data.frame().
  • We can get a spreadsheet view of a data frame by clicking on its variable name in the environment tab.

We’re going to work a lot more with data frames throughout this course. There are many operations with data frames that we still need to learn. For example, so far we’ve had to type all names and numbers by hand. For larger data sets, typing would become tedious. In our next tutorial, we learn how to import data from a file.

See you soon.