Chapter 7 Appending and Removing Columns from Data Frames

Welcome back to Quantitative Reasoning! In the previous tutorial, we learned how to extract elements and subsets of a data frame. In this video, we’ll learn how to perform calculations with columns of a data frame, append new columns to a data frame and remove columns that we no longer need.

Let’s open our titanic project again and run the read.csv() command to import the data.

titanic <- read.csv("~/QR/titanic/titanic.csv")

Rerun the command with which we defined the subset sec_class

sec_class <- titanic[titanic$class == "2nd", ]

and open sec_class as a spreadsheet. The last three columns of sec_class indicate how much each passenger paid for the ticket. If there were multiple passengers travelling with the same ticket (e.g. a family), the price stated here is the total price paid by this passenger group. In 1912, Britain used a complicated system of pounds, shillings and pence. There were 12 pence in a shilling and 20 shillings in a pound.

For example, the passenger with ticket number 1166 paid 12 pounds, 10 shillings and 6 pence. This pounds-shillings-pence system is really inconvenient. One goal of this tutorial is to convert the prices to decimals so that the numbers are more easily interpretable. The formula to use is:

Total = Pounds + Shillings / 20 + Pence / 240.

I keep this equation visible in the bottom right to remind ourselves what our objective is.

Let’s break down the challenge into smaller pieces. We first deal with the term Shillings / 20.

Here is how we divide all numbers in the column shl by 20. We first use the dollar notation to subset the data frame and then we divide by 20.

sec_class$shl / 20

Strictly speaking, we haven’t yet seen the syntax “vector divided by a single number”. In tutorial 02, we saw that we can perform arithmetic operations involving two vectors of equal length. In that case, the operation is carried out elementwise. The situation here is different because we divide the vector sec_class$shl by a single number, namely 20. We may view 20 as a vector, but it has only one element, whereas sec_class$shl has 271 elements.

By the way, we can find out the number of elements in a vector with the length() function. length(sec_class$shl) confirms that the argument vector contains 271 elements.

length(sec_class$shl)

## [1] 271

By contrast, the length of 20 is 1.

length(20)

## [1] 1

What happens when we divide a vector of length 271 by a single number? It’s actually quite intuitive. In R, an arithmetic operation that involves a single number is carried out for every element in the longer vector. That is, sec_class$shl / 20 is a vector with as many elements as sec_class$shl, where each element is divided by 20. In the R jargon, this rule is called vectorisation. If one operand is a vector with multiple elements and the other operand a single number, R carries out the operation for all elements in the longer vector.

We can currently find all numbers in the console, but this output isn’t very useful because the numbers aren’t linked to the remaining passenger information (e.g. name or gender). It would be better if we could add the numbers to the data frame as a new column called, for example, shl_to_pnd.

Appending a new column is easily done with the $-notation. First we type the name of the data frame to which we want to add the column. Here it’s sec_class. Then we type a dollar sign followed by the name of the new column shl_to_pnd. Finally we type the assignment operator. Let’s run the command

sec_class$shl_to_pnd <- sec_class$shl / 20

and take a look at the spreadsheet. We may need to scroll towards the right to see the new column.

So far, so good. But our real goal was to convert not only the shillings to pounds, but to calculate the total ticket price as shown in the equation in the bottom right. Here is how the equation translates into R code.

sec_class$pnd + sec_class$shl / 20 + sec_class$pnc / 240

R applies the same rules of operator precedence that we learned in high school maths: R carries out the divisions first and only then carries out the additions. There isn’t any need for parentheses here. Let’s add the result of the command as a new column called price to the data frame sec_class.

sec_class$price <- sec_class$pnd + sec_class$shl / 20 + sec_class$pnc / 240

Because the purpose of the arithmetic may not be obvious any longer to us when we look at the code a few weeks later, we should add a reminder to ourselves. We can insert comments into R scripts with the hash symbol (#). Everything that follows on the same line after a hash symbol is ignored by R, so we can insert any text that we think may be helpful to us or our teammates when we read our code later on. For example, we can insert the comment:

# Convert ticket price from pounds-shillings-pence to decimal

Let’s click “Run” to check whether we indeed have a new column price.

OK, we now know how to add a column to a spreadsheet. Let’s talk about doing the opposite. How can I remove a column? For example, we don’t really need the column with the name shl_to_pnd any longer. Let’s get rid of it.

There are several ways how we can remove a column in R. We already know one method: we take a subset with the square bracket notation. Afterwards we can assign the subset back to the variable sec_class. Here we want to take the column numbers 1 to 11 and 13, and omit the 12th column.

sec_class[, c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13)]

But this command needs a lot of typing. When we want to keep all but a few elements, R has a convenient shortcut. Instead of typing all the indices we want to keep, we can use a minus sign and specify only those indices that we want to remove. In our example, we want to remove the 12th column.

sec_class <- sec_class[, -12]

Let’s take a look at the spreadsheet to confirm that shl_to_pnd has disappeared. I’m going to show you yet another method to remove a column from a data frame, which is in my opinion even better than the method with the minus sign, at least if all we want to remove is a single column. Let me first restore our previous data frame by running the entire script titanic.R again. Instead of clicking “Run” repeatedly, a convenient shortcut is the button “Source” to its right. The name “Source” doesn’t really reveal in clear language what this button does. To cut a long story short, we can think of “Source” as equivalent to “Run all lines in the script”. Because I want to keep the 12th column for demonstration purposes, I’m commenting out our last command so that R won’t carry it out. Let’s check in the spreadsheet whether the column shl_to_pnd is back. OK, here it is.

To remove this column, we can use the $-notation and assign a special value to the column: NULL.

sec_class$shl_to_pnd <- NULL

Let’s run the command and check the spreadsheet again. As promised, the shl_to_pnd column is gone again. In my opinion, the method that involves the $-notation and NULL is in this example a better choice than numeric indexing because we don’t need to count columns in the spreadsheet to determine which numeric index to remove. It’s easy to make a mistake when counting columns.

Great! Let’s summarise the main points of this tutorial.

We learned about vectorization in arithmetic operations such as addition, subtraction, multiplication or division. If one operand is a vector with multiple elements and the other operand a single number, R carries out the operation for all elements in the longer vector.
We can find the length of a vector with the length() function.
We can assign a new column to a data frame simply by typing $ followed by the name of the new column.
We can remove a column with a minus sign in front of the indices to be removed. By the way, negative indexing works not only for data frames, but also for vectors.
Alternatively, we can remove a column by using the $-notation and assigning the special value NULL.
We can run all lines in a script by clicking “Source”.
We also learned that we can insert comments into our script after a hash symbol.

Next time we’ll work again with the Titanic data to learn some more advanced subsetting methods with logical vectors.

See you soon.