Chapter 4 Importing Spreadsheet Data into R

Welcome back to Quantitative Reasoning! In the previous tutorial, we learned how to combine vectors into a data frame. So far, we’ve had to type all our data by hand, which is tedious. In this tutorial, we cover how to import data into R from a file.

A convenient file type for saving spreadsheet data is CSV, which stands for “comma-separated values”. We can open CSV files with all common spreadsheet programs (e.g. Excel) and, as we learn soon, also with R.

In Excel, we can save a spreadsheet to a CSV file by choosing the menu option “Save as”. Other spreadsheet programs have similar menu options for exporting to a CSV.

Let’s take a look at a concrete example of a CSV file. Please download titanic.csv with a web browser from the URL in the description below this video [https://michaelgastner.com/data_for_QR/titanic.csv]. If you’ve kept your default browser settings, the file will be in your Downloads folder.

Let’s have a glimpse at the content of titanic.csv. Go to the Downloads folder and open titanic.csv with a spreadsheet program. Here is how the data look like in Excel. We won’t have to open the data with Excel in the future, but sometimes it’s a good idea to take a quick look before trying to import data into R.

This spreadsheet contains information about the passengers and crew on board of the Titanic during its maiden voyage in 1912, which ended tragically when the ship hit an iceberg and sank. I found these data on the website of the “Encyclopedia Titanica” (https://www.encyclopedia-titanica.org/). By the way, our textbook discusses statistics related to the sinking of the Titanic in chapter 2.

In this CSV, every row stands for one passenger on the Titanic. There are eleven columns:

  • family name,
  • given name,
  • gender,
  • age,
  • class,
  • whether the passenger survived,
  • ticket number,
  • number of passengers travelling with this ticket and
  • three columns that indicate the ticket price in British pounds, shillings and pence.

Why is this file format called CSV (i.e. comma-separated values)? There are no visible commas anywhere right now. Excel has formatted the data, so the commas have disappeared. We can see them when we open the file with a simple text editor such as TextEdit on a Mac or Notepad on Windows. Here are the commas. They indeed separate the columns, which explains the “C” in the name CSV. Because this file format is so simple, it is the standard way of exchanging spreadsheet data between different software. It’s also the easiest file format to import into R, so we recommend that you always save your data as a CSV. We can import data into R with only a few clicks in RStudio. But, before we import anything, let’s briefly talk about how to organize our files by using RStudio projects.

We can think of an RStudio project as a directory on our computer. Compared to conventional directories, an RStudio project has additional features that help us organize our workflow. We start a new project by going to the menu item “File” -> “New Project”. In the pop-up window, we pick the option “New Directory” and then “New Project”. Now a dialog box appears that prompts us to type a directory name. We should choose a name that clearly communicates what the topic of our project is. In the present example, “titanic” is a good choice for the project name. On the line below, the dialog box wants to know the directory that is supposed to contain the “titanic” project as a subdirectory. You can choose any directory on your computer. I have a directory called QR on my computer to store the projects we’re going to create in this course, so I will choose the QR directory. Then I click “Create Project”.

RStudio has now set the “titanic” directory as our working directory. In practice, this means that we must move the data to the “titanic” directory. Go to the bottom right RStudio pane and highlight the “Files” tab. Then click on “More” next to the cogwheel symbol and choose “Show Folder in New Window”. Shift titanic.csv from the Downloads folder to the “titanic” folder. I’m demonstrating this procedure on a Mac, but it works similarly on Windows and Linux. Simply use the File Explorer to shift the CSV to your project folder.

Now we’ve set the scene for our data analysis: we have a project directory dedicated to our Titanic project with the relevant data. Next we need to import the spreadsheet into R. I’m first going to show you how to import data with RStudio’s graphical user interface. Later I’m going to tell you how we can accomplish the same task with an R script.

In the top right pane, highlight the Environment tab. In the menu, we can find a button labelled “Import Dataset”. From the dropdown menu, choose “From text (base)…” This opens a dialog window with which we can select the file we wish to import. Double-click on titanic.csv. RStudio now opens a new dialog window. A nice feature of this dialog window is that we can see a preview of the imported data. It’s always a good idea to check whether the preview looks as expected. If there’s a problem with the format of the CSV file, we would already be able to see it clearly at this stage. There are many settings in this dialog window that we can adjust. In this example, the defaults are fine, so we can go ahead and click the “Import” button.

In the top left pane, RStudio displays the imported data as a spreadsheet. I recommend that we always briefly check whether RStudio displays the data as we expect. It can save us a lot of frustration down the road. For example, sometimes the column headers are wrong, especially if the imported file contains characters that aren’t in the English alphabet. In that case, we would need to tweak the settings in the import dialog window, but here everything looks fine.

Besides opening the data as a spreadsheet, there was one more change in the RStudio window when we clicked “Import”: RStudio automatically generated two commands in the console. The first command opened the CSV file with the read.csv() function and assigned the result to a data frame called titanic().

titanic <- read.csv("~/QR/titanic/titanic.csv")

Consequently, titanic is currently listed in the Environment tab. This data frame consists of “2208 obs. of 11 variables”. That is, there are 2208 rows and 11 columns.

RStudio also automatically ran the View() function.

View(titanic)

View() opens the data frame as a spreadsheet in the top left pane, so it has the same effect as clicking on the variable name in the Environment tab.

Let’s learn one more thing before ending this tutorial. We’re going to save the import command to a script so that we won’t have to go through the graphical user interface next time we open titanic.csv. Let’s open a new script. Then we copy the command that contains the read.csv() function from the console and paste it into the editor. Let’s pretend that we haven’t yet imported titanic.csv. So let’s remove titanic from the environment. We can remove it by clicking on the broom icon in the Environment tab. A dialog window pops up asking whether we really want to remove all objects from the environment. In this case, we click “Yes”. Looking at the environment now, titanic has indeed disappeared, so the environment is empty again. Now run the command in the script. Voilà! titanic is now back in the environment. The nice thing about working with scripts is that we don’t need to go through graphical user interfaces over and over again if we want to perform the same task repeatedly. We’ll work with the Titanic data again next time. So, thanks to this script, we’ll be able to import the data again with a single click.

We save the script by clicking on the disk symbol at the top of the editor pane. Let’s call the file titanic.R. All R scripts have the extension “.R”. In the dialog box, we can leave out the “.R”. If the extension is missing, Rstudio will automatically append it. If we click Save, RStudio adds titanic.R to the files in our project.

Let me be crystal clear about one bad habit: never save R code by copying it into a Microsoft Word document. Word is not a text editor for R. For example, Word uses a special character encoding for quotation marks. If you copy R commands to and from a Word document, I can almost guarantee that the result will be unusable as R source code.

Instead, please type R code directly into the RStudio editor and save it from there. While working on an R script, save it frequently so that you don’t lose your work.

Here is a summary of what we learned in this tutorial.

  • RStudio projects help us stay organized. As a rule of thumb, whenever we work on a new data set, we create a new project dedicated to these data. The project folder is where we store the data files and save all R scripts that have to do with these data. Don’t hesitate to create many small projects. It’s much better than dumping many unrelated files into one and the same directory.
  • The standard file format for exchanging spreadsheet data between different software is CSV. When we import a CSV file into R for the first time, we click the “Import Dataset” button in the Environment tab. At this stage, check carefully whether R has imported the data correctly. For example, are the column headers correct? Do we have the correct number of columns? If the imported data look indeed correct, then save the read.csv() command in a script.
  • Always write scripts with the RStudio editor. Never use Microsoft Word!
  • We also learned how to save scripts in the project folder.
  • If we want to remove all variables from the environment, we click the broom icon in the Environment tab. This action can be good for housekeeping because we generally want to keep the number of variables in the environment small. Otherwise we may get confused. If we save our commands in a script, we can easily run all necessary commands again to retrieve previous variables.

Next time we’ll work more with data frames.

See you soon.