Chapter 5 Inspecting the Format of a Data Frame

Welcome back to Quantitative Reasoning! In the previous tutorial, we learned how to import spreadsheet data into R. Remember that R’s version of a spreadsheet is called a data frame. In this video, we’ll learn how we can find out essential information about a data frame such as the numbers of rows and columns, the column names and the overall structure.

Last time we wrote the script titanic.R to import information about passengers and crew on board of the Titanic. We’re going add more commands to this script in this tutorial. If the Titanic project is no longer open on your computer, use the “Recent Projects” menu item to reopen it. In the Files tab, click on titanic.R to open the script in the editor pane. Run the read.csv() command so that the data frame titanic is back in the environment.

titanic <- read.csv("~/QR/titanic/titanic.csv")

Let’s click on the variable name titanic in the Environment tab to view the data frame as a spreadsheet.

So far, so good. But looking through an entire spreadsheet is usually a very inefficient way to answer questions about a data frame. For example, one of our first objectives when working with a data frame is to quickly find out how many data points there are. That is, how many rows and columns are we dealing with? We don’t need to scroll through the entire spreadsheet to answer this question. Last time we noticed that the Environment tab shows this information already: 2208 observations of 11 variables. Here “observation” is a synonym for row and “variable” is a synonym for column.

A different way to obtain the same information are the functions nrow() and ncol(). nrow(titanic) reveals that there are 2208 rows and ncol(titanic) shows that we have 11 columns.

nrow(titanic)

## [1] 2208

ncol(titanic)

## [1] 11

Yet another way to obtain the same information is the dim() function, where dim stands for dimension. dim(titanic) returns two numbers: first the number of rows, then the number of columns.

dim(titanic)

## [1] 2208   11

By the way, R is case-sensitive. In our example, we must be careful because R happens to have a preinstalled object called Titanic that starts with an upper-case T. But that object isn’t a data frame, so we get some strange answers if we accidentally make a typo. For example, dim(Titanic) with a capital T returns four numbers instead of two.

dim(Titanic)

## [1] 4 2 2 2

So please be careful to use only the lower-case variable titanic in this tutorial.

Besides knowing the dimensions of a data frame, we often would like to see some snippets of the data. A very useful function for this purpose is head(), which prints the first six rows.

head(titanic)

##   fam_name                    given_name gender age class survived ticket
## 1   ABBING                       Anthony   Male  41   3rd    FALSE   5547
## 2   ABBOTT                   Ernest Owen   Male  21  Crew    FALSE   <NA>
## 3   ABBOTT                 Eugene Joseph   Male  13   3rd    FALSE CA2673
## 4   ABBOTT               Rossmore Edward   Male  16   3rd    FALSE CA2673
## 5   ABBOTT             Rhoda Mary 'Rosa' Female  39   3rd     TRUE CA2673
## 6 ABELSETH Kalle (Karen) Marie Kristiane Female  16   3rd     TRUE 348125
##   pax_on_tckt pnd shl pnc
## 1           1   7  11   0
## 2          NA  NA  NA  NA
## 3           3  20   5   0
## 4           3  20   5   0
## 5           3  20   5   0
## 6           1   7  13   0

The counterpart of head() is tail(), which prints the last six rows.

tail(titanic)

##        fam_name         given_name gender age class survived ticket pax_on_tckt
## 2203      YVOIS Henriette Virginie Female  22   2nd    FALSE 248747           1
## 2204   ZAKARIAN        Mapriededer   Male  22   3rd    FALSE   2656           1
## 2205   ZAKARIAN              Ortin   Male  27   3rd    FALSE   2670           1
## 2206    ZANETTI              Minio   Male  20  Crew    FALSE   <NA>          NA
## 2207  ZARRACCHI                 L.   Male  26  Crew    FALSE   <NA>          NA
## 2208 ZIMMERMANN                Leo   Male  29   3rd    FALSE 315082           1
##      pnd shl pnc
## 2203  13   0   0
## 2204   7   4   6
## 2205   7   4   6
## 2206  NA  NA  NA
## 2207  NA  NA  NA
## 2208   7  17   6

The rows take up quite some space in the console. Depending on the width of your console pane, R may not be able to fit all columns side-by-side. The output then wraps around, so the last few columns are printed below the other columns. This output isn’t easily legible. Fortunately there’s a more user-friendly alternative, str(), which stands for “structure”.

str(titanic)

## 'data.frame':	2208 obs. of  11 variables:
##  $ fam_name   : chr  "ABBING" "ABBOTT" "ABBOTT" "ABBOTT" ...
##  $ given_name : chr  "Anthony" "Ernest Owen" "Eugene Joseph" "Rossmore Edward" ...
##  $ gender     : chr  "Male" "Male" "Male" "Male" ...
##  $ age        : int  41 21 13 16 39 16 25 30 28 45 ...
##  $ class      : chr  "3rd" "Crew" "3rd" "3rd" ...
##  $ survived   : logi  FALSE FALSE FALSE FALSE TRUE TRUE ...
##  $ ticket     : chr  "5547" NA "CA2673" "CA2673" ...
##  $ pax_on_tckt: int  1 NA 3 3 3 1 1 2 2 1 ...
##  $ pnd        : int  7 NA 20 20 20 7 7 24 24 7 ...
##  $ shl        : int  11 NA 5 5 5 13 13 0 0 4 ...
##  $ pnc        : int  0 NA 0 0 0 0 0 0 0 6 ...

The output of str(titanic) shows that titanic is a data frame with 2208 rows and 11 columns. The column names appear after the dollar signs, followed by the data class of the column (character, integer or logical). str() also prints the first few values in each column. By the way, FALSE and TRUE in the column called survived aren’t character objects. If they were, they would appear between quotation marks. Instead FALSE and TRUE are special variables that are called “logical” in the R jargon. We learn more about logical vectors in tutorial 08.

You may also have wondered what NA stands for in the console output. It is the abbreviation for “Not Assigned” and symbolizes a missing value. One of our later tutorials is dedicated to missing values, but right now let’s not worry too much about them.

If we’re only interested in the column names, we can use the names() function.

names(titanic)

##  [1] "fam_name"    "given_name"  "gender"      "age"         "class"      
##  [6] "survived"    "ticket"      "pax_on_tckt" "pnd"         "shl"        
## [11] "pnc"

Let’s summarise the functions we’ve encountered in this tutorial to find information about the format and content of a data frame.

We can find out the number of rows with nrow() and the number of columns with ncol(). The dim() function returns both of these numbers with a single function call.
We can inspect the first and last few rows of a data frame with head() or tail() respectively.
str() gives a compact summary of a data frame.
The names() function returns a character vector containing the column names of a data frame.

Exploratory data analysis often starts with the functions we’ve just learned. If we want to dig deeper, we’ll have to learn how to extract values from data frames. We’ll address that topic in the next tutorial.

See you soon.