Chapter 5 Inspecting the Format of a Data Frame
Welcome back to Quantitative Reasoning! In the previous tutorial, we learned how to import spreadsheet data into R. Remember that R’s version of a spreadsheet is called a data frame. In this video, we’ll learn how we can find out essential information about a data frame such as the numbers of rows and columns, the column names and the overall structure.
Last time we wrote the script titanic.R to import information about passengers
and crew on board of the Titanic.
We’re going add more commands to this script in this tutorial.
If the Titanic project is no longer open on your computer, use the “Recent
Projects” menu item to reopen it.
In the Files tab, click on titanic.R to open the script in the editor pane.
Run the read.csv()
command so that the data frame titanic
is back in the
environment.
read.csv("~/QR/titanic/titanic.csv") titanic <-
Let’s click on the variable name titanic
in the Environment tab to view the
data frame as a spreadsheet.
So far, so good. But looking through an entire spreadsheet is usually a very inefficient way to answer questions about a data frame. For example, one of our first objectives when working with a data frame is to quickly find out how many data points there are. That is, how many rows and columns are we dealing with? We don’t need to scroll through the entire spreadsheet to answer this question. Last time we noticed that the Environment tab shows this information already: 2208 observations of 11 variables. Here “observation” is a synonym for row and “variable” is a synonym for column.
A different way to obtain the same information are the functions nrow()
and
ncol()
.
nrow(titanic)
reveals that there are 2208 rows and ncol(titanic)
shows that
we have 11 columns.
nrow(titanic)
## [1] 2208
ncol(titanic)
## [1] 11
Yet another way to obtain the same information is the dim()
function, where
dim
stands for dimension. dim(titanic)
returns two numbers: first the
number of rows, then the number of columns.
dim(titanic)
## [1] 2208 11
By the way, R is case-sensitive.
In our example, we must be careful because R happens to have a preinstalled
object called Titanic
that starts with an upper-case T.
But that object isn’t a data frame, so we get some strange answers if we
accidentally make a typo.
For example, dim(Titanic)
with a capital T returns four numbers instead of
two.
dim(Titanic)
## [1] 4 2 2 2
So please be careful to use only the lower-case variable titanic
in this
tutorial.
Besides knowing the dimensions of a data frame, we often would like to see
some snippets of the data.
A very useful function for this purpose is head()
, which prints the first six
rows.
head(titanic)
## fam_name given_name gender age class survived ticket
## 1 ABBING Anthony Male 41 3rd FALSE 5547
## 2 ABBOTT Ernest Owen Male 21 Crew FALSE <NA>
## 3 ABBOTT Eugene Joseph Male 13 3rd FALSE CA2673
## 4 ABBOTT Rossmore Edward Male 16 3rd FALSE CA2673
## 5 ABBOTT Rhoda Mary 'Rosa' Female 39 3rd TRUE CA2673
## 6 ABELSETH Kalle (Karen) Marie Kristiane Female 16 3rd TRUE 348125
## pax_on_tckt pnd shl pnc
## 1 1 7 11 0
## 2 NA NA NA NA
## 3 3 20 5 0
## 4 3 20 5 0
## 5 3 20 5 0
## 6 1 7 13 0
The counterpart of head()
is tail()
, which prints the last six rows.
tail(titanic)
## fam_name given_name gender age class survived ticket pax_on_tckt
## 2203 YVOIS Henriette Virginie Female 22 2nd FALSE 248747 1
## 2204 ZAKARIAN Mapriededer Male 22 3rd FALSE 2656 1
## 2205 ZAKARIAN Ortin Male 27 3rd FALSE 2670 1
## 2206 ZANETTI Minio Male 20 Crew FALSE <NA> NA
## 2207 ZARRACCHI L. Male 26 Crew FALSE <NA> NA
## 2208 ZIMMERMANN Leo Male 29 3rd FALSE 315082 1
## pnd shl pnc
## 2203 13 0 0
## 2204 7 4 6
## 2205 7 4 6
## 2206 NA NA NA
## 2207 NA NA NA
## 2208 7 17 6
The rows take up quite some space in the console.
Depending on the width of your console pane, R may not be able to fit all
columns side-by-side.
The output then wraps around, so the last few columns are printed below the
other columns.
This output isn’t easily legible.
Fortunately there’s a more user-friendly alternative, str()
, which stands for
“structure”.
str(titanic)
## 'data.frame': 2208 obs. of 11 variables:
## $ fam_name : chr "ABBING" "ABBOTT" "ABBOTT" "ABBOTT" ...
## $ given_name : chr "Anthony" "Ernest Owen" "Eugene Joseph" "Rossmore Edward" ...
## $ gender : chr "Male" "Male" "Male" "Male" ...
## $ age : int 41 21 13 16 39 16 25 30 28 45 ...
## $ class : chr "3rd" "Crew" "3rd" "3rd" ...
## $ survived : logi FALSE FALSE FALSE FALSE TRUE TRUE ...
## $ ticket : chr "5547" NA "CA2673" "CA2673" ...
## $ pax_on_tckt: int 1 NA 3 3 3 1 1 2 2 1 ...
## $ pnd : int 7 NA 20 20 20 7 7 24 24 7 ...
## $ shl : int 11 NA 5 5 5 13 13 0 0 4 ...
## $ pnc : int 0 NA 0 0 0 0 0 0 0 6 ...
The output of str(titanic)
shows that titanic
is a data frame with 2208 rows
and 11 columns.
The column names appear after the dollar signs, followed by the data class of
the column (character, integer or logical).
str()
also prints the first few values in each column.
By the way, FALSE
and TRUE
in the column called survived
aren’t character
objects.
If they were, they would appear between quotation marks.
Instead FALSE
and TRUE
are special variables that are called “logical” in
the R jargon.
We learn more about logical vectors in tutorial 08.
You may also have wondered what NA
stands for in the console output.
It is the abbreviation for “Not Assigned” and symbolizes a missing value.
One of our later tutorials is dedicated to missing values, but right now
let’s not worry too much about them.
If we’re only interested in the column names, we can use the names()
function.
names(titanic)
## [1] "fam_name" "given_name" "gender" "age" "class"
## [6] "survived" "ticket" "pax_on_tckt" "pnd" "shl"
## [11] "pnc"
Let’s summarise the functions we’ve encountered in this tutorial to find information about the format and content of a data frame.
- We can find out the number of rows with
nrow()
and the number of columns withncol()
. Thedim()
function returns both of these numbers with a single function call. - We can inspect the first and last few rows of a data frame with
head()
ortail()
respectively. str()
gives a compact summary of a data frame.- The
names()
function returns a character vector containing the column names of a data frame.
Exploratory data analysis often starts with the functions we’ve just learned. If we want to dig deeper, we’ll have to learn how to extract values from data frames. We’ll address that topic in the next tutorial.
See you soon.