Chapter 13 Summary Statistics in R
Welcome back to Quantitative Reasoning!
In the previous tutorials, we learned how to summarize and visualise
categorical data.
In this video, we turn our attention to quantitative data.
We’re going to work with a data frame called iris
.
This data frame contains measurements of iris flowers by the American botanist
Edgar Anderson.
It’s preinstalled in R, so there’s no need to import these data from a file.
Let’s start a new project with the name iris
.
If we want to take a look at the iris
data frame, we can run the
command View(iris)
from the console.
There are five columns: sepal length, sepal width, petal length, petal width
and the species.
Let’s inspect the data a little bit more.
As we learned in tutorial 05, we can find out the number of rows with
nrow(iris)
.
nrow(iris)
## [1] 150
So there are 150 rows.
Which species appear in the data frame?
We know from tutorial 06 that we can find the answer with the function
unique()
.
unique(iris$Species)
shows that there are three different species:
setosa, versicolor and virginica.
unique(iris$Species)
## [1] setosa versicolor virginica
## Levels: setosa versicolor virginica
By the way, here is a photo showing the three species.
Don’t worry about the second line of output that starts with the word
“Levels”.
R prints the second line because the Species
column is technically speaking
not a character vector, but a different type of object called a factor.
We won’t cover factors in detail in this course.
In our applications, we can usually treat factors as if they were character
vectors.
For example, we can find out how many times each species appears in the data
frame with the same method that we learned in tutorial 11.
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
The output shows that we have 50 measurements for each of the three species.
For more information, we can use the ?
-operator from tutorial 01.
After we run ?iris
from the console, we can read the documentation of the
data frame in the bottom right pane, including references to the source of the
data.
Let’s shift our attention from the Species
column, which contains
categorical data, to one of the numeric columns (e.g. Sepal.Width
).
To summarize quantitative data, functions such as unique()
or table()
aren’t as useful as they are for categorical data.
In principle, we can run table(iris$Sepal.Width)
.
table(iris$Sepal.Width)
##
## 2 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
## 1 3 4 3 8 5 9 14 10 26 11 13 6 12 6 4 3 6 2 1
## 4.1 4.2 4.4
## 1 1 1
It shows all the distinct values in the column and how often they appear, but this output contains more information than what we usually need. We often would like to summarize quantitative data with only a few numbers.
Let’s start a new script called iris.R
in which we apply R functions that
compute summary statistics.
The most common summary statistic is the mean (i.e. the sum of all data
divided by the number of measurements).
In R, calculating the mean is easy.
All you need is the function mean()
and a numeric vector.
In our example, we type mean(iris$Sepal.Width)
.
mean(iris$Sepal.Width)
## [1] 3.057333
Besides the mean, another common summary statistic is the median.
The median is defined such that exactly half of the data are smaller and half
of the data are larger than the median.
In R, you obtain the median with the function named, guess what, median()
.
median(iris$Sepal.Width)
## [1] 3
For the iris
data, there isn’t much difference between the median and the
mean.
This is often the case.
But, when the mean and median are clearly different, we may want to report
both values.
Sometimes we also want a measure of the spread in the data.
For example, what are the smallest and largest numbers?
R returns the minimum of a vector with the function min()
.
min(iris$Sepal.Width)
## [1] 2
Similarly, we get the maximum with the function max()
.
max(iris$Sepal.Width)
## [1] 4.4
The range()
function combines the minimum and maximum into a single vector.
range(iris$Sepal.Width)
## [1] 2.0 4.4
While the range tells us the extremes in the distribution, the standard
deviation gives a better impression of typical differences from the mean.
For the precise definition of the standard deviation, have a look at our
textbook.
In R, you calculate the standard deviation with the function sd()
.
sd(iris$Sepal.Width)
## [1] 0.4358663
Another measure for the spread of a distribution is the interquartile range.
It’s the difference between the upper and lower quartile, so 50% of the
data are within the interquartile range.
The R command for the interquartile range is IQR()
in capital letters.
IQR(iris$Sepal.Width)
## [1] 0.5
Instead of typing mean()
, median()
, range()
and IQR()
one by one, we
can get all this information with the function summary()
.
summary(iris$Sepal.Width)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.800 3.000 3.057 3.300 4.400
It returns a vector with six numbers: the minimum, the lower or first quartile (i.e. exactly 25% of the data are lower), the median, the mean, the upper or third quartile (i.e. exactly 25% of the data are bigger) and the maximum. These numbers are usually enough information to quickly convey the essence of our data.
Let’s summarise the most common R functions for summary statistics.
mean()
median()
min()
for the minimum,max()
for the maximum,sd()
for the standard deviation andIQR()
for the interquartile range.- The
summary()
function combines several of these statistics into one function call.
Looking at numbers is fine, but can feel a bit dry. Often it’s more appealing to summarize quantitative data with diagrams. In our next video, we find out how R can visualise quantitative data with box plots.
See you soon.