Chapter 13 Summary Statistics in R

Welcome back to Quantitative Reasoning! In the previous tutorials, we learned how to summarize and visualise categorical data. In this video, we turn our attention to quantitative data. We’re going to work with a data frame called iris. This data frame contains measurements of iris flowers by the American botanist Edgar Anderson. It’s preinstalled in R, so there’s no need to import these data from a file.

Let’s start a new project with the name iris. If we want to take a look at the iris data frame, we can run the command View(iris) from the console. There are five columns: sepal length, sepal width, petal length, petal width and the species.

Let’s inspect the data a little bit more. As we learned in tutorial 05, we can find out the number of rows with nrow(iris).

nrow(iris)
## [1] 150

So there are 150 rows. Which species appear in the data frame? We know from tutorial 06 that we can find the answer with the function unique(). unique(iris$Species) shows that there are three different species: setosa, versicolor and virginica.

unique(iris$Species)
## [1] setosa     versicolor virginica 
## Levels: setosa versicolor virginica

By the way, here is a photo showing the three species. Don’t worry about the second line of output that starts with the word “Levels”. R prints the second line because the Species column is technically speaking not a character vector, but a different type of object called a factor. We won’t cover factors in detail in this course. In our applications, we can usually treat factors as if they were character vectors. For example, we can find out how many times each species appears in the data frame with the same method that we learned in tutorial 11.

table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50

The output shows that we have 50 measurements for each of the three species. For more information, we can use the ?-operator from tutorial 01. After we run ?iris from the console, we can read the documentation of the data frame in the bottom right pane, including references to the source of the data.

Let’s shift our attention from the Species column, which contains categorical data, to one of the numeric columns (e.g. Sepal.Width). To summarize quantitative data, functions such as unique() or table() aren’t as useful as they are for categorical data. In principle, we can run table(iris$Sepal.Width).

table(iris$Sepal.Width)
## 
##   2 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9   3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9   4 
##   1   3   4   3   8   5   9  14  10  26  11  13   6  12   6   4   3   6   2   1 
## 4.1 4.2 4.4 
##   1   1   1

It shows all the distinct values in the column and how often they appear, but this output contains more information than what we usually need. We often would like to summarize quantitative data with only a few numbers.

Let’s start a new script called iris.R in which we apply R functions that compute summary statistics. The most common summary statistic is the mean (i.e. the sum of all data divided by the number of measurements). In R, calculating the mean is easy. All you need is the function mean() and a numeric vector. In our example, we type mean(iris$Sepal.Width).

mean(iris$Sepal.Width)
## [1] 3.057333

Besides the mean, another common summary statistic is the median. The median is defined such that exactly half of the data are smaller and half of the data are larger than the median. In R, you obtain the median with the function named, guess what, median().

median(iris$Sepal.Width)
## [1] 3

For the iris data, there isn’t much difference between the median and the mean. This is often the case. But, when the mean and median are clearly different, we may want to report both values.

Sometimes we also want a measure of the spread in the data. For example, what are the smallest and largest numbers? R returns the minimum of a vector with the function min().

min(iris$Sepal.Width)
## [1] 2

Similarly, we get the maximum with the function max().

max(iris$Sepal.Width)
## [1] 4.4

The range() function combines the minimum and maximum into a single vector.

range(iris$Sepal.Width)
## [1] 2.0 4.4

While the range tells us the extremes in the distribution, the standard deviation gives a better impression of typical differences from the mean. For the precise definition of the standard deviation, have a look at our textbook. In R, you calculate the standard deviation with the function sd().

sd(iris$Sepal.Width)
## [1] 0.4358663

Another measure for the spread of a distribution is the interquartile range. It’s the difference between the upper and lower quartile, so 50% of the data are within the interquartile range. The R command for the interquartile range is IQR() in capital letters.

IQR(iris$Sepal.Width)
## [1] 0.5

Instead of typing mean(), median(), range() and IQR() one by one, we can get all this information with the function summary().

summary(iris$Sepal.Width) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.800   3.000   3.057   3.300   4.400

It returns a vector with six numbers: the minimum, the lower or first quartile (i.e. exactly 25% of the data are lower), the median, the mean, the upper or third quartile (i.e. exactly 25% of the data are bigger) and the maximum. These numbers are usually enough information to quickly convey the essence of our data.

Let’s summarise the most common R functions for summary statistics.

  • mean()
  • median()
  • min() for the minimum,
  • max() for the maximum,
  • sd() for the standard deviation and
  • IQR() for the interquartile range.
  • The summary() function combines several of these statistics into one function call.

Looking at numbers is fine, but can feel a bit dry. Often it’s more appealing to summarize quantitative data with diagrams. In our next video, we find out how R can visualise quantitative data with box plots.

See you soon.