Chapter 20 aggregate()

Welcome back to Quantitative Reasoning! In tutorial 11, we learned how we can count the occurrences of values in a vector with the table() function. In this tutorial, we learn how to compute other kinds of summary statistics for data subsets.

Let’s consider as a concrete example the data frame iris.

View(iris)

In tutorial 14, we produced this box plot. It shows that setosa tends to have the widest sepals and versicolor the narrowest.

Suppose we want to find numeric support for the claim that the values for setosa are largest and those for versicolor smallest. For example, we may wish to calculate the mean of the sepal widths for each of the three species. Applying the mean() function directly to iris$Sepal.Width doesn’t give us the answer because it returns the mean of all elements in the column.

mean(iris$Sepal.Width)
## [1] 3.057333

Instead we would like to have an answer consisting of three numbers, namely the means calculated separately for setosa, versicolor and virginica. The desired output should look similar to this data frame.

##      Species mean_sepal_width
## 1     setosa            3.428
## 2 versicolor            2.770
## 3  virginica            2.974

It has two columns: the species and the mean restricted to this species.

Let’s look at a simple example to clarify the steps involved in this calculation. Here is a data frame with two columns: Sepal.Width and Species.

As a first step, we have to split the input data frame into three separate data frames (i.e. one for each species that occurs in the Species column).

Then we apply the mean() function to the column with the sepal widths, separately for each of the three species.

Finally we combine the three means and the three species names into a data frame.

It would be a lot of work to carry out all three steps (i.e. “split”, “apply” and “combine”) ourselves. R has a preinstalled function that bundles the three steps into a single command: aggregate(). In its basic form, aggregate() needs three arguments:

  • a formula to communicate which column in the data frame we want to split and which other column is the criterion for the split,
  • the name of the data frame that contains these columns and
  • the function we want to apply to each of the subsets.

In our case, the first argument is Sepal.Width ~ Species. The tilde is the same symbol that we encountered when we worked with box plots in tutorial 14. We learned back then that the tilde stands for “as a function of”, so we view the sepal widths as a function of the species. The first argument in aggregate() always contains a tilde. Usually, there is a numeric column to the left of the tilde (here Sepal.Width) and a categorical column to the right (here Species). The second argument specifies the data frame in the form data = iris. The third argument in our example is the function mean(), so we write FUN = mean. When we run this command, we receive the species name in the first column and the mean for the corresponding species in the second column.

aggregate(Sepal.Width ~ Species,
          data = iris,
          FUN = mean)
##      Species Sepal.Width
## 1     setosa       3.428
## 2 versicolor       2.770
## 3  virginica       2.974

This example shows the basic use of aggregate(), but sometimes we want to apply the same function to more than one column. For example, the iris data frame contains more information than just the sepal widths. There’s also a column with petal widths, so we may be interested in the species-dependent means of the petal widths too. We can obtain the means of sepal widths and petal widths with the cbind() function. In the first argument of aggregate(), we replace the column name to the left of the tilde by cbind(). Inside the parentheses, we insert the names of the columns whose means we want to calculate: Sepal.Width and Petal.Width.

aggregate(cbind(Sepal.Width, Petal.Width) ~ Species,
          data = iris,
          FUN = mean)
##      Species Sepal.Width Petal.Width
## 1     setosa       3.428       0.246
## 2 versicolor       2.770       1.326
## 3  virginica       2.974       2.026

The returned data frame has one more column than before: the means of the petal widths for each species. By the way, the “c” in cbind() stands for “column”. That is, we bind the columns Sepal.Width and Petal.Width together. There are more applications for cbind(), but at this stage we only need it in the combination with aggregate().

In summary, we learned that we can calculate summary statistics of data subsets with the function aggregate(). It performs the operations “split”, “apply” and “combine” in a single function call. Usually, aggregate() needs three arguments:

  • a formula involving the tilde operator,
  • the name of the data frame that contains the data to be summarised and
  • the function to apply to each data subset. Common functions are mean() or sum(), but we can also use other summary statistics functions.

If we want to apply the function in the last argument to multiple columns in the data frame, we use the cbind() function in the first argument.

In the next video, we learn how to visualise data subsets with multi-panel plots.

See you soon.