Chapter 14 Box Plots

Welcome back to Quantitative Reasoning! Last time we learned that R’s summary() function gives a 6-number summary of a numeric vector.

summary(iris$Sepal.Width)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.800   3.000   3.057   3.300   4.400

A box plot displays information that is very similar to the summary() function, but as an image instead of a list of numbers. In R, we make a box plot with the boxplot() function.

boxplot(iris$Sepal.Width)

We may want to resize the plots tab to get a nicer view of the box plot. The thick line in the middle is the median. The upper and lower edges of the box are the upper and lower quartiles. Values that are more than 1.5 times the interquartile range away from the box are considered to be outliers and shown as circles. The whiskers that extend from the box show the minimum and maximum of the remaining, non-outlier values. We can fine-tune the box plot by adding more arguments inside the parentheses. For example, to give the box a light green colour, we add the argument “col” for colour, an equals sign and in quotes “lightgreen”.

boxplot(iris$Sepal.Width, col = "lightgreen")

Light green is only one out of many colours that R can produce. We can find a complete list of colour names that R understands on this web page: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf.

We can label the vertical axis of the box plot with the argument ylab (for y-axis label), =, and in quotes something informative such as "Sepal width (cm)".

boxplot(iris$Sepal.Width, col = "lightgreen", ylab = "Sepal width (cm)")

We can give the plot a title, for example “Anderson’s Iris Data” with the argument main = "Anderson’s Iris Data". This command becomes quite long on a single line. So, to make the code more readable, let’s insert a few line breaks to stay within the recommended 80-character line length.

boxplot(iris$Sepal.Width,
        col = "lightgreen", 
        ylab = "Sepal width (cm)",
        main = "Anderson's Iris Data")

After you run the command, you’ll see the title at the top of the plot.

A plot with a single box can be informative, but the real strength of box plots is that you can put multiple boxes next to each other. For example, let’s make one box for each of the three iris species (i.e. setosa, versicolor and virginica) to determine whether they tend to have different sepal widths. Thanks to our expertise in subsetting data frames, we could in principle make three different box plots, namely one for each species, but let me show you a trick how we can get all three boxes in the same plot. Instead of using iris$Sepal.Width as our first argument, we’re going to tell R that it should split the data by species. We do this by adding a tilde followed by the column by which we want to split the data. In our example it’s the column iris$Species. The tilde means “as a function of”, so we’re telling R to view the sepal width as a function of the species. When we run this code, we receive a figure with three boxes.

boxplot(iris$Sepal.Width ~ iris$Species,
        col = "lightgreen", 
        ylab = "Sepal width (cm)",
        main = "Anderson's Iris Data")

From this plot, it’s evident that setosa tends to have the widest sepals and versicolor the narrowest. Multiple-box plots are a great way of visualizing data because we can easily compare distributions without overwhelming the viewer with details.

There is an alternative to our current argument list for the boxplot() function. We can remove everything in front of the $ in the first argument and instead pass another argument to indicate the name of the data frame. Here’s what I mean. I erase both occurrences of iris$ in the first argument and add another argument data = iris.

boxplot(Sepal.Width ~ Species,
        data = iris,
        col = "lightgreen", 
        ylab = "Sepal width (cm)",
        main = "Anderson's Iris Data")

In our example, both code options are about equally readable, but when the data frame’s name is longer than the four letters in the word “iris”, the second option is in my opinion a little bit more readable.

R has many more graphics tricks up its sleeve. We continue exploring R graphics in our next tutorial. But for right now, let’s summarise what we’ve learned so far.

  • Box plots show minimum, lower quartile, median, upper quartile and maximum in a single diagram.
  • R makes box plots with the boxplot() function.
  • We can show multiple boxes in one plot, split by category, with the tilde notation.

Box plots are great when comparing data for different groups (e.g. species), but sometimes we want to see the distribution for one group in greater detail. In such cases, we visualise data with a histogram. Next time we learn how to make histograms with R.

See you soon.