Chapter 22 Scatter Plots and Line Charts

Welcome back to Quantitative Reasoning! In this tutorial, we learn how to make plots that visualise the association between two quantitative variables. We treat one variable as x-coordinate and the other variable as y-coordinate. In a scatter plot, we show these coordinate pairs as points in a diagram. In a line chart, we connect the coordinate pairs by lines. It’s best explained with an example.

Below this video, you find a link (https://michaelgastner.com/data_for_QR/hopkins.csv) to the spreadsheet hopkins.csv. Please import the data into R to follow along with me. The spreadsheet contains wind speed measurements in Hopkins Memorial Forest, a 2500-acre reserve in the northeastern United States. Our textbook discusses these data in chapter 4.

head(hopkins)
##         date day month wind
## 1 2011-01-01   1     1 0.53
## 2 2011-01-02   2     1 2.13
## 3 2011-01-03   3     1 4.43
## 4 2011-01-04   4     1 0.43
## 5 2011-01-05   5     1 3.38
## 6 2011-01-06   6     1 0.82

The first column in the hopkins data frame shows the dates of the measurements. In the second column, each date is converted into the day of the year from 1 to 365. The third column contains the month, and the fourth column is the average wind speed in miles per hour.

How can we find out whether the wind speed varies during the year? Let’s make a scatter plot in which the day of the year is the x-coordinate, and wind speed is the y-coordinate. In R, we make scatter plots with the function plot(). We can enter the variable names of the x-coordinate and y-coordinate in three different ways. The first option is to insert the variable names in the order “x-coordinate comma y-coordinate”. In our example, we type

plot(hopkins$day, hopkins$wind)

The scatter plot then appears in the bottom right pane.

The second option is to insert the x-coordinate and y-coordinate using R’s formula notation: “y-coordinate as a function of x-coordinate”. In our case, we type

plot(hopkins$wind ~ hopkins$day)

A third option is to leave out the name of the data frame in the formula and put it in a separate data argument.

plot(wind ~ day, data = hopkins)

To customise the appearance, the plot() function accepts similar arguments as boxplot() and hist(). For example, we can change

  • the plot title with main,
  • the x-axis label with xlab,
  • the y-axis label with ylab and
  • the colour of the points with col.
plot(wind ~ day,
     data = hopkins,
     main = "Wind Speed in Hopkins Memorial Forest (Year 2011)",
     xlab = "Day of year",
     ylab = "Average wind speed (mph)",
     col = "red")

We can change the point size with the argument cex, which stands for “character expansion factor”. The default value of cex is 1. A value of cex less than 1 makes the points smaller. For example, here is the plot after setting cex to 0.75.

plot(wind ~ day,
     data = hopkins,
     main = "Wind Speed in Hopkins Memorial Forest (Year 2011)",
     xlab = "Day of year",
     ylab = "Average wind speed (mph)",
     col = "red",
     cex = 0.75)

We can change the point symbols with the argument pch, which stands for “point character”. For example, we change the points from circles to squares with pch = 0.

plot(wind ~ day,
     data = hopkins,
     main = "Wind Speed in Hopkins Memorial Forest (Year 2011)",
     xlab = "Day of year",
     ylab = "Average wind speed (mph)",
     col = "red",
     cex = 0.75,
     pch = 0)

We can find the options for the pch argument by typing ?points in the console. When we scroll down in the R documentation page, we can see a list of point symbols with their numeric codes. The default for pch is 1, which corresponds to an open circle. If we change pch to 0, we get squares. If we change pch to 2, we get triangles, and so on.

Our scatter plot has time along the x-axis. Such plots are often shown with lines that connect the points for consecutive dates. For example, stock market trends are often plotted in this manner. We can make a line chart by adding the argument type = "l" to the plot() function.

plot(wind ~ day,
     data = hopkins,
     main = "Wind Speed in Hopkins Memorial Forest (Year 2011)",
     xlab = "Day of year",
     ylab = "Average wind speed (mph)",
     col = "red",
     cex = 0.75,
     pch = 0,
     type = "l")

It’s also possible to show points and lines. In this case, we must use the argument type = "b", where "b" stands for “both”.

plot(wind ~ day,
     data = hopkins,
     main = "Wind Speed in Hopkins Memorial Forest (Year 2011)",
     xlab = "Day of year",
     ylab = "Average wind speed (mph)",
     col = "red",
     cex = 0.75,
     pch = 0,
     type = "b")

Looking at the plot, we get the overall impression that wind speeds tend to be higher at the beginning and end of the year than in the middle. However, there are a lot of fluctuations, which make it difficult to see the trend clearly. Sometimes it helps to add a smooth trend curve to the plot. Our textbook mentions a popular tool to compute a trend curve: LOWESS, which stands for Locally Weighted Scatterplot Smoothing. Adding a LOWESS curve to an R scatter plot is very easy.

Let’s first revert to our earlier version of the plot which only showed points but no lines so that we can see an additional curve more clearly. Then we add a LOWESS curve with the command lines(lowess(hopkins$day, hopkins$wind)).

plot(wind ~ day,
     data = hopkins,
     main = "Wind Speed in Hopkins Memorial Forest (Year 2011)",
     xlab = "Day of year",
     ylab = "Average wind speed (mph)",
     col = "red",
     cex = 0.75,
     pch = 0)
lines(lowess(hopkins$day, hopkins$wind))

The LOWESS curve confirms that wind speeds tend to have a minimum around day 220.

In summary, we learned that we can make scatter plots with the R function plot(). There are many arguments to customise the plot. For example, cex changes the size of the points, and pch changes the point symbol. We can use the argument type = "l" to produce a line chart. We can add a smooth trend curve to the plot with the command lines(lowess(..., ...)), where the first argument in the parentheses is the variable plotted along the x-axis, and the second argument is the variable on the y-axis.

Next time we learn how we can show multiple pairs of variables in a scatter plot.

See you soon.