Chapter 22 Scatter Plots and Line Charts
Welcome back to Quantitative Reasoning! In this tutorial, we learn how to make plots that visualise the association between two quantitative variables. We treat one variable as x-coordinate and the other variable as y-coordinate. In a scatter plot, we show these coordinate pairs as points in a diagram. In a line chart, we connect the coordinate pairs by lines. It’s best explained with an example.
Below this video, you find a link (https://michaelgastner.com/data_for_QR/hopkins.csv) to the spreadsheet hopkins.csv. Please import the data into R to follow along with me. The spreadsheet contains wind speed measurements in Hopkins Memorial Forest, a 2500-acre reserve in the northeastern United States. Our textbook discusses these data in chapter 4.
head(hopkins)
## date day month wind
## 1 2011-01-01 1 1 0.53
## 2 2011-01-02 2 1 2.13
## 3 2011-01-03 3 1 4.43
## 4 2011-01-04 4 1 0.43
## 5 2011-01-05 5 1 3.38
## 6 2011-01-06 6 1 0.82
The first column in the hopkins
data frame shows the dates of the
measurements.
In the second column, each date is converted into the day of the year from 1
to 365.
The third column contains the month, and the fourth column is the average wind
speed in miles per hour.
How can we find out whether the wind speed varies during the year?
Let’s make a scatter plot in which the day of the year is the x-coordinate, and
wind speed is the y-coordinate.
In R, we make scatter plots with the function plot()
.
We can enter the variable names of the x-coordinate and y-coordinate in three
different ways.
The first option is to insert the variable names in the order “x-coordinate
comma y-coordinate”.
In our example, we type
plot(hopkins$day, hopkins$wind)
The scatter plot then appears in the bottom right pane.
The second option is to insert the x-coordinate and y-coordinate using R’s formula notation: “y-coordinate as a function of x-coordinate”. In our case, we type
plot(hopkins$wind ~ hopkins$day)
A third option is to leave out the name of the data frame in the formula and
put it in a separate data
argument.
plot(wind ~ day, data = hopkins)
To customise the appearance, the plot()
function accepts similar arguments
as boxplot()
and hist()
.
For example, we can change
- the plot title with
main
, - the x-axis label with
xlab
, - the y-axis label with
ylab
and - the colour of the points with
col
.
plot(wind ~ day,
data = hopkins,
main = "Wind Speed in Hopkins Memorial Forest (Year 2011)",
xlab = "Day of year",
ylab = "Average wind speed (mph)",
col = "red")
We can change the point size with the argument cex
, which stands for
“character expansion factor”.
The default value of cex
is 1.
A value of cex
less than 1 makes the points smaller.
For example, here is the plot after setting cex
to 0.75.
plot(wind ~ day,
data = hopkins,
main = "Wind Speed in Hopkins Memorial Forest (Year 2011)",
xlab = "Day of year",
ylab = "Average wind speed (mph)",
col = "red",
cex = 0.75)
We can change the point symbols with the argument pch
, which stands for
“point character”.
For example, we change the points from circles to squares with pch = 0
.
plot(wind ~ day,
data = hopkins,
main = "Wind Speed in Hopkins Memorial Forest (Year 2011)",
xlab = "Day of year",
ylab = "Average wind speed (mph)",
col = "red",
cex = 0.75,
pch = 0)
We can find the options for the pch
argument by typing ?points
in the
console.
When we scroll down in the R documentation page, we can see a list of point
symbols with their numeric codes.
The default for pch
is 1, which corresponds to an open circle.
If we change pch
to 0, we get squares.
If we change pch
to 2, we get triangles, and so on.
Our scatter plot has time along the x-axis.
Such plots are often shown with lines that connect the points for consecutive
dates.
For example, stock market trends are often plotted in this manner.
We can make a line chart by adding the argument type = "l"
to the
plot()
function.
plot(wind ~ day,
data = hopkins,
main = "Wind Speed in Hopkins Memorial Forest (Year 2011)",
xlab = "Day of year",
ylab = "Average wind speed (mph)",
col = "red",
cex = 0.75,
pch = 0,
type = "l")
It’s also possible to show points and lines. In this case, we must use the
argument type = "b"
, where "b"
stands for “both”.
plot(wind ~ day,
data = hopkins,
main = "Wind Speed in Hopkins Memorial Forest (Year 2011)",
xlab = "Day of year",
ylab = "Average wind speed (mph)",
col = "red",
cex = 0.75,
pch = 0,
type = "b")
Looking at the plot, we get the overall impression that wind speeds tend to be higher at the beginning and end of the year than in the middle. However, there are a lot of fluctuations, which make it difficult to see the trend clearly. Sometimes it helps to add a smooth trend curve to the plot. Our textbook mentions a popular tool to compute a trend curve: LOWESS, which stands for Locally Weighted Scatterplot Smoothing. Adding a LOWESS curve to an R scatter plot is very easy.
Let’s first revert to our earlier version of the plot which only showed points
but no lines so that we can see an additional curve more clearly.
Then we add a LOWESS curve with the command
lines(lowess(hopkins$day, hopkins$wind))
.
plot(wind ~ day,
data = hopkins,
main = "Wind Speed in Hopkins Memorial Forest (Year 2011)",
xlab = "Day of year",
ylab = "Average wind speed (mph)",
col = "red",
cex = 0.75,
pch = 0)
lines(lowess(hopkins$day, hopkins$wind))
The LOWESS curve confirms that wind speeds tend to have a minimum around day 220.
In summary, we learned that we can make scatter plots with the R function
plot()
.
There are many arguments to customise the plot.
For example, cex
changes the size of the points, and pch
changes the point
symbol.
We can use the argument type = "l"
to produce a line chart.
We can add a smooth trend curve to the plot with the command
lines(lowess(..., ...))
, where the first argument in the parentheses is the
variable plotted along the x-axis, and the second argument is the variable on
the y-axis.
Next time we learn how we can show multiple pairs of variables in a scatter plot.
See you soon.