Chapter 37 Functions

Welcome back to Quantitative Reasoning! We’ve worked with many preinstalled functions during this course. All instructions that are followed by a pair of parentheses (e.g. hist(), sd() or lm()) are functions that perform useful operations when we supply appropriate arguments as input. Sometimes R doesn’t have a preinstalled function for operations that would be useful for us. In this tutorial, we learn how to fill this gap by writing our own functions.

In tutorial 25, we talked about z-scores. We used this code snippet to calculate z-scores for the performance of heptathletes in two disciplines (200-metre run and long jump) of the 2012 Olympic heptathlon. You can find the data at the URL linked below this video (https://michaelgastner.com/data_for_QR/hept2012.csv)

hept <- read.csv("hept2012.csv")
hept$z_run200 <-
  (hept$run200 - mean(hept$run200, na.rm = TRUE)) /
  sd(hept$run200, na.rm = TRUE)
hept$z_lj <-
  (hept$lj - mean(hept$lj, na.rm = TRUE)) /
  sd(hept$lj, na.rm = TRUE)

This code did the job for us back then, but the commands are long and complicated. For an outsider, it wouldn’t be obvious what we are trying to accomplish. We can, of course, insert a comment (e.g.: “The following lines calculate z-scores”).

# The following lines calculate z-scores
hept$z_run200 <-
  (hept$run200 - mean(hept$run200, na.rm = TRUE)) /
  sd(hept$run200, na.rm = TRUE)
hept$z_lj <-
  (hept$lj - mean(hept$lj, na.rm = TRUE)) /
  sd(hept$lj, na.rm = TRUE)

Still, it would be even better if the commands were replaced by a function with a descriptive name (e.g. zscore()).

hept$z_run200 <- zscore(hept$run200)
hept$z_lj <- zscore(hept$lj)

Code that can be easily understood by another person without specific domain knowledge is called self-documenting. In general, we should aim to write self-documenting code instead of long lines of code with complicated instructions that need explicit comments to be intelligible.

Right now, our code doesn’t work because R doesn’t have a preinstalled function zscore(), but it’s easy to write this function ourselves. Here is the general pattern that we need to follow when we write our own functions.

function_name <- function(param1, param2, ...) {
  # Body of the function:
  #   commands that do something with param1, param2, ...
  #   last evaluated object is returned as function value
}

The objects param1, param2, … are called the parameters of the function. Our zscore() function only has one parameter: a numeric vector. We can give the parameter any name we wish. Let me call it v for “vector”. The function body consists of the general commands that calculate z-scores for a numeric vector v. It’s also a good idea to include a comment that explains the function’s main purpose.

zscore <- function(v) {
  
  # zscore() returns distance from mean in units of standard deviations
  (v - mean(v, na.rm = TRUE)) / sd(v, na.rm = TRUE)
}

It’s best practice to place function definitions before the first call to this function in the script. Before we can call any function written by ourselves, we must first add it to our environment. To accomplish this task, we run the line with the keyword function (e.g. by placing the cursor on this line and clicking the Run button). Now the zscore() function appears in the Environment tab. Adding the function to our environment doesn’t directly calculate any concrete z-scores, but we have provided R with two important pieces of information.

R is aware that we may, at some point, call a function with the name zscore.
If we call zscore(), followed by an object inside a pair of parentheses, R makes a copy of this object and calls it v. Then R performs the commands in the function body of zscore(). When finished, R returns the value on the last line of the function body, which is in our case the z-score calculated by (v - mean(v, na.rm = TRUE)) / sd(v, na.rm = TRUE).

The concrete value that we give to a function parameter is called an argument. For example, when we run zscore(hept$run200), the argument is hept$run200. The main advantage of a function is that it can perform a specific set of operations on different objects. If we swap hept$run200 for hept$lj in the parentheses, we still calculate z-scores, but now they are z-scores of a different vector.

hept$z_run200 <- zscore(hept$run200)
hept$z_lj <- zscore(hept$lj)

Functions are mainly used to avoid command sequences that are almost identical copies of each other. Let’s consider another piece of code that we wrote in an earlier lesson. In tutorial 21, we wrote a script that generated a multi-panel plot with histograms of sepal widths for different iris species. I made only a few small edits to make the structure of the code more obvious.

par(mfrow = c(3, 1))
iris_breaks <- seq(1.8, 4.6, 0.1)
iris_ylim <- c(0, 12)
iris_xlab <- "Sepal width (cm)"
hist(iris$Sepal.Width[iris$Species == "setosa"],
     breaks = iris_breaks,
     ylim = iris_ylim,
     xlab = iris_xlab,
     main = "setosa")
grid()
hist(iris$Sepal.Width[iris$Species == "versicolor"],
     breaks = iris_breaks,
     ylim = iris_ylim,
     xlab = iris_xlab,
     main = "versicolor")
grid()
hist(iris$Sepal.Width[iris$Species == "virginica"],
     breaks = iris_breaks,
     ylim = iris_ylim,
     xlab = iris_xlab,
     main = "virginica")
grid()
par(mfrow = c(1, 1))

The structure consists of three repetitions of the hist() function, each followed by a call to the grid() function. The arguments passed to the hist() function only differ in one detail: the name of the species (setosa, versicolor or virginica). We can make the code more legible by turning the repeating parts of the code into a function, say hist_panel(). We’ll treat the species name as a function parameter called species_name. Following the general syntax for a function definition, we write:

hist_panel <- function(species_name) {
  # Function body
}

For the function body, we first copy one instance from our previous code that contains the repeating combination of the hist() and grid() functions. Then we replace the explicit name of the species (e.g. "setosa") by the parameter species_name.

hist_panel <- function(species_name) {
  
  # Show histogram of sepal width distribution for a given species
  hist(iris$Sepal.Width[iris$Species == species_name],
       breaks = iris_breaks,
       ylim = iris_ylim,
       xlab = iris_xlab,
       main = species_name)
  grid()
}

Now we can replace the repeating code blocks by calls to the function hist_panel().

par(mfrow = c(3, 1))
iris_breaks <- seq(1.8, 4.6, 0.1)
iris_ylim <- c(0, 12)
iris_xlab <- "Sepal width (cm)"
hist_panel("setosa")
hist_panel("versicolor")
hist_panel("virginica")
par(mfrow = c(1, 1))

The new code is much shorter, but we can make it even better. We don’t need to repeat the function name hist_panel on three consecutive lines. Setosa, versicolor and virginica are the three species that occur in the column iris$Species, so we can call hist_panel() with a for-loop that iterates over the unique values in iris$Species.

par(mfrow = c(3, 1))
iris_breaks <- seq(1.8, 4.6, 0.1)
iris_ylim <- c(0, 12)
iris_xlab <- "Sepal width (cm)"
for (species in unique(iris$Species)) {
  hist_panel(species)
}

par(mfrow = c(1, 1))

The version with the for-loop makes it clearer how we choose the species to be included in the plot. Consequently, this version comes closer to our aim of writing self-documenting code.

In summary, we learned how to write our own functions in R.

Functions are defined with the following syntax.

function_name <- function(param1, param2, ...) {
  # Body of the function:
  #   commands that do something with param1, param2, ...
  #   last evaluated object is returned as function value
}

The objects in the parentheses (param1, param2, …) are called parameters.
When a function is called with concrete arguments, R replaces the parameters in the function body by the values of the arguments.
Functions are mainly used to replace repeating code sequences. The resulting code is usually shorter, more readable and easier to maintain.

With this tutorial, our R lessons are drawing to a close. We’ve come a long way from simple vector operations to more complex scripts with for-loops and functions. There are many more R features that are worth exploring. For example, R can make geographic maps, animations and web apps. It’s even possible to write complete books with R. I hope R has piqued your interest. Thank you for watching these tutorials. Goodbye!